data_juicer.ops.filter.llm_analysis_filter module

class data_juicer.ops.filter.llm_analysis_filter.LLMAnalysisFilter(api_or_hf_model: str = 'gpt-4o', min_score: float = 0.5, max_score: float = 1.0, is_hf_model: bool = False, *, api_endpoint: str | None = None, response_path: str | None = None, input_keys: List[str] = ['text'], field_names: List[str] = ['Text'], system_prompt: str | None = None, input_template: str | None = None, field_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, dim_required_keys: List[str] | None = None, **kwargs)[source]

Bases: Filter

Base filter class for leveraging LLMs to filter various samples. Provides foundational functionality for dimensional scoring (0~5) and tagging.

DEFAULT_SYSTEM_PROMPT = 'You are a meticulous data quality assessor for LLM training. Analyze each data sample across multiple quality dimensions and provide numerical scores, tags, and reasoning. Follow these guidelines:\n\n1. Evaluation Dimensions\nScore each dimension (1-5 scale: 1=lowest, 5=highest):\n- Clarity: How easy is the sample to understand?\n- Relevance: How relevant is the sample to the intended task or topic?\n- Usefulness: How helpful or valuable is the information in the sample?\n- Fluency: How natural and well-written is the sample (grammar, style)?\n\n2. Tagging:\nAssign descriptive tags to categorize the data sample (string or list of string).  Examples include:\n- "Topic": The main subject of the sample (e.g., "Machine Learning", "Historical Event").\n- "Style":  The writing style or genre (e.g., "Informational", "Narrative", "Technical").\n3. Scoring Protocol\n- Base scores and tags on concrete evidence from the text.\n- Flag samples needing human review (confidence <90%).\n- Compare with similar data points for consistency.\n- Penalize hallucination/misinformation severely (if applicable).\n\n4. Output Format\njson\n{\n  "dimension_scores": {\n    "clarity": ,\n    "relevance": ,\n    "usefulness": ,\n    "fluency":\n  },\n  "tags": {\n    "topic": ,\n    "style":\n  },\n  "flags": ["syntax_error", "insufficient_information", ...],\n  "rationale": "Concise analysis of quality dimensions and tagging decisions.",\n  "recommendation": ["keep", "review", "discard"]\n}\n\n5. Special Instructions\n- Prioritize accuracy and relevance over stylistic qualities.\n- Contextualize cultural references appropriately.\n- Clearly justify your scores, tags, and flags in the rationale.\n- Response a json dict\n\nExample Response:\n\njson\n{\n  "dimension_scores": {\n    "clarity": 4,\n    "relevance": 5,\n    "usefulness": 3,\n    "fluency": 4\n  },\n  "tags": {\n    "topic": "Artificial Intelligence",\n    "style": "Informational"\n  },\n  "flags": ["minor_grammar_issues"],\n  "rationale": "The text is highly relevant and generally well-written, but suffers from some minor grammar issues and could be more useful with additional examples.  The topic is clearly Artificial Intelligence, and the difficulty is appropriate for an intermediate audience.",\n  "recommendation": "review"\n}\n'
DEFAULT_INPUT_TEMPLATE = "# Data\n'''\n{data}\n'''\n\n# Response\njson\n"
DEFAULT_FIELD_TEMPLATE = '**{field_name}**\n{field_data}'
DEFAULT_DIM_REQUIRED_KEYS = ['clarity', 'relevance', 'usefulness', 'fluency']
__init__(api_or_hf_model: str = 'gpt-4o', min_score: float = 0.5, max_score: float = 1.0, is_hf_model: bool = False, *, api_endpoint: str | None = None, response_path: str | None = None, input_keys: List[str] = ['text'], field_names: List[str] = ['Text'], system_prompt: str | None = None, input_template: str | None = None, field_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, dim_required_keys: List[str] | None = None, **kwargs)[source]

Initialization method.

Parameters:
  • api_or_hf_model – API or huggingface model name.

  • min_score – The min score threshold to keep the sample.

  • max_score – The max score threshold to keep the sample.

  • is_hf_model – If true, use huggingface model. Otherwise, use API.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • input_keys – Sub set of keys in the sample. Support data with multi fields such as ‘query’, ‘analysis’ and ‘answer’ in RFT data.

  • field_names – Corresponding field names for input keys.

  • system_prompt – System prompt for the task.

  • input_template – Template for building the model input.

  • field_template – Template for each field in the prompt.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • enable_vllm – If true, use VLLM for loading hugging face or local llm. Otherwise, use API for reference.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • dim_required_keys – A list of keys used to calculate the average dimension score, only the dimension scores associated with these keys are used in the average calculation.

  • kwargs – Extra keyword arguments.

build_input(sample)[source]
parse_output(raw_output)[source]
generate_llm_analysis(sample, rank)[source]
compute_stats_single(sample, rank=None, context=False)[source]

Compute stats for the sample which is used as a metric to decide whether to filter this sample.

Parameters:
  • sample – input sample.

  • context – whether to store context information of intermediate vars in the sample temporarily.

Returns:

sample with computed stats

process_single(sample, rank=None)[source]

For sample level, sample –> Boolean.

Parameters:

sample – sample to decide whether to filter

Returns:

true for keeping and false for filtering