data_juicer.ops.filter.llm_analysis_filter module¶
- class data_juicer.ops.filter.llm_analysis_filter.LLMAnalysisFilter(api_or_hf_model: str = 'gpt-4o', min_score: float = 0.5, max_score: float = 1.0, is_hf_model: bool = False, *, api_endpoint: str | None = None, response_path: str | None = None, input_keys: List[str] = ['text'], field_names: List[str] = ['Text'], system_prompt: str | None = None, input_template: str | None = None, field_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, dim_required_keys: List[str] | None = None, **kwargs)[source]¶
Bases:
Filter
Base filter class for leveraging LLMs to analyze and filter data samples.
This operator uses an LLM to score and tag each sample across multiple quality dimensions. It supports both API-based and Hugging Face models. The LLM evaluates the sample on clarity, relevance, usefulness, and fluency, providing scores from 1 to 5. Tags are assigned to categorize the sample, and a recommendation is made to keep, review, or discard the sample. The average score is computed based on the required dimension keys. Samples are kept if their average score falls within the specified min and max score thresholds. The key metric ‘llm_analysis_score’ is cached in the sample’s stats.
- DEFAULT_SYSTEM_PROMPT = 'You are a meticulous data quality assessor for LLM training. Analyze each data sample across multiple quality dimensions and provide numerical scores, tags, and reasoning. Follow these guidelines:\n\n1. Evaluation Dimensions\nScore each dimension (1-5 scale: 1=lowest, 5=highest):\n- Clarity: How easy is the sample to understand?\n- Relevance: How relevant is the sample to the intended task or topic?\n- Usefulness: How helpful or valuable is the information in the sample?\n- Fluency: How natural and well-written is the sample (grammar, style)?\n\n2. Tagging:\nAssign descriptive tags to categorize the data sample (string or list of string). Examples include:\n- "Topic": The main subject of the sample (e.g., "Machine Learning", "Historical Event").\n- "Style": The writing style or genre (e.g., "Informational", "Narrative", "Technical").\n3. Scoring Protocol\n- Base scores and tags on concrete evidence from the text.\n- Flag samples needing human review (confidence <90%).\n- Compare with similar data points for consistency.\n- Penalize hallucination/misinformation severely (if applicable).\n\n4. Output Format\njson\n{\n "dimension_scores": {\n "clarity": ,\n "relevance": ,\n "usefulness": ,\n "fluency":\n },\n "tags": {\n "topic": ,\n "style":\n },\n "flags": ["syntax_error", "insufficient_information", ...],\n "rationale": "Concise analysis of quality dimensions and tagging decisions.",\n "recommendation": ["keep", "review", "discard"]\n}\n\n5. Special Instructions\n- Prioritize accuracy and relevance over stylistic qualities.\n- Contextualize cultural references appropriately.\n- Clearly justify your scores, tags, and flags in the rationale.\n- Response a json dict\n\nExample Response:\n\njson\n{\n "dimension_scores": {\n "clarity": 4,\n "relevance": 5,\n "usefulness": 3,\n "fluency": 4\n },\n "tags": {\n "topic": "Artificial Intelligence",\n "style": "Informational"\n },\n "flags": ["minor_grammar_issues"],\n "rationale": "The text is highly relevant and generally well-written, but suffers from some minor grammar issues and could be more useful with additional examples. The topic is clearly Artificial Intelligence, and the difficulty is appropriate for an intermediate audience.",\n "recommendation": "review"\n}\n'¶
- DEFAULT_INPUT_TEMPLATE = "# Data\n'''\n{data}\n'''\n\n# Response\njson\n"¶
- DEFAULT_FIELD_TEMPLATE = '**{field_name}**\n{field_data}'¶
- DEFAULT_DIM_REQUIRED_KEYS = ['clarity', 'relevance', 'usefulness', 'fluency']¶
- __init__(api_or_hf_model: str = 'gpt-4o', min_score: float = 0.5, max_score: float = 1.0, is_hf_model: bool = False, *, api_endpoint: str | None = None, response_path: str | None = None, input_keys: List[str] = ['text'], field_names: List[str] = ['Text'], system_prompt: str | None = None, input_template: str | None = None, field_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, enable_vllm: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, dim_required_keys: List[str] | None = None, **kwargs)[source]¶
Initialization method.
- Parameters:
api_or_hf_model – API or huggingface model name.
min_score – The min score threshold to keep the sample.
max_score – The max score threshold to keep the sample.
is_hf_model – If true, use Transformers for loading hugging face or local llm.
api_endpoint – URL endpoint for the API.
response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.
input_keys – Sub set of keys in the sample. Support data with multi fields such as ‘query’, ‘analysis’ and ‘answer’ in RFT data.
field_names – Corresponding field names for input keys.
system_prompt – System prompt for the task.
input_template – Template for building the model input.
field_template – Template for each field in the prompt.
try_num – The number of retry attempts when there is an API call error or output parsing error.
enable_vllm – If true, use VLLM for loading hugging face or local llm.
model_params – Parameters for initializing the API model.
sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}
dim_required_keys – A list of keys used to calculate the average dimension score, only the dimension scores associated with these keys are used in the average calculation.
kwargs – Extra keyword arguments.
- compute_stats_single(sample, rank=None, context=False)[source]¶
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats