data_juicer.ops.aggregator.most_relevant_entities_aggregator module¶
- class data_juicer.ops.aggregator.most_relevant_entities_aggregator.MostRelevantEntitiesAggregator(api_model: str = 'gpt-4o', entity: str = None, query_entity_type: str = None, input_key: str = 'event_description', output_key: str = 'most_relevant_entities', max_token_num: Annotated[int, Gt(gt=0)] | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[源代码]¶
基类:
Aggregator
Extract entities closely related to a given entity from some texts, and sort them in descending order of importance.
- DEFAULT_SYSTEM_TEMPLATE = '给定与`{entity}`相关的一些文档,总结一些与`{entity}`最为相关的`{entity_type}`。\n要求:\n- 不用包含与{entity}为同一{entity_type}的{entity_type}。\n- 请按照人物的重要性进行排序,**越重要人物在列表越前面**。\n- 你的返回格式如下:\n## 分析\n你对各个{entity_type}与{entity}关联度的分析\n## 列表\n人物1, 人物2, 人物3, ...'¶
- DEFAULT_INPUT_TEMPLATE = '`{entity}`的相关文档:\n{sub_docs}\n\n与`{entity}`最相关的一些`{entity_type}`:\n'¶
- DEFAULT_OUTPUT_PATTERN = '\\#\\#\\s*列表\\s*(.*?)\\Z'¶
- __init__(api_model: str = 'gpt-4o', entity: str = None, query_entity_type: str = None, input_key: str = 'event_description', output_key: str = 'most_relevant_entities', max_token_num: Annotated[int, Gt(gt=0)] | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[源代码]¶
Initialization method. :param api_model: API model name. :param entity: The given entity. :param query_entity_type: The type of queried relevant entities. :param input_key: The input key in the meta field of the samples.
It is "event_description" in default.
- 参数:
output_key -- The output key in the aggregation field of the samples. It is "most_relevant_entities" in default.
max_token_num -- The max token num of the total tokens of the sub documents. Without limitation if it is None.
api_endpoint -- URL endpoint for the API.
response_path -- Path to extract content from the API response. Defaults to 'choices.0.message.content'.
system_prompt_template -- The system prompt template.
input_template -- The input template.
output_pattern -- The output pattern.
try_num -- The number of retry attempts when there is an API call error or output parsing error.
model_params -- Parameters for initializing the API model.
sampling_params -- Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
kwargs -- Extra keyword arguments.