data_juicer.ops.mapper.relation_identity_mapper module¶

class data_juicer.ops.mapper.relation_identity_mapper.RelationIdentityMapper(api_model: str = 'gpt-4o', source_entity: str = None, target_entity: str = None, *, output_key: str = 'role_relation', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, output_pattern_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]¶

Bases: Mapper

Identify the relation between two entities in a given text.

This operator uses an API model to analyze the relationship between two specified entities in the text. It constructs a prompt with the provided system and input templates, then sends it to the API model for analysis. The output is parsed using a regular expression to extract the relationship. If the two entities are the same, the relationship is identified as “another identity.” The result is stored in the meta field under the key ‘role_relation’ by default. The operator retries the API call up to a specified number of times in case of errors. If drop_text is set to True, the original text is removed from the sample after processing.

DEFAULT_SYSTEM_PROMPT_TEMPLATE = '给定关于{entity1}和{entity2}的文本信息。判断{entity1}和{entity2}之间的关系。\n要求：\n- 关系用一个或多个词语表示，必要时可以加一个形容词来描述这段关系\n- 输出关系时不要参杂任何标点符号\n- 需要你进行合理的推理才能得出结论\n- 如果两个人物身份是同一个人，输出关系为：另一个身份\n- 输出格式为：\n分析推理：...\n所以{entity2}是{entity1}的：...\n- 注意输出的是{entity2}是{entity1}的什么关系，而不是{entity1}是{entity2}的什么关系'¶

DEFAULT_INPUT_TEMPLATE = '关于{entity1}和{entity2}的文本信息：\n```\n{text}\n```\n'¶

DEFAULT_OUTPUT_PATTERN_TEMPLATE = '\n \\s*分析推理：\\s*(.*?)\\s*\n \\s*所以{entity2}是{entity1}的：\\s*(.*?)\\Z\n '¶

__init__(api_model: str = 'gpt-4o', source_entity: str = None, target_entity: str = None, *, output_key: str = 'role_relation', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, output_pattern_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]¶

Initialization method. :param api_model: API model name. :param source_entity: The source entity of the relation to be

identified.

Parameters:

target_entity – The target entity of the relation to be identified.
output_key – The output key in the meta field in the samples. It is ‘role_relation’ in default.
api_endpoint – URL endpoint for the API.
response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.
system_prompt_template – System prompt template for the task.
input_template – Template for building the model input.
output_pattern_template – Regular expression template for parsing model output.
try_num – The number of retry attempts when there is an API call error or output parsing error.
drop_text – If drop the text in the output.
model_params – Parameters for initializing the API model.
sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}
kwargs – Extra keyword arguments.

parse_output(raw_output)[source]¶

process_single(sample, rank=None)[source]¶

For sample level, sample –> sample

Parameters:: sample – sample to process
Returns:: processed sample