data_juicer.ops.mapper.dialog_topic_detection_mapper module

class data_juicer.ops.mapper.dialog_topic_detection_mapper.DialogTopicDetectionMapper(api_model: str = 'gpt-4o', topic_candidates: List[str] | None = None, max_round: Annotated[int, Ge(ge=0)] = 10, *, labels_key: str = 'dialog_topic_labels', analysis_key: str = 'dialog_topic_labels_analysis', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, query_template: str | None = None, response_template: str | None = None, candidate_template: str | None = None, analysis_template: str | None = None, labels_template: str | None = None, analysis_pattern: str | None = None, labels_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[源代码]

基类:Mapper

Mapper to generate user's topic labels in dialog. Input from history_key, query_key and response_key. Output lists of labels and analysis for queries in the dialog.

DEFAULT_SYSTEM_PROMPT = '请判断用户和LLM多轮对话中用户所讨论的话题。\n要求:\n- 针对用户的每个query,需要先进行分析,然后列出用户正在讨论的话题,下面是一个样例,请模仿样例格式输出。\n用户:你好,今天我们来聊聊秦始皇吧。\n话题分析:用户提到秦始皇,这是中国历史上第一位皇帝。\n话题类别:历史\nLLM:当然可以,秦始皇是中国历史上第一个统一全国的皇帝,他在公元前221年建立了秦朝,并采取了一系列重要的改革措施,如统一文字、度量衡和货币等。\n用户:秦始皇修建的长城和现在的长城有什么区别?\n话题分析:用户提到秦始皇修建的长城,并将其与现代长城进行比较,涉及建筑历史和地理位置。\n话题类别:历史LLM:秦始皇时期修建的长城主要是为了抵御北方游牧民族的入侵,它的规模和修建技术相对较为简陋。现代人所看到的长城大部分是明朝时期修建和扩建的,明长城不仅规模更大、结构更坚固,而且保存得比较完好。\n用户:有意思,那么长城的具体位置在哪些省份呢?\n话题分析:用户询问长城的具体位置,涉及到地理知识。\n话题类别:地理\nLLM:长城横跨中国北方多个省份,主要包括河北、山西、内蒙古、宁夏、陕西、甘肃和北京等。每一段长城都建在关键的战略位置,以便最大限度地发挥其防御作用。\n'
DEFAULT_QUERY_TEMPLATE = '用户:{query}\n'
DEFAULT_RESPONSE_TEMPLATE = 'LLM:{response}\n'
DEFAULT_CANDIDATES_TEMPLATE = '备选话题类别:[{candidate_str}]'
DEFAULT_ANALYSIS_TEMPLATE = '话题分析:{analysis}\n'
DEFAULT_LABELS_TEMPLATE = '话题类别:{labels}\n'
DEFAULT_ANALYSIS_PATTERN = '话题分析:(.*?)\n'
DEFAULT_LABELS_PATTERN = '话题类别:(.*?)($|\n)'
__init__(api_model: str = 'gpt-4o', topic_candidates: List[str] | None = None, max_round: Annotated[int, Ge(ge=0)] = 10, *, labels_key: str = 'dialog_topic_labels', analysis_key: str = 'dialog_topic_labels_analysis', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, query_template: str | None = None, response_template: str | None = None, candidate_template: str | None = None, analysis_template: str | None = None, labels_template: str | None = None, analysis_pattern: str | None = None, labels_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[源代码]

Initialization method.

参数:
  • api_model -- API model name.

  • topic_candidates -- The output topic candidates. Use open-domain topic labels if it is None.

  • max_round -- The max num of round in the dialog to build the prompt.

  • labels_key -- The key name in the meta field to store the output labels. It is 'dialog_topic_labels' in default.

  • analysis_key -- The key name in the meta field to store the corresponding analysis. It is 'dialog_topic_labels_analysis' in default.

  • api_endpoint -- URL endpoint for the API.

  • response_path -- Path to extract content from the API response. Defaults to 'choices.0.message.content'.

  • system_prompt -- System prompt for the task.

  • query_template -- Template for query part to build the input prompt.

  • response_template -- Template for response part to build the input prompt.

  • candidate_template -- Template for topic candidates to build the input prompt.

  • analysis_template -- Template for analysis part to build the input prompt.

  • labels_template -- Template for labels part to build the input prompt.

  • analysis_pattern -- Pattern to parse the return topic analysis.

  • labels_pattern -- Pattern to parse the return topic labels.

  • try_num -- The number of retry attempts when there is an API call error or output parsing error.

  • model_params -- Parameters for initializing the API model.

  • sampling_params -- Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}

  • kwargs -- Extra keyword arguments.

build_input(history, query)[源代码]
parse_output(response)[源代码]
process_single(sample, rank=None)[源代码]

For sample level, sample --> sample

参数:

sample -- sample to process

返回:

processed sample