data_juicer.ops.mapper.query_topic_detection_mapper module

class data_juicer.ops.mapper.query_topic_detection_mapper.QueryTopicDetectionMapper(hf_model: str = 'dstefa/roberta-base_topic_classification_nyt_news', zh_to_en_hf_model: str | None = 'Helsinki-NLP/opus-mt-zh-en', model_params: Dict = {}, zh_to_en_model_params: Dict = {}, *, label_key: str = 'query_topic_label', score_key: str = 'query_topic_label_score', **kwargs)[source]

Bases: Mapper

Predicts the topic label and its corresponding score for a given query. The input is taken from the specified query key. The output, which includes the predicted topic label and its score, is stored in the ‘query_topic_label’ and ‘query_topic_label_score’ fields of the Data-Juicer meta field. This operator uses a Hugging Face model for topic classification. If a Chinese to English translation model is provided, it will first translate the query from Chinese to English before predicting the topic.

  • Uses a Hugging Face model for topic classification.

  • Optionally translates Chinese queries to English using another Hugging Face

model. - Stores the predicted topic label in ‘query_topic_label’. - Stores the corresponding score in ‘query_topic_label_score’.

__init__(hf_model: str = 'dstefa/roberta-base_topic_classification_nyt_news', zh_to_en_hf_model: str | None = 'Helsinki-NLP/opus-mt-zh-en', model_params: Dict = {}, zh_to_en_model_params: Dict = {}, *, label_key: str = 'query_topic_label', score_key: str = 'query_topic_label_score', **kwargs)[source]

Initialization method.

Parameters:
  • hf_model – Huggingface model ID to predict topic label.

  • zh_to_en_hf_model – Translation model from Chinese to English. If not None, translate the query from Chinese to English.

  • model_params – model param for hf_model.

  • zh_to_en_model_params – model param for zh_to_hf_model.

  • label_key – The key name in the meta field to store the output label. It is ‘query_topic_label’ in default.

  • score_key – The key name in the meta field to store the corresponding label score. It is ‘query_topic_label_score’ in default.

  • kwargs – Extra keyword arguments.

process_batched(samples, rank=None)[source]