data_juicer.ops.mapper package

Submodules

data_juicer.ops.mapper.audio_ffmpeg_wrapped_mapper module

class data_juicer.ops.mapper.audio_ffmpeg_wrapped_mapper.AudioFFmpegWrappedMapper(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, *args, **kwargs)[source]

Bases: Mapper

Simple wrapper for FFmpeg audio filters.

__init__(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • filter_name – ffmpeg audio filter name.

  • filter_kwargs – keyword-arguments passed to ffmpeg filter.

  • global_args – list-arguments passed to ffmpeg command-line.

  • capture_stderr – whether to capture stderr.

  • overwrite_output – whether to overwrite output file.

  • args – extra args

  • kwargs – extra args

process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.calibrate_qa_mapper module

class data_juicer.ops.mapper.calibrate_qa_mapper.CalibrateQAMapper(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Mapper to calibrate question-answer pairs based on reference text.

DEFAULT_SYSTEM_PROMPT = '请根据提供的【参考信息】对【问题】和【回答】进行校准,使其更加详细、准确。\n按照以下格式输出:\n【问题】\n校准后的问题\n【回答】\n校准后的回答'
DEFAULT_INPUT_TEMPLATE = '{reference}\n{qa_pair}'
DEFAULT_REFERENCE_TEMPLATE = '【参考信息】\n{}'
DEFAULT_QA_PAIR_TEMPLATE = '【问题】\n{}\n【回答】\n{}'
DEFAULT_OUTPUT_PATTERN = '【问题】\\s*(.*?)\\s*【回答】\\s*(.*)'
__init__(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method.

Parameters:
  • api_model – API model name.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt – System prompt for the calibration task.

  • input_template – Template for building the model input.

  • reference_template – Template for formatting the reference text.

  • qa_pair_template – Template for formatting question-answer pairs.

  • output_pattern – Regular expression for parsing model output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

build_input(sample)[source]
parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.calibrate_query_mapper module

class data_juicer.ops.mapper.calibrate_query_mapper.CalibrateQueryMapper(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: CalibrateQAMapper

Mapper to calibrate query in question-answer pairs based on reference text.

DEFAULT_SYSTEM_PROMPT = '请根据提供的【参考信息】对问答对中的【问题】进行校准,        使其更加详细、准确,且仍可以由原答案回答。只输出校准后的问题,不要输出多余内容。'
parse_output(raw_output)[source]

data_juicer.ops.mapper.calibrate_response_mapper module

class data_juicer.ops.mapper.calibrate_response_mapper.CalibrateResponseMapper(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: CalibrateQAMapper

Mapper to calibrate response in question-answer pairs based on reference text.

DEFAULT_SYSTEM_PROMPT = '请根据提供的【参考信息】对问答对中的【回答】进行校准,        使其更加详细、准确,且仍可以回答原问题。只输出校准后的回答,不要输出多余内容。'
parse_output(raw_output)[source]

data_juicer.ops.mapper.chinese_convert_mapper module

data_juicer.ops.mapper.chinese_convert_mapper.prepare_converter(mode)[source]
class data_juicer.ops.mapper.chinese_convert_mapper.ChineseConvertMapper(mode: str = 's2t', *args, **kwargs)[source]

Bases: Mapper

Mapper to convert Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji.

__init__(mode: str = 's2t', *args, **kwargs)[source]

Initialization method.

Parameters:
  • mode

    Choose the mode to convert Chinese:

    s2t: Simplified Chinese to Traditional Chinese,

    t2s: Traditional Chinese to Simplified Chinese,

    s2tw: Simplified Chinese to Traditional Chinese (Taiwan Standard),

    tw2s: Traditional Chinese (Taiwan Standard) to Simplified Chinese,

    s2hk: Simplified Chinese to Traditional Chinese (Hong Kong variant),

    hk2s: Traditional Chinese (Hong Kong variant) to Simplified Chinese,

    s2twp: Simplified Chinese to Traditional Chinese (Taiwan Standard) with Taiwanese idiom,

    tw2sp: Traditional Chinese (Taiwan Standard) to Simplified Chinese with Mainland Chinese idiom,

    t2tw: Traditional Chinese to Traditional Chinese (Taiwan Standard),

    tw2t: Traditional Chinese (Taiwan standard) to Traditional Chinese,

    hk2t: Traditional Chinese (Hong Kong variant) to Traditional Chinese,

    t2hk: Traditional Chinese to Traditional Chinese (Hong Kong variant),

    t2jp: Traditional Chinese Characters (Kyūjitai) to New Japanese Kanji,

    jp2t: New Japanese Kanji (Shinjitai) to Traditional Chinese Characters,

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.clean_email_mapper module

class data_juicer.ops.mapper.clean_email_mapper.CleanEmailMapper(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Bases: Mapper

Mapper to clean email in text samples.

__init__(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • pattern – regular expression pattern to search for within text.

  • repl – replacement string, default is empty string.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.clean_html_mapper module

class data_juicer.ops.mapper.clean_html_mapper.CleanHtmlMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to clean html code in text samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.clean_ip_mapper module

class data_juicer.ops.mapper.clean_ip_mapper.CleanIpMapper(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Bases: Mapper

Mapper to clean ipv4 and ipv6 address in text samples.

__init__(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • pattern – regular expression pattern to search for within text.

  • repl – replacement string, default is empty string.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.expand_macro_mapper module

class data_juicer.ops.mapper.expand_macro_mapper.ExpandMacroMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to expand macro definitions in the document body of Latex samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.extract_entity_attribute_mapper module

class data_juicer.ops.mapper.extract_entity_attribute_mapper.ExtractEntityAttributeMapper(api_model: str = 'gpt-4o', query_entities: List[str] = [], query_attributes: List[str] = [], *, entity_key: str = '__dj__main_entities__', attribute_key: str = '__dj__attributes__', attribute_desc_key: str = '__dj__attribute_descriptions__', support_text_key: str = '__dj__attribute_support_texts__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, attr_pattern_template: str | None = None, demo_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Extract attributes for given entities from the text

DEFAULT_SYSTEM_PROMPT_TEMPLATE = '给定一段文本,从文本中总结{entity}的{attribute},并且从原文摘录最能说明该{attribute}的代表性示例。\n要求:\n- 摘录的示例应该简短。\n- 遵循如下的回复格式:\n# {entity}\n## {attribute}:\n...\n### 代表性示例摘录1:\n```\n...\n```\n### 代表性示例摘录2:\n```\n...\n```\n...\n'
DEFAULT_INPUT_TEMPLATE = '# 文本\n```\n{text}\n```\n'
DEFAULT_ATTR_PATTERN_TEMPLATE = '\\#\\#\\s*{attribute}:\\s*(.*?)(?=\\#\\#\\#|\\Z)'
DEFAULT_DEMON_PATTERN = '\\#\\#\\#\\s*代表性示例摘录(\\d+):\\s*```\\s*(.*?)```\\s*(?=\\#\\#\\#|\\Z)'
__init__(api_model: str = 'gpt-4o', query_entities: List[str] = [], query_attributes: List[str] = [], *, entity_key: str = '__dj__main_entities__', attribute_key: str = '__dj__attributes__', attribute_desc_key: str = '__dj__attribute_descriptions__', support_text_key: str = '__dj__attribute_support_texts__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, attr_pattern_template: str | None = None, demo_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param query_entities: Entity list to be queried. :param query_attributes: Attribute list to be queried. :param entity_key: The field name to store the given main entity for

attribute extraction. It’s “__dj__entity__” in default.

Parameters:
  • entity_attribute_key – The field name to store the given attribute to be extracted. It’s “__dj__attribute__” in default.

  • attribute_desc_key – The field name to store the extracted attribute description. It’s “__dj__attribute_description__” in default.

  • support_text_key – The field name to store the attribute support text extracted from the raw text. It’s “__dj__support_text__” in default.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt_template – System prompt template for the task. Need to be specified by given entity and attribute.

  • input_template – Template for building the model input.

  • attr_pattern_template – Pattern for parsing the attribute from output. Need to be specified by given attribute.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

Param:

demo_pattern: Pattern for parsing the demonstraction from output to support the attribute.

parse_output(raw_output, attribute_name)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.extract_entity_relation_mapper module

class data_juicer.ops.mapper.extract_entity_relation_mapper.ExtractEntityRelationMapper(api_model: str = 'gpt-4o', entity_types: List[str] | None = None, *, entity_key: str = '__dj__entity__', relation_key: str = '__dj__relation__', api_endpoint: str | None = None, response_path: str | None = None, prompt_template: str | None = None, tuple_delimiter: str | None = None, record_delimiter: str | None = None, completion_delimiter: str | None = None, max_gleaning: Annotated[int, Ge(ge=0)] = 1, continue_prompt: str | None = None, if_loop_prompt: str | None = None, entity_pattern: str | None = None, relation_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Extract entities and relations in the text for knowledge graph.

DEFAULT_PROMPT_TEMPLATE = '-Goal-\nGiven a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.\n\n-Steps-\n1. Identify all entities. For each identified entity, extract the following information:\n- entity_name: Name of the entity\n- entity_type: One of the following types: [{entity_types}]\n- entity_description: Comprehensive description of the entity\'s attributes and activities\nFormat each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>\n\n2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.\nFor each pair of related entities, extract the following information:\n- source_entity: name of the source entity, as identified in step 1\n- target_entity: name of the target entity, as identified in step 1\n- relationship_description: explanation as to why you think the source entity and the target entity are related to each other\n- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity\n- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details\nFormat each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)\n\n3. Return output in the language of the given text as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.\n\n4. When finished, output {completion_delimiter}\n\n######################\n-Examples-\n######################\nExample 1:\n\nEntity_types: [person, technology, mission, organization, location]\nText:\n```\nwhile Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor\'s authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan\'s shared commitment to discovery was an unspoken rebellion against Cruz\'s narrowing vision of control and order.\n\nThen Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. “If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us.”\n\nThe underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor\'s, a wordless clash of wills softening into an uneasy truce.\n\nIt was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths\n```\n################\nOutput:\n("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}\n("entity"{tuple_delimiter}"Taylor"{tuple_delimiter}"person"{tuple_delimiter}"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."){record_delimiter}\n("entity"{tuple_delimiter}"Jordan"{tuple_delimiter}"person"{tuple_delimiter}"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."){record_delimiter}\n("entity"{tuple_delimiter}"Cruz"{tuple_delimiter}"person"{tuple_delimiter}"Cruz is associated with a vision of control and order, influencing the dynamics among other characters."){record_delimiter}\n("entity"{tuple_delimiter}"The Device"{tuple_delimiter}"technology"{tuple_delimiter}"The Device is central to the story, with potential game-changing implications, and is revered by Taylor."){record_delimiter}\n("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Taylor"{tuple_delimiter}"Alex is affected by Taylor\'s authoritarian certainty and observes changes in Taylor\'s attitude towards the device."{tuple_delimiter}"power dynamics, perspective shift"{tuple_delimiter}7){record_delimiter}\n("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Jordan"{tuple_delimiter}"Alex and Jordan share a commitment to discovery, which contrasts with Cruz\'s vision."{tuple_delimiter}"shared goals, rebellion"{tuple_delimiter}6){record_delimiter}\n("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"Jordan"{tuple_delimiter}"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."{tuple_delimiter}"conflict resolution, mutual respect"{tuple_delimiter}8){record_delimiter}\n("relationship"{tuple_delimiter}"Jordan"{tuple_delimiter}"Cruz"{tuple_delimiter}"Jordan\'s commitment to discovery is in rebellion against Cruz\'s vision of control and order."{tuple_delimiter}"ideological conflict, rebellion"{tuple_delimiter}5){record_delimiter}\n("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"The Device"{tuple_delimiter}"Taylor shows reverence towards the device, indicating its importance and potential impact."{tuple_delimiter}"reverence, technological significance"{tuple_delimiter}9){record_delimiter}\n#############################\nExample 2:\n\nEntity_types: [人物, 技术, 任务, 组织, 地点]\nText:\n```\n他们不再是单纯的执行者;他们已成为某个超越星辰与条纹的领域的信息守护者。这一使命的提升不能被规则和既定协议所束缚——它需要一种新的视角,一种新的决心。\n\n随着与华盛顿的通讯在背景中嗡嗡作响,对话中的紧张情绪通过嘟嘟声和静电噪音贯穿始终。团队站立着,一股不祥的气息笼罩着他们。显然,他们在接下来几个小时内做出的决定可能会重新定义人类在宇宙中的位置,或者将他们置于无知和潜在危险之中。\n\n随着与星辰的联系变得更加牢固,小组开始处理逐渐成形的警告,从被动接受者转变为积极参与者。梅瑟后来的直觉占据了上风——团队的任务已经演变,不再仅仅是观察和报告,而是互动和准备。一场蜕变已经开始,而“杜尔塞行动”则以他们大胆的新频率震动,这种基调不是由世俗设定的\n```\n#############\nOutput:\n("entity"{tuple_delimiter}"华盛顿"{tuple_delimiter}"地点"{tuple_delimiter}"华盛顿是正在接收通讯的地方,表明其在决策过程中的重要性。"){record_delimiter}\n("entity"{tuple_delimiter}"杜尔塞行动"{tuple_delimiter}"任务"{tuple_delimiter}"杜尔塞行动被描述为一项已演变为互动和准备的任务,显示出目标和活动的重大转变。"){record_delimiter}\n("entity"{tuple_delimiter}"团队"{tuple_delimiter}"组织"{tuple_delimiter}"团队被描绘成一群从被动观察者转变为积极参与者的人,展示了他们角色的动态变化。"){record_delimiter}\n("relationship"{tuple_delimiter}"团队"{tuple_delimiter}"华盛顿"{tuple_delimiter}"团队收到来自华盛顿的通讯,这影响了他们的决策过程。"{tuple_delimiter}"决策、外部影响"{tuple_delimiter}7){record_delimiter}\n("relationship"{tuple_delimiter}"团队"{tuple_delimiter}"杜尔塞行动"{tuple_delimiter}"团队直接参与杜尔塞行动,执行其演变后的目标和活动。"{tuple_delimiter}"任务演变、积极参与"{tuple_delimiter}9){completion_delimiter}\n#############################\nExample 3:\n\nEntity_types: [person, role, technology, organization, event, location, concept]\nText:\n```\ntheir voice slicing through the buzz of activity. "Control may be an illusion when facing an intelligence that literally writes its own rules," they stated stoically, casting a watchful eye over the flurry of data.\n\n"It\'s like it\'s learning to communicate," offered Sam Rivera from a nearby interface, their youthful energy boding a mix of awe and anxiety. "This gives talking to strangers\' a whole new meaning."\n\nAlex surveyed his team—each face a study in concentration, determination, and not a small measure of trepidation. "This might well be our first contact," he acknowledged, "And we need to be ready for whatever answers back."\n\nTogether, they stood on the edge of the unknown, forging humanity\'s response to a message from the heavens. The ensuing silence was palpable—a collective introspection about their role in this grand cosmic play, one that could rewrite human history.\n\nThe encrypted dialogue continued to unfold, its intricate patterns showing an almost uncanny anticipation\n```\n#############\nOutput:\n("entity"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"person"{tuple_delimiter}"Sam Rivera is a member of a team working on communicating with an unknown intelligence, showing a mix of awe and anxiety."){record_delimiter}\n("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is the leader of a team attempting first contact with an unknown intelligence, acknowledging the significance of their task."){record_delimiter}\n("entity"{tuple_delimiter}"Control"{tuple_delimiter}"concept"{tuple_delimiter}"Control refers to the ability to manage or govern, which is challenged by an intelligence that writes its own rules."){record_delimiter}\n("entity"{tuple_delimiter}"Intelligence"{tuple_delimiter}"concept"{tuple_delimiter}"Intelligence here refers to an unknown entity capable of writing its own rules and learning to communicate."){record_delimiter}\n("entity"{tuple_delimiter}"First Contact"{tuple_delimiter}"event"{tuple_delimiter}"First Contact is the potential initial communication between humanity and an unknown intelligence."){record_delimiter}\n("entity"{tuple_delimiter}"Humanity\'s Response"{tuple_delimiter}"event"{tuple_delimiter}"Humanity\'s Response is the collective action taken by Alex\'s team in response to a message from an unknown intelligence."){record_delimiter}\n("relationship"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"Intelligence"{tuple_delimiter}"Sam Rivera is directly involved in the process of learning to communicate with the unknown intelligence."{tuple_delimiter}"communication, learning process"{tuple_delimiter}9){record_delimiter}\n("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"First Contact"{tuple_delimiter}"Alex leads the team that might be making the First Contact with the unknown intelligence."{tuple_delimiter}"leadership, exploration"{tuple_delimiter}10){record_delimiter}\n("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Humanity\'s Response"{tuple_delimiter}"Alex and his team are the key figures in Humanity\'s Response to the unknown intelligence."{tuple_delimiter}"collective action, cosmic significance"{tuple_delimiter}8){record_delimiter}\n("relationship"{tuple_delimiter}"Control"{tuple_delimiter}"Intelligence"{tuple_delimiter}"The concept of Control is challenged by the Intelligence that writes its own rules."{tuple_delimiter}"power dynamics, autonomy"{tuple_delimiter}7){record_delimiter}\n#############################\n-Real Data-\n######################\nEntity_types: [{entity_types}]\nText:\n```\n{input_text}\n```\n######################\nOutput:\n'
DEFAULT_CONTINUE_PROMPT = 'MANY entities were missed in the last extraction.  Add them below using the same format:\n'
DEFAULT_IF_LOOP_PROMPT = 'It appears some entities may have still been missed.  Answer YES | NO if there are still entities that need to be added.\n'
DEFAULT_ENTITY_TYPES = ['organization', 'person', 'geo', 'event']
DEFAULT_TUPLE_DELIMITER = '<|>'
DEFAULT_RECORD_DELIMITER = '##'
DEFAULT_COMPLETION_DELIMITER = '<|COMPLETE|>'
DEFAULT_ENTITY_PATTERN = '\\("entity"(.*?)\\)'
DEFAULT_RELATION_PATTERN = '\\("relationship"(.*?)\\)'
__init__(api_model: str = 'gpt-4o', entity_types: List[str] | None = None, *, entity_key: str = '__dj__entity__', relation_key: str = '__dj__relation__', api_endpoint: str | None = None, response_path: str | None = None, prompt_template: str | None = None, tuple_delimiter: str | None = None, record_delimiter: str | None = None, completion_delimiter: str | None = None, max_gleaning: Annotated[int, Ge(ge=0)] = 1, continue_prompt: str | None = None, if_loop_prompt: str | None = None, entity_pattern: str | None = None, relation_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param entity_types: Pre-defined entity types for knowledge graph. :param entity_key: The field name to store the entities. It’s

“__dj__entity__” in default.

Parameters:
  • relation_key – The field name to store the relations between entities. It’s “__dj__relation__” in default.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • prompt_template – The template of input prompt.

  • tuple_delimiter – Delimiter to separate items in outputs.

  • record_delimiter – Delimiter to separate records in outputs.

  • completion_delimiter – To mark the end of the output.

  • max_gleaning – the extra max num to call LLM to glean entities and relations.

  • continue_prompt – the prompt for gleaning entities and relations.

  • if_loop_prompt – the prompt to determine whether to stop gleaning.

  • entity_pattern – Regular expression for parsing entity record.

  • relation_pattern – Regular expression for parsing relation record.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

parse_output(raw_output)[source]
add_message(messages, role, content)[source]
light_rag_extraction(messages, rank=None)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.extract_event_mapper module

class data_juicer.ops.mapper.extract_event_mapper.ExtractEventMapper(api_model: str = 'gpt-4o', *, event_desc_key: str = '__dj__event_description__', relevant_char_key: str = '__dj__relevant_characters__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Extract events and relevant characters in the text

DEFAULT_SYSTEM_PROMPT = '给定一段文本,对文本的情节进行分点总结,并抽取与情节相关的人物。\n要求:\n- 尽量不要遗漏内容,不要添加文本中没有的情节,符合原文事实\n- 联系上下文说明前因后果,但仍然需要符合事实\n- 不要包含主观看法\n- 注意要尽可能保留文本的专有名词\n- 注意相关人物需要在对应情节中出现\n- 只抽取情节中的主要人物,不要遗漏情节的主要人物\n- 总结格式如下:\n### 情节1:\n- **情节描述**: ...\n- **相关人物**:人物1,人物2,人物3,...\n### 情节2:\n- **情节描述**: ...\n- **相关人物**:人物1,人物2,...\n### 情节3:\n- **情节描述**: ...\n- **相关人物**:人物1,...\n...\n'
DEFAULT_INPUT_TEMPLATE = '# 文本\n```\n{text}\n```\n'
DEFAULT_OUTPUT_PATTERN = '\n        \\#\\#\\#\\s*情节(\\d+):\\s*\n        -\\s*\\*\\*情节描述\\*\\*\\s*:\\s*(.*?)\\s*\n        -\\s*\\*\\*相关人物\\*\\*\\s*:\\s*(.*?)(?=\\#\\#\\#|\\Z)\n    '
__init__(api_model: str = 'gpt-4o', *, event_desc_key: str = '__dj__event_description__', relevant_char_key: str = '__dj__relevant_characters__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param event_desc_key: The field name to store the event descriptions.

It’s “__dj__event_description__” in default.

Parameters:
  • relevant_char_key – The field name to store the relevant characters to the events. It’s “__dj__relevant_characters__” in default.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt – System prompt for the task.

  • input_template – Template for building the model input.

  • output_pattern – Regular expression for parsing model output.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

parse_output(raw_output)[source]
process_batched(samples, rank=None)[source]

data_juicer.ops.mapper.extract_keyword_mapper module

class data_juicer.ops.mapper.extract_keyword_mapper.ExtractKeywordMapper(api_model: str = 'gpt-4o', *, keyword_key: str = '__dj__keyword__', api_endpoint: str | None = None, response_path: str | None = None, prompt_template: str | None = None, completion_delimiter: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Generate keywords for the text

DEFAULT_PROMPT_TEMPLATE = '-Goal-\nGiven a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.\n\n-Steps-\n1. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.\nFormat the content-level key words as ("content_keywords" <high_level_keywords>)\n\n3. Return output in the language of the given text.\n\n4. When finished, output {completion_delimiter}\n\n######################\n-Examples-\n######################\nExample 1:\n\nText:\n```\nwhile Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor\'s authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan\'s shared commitment to discovery was an unspoken rebellion against Cruz\'s narrowing vision of control and order.\n\nThen Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. “If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us.”\n\nThe underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor\'s, a wordless clash of wills softening into an uneasy truce.\n\nIt was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths\n```\n################\nOutput:\n("content_keywords" "power dynamics, ideological conflict, discovery, rebellion"){completion_delimiter}\n#############################\nExample 2:\n\nText:\n```\n他们不再是单纯的执行者;他们已成为某个超越星辰与条纹的领域的信息守护者。这一使命的提升不能被规则和既定协议所束缚——它需要一种新的视角,一种新的决心。\n\n随着与华盛顿的通讯在背景中嗡嗡作响,对话中的紧张情绪通过嘟嘟声和静电噪音贯穿始终。团队站立着,一股不祥的气息笼罩着他们。显然,他们在接下来几个小时内做出的决定可能会重新定义人类在宇宙中的位置,或者将他们置于无知和潜在危险之中。\n\n随着与星辰的联系变得更加牢固,小组开始处理逐渐成形的警告,从被动接受者转变为积极参与者。梅瑟后来的直觉占据了上风——团队的任务已经演变,不再仅仅是观察和报告,而是互动和准备。一场蜕变已经开始,而“杜尔塞行动”则以他们大胆的新频率震动,这种基调不是由世俗设定的\n```\n#############\nOutput:\n("content_keywords" "任务演变, 决策制定, 积极参与, 宇宙意义"){completion_delimiter}\n#############################\nExample 3:\n\nEntity_types: [person, role, technology, organization, event, location, concept]\nText:\n```\ntheir voice slicing through the buzz of activity. "Control may be an illusion when facing an intelligence that literally writes its own rules," they stated stoically, casting a watchful eye over the flurry of data.\n\n"It\'s like it\'s learning to communicate," offered Sam Rivera from a nearby interface, their youthful energy boding a mix of awe and anxiety. "This gives talking to strangers\' a whole new meaning."\n\nAlex surveyed his team—each face a study in concentration, determination, and not a small measure of trepidation. "This might well be our first contact," he acknowledged, "And we need to be ready for whatever answers back."\n\nTogether, they stood on the edge of the unknown, forging humanity\'s response to a message from the heavens. The ensuing silence was palpable—a collective introspection about their role in this grand cosmic play, one that could rewrite human history.\n\nThe encrypted dialogue continued to unfold, its intricate patterns showing an almost uncanny anticipation\n```\n#############\nOutput:\n("content_keywords" "first contact, control, communication, cosmic significance"){completion_delimiter}\n-Real Data-\n######################\nText:\n```\n{input_text}\n```\n######################\nOutput:\n'
DEFAULT_COMPLETION_DELIMITER = '<|COMPLETE|>'
DEFAULT_OUTPUT_PATTERN = '\\("content_keywords"(.*?)\\)'
__init__(api_model: str = 'gpt-4o', *, keyword_key: str = '__dj__keyword__', api_endpoint: str | None = None, response_path: str | None = None, prompt_template: str | None = None, completion_delimiter: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param keyword_key: The field name to store the keywords. It’s

“__dj__keyword__” in default.

Parameters:
  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • prompt_template – The template of input prompt.

  • completion_delimiter – To mark the end of the output.

  • output_pattern – Regular expression for parsing keywords.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.extract_nickname_mapper module

class data_juicer.ops.mapper.extract_nickname_mapper.ExtractNicknameMapper(api_model: str = 'gpt-4o', *, nickname_key: str = '__dj__nickname__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Extract nickname relationship in the text.

DEFAULT_SYSTEM_PROMPT = '给定你一段文本,你的任务是将人物之间的称呼方式(昵称)提取出来。\n要求:\n- 需要给出说话人对被称呼人的称呼,不要搞反了。\n- 相同的说话人和被称呼人最多给出一个最常用的称呼。\n- 请不要输出互相没有昵称的称呼方式。\n- 输出格式如下:\n```\n### 称呼方式1\n- **说话人**:...\n- **被称呼人**:...\n- **...对...的昵称**:...\n### 称呼方式2\n- **说话人**:...\n- **被称呼人**:...\n- **...对...的昵称**:...\n### 称呼方式3\n- **说话人**:...\n- **被称呼人**:...\n- **...对...的昵称**:...\n...\n```\n'
DEFAULT_INPUT_TEMPLATE = '# 文本\n```\n{text}\n```\n'
DEFAULT_OUTPUT_PATTERN = '\n        \\#\\#\\#\\s*称呼方式(\\d+)\\s*\n        -\\s*\\*\\*说话人\\*\\*\\s*:\\s*(.*?)\\s*\n        -\\s*\\*\\*被称呼人\\*\\*\\s*:\\s*(.*?)\\s*\n        -\\s*\\*\\*(.*?)对(.*?)的昵称\\*\\*\\s*:\\s*(.*?)(?=\\#\\#\\#|\\Z) # for double check\n    '
__init__(api_model: str = 'gpt-4o', *, nickname_key: str = '__dj__nickname__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param nickname_key: The field name to store the nickname

relationship. It’s “__dj__nickname__” in default.

Parameters:
  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt – System prompt for the task.

  • input_template – Template for building the model input.

  • output_pattern – Regular expression for parsing model output.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.extract_support_text_mapper module

class data_juicer.ops.mapper.extract_support_text_mapper.ExtractSupportTextMapper(api_model: str = 'gpt-4o', *, summary_key: str = '__dj__event_description__', support_text_key: str = '__dj__support_text__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Extract support sub text for a summary.

DEFAULT_SYSTEM_PROMPT = '你将扮演一个文本摘录助手的角色。你的主要任务是基于给定的文章(称为“原文”)以及对原文某个部分的简短描述或总结(称为“总结”),准确地识别并提取出与该总结相对应的原文片段。\n要求:\n- 你需要尽可能精确地匹配到最符合总结内容的那部分内容\n- 如果存在多个可能的答案,请选择最贴近总结意思的那个\n- 下面是一个例子帮助理解这一过程:\n### 原文:\n《红楼梦》是中国古典小说四大名著之一,由清代作家曹雪芹创作。它讲述了贾宝玉、林黛玉等人的爱情故事及四大家族的兴衰历程。书中通过复杂的人物关系展现了封建社会的各种矛盾冲突。其中关于贾府内部斗争的部分尤其精彩,特别是王熙凤与尤二姐之间的争斗,生动描绘了权力争夺下的女性形象。此外,《红楼梦》还以其精美的诗词闻名,这些诗词不仅增添了文学色彩,也深刻反映了人物的性格特点和命运走向。\n\n### 总结:\n描述了书中的两个女性角色之间围绕权力展开的竞争。\n\n### 原文摘录:\n其中关于贾府内部斗争的部分尤其精彩,特别是王熙凤与尤二姐之间的争斗,生动描绘了权力争夺下的女性形象。'
DEFAULT_INPUT_TEMPLATE = '### 原文:\n{text}\n\n### 总结:\n{summary}\n\n### 原文摘录:\n'
__init__(api_model: str = 'gpt-4o', *, summary_key: str = '__dj__event_description__', support_text_key: str = '__dj__support_text__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param summary_key: The field name to store the input summary.

Support for nested keys such as “__dj__stats__.text_len”. It’s “__dj__event_description__” in default.

Parameters:
  • support_text_key – The field name to store the output support text for the summary. It’s “__dj__support_text__” in default.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt – System prompt for the task.

  • input_template – Template for building the model input.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.fix_unicode_mapper module

class data_juicer.ops.mapper.fix_unicode_mapper.FixUnicodeMapper(normalization: str | None = None, *args, **kwargs)[source]

Bases: Mapper

Mapper to fix unicode errors in text samples.

__init__(normalization: str | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • normalization – the specified form of Unicode normalization mode, which can be one of [‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’], default ‘NFC’.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.generate_qa_from_examples_mapper module

class data_juicer.ops.mapper.generate_qa_from_examples_mapper.GenerateQAFromExamplesMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Bases: Mapper

Mapper to generate question and answer pairs from examples. You should configure an empty dataset in your yaml config file: ``` generated_dataset_config:

type: ‘EmptyFormatter’ # use RayEmptyFormatter when enable ray length: ${The number of generated samples} feature_keys: ${text key}

``` The number of samples generated is determined by the length of the empty dataset.

DEFAULT_SYSTEM_PROMPT = '请你仔细观察多个示例数据的输入和输出,按照你的理解,总结出相应规矩,然后写出一个新的【问题】和【回答】。注意,新生成的【问题】和【回答】需要满足如下要求:\n1. 生成的【问题】和【回答】不能与输入的【问题】和【回答】一致,但是需要保持格式相同。\n2. 生成的【问题】不一定要局限于输入【问题】的话题或领域,生成的【回答】需要正确回答生成的【问题】。\n3. 提供的【问题】和【回答】可能是多轮对话,生成的【问题】和【回答】也可以是多轮,但是需要保持格式相同。\n4. 生成的【问题】和【回答】必须成对出现,而且【问题】需要在【回答】之前。\n'
DEFAULT_INPUT_TEMPLATE = '{}'
DEFAULT_EXAMPLE_TEMPLATE = '\n如下是一条示例数据:\n{}'
DEFAULT_QA_PAIR_TEMPLATE = '【问题】\n{}\n【回答】\n{}\n'
DEFAULT_OUTPUT_PATTERN = '【问题】(.*?)【回答】(.*?)(?=【问题】|$)'
__init__(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Initialization method.

Parameters:
  • hf_model – Hugginface model ID.

  • seed_file – Path to the seed file in chatml format.

  • example_num – The number of selected examples. Randomly select N examples from “seed_file” and put them into prompt as QA examples.

  • similarity_threshold – The similarity score threshold between the generated samples and the seed examples. Range from 0 to 1. Samples with similarity score less than this threshold will be kept.

  • system_prompt – System prompt for guiding the generation task.

  • input_template – Template for building the input prompt. It must include one placeholder ‘{}’, which will be replaced by example_num formatted examples defined by example_template.

  • example_template – Template for formatting one QA example. It must include one placeholder ‘{}’, which will be replaced by one formatted qa_pair.

  • qa_pair_template – Template for formatting a single QA pair within each example. Must include two placeholders ‘{}’ for the question and answer.

  • output_pattern – Regular expression pattern to extract questions and answers from model response.

  • enable_vllm – Whether to use vllm for inference acceleration.

  • model_params – Parameters for initializing the model.

  • sampling_params – Sampling parameters for text generation. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

build_input(qa_examples)[source]
parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.generate_qa_from_text_mapper module

class data_juicer.ops.mapper.generate_qa_from_text_mapper.GenerateQAFromTextMapper(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', *, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Bases: Mapper

Mapper to generate question and answer pairs from text. Recommended model list: [

‘alibaba-pai/pai-llama3-8b-doc2qa’, ‘alibaba-pai/pai-baichuan2-7b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-4b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-7b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-1b8-doc2qa’, ‘alibaba-pai/pai-qwen1_5-0b5-doc2qa’

] These recommended models are all trained with Chinese data and are suitable for Chinese.

__init__(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', *, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Initialization method.

Parameters:
  • hf_model – Hugginface model ID.

  • output_pattern – Regular expression pattern to extract questions and answers from model response.

  • enable_vllm – Whether to use vllm for inference acceleration.

  • model_params – Parameters for initializing the model.

  • sampling_params – Sampling parameters for text generation, e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

The default data format parsed by this interface is as follows: Model Input:

蒙古国的首都是乌兰巴托(Ulaanbaatar) 冰岛的首都是雷克雅未克(Reykjavik)

Model Output:

蒙古国的首都是乌兰巴托(Ulaanbaatar) 冰岛的首都是雷克雅未克(Reykjavik) Human: 请问蒙古国的首都是哪里? Assistant: 你好,根据提供的信息,蒙古国的首都是乌兰巴托(Ulaanbaatar)。 Human: 冰岛的首都是哪里呢? Assistant: 冰岛的首都是雷克雅未克(Reykjavik)。 …

parse_output(raw_output)[source]
process_batched(samples, rank=None)[source]

data_juicer.ops.mapper.image_blur_mapper module

class data_juicer.ops.mapper.image_blur_mapper.ImageBlurMapper(p: float = 0.2, blur_type: str = 'gaussian', radius: float = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to blur images.

__init__(p: float = 0.2, blur_type: str = 'gaussian', radius: float = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • p – Probability of the image being blured.

  • blur_type – Type of blur kernel, including [‘mean’, ‘box’, ‘gaussian’].

  • radius – Radius of blur kernel.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.image_captioning_from_gpt4v_mapper module

data_juicer.ops.mapper.image_captioning_from_gpt4v_mapper.call_gpt_vision_api(api_key, system_prompt, user_prompt, base64_image, max_tokens=500, temperature=1.0, model='gpt-4-vision-preview')[source]
class data_juicer.ops.mapper.image_captioning_from_gpt4v_mapper.ImageCaptioningFromGPT4VMapper(mode: str = 'description', api_key: str = '', max_token: int = 500, temperature: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 1.0, system_prompt: str = '', user_prompt: str = '', user_prompt_key: str | None = None, keep_original_sample: bool = True, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Mapper

Mapper to generate samples whose texts are generated based on gpt-4-visison and the image.

__init__(mode: str = 'description', api_key: str = '', max_token: int = 500, temperature: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 1.0, system_prompt: str = '', user_prompt: str = '', user_prompt_key: str | None = None, keep_original_sample: bool = True, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • mode – mode of text generated from images, can be one of [‘resoning’, ‘description’, ‘conversation’, ‘custom’]

  • api_key – the API key to authenticate the request.

  • max_token – the maximum number of tokens to generate. Default is 500.

  • temperature – controls the randomness of the output (range from 0 to 1). Default is 0.

  • system_prompt – a string prompt used to set the context of a conversation and provide global guidance or rules for the gpt4-vision so that it can generate responses in the expected way. If mode set to custom, the parameter will be used.

  • user_prompt – a string prompt to guide the generation of gpt4-vision for each samples. It’s “” in default, which means no prompt provided.

  • uers_prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated text in the final datasets and the original text will be removed. It’s True in default.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.image_captioning_mapper module

class data_juicer.ops.mapper.image_captioning_mapper.ImageCaptioningMapper(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, *args, **kwargs)[source]

Bases: Mapper

Mapper to generate samples whose captions are generated based on another model and the figure.

__init__(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_img2seq – model name on huggingface to generate caption

  • caption_num – how many candidate captions to generate for each image

  • keep_candidate_mode

    retain strategy for the generated $caption_num$ candidates.

    ’random_any’: Retain the random one from generated captions

    ’similar_one_simhash’: Retain the generated one that is most

    similar to the original caption

    ’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated captions in the final datasets and the original captions will be removed. It’s True in default.

  • prompt – a string prompt to guide the generation of blip2 model for all samples globally. It’s None in default, which means no prompt provided.

  • prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None)[source]

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for ‘random_any’ and ‘similar_one’ mode, and $(1+M)Nb$ for ‘all’ mode.

Parameters:

samples

Returns:

data_juicer.ops.mapper.image_diffusion_mapper module

class data_juicer.ops.mapper.image_diffusion_mapper.ImageDiffusionMapper(hf_diffusion: str = 'CompVis/stable-diffusion-v1-4', trust_remote_code: bool = False, torch_dtype: str = 'fp32', revision: str = 'main', strength: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.8, guidance_scale: float = 7.5, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, caption_key: str | None = None, hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', *args, **kwargs)[source]

Bases: Mapper

Generate image by diffusion model

__init__(hf_diffusion: str = 'CompVis/stable-diffusion-v1-4', trust_remote_code: bool = False, torch_dtype: str = 'fp32', revision: str = 'main', strength: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.8, guidance_scale: float = 7.5, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, caption_key: str | None = None, hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_diffusion – diffusion model name on huggingface to generate the image.

  • torch_dtype – the floating point type used to load the diffusion model. Can be one of [‘fp32’, ‘fp16’, ‘bf16’]

  • revision – The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier allowed by Git.

  • strength – Indicates extent to transform the reference image. Must be between 0 and 1. image is used as a starting point and more noise is added the higher the strength. The number of denoising steps depends on the amount of noise initially added. When strength is 1, added noise is maximum and the denoising process runs for the full number of iterations specified in num_inference_steps. A value of 1 essentially ignores image.

  • guidance_scale – A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

  • aug_num – The image number to be produced by stable-diffusion model.

  • keep_candidate_mode

    retain strategy for the generated $caption_num$ candidates.

    ’random_any’: Retain the random one from generated captions

    ’similar_one_simhash’: Retain the generated one that is most

    similar to the original caption

    ’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • caption_key – the key name of fields in samples to store captions for each images. It can be a string if there is only one image in each sample. Otherwise, it should be a list. If it’s none, ImageDiffusionMapper will produce captions for each images.

  • hf_img2seq – model name on huggingface to generate caption if caption_key is None.

process_batched(samples, rank=None, context=False)[source]

Note

This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote aug_num as $M$. the number of total samples after generation is $(1+M)Nb$.

Parameters:

samples

Returns:

data_juicer.ops.mapper.image_face_blur_mapper module

class data_juicer.ops.mapper.image_face_blur_mapper.ImageFaceBlurMapper(cv_classifier: str = '', blur_type: str = 'gaussian', radius: Annotated[float, Ge(ge=0)] = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to blur faces detected in images.

__init__(cv_classifier: str = '', blur_type: str = 'gaussian', radius: Annotated[float, Ge(ge=0)] = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • cv_classifier – OpenCV classifier path for face detection. By default, we will use ‘haarcascade_frontalface_alt.xml’.

  • blur_type – Type of blur kernel, including [‘mean’, ‘box’, ‘gaussian’].

  • radius – Radius of blur kernel.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.image_tagging_mapper module

class data_juicer.ops.mapper.image_tagging_mapper.ImageTaggingMapper(tag_field_name: str = '__dj__image_tags__', *args, **kwargs)[source]

Bases: Mapper

Mapper to generate image tags.

__init__(tag_field_name: str = '__dj__image_tags__', *args, **kwargs)[source]

Initialization method. :param tag_field_name: the field name to store the tags. It’s

“__dj__image_tags__” in default.

Parameters:
  • args – extra args

  • kwargs – extra args

process_single(sample, rank=None, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.nlpaug_en_mapper module

class data_juicer.ops.mapper.nlpaug_en_mapper.NlpaugEnMapper(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, delete_random_word: bool = False, swap_random_word: bool = False, spelling_error_word: bool = False, split_random_word: bool = False, keyboard_error_char: bool = False, ocr_error_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, insert_random_char: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to simply augment samples in English based on nlpaug library.

__init__(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, delete_random_word: bool = False, swap_random_word: bool = False, spelling_error_word: bool = False, split_random_word: bool = False, keyboard_error_char: bool = False, ocr_error_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, insert_random_char: bool = False, *args, **kwargs)[source]

Initialization method. All augmentation methods use default parameters in default. We recommend you to only use 1-3 augmentation methods at a time. Otherwise, the semantics of samples might be changed significantly.

Parameters:
  • sequential – whether combine all augmentation methods to a sequence. If it’s True, a sample will be augmented by all opened augmentation methods sequentially. If it’s False, each opened augmentation method would generate its augmented samples independently.

  • aug_num – number of augmented samples to be generated. If sequential is True, there will be total aug_num augmented samples generated. If it’s False, there will be (aug_num * #opened_aug_method) augmented samples generated.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated texts in the final datasets and the original texts will be removed. It’s True in default.

  • delete_random_word – whether to open the augmentation method of deleting random words from the original texts. e.g. “I love LLM” –> “I LLM”

  • swap_random_word – whether to open the augmentation method of swapping random contiguous words in the original texts. e.g. “I love LLM” –> “Love I LLM”

  • spelling_error_word – whether to open the augmentation method of simulating the spelling error for words in the original texts. e.g. “I love LLM” –> “Ai love LLM”

  • split_random_word – whether to open the augmentation method of splitting words randomly with whitespaces in the original texts. e.g. “I love LLM” –> “I love LL M”

  • keyboard_error_char – whether to open the augmentation method of simulating the keyboard error for characters in the original texts. e.g. “I love LLM” –> “I ;ov4 LLM”

  • ocr_error_char – whether to open the augmentation method of simulating the OCR error for characters in the original texts. e.g. “I love LLM” –> “I 10ve LLM”

  • delete_random_char – whether to open the augmentation method of deleting random characters from the original texts. e.g. “I love LLM” –> “I oe LLM”

  • swap_random_char – whether to open the augmentation method of swapping random contiguous characters in the original texts. e.g. “I love LLM” –> “I ovle LLM”

  • insert_random_char – whether to open the augmentation method of inserting random characters into the original texts. e.g. “I love LLM” –> “I ^lKove LLM”

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.nlpcda_zh_mapper module

class data_juicer.ops.mapper.nlpcda_zh_mapper.NlpcdaZhMapper(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, replace_similar_word: bool = False, replace_homophone_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, replace_equivalent_num: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to simply augment samples in Chinese based on nlpcda library.

__init__(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, replace_similar_word: bool = False, replace_homophone_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, replace_equivalent_num: bool = False, *args, **kwargs)[source]

Initialization method. All augmentation methods use default parameters in default. We recommend you to only use 1-3 augmentation methods at a time. Otherwise, the semantics of samples might be changed significantly. Notice: some augmentation method might not work for some special texts, so there might be no augmented texts generated.

Parameters:
  • sequential – whether combine all augmentation methods to a sequence. If it’s True, a sample will be augmented by all opened augmentation methods sequentially. If it’s False, each opened augmentation method would generate its augmented samples independently.

  • aug_num – number of augmented samples to be generated. If sequential is True, there will be total aug_num augmented samples generated. If it’s False, there will be (aug_num * #opened_aug_method) augmented samples generated.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated texts in the final datasets and the original texts will be removed. It’s True in default.

  • replace_similar_word – whether to open the augmentation method of replacing random words with their similar words in the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这边一共有5种不同的数据增强方法”

  • replace_homophone_char – whether to open the augmentation method of replacing random characters with their homophones in the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有5种不同的濖据增强方法”

  • delete_random_char – whether to open the augmentation method of deleting random characters from the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有5种不同的数据增强”

  • swap_random_char – whether to open the augmentation method of swapping random contiguous characters in the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有5种不同的数据强增方法”

  • replace_equivalent_num – whether to open the augmentation method of replacing random numbers with their equivalent representations in the original texts. Notice: Only for numbers for now. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有伍种不同的数据增强方法”

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.optimize_qa_mapper module

class data_juicer.ops.mapper.optimize_qa_mapper.OptimizeQAMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Bases: Mapper

Mapper to optimize question-answer pairs.

DEFAULT_SYSTEM_PROMPT = '请优化输入的问答对,使【问题】和【回答】都更加详细、准确。必须按照以下标记格式,直接输出优化后的问答对:\n【问题】\n优化后的问题\n【回答】\n优化后的回答'
DEFAULT_INPUT_TEMPLATE = '以下是原始问答对:\n{}'
DEFAULT_QA_PAIR_TEMPLATE = '【问题】\n{}\n【回答】\n{}'
DEFAULT_OUTPUT_PATTERN = '.*?【问题】\\s*(.*?)\\s*【回答】\\s*(.*)'
__init__(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Initialization method.

Parameters:
  • hf_model – Hugging Face model ID.

  • system_prompt – System prompt for guiding the optimization task.

  • input_template – Template for building the input for the model. Please make sure the template contains one placeholder ‘{}’, which corresponds to the question and answer pair generated by param qa_pair_template.

  • qa_pair_template – Template for formatting the question and answer pair. Please make sure the template contains two ‘{}’ to format question and answer.

  • output_pattern – Regular expression pattern to extract question and answer from model response.

  • enable_vllm – Whether to use VLLM for inference acceleration.

  • model_params – Parameters for initializing the model.

  • sampling_params – Sampling parameters for text generation (e.g., {‘temperature’: 0.9, ‘top_p’: 0.95}).

  • kwargs – Extra keyword arguments.

build_input(sample)[source]
parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.optimize_query_mapper module

class data_juicer.ops.mapper.optimize_query_mapper.OptimizeQueryMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Bases: OptimizeQAMapper

Mapper to optimize query in question-answer pairs.

DEFAULT_SYSTEM_PROMPT = '优化问答对中的【问题】,将其更加详细具体,但仍可以由原答案回答。只输出优化后的【问题】,不要输出多余内容。'
parse_output(raw_output)[source]

data_juicer.ops.mapper.optimize_response_mapper module

class data_juicer.ops.mapper.optimize_response_mapper.OptimizeResponseMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Bases: OptimizeQAMapper

Mapper to optimize response in question-answer pairs.

DEFAULT_SYSTEM_PROMPT = '请优化问答对中的回答,将其更加详细具体,但仍可以回答原问题。只输出优化后的回答,不要输出多余内容。'
parse_output(raw_output)[source]

data_juicer.ops.mapper.pair_preference_mapper module

class data_juicer.ops.mapper.pair_preference_mapper.PairPreferenceMapper(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, rejected_key: str = 'rejected_response', reason_key: str = 'reason', try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Mapper to construct paired preference samples.

DEFAULT_SYSTEM_PROMPT = '你的任务是根据参考信息修改问答对中的回答,在语言风格、事实性、人物身份、立场等任一方面与原回答相反。必须按照以下标记格式输出,不要输出其他多余内容。\n【回答】\n生成的新回答\n【原因】\n生成该回答的原因'
DEFAULT_INPUT_TEMPLATE = '【参考信息】\n{reference}\n\n以下是原始问答对:\n【问题】\n{query}\n【回答】\n{response}'
DEFAULT_OUTPUT_PATTERN = '.*?【回答】\\s*(.*?)\\s*【原因】\\s*(.*)'
__init__(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, rejected_key: str = 'rejected_response', reason_key: str = 'reason', try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method.

Parameters:
  • api_model – API model name.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt – System prompt for guiding the generation task.

  • input_template – Template for building the model input. It must contain placeholders ‘{query}’ and ‘{reponse}’, and can optionally include ‘{reference}’.

  • output_pattern – Regular expression for parsing model output.

  • rejected_key – The field name in the sample to store the generated rejected response. Defaults to ‘rejected_response’.

  • reason_key – The field name in the sample to store the reason for generating the response. Defaults to ‘reason’.

  • try_num – The number of retries for the API call in case of response parsing failure. Defaults to 3.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

build_input(sample)[source]
parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.punctuation_normalization_mapper module

class data_juicer.ops.mapper.punctuation_normalization_mapper.PunctuationNormalizationMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to normalize unicode punctuations to English punctuations in text samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.python_file_mapper module

class data_juicer.ops.mapper.python_file_mapper.PythonFileMapper(file_path: str = '', function_name: str = 'process_single', batched: bool = False, **kwargs)[source]

Bases: Mapper

Mapper for executing Python function defined in a file.

__init__(file_path: str = '', function_name: str = 'process_single', batched: bool = False, **kwargs)[source]

Initialization method.

Parameters:
  • file_path – The path to the Python file containing the function to be executed.

  • function_name – The name of the function defined in the file to be executed.

  • batched – A boolean indicating whether to process input data in batches.

  • kwargs – Additional keyword arguments passed to the parent class.

process_single(sample)[source]

Invoke the loaded function with the provided sample.

process_batched(samples)[source]

Invoke the loaded function with the provided samples.

data_juicer.ops.mapper.python_lambda_mapper module

class data_juicer.ops.mapper.python_lambda_mapper.PythonLambdaMapper(lambda_str: str = '', batched: bool = False, **kwargs)[source]

Bases: Mapper

Mapper for executing Python lambda function on data samples.

__init__(lambda_str: str = '', batched: bool = False, **kwargs)[source]

Initialization method.

Parameters:
  • lambda_str – A string representation of the lambda function to be executed on data samples. If empty, the identity function is used.

  • batched – A boolean indicating whether to process input data in batches.

  • kwargs – Additional keyword arguments passed to the parent class.

process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

process_batched(samples)[source]

data_juicer.ops.mapper.relation_identity_mapper module

class data_juicer.ops.mapper.relation_identity_mapper.RelationIdentityMapper(api_model: str = 'gpt-4o', source_entity: str | None = None, target_entity: str | None = None, input_key: str | None = None, output_key: str | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, output_pattern_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

identify relation between two entity in the text.

DEFAULT_SYSTEM_PROMPT_TEMPLATE = '给定关于{entity1}和{entity2}的文本信息。判断{entity1}和{entity2}之间的关系。\n要求:\n- 关系用一个或多个词语表示,必要时可以加一个形容词来描述这段关系\n- 输出关系时不要参杂任何标点符号\n- 需要你进行合理的推理才能得出结论\n- 如果两个人物身份是同一个人,输出关系为:另一个身份\n- 输出格式为:\n分析推理:...\n所以{entity2}是{entity1}的:...\n- 注意输出的是{entity2}是{entity1}的什么关系,而不是{entity1}是{entity2}的什么关系'
DEFAULT_INPUT_TEMPLATE = '关于{entity1}和{entity2}的文本信息:\n```\n{text}\n```\n'
DEFAULT_OUTPUT_PATTERN_TEMPLATE = '\n        \\s*分析推理:\\s*(.*?)\\s*\n        \\s*所以{entity2}是{entity1}的:\\s*(.*?)\\Z\n    '
__init__(api_model: str = 'gpt-4o', source_entity: str | None = None, target_entity: str | None = None, input_key: str | None = None, output_key: str | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, output_pattern_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param source_entity: The source entity of the relation to be

identified.

Parameters:
  • target_entity – The target entity of the relation to be identified.

  • input_key – The input field key in the samples. Support for nested keys such as “__dj__stats__.text_len”. It is text_key in default.

  • output_key – The output field key in the samples. Support for nested keys such as “__dj__stats__.text_len”. It is input_key in default.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt_template – System prompt template for the task.

  • input_template – Template for building the model input.

  • output_pattern_template – Regular expression template for parsing model output.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.remove_bibliography_mapper module

class data_juicer.ops.mapper.remove_bibliography_mapper.RemoveBibliographyMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to remove bibliography at the end of documents in Latex samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.remove_comments_mapper module

class data_juicer.ops.mapper.remove_comments_mapper.RemoveCommentsMapper(doc_type: str | List[str] = 'tex', inline: bool = True, multiline: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove comments in different kinds of documents.

Only support ‘tex’ for now.

__init__(doc_type: str | List[str] = 'tex', inline: bool = True, multiline: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • doc_type – Type of document to remove comments.

  • inline – Whether to remove inline comments.

  • multiline – Whether to remove multiline comments.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.remove_header_mapper module

class data_juicer.ops.mapper.remove_header_mapper.RemoveHeaderMapper(drop_no_head: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove headers at the beginning of documents in Latex samples.

__init__(drop_no_head: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • drop_no_head – whether to drop sample texts without headers.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.remove_long_words_mapper module

class data_juicer.ops.mapper.remove_long_words_mapper.RemoveLongWordsMapper(min_len: int = 1, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove long words within a specific range.

__init__(min_len: int = 1, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_len – The min mapper word length in this op, words will be filtered if their length is below this parameter.

  • max_len – The max mapper word length in this op, words will be filtered if their length exceeds this parameter.

  • args – extra args

  • kwargs – extra args

should_keep_long_word(word)[source]
process_batched(samples)[source]

data_juicer.ops.mapper.remove_non_chinese_character_mapper module

class data_juicer.ops.mapper.remove_non_chinese_character_mapper.RemoveNonChineseCharacterlMapper(keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove non chinese Character in text samples.

__init__(keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • keep_alphabet – whether to keep alphabet

  • keep_number – whether to keep number

  • keep_punc – whether to keep punctuation

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.remove_repeat_sentences_mapper module

data_juicer.ops.mapper.remove_repeat_sentences_mapper.split_sentence(text)[source]
class data_juicer.ops.mapper.remove_repeat_sentences_mapper.RemoveRepeatSentencesMapper(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove repeat sentences in text samples.

__init__(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lowercase – Whether to convert sample text to lower case

  • ignore_special_character – Whether to ignore special characters when judging repeated sentences. Special characters are all characters except Chinese characters, letters and numbers.

  • min_repeat_sentence_length – Sentences shorter than this length will not be deduplicated. If ignore_special_character is set to True, then special characters are not included in this length.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.remove_specific_chars_mapper module

class data_juicer.ops.mapper.remove_specific_chars_mapper.RemoveSpecificCharsMapper(chars_to_remove: str | List[str] = '◆●■►▼▲▴∆▻▷❖♡□', *args, **kwargs)[source]

Bases: Mapper

Mapper to clean specific chars in text samples.

__init__(chars_to_remove: str | List[str] = '◆●■►▼▲▴∆▻▷❖♡□', *args, **kwargs)[source]

Initialization method.

Parameters:
  • chars_to_remove – a list or a string including all characters that need to be removed from text.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.remove_table_text_mapper module

class data_juicer.ops.mapper.remove_table_text_mapper.RemoveTableTextMapper(min_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 2, max_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 20, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove table texts from text samples.

Regular expression is used to remove tables in the range of column number of tables.

__init__(min_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 2, max_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 20, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_col – The min number of columns of table to remove.

  • max_col – The max number of columns of table to remove.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.remove_words_with_incorrect_substrings_mapper module

class data_juicer.ops.mapper.remove_words_with_incorrect_substrings_mapper.RemoveWordsWithIncorrectSubstringsMapper(lang: str = 'en', tokenization: bool = False, substrings: List[str] | None = None, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove words with incorrect substrings.

__init__(lang: str = 'en', tokenization: bool = False, substrings: List[str] | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – sample in which language

  • tokenization – whether to use model to tokenize documents

  • substrings – The incorrect substrings in words.

  • args – extra args

  • kwargs – extra args

should_keep_word_with_incorrect_substrings(word, substrings)[source]
process_batched(samples)[source]

data_juicer.ops.mapper.replace_content_mapper module

class data_juicer.ops.mapper.replace_content_mapper.ReplaceContentMapper(pattern: str | List[str] | None = None, repl: str | List[str] = '', *args, **kwargs)[source]

Bases: Mapper

Mapper to replace all content in the text that matches a specific regular expression pattern with a designated replacement string.

__init__(pattern: str | List[str] | None = None, repl: str | List[str] = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • pattern – regular expression pattern(s) to search for within text

  • repl – replacement string(s), default is empty string

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.sentence_split_mapper module

class data_juicer.ops.mapper.sentence_split_mapper.SentenceSplitMapper(lang: str = 'en', *args, **kwargs)[source]

Bases: Mapper

Mapper to split text samples to sentences.

__init__(lang: str = 'en', *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – split sentence of text in which language.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

data_juicer.ops.mapper.text_chunk_mapper module

class data_juicer.ops.mapper.text_chunk_mapper.TextChunkMapper(max_len: Annotated[int, Gt(gt=0)] | None = None, split_pattern: str | None = '\\n\\n', overlap_len: Annotated[int, Ge(ge=0)] = 0, tokenizer: str | None = None, trust_remote_code: bool = False, *args, **kwargs)[source]

Bases: Mapper

Split input text to chunks.

__init__(max_len: Annotated[int, Gt(gt=0)] | None = None, split_pattern: str | None = '\\n\\n', overlap_len: Annotated[int, Ge(ge=0)] = 0, tokenizer: str | None = None, trust_remote_code: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • max_len – Split text into multi texts with this max len if not None.

  • split_pattern – Make sure split in this pattern if it is not None and force cut if the length exceeds max_len.

  • overlap_len – Overlap length of the split texts if not split in the split pattern.

  • tokenizer – The tokenizer name of Hugging Face tokenizers. The text length will be calculate as the token num if it is offerd. Otherwise, the text length equals to string length. Support tiktoken tokenizer (such as gpt-4o), dashscope tokenizer (such as qwen2.5-72b-instruct) and huggingface tokenizer.

  • args – extra args

  • kwargs – extra args

Trust_remote_code:

for loading huggingface model

recursively_chunk(text)[source]
get_text_chunks(text, rank=None)[source]
process_batched(samples, rank=None)[source]

data_juicer.ops.mapper.video_captioning_from_audio_mapper module

class data_juicer.ops.mapper.video_captioning_from_audio_mapper.VideoCaptioningFromAudioMapper(keep_original_sample: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to caption a video according to its audio streams based on Qwen-Audio model.

__init__(keep_original_sample: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only captioned sample in the final datasets and the original sample will be removed. It’s True in default.

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None)[source]

data_juicer.ops.mapper.video_captioning_from_frames_mapper module

class data_juicer.ops.mapper.video_captioning_from_frames_mapper.VideoCaptioningFromFramesMapper(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to generate samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated to a single string.

__init__(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_img2seq – model name on huggingface to generate caption

  • caption_num – how many candidate captions to generate for each video

  • keep_candidate_mode

    retain strategy for the generated $caption_num$ candidates.

    ’random_any’: Retain the random one from generated captions

    ’similar_one_simhash’: Retain the generated one that is most

    similar to the original caption

    ’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated captions in the final datasets and the original captions will be removed. It’s True in default.

  • prompt – a string prompt to guide the generation of image-to-text model for all samples globally. It’s None in default, which means no prompt provided.

  • prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.

  • frame_sampling_method – sampling method of extracting frame videos from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • horizontal_flip – flip frame video horizontally (left to right).

  • vertical_flip – flip frame video vertically (top to bottom).

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None, context=False)[source]
Parameters:

samples

Returns:

Note

This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for ‘random_any’ and ‘similar_one’ mode, and $(1+M)Nb$ for ‘all’ mode.

data_juicer.ops.mapper.video_captioning_from_summarizer_mapper module

class data_juicer.ops.mapper.video_captioning_from_summarizer_mapper.VideoCaptioningFromSummarizerMapper(hf_summarizer: str | None = None, trust_remote_code: bool = False, consider_video_caption_from_video: bool = True, consider_video_caption_from_audio: bool = True, consider_video_caption_from_frames: bool = True, consider_video_tags_from_audio: bool = True, consider_video_tags_from_frames: bool = True, vid_cap_from_vid_args: Dict | None = None, vid_cap_from_frm_args: Dict | None = None, vid_tag_from_aud_args: Dict | None = None, vid_tag_from_frm_args: Dict | None = None, keep_tag_num: Annotated[int, Gt(gt=0)] = 5, keep_original_sample: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to generate video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, …)

__init__(hf_summarizer: str | None = None, trust_remote_code: bool = False, consider_video_caption_from_video: bool = True, consider_video_caption_from_audio: bool = True, consider_video_caption_from_frames: bool = True, consider_video_tags_from_audio: bool = True, consider_video_tags_from_frames: bool = True, vid_cap_from_vid_args: Dict | None = None, vid_cap_from_frm_args: Dict | None = None, vid_tag_from_aud_args: Dict | None = None, vid_tag_from_frm_args: Dict | None = None, keep_tag_num: Annotated[int, Gt(gt=0)] = 5, keep_original_sample: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_summarizer – the summarizer model used to summarize texts generated by other methods.

  • consider_video_caption_from_video – whether to consider the video caption generated from video directly in the summarization process. Default: True.

  • consider_video_caption_from_audio – whether to consider the video caption generated from audio streams in the video in the summarization process. Default: True.

  • consider_video_caption_from_frames – whether to consider the video caption generated from sampled frames from the video in the summarization process. Default: True.

  • consider_video_tags_from_audio – whether to consider the video tags generated from audio streams in the video in the summarization process. Default: True.

  • consider_video_tags_from_frames – whether to consider the video tags generated from sampled frames from the video in the summarization process. Default: True.

  • vid_cap_from_vid_args – the arg dict for video captioning from video directly with keys are the arg names and values are the arg values. Default: None.

  • vid_cap_from_frm_args – the arg dict for video captioning from sampled frames from the video with keys are the arg names and values are the arg values. Default: None.

  • vid_tag_from_aud_args – the arg dict for video tagging from audio streams in the video with keys are the arg names and values are the arg values. Default: None.

  • vid_tag_from_frm_args – the arg dict for video tagging from sampled frames from the video with keys are the arg names and values are the arg values. Default: None.

  • keep_tag_num – max number N of tags from sampled frames to keep. Too many tags might bring negative influence to summarized text, so we consider to only keep the N most frequent tags. Default: 5.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only summarized captions in the final datasets and the original captions will be removed. It’s True in default.

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None)[source]

data_juicer.ops.mapper.video_captioning_from_video_mapper module

class data_juicer.ops.mapper.video_captioning_from_video_mapper.VideoCaptioningFromVideoMapper(hf_video_blip: str = 'kpyu/video-blip-opt-2.7b-ego4d', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to generate samples whose captions are generated based on a video-to-text model and sampled video frame.

__init__(hf_video_blip: str = 'kpyu/video-blip-opt-2.7b-ego4d', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_video_blip – video-blip model name on huggingface to generate caption

  • caption_num – how many candidate captions to generate for each video

  • keep_candidate_mode

    retain strategy for the generated $caption_num$ candidates.

    ’random_any’: Retain the random one from generated captions

    ’similar_one_simhash’: Retain the generated one that is most

    similar to the original caption

    ’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated captions in the final datasets and the original captions will be removed. It’s True in default.

  • prompt – a string prompt to guide the generation of video-blip model for all samples globally. It’s None in default, which means no prompt provided.

  • prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.

  • frame_sampling_method – sampling method of extracting frame videos from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • horizontal_flip – flip frame video horizontally (left to right).

  • vertical_flip – flip frame video vertically (top to bottom).

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None, context=False)[source]
Parameters:

samples

Returns:

Note

This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for ‘random_any’ and ‘similar_one’ mode, and $(1+M)Nb$ for ‘all’ mode.

data_juicer.ops.mapper.video_extract_frames_mapper module

class data_juicer.ops.mapper.video_extract_frames_mapper.VideoExtractFramesMapper(frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, frame_dir: str | None = None, frame_key='__dj__video_frames__', *args, **kwargs)[source]

Bases: Mapper

Mapper to extract frames from video files according to specified methods. Extracted Frames Data Format:

The data format for the extracted frames is a dictionary mapping video key to extracted frames directory where the extracted frames are saved. The dictionary follows the structure: {

“video_key_1”: “/${frame_dir}/video_key_1_filename/”, “video_key_2”: “/${frame_dir}/video_key_2_filename/”, …

}

__init__(frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, frame_dir: str | None = None, frame_key='__dj__video_frames__', *args, **kwargs)[source]

Initialization method. :param frame_sampling_method: sampling method of extracting frame

videos from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. If “duration” > 0, frame_sampling_method acts on every segment. Default: “all_keyframes”.

Parameters:
  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration. If “duration” > 0, frame_num is the number of frames per segment.

  • duration – The duration of each segment in seconds. If 0, frames are extracted from the entire video. If duration > 0, the video is segmented into multiple segments based on duration, and frames are extracted from each segment.

  • frame_dir – Output directory to save extracted frames. If None, a default directory based on the video file path is used.

  • frame_key – The name of field to save generated frames info.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.video_face_blur_mapper module

class data_juicer.ops.mapper.video_face_blur_mapper.VideoFaceBlurMapper(cv_classifier: str = '', blur_type: str = 'gaussian', radius: float = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to blur faces detected in videos.

__init__(cv_classifier: str = '', blur_type: str = 'gaussian', radius: float = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • cv_classifier – OpenCV classifier path for face detection. By default, we will use ‘haarcascade_frontalface_alt.xml’.

  • blur_type – Type of blur kernel, including [‘mean’, ‘box’, ‘gaussian’].

  • radius – Radius of blur kernel.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.video_ffmpeg_wrapped_mapper module

class data_juicer.ops.mapper.video_ffmpeg_wrapped_mapper.VideoFFmpegWrappedMapper(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, *args, **kwargs)[source]

Bases: Mapper

Simple wrapper for FFmpeg video filters.

__init__(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • filter_name – ffmpeg video filter name.

  • filter_kwargs – keyword-arguments passed to ffmpeg filter.

  • global_args – list-arguments passed to ffmpeg command-line.

  • capture_stderr – whether to capture stderr.

  • overwrite_output – whether to overwrite output file.

  • args – extra args

  • kwargs – extra args

process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.video_remove_watermark_mapper module

class data_juicer.ops.mapper.video_remove_watermark_mapper.VideoRemoveWatermarkMapper(roi_strings: List[str] = ['0,0,0.1,0.1'], roi_type: str = 'ratio', roi_key: str | None = None, frame_num: Annotated[int, Gt(gt=0)] = 10, min_frame_threshold: Annotated[int, Gt(gt=0)] = 7, detection_method: str = 'pixel_value', *args, **kwargs)[source]

Bases: Mapper

Remove the watermarks in videos given regions.

__init__(roi_strings: List[str] = ['0,0,0.1,0.1'], roi_type: str = 'ratio', roi_key: str | None = None, frame_num: Annotated[int, Gt(gt=0)] = 10, min_frame_threshold: Annotated[int, Gt(gt=0)] = 7, detection_method: str = 'pixel_value', *args, **kwargs)[source]

Initialization method.

Parameters:
  • roi_strings – a given list of regions the watermarks locate. The format of each can be “x1, y1, x2, y2”, “(x1, y1, x2, y2)”, or “[x1, y1, x2, y2]”.

  • roi_type – the roi string type. When the type is ‘pixel’, (x1, y1), (x2, y2) are the locations of pixels in the top left corner and the bottom right corner respectively. If the roi_type is ‘ratio’, the coordinates are normalized by wights and heights.

  • roi_key – the key name of fields in samples to store roi_strings for each sample. It’s used for set different rois for different samples. If it’s none, use rois in parameter “roi_strings”. It’s None in default.

  • frame_num – the number of frames to be extracted uniformly from the video to detect the pixels of watermark.

  • min_frame_threshold – a coodination is considered as the location of a watermark pixel when it is that in no less min_frame_threshold frames.

  • detection_method – the method to detect the pixels of watermark. If it is ‘pixel_value’, we consider the distribution of pixel value in each frame. If it is ‘pixel_diversity’, we will consider the pixel diversity in different frames. The min_frame_threshold is useless and frame_num must be greater than 1 in ‘pixel_diversity’ mode.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.video_resize_aspect_ratio_mapper module

data_juicer.ops.mapper.video_resize_aspect_ratio_mapper.rescale(width, height, ori_ratio, min_ratio, max_ratio, strategy)[source]
class data_juicer.ops.mapper.video_resize_aspect_ratio_mapper.VideoResizeAspectRatioMapper(min_ratio: str = '9/21', max_ratio: str = '21/9', strategy: str = 'increase', *args, **kwargs)[source]

Bases: Mapper

Mapper to resize videos by aspect ratio. AspectRatio = W / H.

STRATEGY = ['decrease', 'increase']
__init__(min_ratio: str = '9/21', max_ratio: str = '21/9', strategy: str = 'increase', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_ratio – The minimum aspect ratio to enforce videos with an aspect ratio below min_ratio will be resized to match this minimum ratio. The ratio should be provided as a string in the format “9:21” or “9/21”.

  • max_ratio – The maximum aspect ratio to enforce videos with an aspect ratio above max_ratio will be resized to match this maximum ratio. The ratio should be provided as a string in the format “21:9” or “21/9”.

  • strategy – The resizing strategy to apply when adjusting the video dimensions. It can be either ‘decrease’ to reduce the dimension or ‘increase’ to enlarge it. Accepted values are [‘decrease’, ‘increase’].

  • args – extra args

  • kwargs – extra args

process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.video_resize_resolution_mapper module

class data_juicer.ops.mapper.video_resize_resolution_mapper.VideoResizeResolutionMapper(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, force_original_aspect_ratio: str = 'disable', force_divisible_by: Annotated[int, Gt(gt=0)] = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to resize videos resolution. We leave the super resolution with deep learning for future works.

__init__(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, force_original_aspect_ratio: str = 'disable', force_divisible_by: Annotated[int, Gt(gt=0)] = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_width – Videos with width less than ‘min_width’ will be mapped to videos with equal or bigger width.

  • max_width – Videos with width more than ‘max_width’ will be mapped to videos with equal of smaller width.

  • min_height – Videos with height less than ‘min_height’ will be mapped to videos with equal or bigger height.

  • max_height – Videos with height more than ‘max_height’ will be mapped to videos with equal or smaller height.

  • force_original_aspect_ratio – Enable decreasing or increasing output video width or height if necessary to keep the original aspect ratio, including [‘disable’, ‘decrease’, ‘increase’].

  • force_divisible_by – Ensures that both the output dimensions, width and height, are divisible by the given integer when used together with force_original_aspect_ratio, must be a positive even number.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.video_split_by_duration_mapper module

data_juicer.ops.mapper.video_split_by_duration_mapper.create_replacer(replacements)[source]
class data_juicer.ops.mapper.video_split_by_duration_mapper.VideoSplitByDurationMapper(split_duration: float = 10, min_last_split_duration: float = 0, keep_original_sample: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to split video by duration.

__init__(split_duration: float = 10, min_last_split_duration: float = 0, keep_original_sample: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • split_duration – duration of each video split in seconds.

  • min_last_split_duration – The minimum allowable duration in seconds for the last video split. If the duration of the last split is less than this value, it will be discarded.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only cut sample in the final datasets and the original sample will be removed. It’s True in default.

  • args – extra args

  • kwargs – extra args

split_videos_by_duration(video_key, container)[source]
process_batched(samples)[source]

data_juicer.ops.mapper.video_split_by_key_frame_mapper module

data_juicer.ops.mapper.video_split_by_key_frame_mapper.create_replacer(replacements)[source]
class data_juicer.ops.mapper.video_split_by_key_frame_mapper.VideoSplitByKeyFrameMapper(keep_original_sample: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to split video by key frame.

__init__(keep_original_sample: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only split sample in the final datasets and the original sample will be removed. It’s True in default.

  • args – extra args

  • kwargs – extra args

get_split_key_frame(video_key, container)[source]
process_batched(samples)[source]

data_juicer.ops.mapper.video_split_by_scene_mapper module

data_juicer.ops.mapper.video_split_by_scene_mapper.replace_func(match, scene_counts_iter)[source]
class data_juicer.ops.mapper.video_split_by_scene_mapper.VideoSplitBySceneMapper(detector: str = 'ContentDetector', threshold: Annotated[float, Ge(ge=0)] = 27.0, min_scene_len: Annotated[int, Ge(ge=0)] = 15, show_progress: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to cut videos into scene clips.

avaliable_detectors = {'AdaptiveDetector': ['window_width', 'min_content_val', 'weights', 'luma_only', 'kernel_size', 'video_manager', 'min_delta_hsv'], 'ContentDetector': ['weights', 'luma_only', 'kernel_size'], 'ThresholdDetector': ['fade_bias', 'add_final_scene', 'method', 'block_size']}
__init__(detector: str = 'ContentDetector', threshold: Annotated[float, Ge(ge=0)] = 27.0, min_scene_len: Annotated[int, Ge(ge=0)] = 15, show_progress: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • detector – Algorithm from scenedetect.detectors. Should be one of [‘ContentDetector’, ‘ThresholdDetector’, ‘AdaptiveDetector`].

  • threshold – Threshold passed to the detector.

  • min_scene_len – Minimum length of any scene.

  • show_progress – Whether to show progress from scenedetect.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.video_tagging_from_audio_mapper module

class data_juicer.ops.mapper.video_tagging_from_audio_mapper.VideoTaggingFromAudioMapper(hf_ast: str = 'MIT/ast-finetuned-audioset-10-10-0.4593', trust_remote_code: bool = False, tag_field_name: str = '__dj__video_audio_tags__', *args, **kwargs)[source]

Bases: Mapper

Mapper to generate video tags from audio streams extracted by video using the Audio Spectrogram Transformer.

__init__(hf_ast: str = 'MIT/ast-finetuned-audioset-10-10-0.4593', trust_remote_code: bool = False, tag_field_name: str = '__dj__video_audio_tags__', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_ast – path to the HF model to tag from audios.

  • trust_remote_code – whether to trust the remote code of HF models

  • tag_field_name – the field name to store the tags. It’s “__dj__video_audio_tags__” in default.

  • args – extra args

  • kwargs – extra args

process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.video_tagging_from_frames_mapper module

class data_juicer.ops.mapper.video_tagging_from_frames_mapper.VideoTaggingFromFramesMapper(frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, tag_field_name: str = '__dj__video_frame_tags__', *args, **kwargs)[source]

Bases: Mapper

Mapper to generate video tags from frames extract by video.

__init__(frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, tag_field_name: str = '__dj__video_frame_tags__', *args, **kwargs)[source]

Initialization method.

Parameters:
  • frame_sampling_method – sampling method of extracting frame images from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • tag_field_name – the field name to store the tags. It’s “__dj__video_frame_tags__” in default.

  • args – extra args

  • kwargs – extra args

process_single(sample, rank=None, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

data_juicer.ops.mapper.whitespace_normalization_mapper module

class data_juicer.ops.mapper.whitespace_normalization_mapper.WhitespaceNormalizationMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to normalize different kinds of whitespaces to whitespace ‘ ‘ (0x20) in text samples.

Different kinds of whitespaces can be found here: https://en.wikipedia.org/wiki/Whitespace_character

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]

Module contents

class data_juicer.ops.mapper.AudioFFmpegWrappedMapper(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, *args, **kwargs)[source]

Bases: Mapper

Simple wrapper for FFmpeg audio filters.

__init__(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • filter_name – ffmpeg audio filter name.

  • filter_kwargs – keyword-arguments passed to ffmpeg filter.

  • global_args – list-arguments passed to ffmpeg command-line.

  • capture_stderr – whether to capture stderr.

  • overwrite_output – whether to overwrite output file.

  • args – extra args

  • kwargs – extra args

process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.CalibrateQAMapper(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Mapper to calibrate question-answer pairs based on reference text.

DEFAULT_SYSTEM_PROMPT = '请根据提供的【参考信息】对【问题】和【回答】进行校准,使其更加详细、准确。\n按照以下格式输出:\n【问题】\n校准后的问题\n【回答】\n校准后的回答'
DEFAULT_INPUT_TEMPLATE = '{reference}\n{qa_pair}'
DEFAULT_REFERENCE_TEMPLATE = '【参考信息】\n{}'
DEFAULT_QA_PAIR_TEMPLATE = '【问题】\n{}\n【回答】\n{}'
DEFAULT_OUTPUT_PATTERN = '【问题】\\s*(.*?)\\s*【回答】\\s*(.*)'
__init__(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method.

Parameters:
  • api_model – API model name.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt – System prompt for the calibration task.

  • input_template – Template for building the model input.

  • reference_template – Template for formatting the reference text.

  • qa_pair_template – Template for formatting question-answer pairs.

  • output_pattern – Regular expression for parsing model output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

build_input(sample)[source]
parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.CalibrateQueryMapper(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: CalibrateQAMapper

Mapper to calibrate query in question-answer pairs based on reference text.

DEFAULT_SYSTEM_PROMPT = '请根据提供的【参考信息】对问答对中的【问题】进行校准,        使其更加详细、准确,且仍可以由原答案回答。只输出校准后的问题,不要输出多余内容。'
parse_output(raw_output)[source]
class data_juicer.ops.mapper.CalibrateResponseMapper(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, reference_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: CalibrateQAMapper

Mapper to calibrate response in question-answer pairs based on reference text.

DEFAULT_SYSTEM_PROMPT = '请根据提供的【参考信息】对问答对中的【回答】进行校准,        使其更加详细、准确,且仍可以回答原问题。只输出校准后的回答,不要输出多余内容。'
parse_output(raw_output)[source]
class data_juicer.ops.mapper.ChineseConvertMapper(mode: str = 's2t', *args, **kwargs)[source]

Bases: Mapper

Mapper to convert Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji.

__init__(mode: str = 's2t', *args, **kwargs)[source]

Initialization method.

Parameters:
  • mode

    Choose the mode to convert Chinese:

    s2t: Simplified Chinese to Traditional Chinese,

    t2s: Traditional Chinese to Simplified Chinese,

    s2tw: Simplified Chinese to Traditional Chinese (Taiwan Standard),

    tw2s: Traditional Chinese (Taiwan Standard) to Simplified Chinese,

    s2hk: Simplified Chinese to Traditional Chinese (Hong Kong variant),

    hk2s: Traditional Chinese (Hong Kong variant) to Simplified Chinese,

    s2twp: Simplified Chinese to Traditional Chinese (Taiwan Standard) with Taiwanese idiom,

    tw2sp: Traditional Chinese (Taiwan Standard) to Simplified Chinese with Mainland Chinese idiom,

    t2tw: Traditional Chinese to Traditional Chinese (Taiwan Standard),

    tw2t: Traditional Chinese (Taiwan standard) to Traditional Chinese,

    hk2t: Traditional Chinese (Hong Kong variant) to Traditional Chinese,

    t2hk: Traditional Chinese to Traditional Chinese (Hong Kong variant),

    t2jp: Traditional Chinese Characters (Kyūjitai) to New Japanese Kanji,

    jp2t: New Japanese Kanji (Shinjitai) to Traditional Chinese Characters,

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.CleanCopyrightMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to clean copyright comments at the beginning of the text samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.CleanEmailMapper(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Bases: Mapper

Mapper to clean email in text samples.

__init__(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • pattern – regular expression pattern to search for within text.

  • repl – replacement string, default is empty string.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.CleanHtmlMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to clean html code in text samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.CleanIpMapper(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Bases: Mapper

Mapper to clean ipv4 and ipv6 address in text samples.

__init__(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • pattern – regular expression pattern to search for within text.

  • repl – replacement string, default is empty string.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.CleanLinksMapper(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Bases: Mapper

Mapper to clean links like http/https/ftp in text samples.

__init__(pattern: str | None = None, repl: str = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • pattern – regular expression pattern to search for within text.

  • repl – replacement string, default is empty string.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.ExpandMacroMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to expand macro definitions in the document body of Latex samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.ExtractEntityAttributeMapper(api_model: str = 'gpt-4o', query_entities: List[str] = [], query_attributes: List[str] = [], *, entity_key: str = '__dj__main_entities__', attribute_key: str = '__dj__attributes__', attribute_desc_key: str = '__dj__attribute_descriptions__', support_text_key: str = '__dj__attribute_support_texts__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, attr_pattern_template: str | None = None, demo_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Extract attributes for given entities from the text

DEFAULT_SYSTEM_PROMPT_TEMPLATE = '给定一段文本,从文本中总结{entity}的{attribute},并且从原文摘录最能说明该{attribute}的代表性示例。\n要求:\n- 摘录的示例应该简短。\n- 遵循如下的回复格式:\n# {entity}\n## {attribute}:\n...\n### 代表性示例摘录1:\n```\n...\n```\n### 代表性示例摘录2:\n```\n...\n```\n...\n'
DEFAULT_INPUT_TEMPLATE = '# 文本\n```\n{text}\n```\n'
DEFAULT_ATTR_PATTERN_TEMPLATE = '\\#\\#\\s*{attribute}:\\s*(.*?)(?=\\#\\#\\#|\\Z)'
DEFAULT_DEMON_PATTERN = '\\#\\#\\#\\s*代表性示例摘录(\\d+):\\s*```\\s*(.*?)```\\s*(?=\\#\\#\\#|\\Z)'
__init__(api_model: str = 'gpt-4o', query_entities: List[str] = [], query_attributes: List[str] = [], *, entity_key: str = '__dj__main_entities__', attribute_key: str = '__dj__attributes__', attribute_desc_key: str = '__dj__attribute_descriptions__', support_text_key: str = '__dj__attribute_support_texts__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, attr_pattern_template: str | None = None, demo_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param query_entities: Entity list to be queried. :param query_attributes: Attribute list to be queried. :param entity_key: The field name to store the given main entity for

attribute extraction. It’s “__dj__entity__” in default.

Parameters:
  • entity_attribute_key – The field name to store the given attribute to be extracted. It’s “__dj__attribute__” in default.

  • attribute_desc_key – The field name to store the extracted attribute description. It’s “__dj__attribute_description__” in default.

  • support_text_key – The field name to store the attribute support text extracted from the raw text. It’s “__dj__support_text__” in default.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt_template – System prompt template for the task. Need to be specified by given entity and attribute.

  • input_template – Template for building the model input.

  • attr_pattern_template – Pattern for parsing the attribute from output. Need to be specified by given attribute.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

Param:

demo_pattern: Pattern for parsing the demonstraction from output to support the attribute.

parse_output(raw_output, attribute_name)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ExtractEntityRelationMapper(api_model: str = 'gpt-4o', entity_types: List[str] | None = None, *, entity_key: str = '__dj__entity__', relation_key: str = '__dj__relation__', api_endpoint: str | None = None, response_path: str | None = None, prompt_template: str | None = None, tuple_delimiter: str | None = None, record_delimiter: str | None = None, completion_delimiter: str | None = None, max_gleaning: Annotated[int, Ge(ge=0)] = 1, continue_prompt: str | None = None, if_loop_prompt: str | None = None, entity_pattern: str | None = None, relation_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Extract entities and relations in the text for knowledge graph.

DEFAULT_PROMPT_TEMPLATE = '-Goal-\nGiven a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.\n\n-Steps-\n1. Identify all entities. For each identified entity, extract the following information:\n- entity_name: Name of the entity\n- entity_type: One of the following types: [{entity_types}]\n- entity_description: Comprehensive description of the entity\'s attributes and activities\nFormat each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>\n\n2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.\nFor each pair of related entities, extract the following information:\n- source_entity: name of the source entity, as identified in step 1\n- target_entity: name of the target entity, as identified in step 1\n- relationship_description: explanation as to why you think the source entity and the target entity are related to each other\n- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity\n- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details\nFormat each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)\n\n3. Return output in the language of the given text as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.\n\n4. When finished, output {completion_delimiter}\n\n######################\n-Examples-\n######################\nExample 1:\n\nEntity_types: [person, technology, mission, organization, location]\nText:\n```\nwhile Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor\'s authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan\'s shared commitment to discovery was an unspoken rebellion against Cruz\'s narrowing vision of control and order.\n\nThen Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. “If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us.”\n\nThe underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor\'s, a wordless clash of wills softening into an uneasy truce.\n\nIt was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths\n```\n################\nOutput:\n("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}\n("entity"{tuple_delimiter}"Taylor"{tuple_delimiter}"person"{tuple_delimiter}"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."){record_delimiter}\n("entity"{tuple_delimiter}"Jordan"{tuple_delimiter}"person"{tuple_delimiter}"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."){record_delimiter}\n("entity"{tuple_delimiter}"Cruz"{tuple_delimiter}"person"{tuple_delimiter}"Cruz is associated with a vision of control and order, influencing the dynamics among other characters."){record_delimiter}\n("entity"{tuple_delimiter}"The Device"{tuple_delimiter}"technology"{tuple_delimiter}"The Device is central to the story, with potential game-changing implications, and is revered by Taylor."){record_delimiter}\n("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Taylor"{tuple_delimiter}"Alex is affected by Taylor\'s authoritarian certainty and observes changes in Taylor\'s attitude towards the device."{tuple_delimiter}"power dynamics, perspective shift"{tuple_delimiter}7){record_delimiter}\n("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Jordan"{tuple_delimiter}"Alex and Jordan share a commitment to discovery, which contrasts with Cruz\'s vision."{tuple_delimiter}"shared goals, rebellion"{tuple_delimiter}6){record_delimiter}\n("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"Jordan"{tuple_delimiter}"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."{tuple_delimiter}"conflict resolution, mutual respect"{tuple_delimiter}8){record_delimiter}\n("relationship"{tuple_delimiter}"Jordan"{tuple_delimiter}"Cruz"{tuple_delimiter}"Jordan\'s commitment to discovery is in rebellion against Cruz\'s vision of control and order."{tuple_delimiter}"ideological conflict, rebellion"{tuple_delimiter}5){record_delimiter}\n("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"The Device"{tuple_delimiter}"Taylor shows reverence towards the device, indicating its importance and potential impact."{tuple_delimiter}"reverence, technological significance"{tuple_delimiter}9){record_delimiter}\n#############################\nExample 2:\n\nEntity_types: [人物, 技术, 任务, 组织, 地点]\nText:\n```\n他们不再是单纯的执行者;他们已成为某个超越星辰与条纹的领域的信息守护者。这一使命的提升不能被规则和既定协议所束缚——它需要一种新的视角,一种新的决心。\n\n随着与华盛顿的通讯在背景中嗡嗡作响,对话中的紧张情绪通过嘟嘟声和静电噪音贯穿始终。团队站立着,一股不祥的气息笼罩着他们。显然,他们在接下来几个小时内做出的决定可能会重新定义人类在宇宙中的位置,或者将他们置于无知和潜在危险之中。\n\n随着与星辰的联系变得更加牢固,小组开始处理逐渐成形的警告,从被动接受者转变为积极参与者。梅瑟后来的直觉占据了上风——团队的任务已经演变,不再仅仅是观察和报告,而是互动和准备。一场蜕变已经开始,而“杜尔塞行动”则以他们大胆的新频率震动,这种基调不是由世俗设定的\n```\n#############\nOutput:\n("entity"{tuple_delimiter}"华盛顿"{tuple_delimiter}"地点"{tuple_delimiter}"华盛顿是正在接收通讯的地方,表明其在决策过程中的重要性。"){record_delimiter}\n("entity"{tuple_delimiter}"杜尔塞行动"{tuple_delimiter}"任务"{tuple_delimiter}"杜尔塞行动被描述为一项已演变为互动和准备的任务,显示出目标和活动的重大转变。"){record_delimiter}\n("entity"{tuple_delimiter}"团队"{tuple_delimiter}"组织"{tuple_delimiter}"团队被描绘成一群从被动观察者转变为积极参与者的人,展示了他们角色的动态变化。"){record_delimiter}\n("relationship"{tuple_delimiter}"团队"{tuple_delimiter}"华盛顿"{tuple_delimiter}"团队收到来自华盛顿的通讯,这影响了他们的决策过程。"{tuple_delimiter}"决策、外部影响"{tuple_delimiter}7){record_delimiter}\n("relationship"{tuple_delimiter}"团队"{tuple_delimiter}"杜尔塞行动"{tuple_delimiter}"团队直接参与杜尔塞行动,执行其演变后的目标和活动。"{tuple_delimiter}"任务演变、积极参与"{tuple_delimiter}9){completion_delimiter}\n#############################\nExample 3:\n\nEntity_types: [person, role, technology, organization, event, location, concept]\nText:\n```\ntheir voice slicing through the buzz of activity. "Control may be an illusion when facing an intelligence that literally writes its own rules," they stated stoically, casting a watchful eye over the flurry of data.\n\n"It\'s like it\'s learning to communicate," offered Sam Rivera from a nearby interface, their youthful energy boding a mix of awe and anxiety. "This gives talking to strangers\' a whole new meaning."\n\nAlex surveyed his team—each face a study in concentration, determination, and not a small measure of trepidation. "This might well be our first contact," he acknowledged, "And we need to be ready for whatever answers back."\n\nTogether, they stood on the edge of the unknown, forging humanity\'s response to a message from the heavens. The ensuing silence was palpable—a collective introspection about their role in this grand cosmic play, one that could rewrite human history.\n\nThe encrypted dialogue continued to unfold, its intricate patterns showing an almost uncanny anticipation\n```\n#############\nOutput:\n("entity"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"person"{tuple_delimiter}"Sam Rivera is a member of a team working on communicating with an unknown intelligence, showing a mix of awe and anxiety."){record_delimiter}\n("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is the leader of a team attempting first contact with an unknown intelligence, acknowledging the significance of their task."){record_delimiter}\n("entity"{tuple_delimiter}"Control"{tuple_delimiter}"concept"{tuple_delimiter}"Control refers to the ability to manage or govern, which is challenged by an intelligence that writes its own rules."){record_delimiter}\n("entity"{tuple_delimiter}"Intelligence"{tuple_delimiter}"concept"{tuple_delimiter}"Intelligence here refers to an unknown entity capable of writing its own rules and learning to communicate."){record_delimiter}\n("entity"{tuple_delimiter}"First Contact"{tuple_delimiter}"event"{tuple_delimiter}"First Contact is the potential initial communication between humanity and an unknown intelligence."){record_delimiter}\n("entity"{tuple_delimiter}"Humanity\'s Response"{tuple_delimiter}"event"{tuple_delimiter}"Humanity\'s Response is the collective action taken by Alex\'s team in response to a message from an unknown intelligence."){record_delimiter}\n("relationship"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"Intelligence"{tuple_delimiter}"Sam Rivera is directly involved in the process of learning to communicate with the unknown intelligence."{tuple_delimiter}"communication, learning process"{tuple_delimiter}9){record_delimiter}\n("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"First Contact"{tuple_delimiter}"Alex leads the team that might be making the First Contact with the unknown intelligence."{tuple_delimiter}"leadership, exploration"{tuple_delimiter}10){record_delimiter}\n("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Humanity\'s Response"{tuple_delimiter}"Alex and his team are the key figures in Humanity\'s Response to the unknown intelligence."{tuple_delimiter}"collective action, cosmic significance"{tuple_delimiter}8){record_delimiter}\n("relationship"{tuple_delimiter}"Control"{tuple_delimiter}"Intelligence"{tuple_delimiter}"The concept of Control is challenged by the Intelligence that writes its own rules."{tuple_delimiter}"power dynamics, autonomy"{tuple_delimiter}7){record_delimiter}\n#############################\n-Real Data-\n######################\nEntity_types: [{entity_types}]\nText:\n```\n{input_text}\n```\n######################\nOutput:\n'
DEFAULT_CONTINUE_PROMPT = 'MANY entities were missed in the last extraction.  Add them below using the same format:\n'
DEFAULT_IF_LOOP_PROMPT = 'It appears some entities may have still been missed.  Answer YES | NO if there are still entities that need to be added.\n'
DEFAULT_ENTITY_TYPES = ['organization', 'person', 'geo', 'event']
DEFAULT_TUPLE_DELIMITER = '<|>'
DEFAULT_RECORD_DELIMITER = '##'
DEFAULT_COMPLETION_DELIMITER = '<|COMPLETE|>'
DEFAULT_ENTITY_PATTERN = '\\("entity"(.*?)\\)'
DEFAULT_RELATION_PATTERN = '\\("relationship"(.*?)\\)'
__init__(api_model: str = 'gpt-4o', entity_types: List[str] | None = None, *, entity_key: str = '__dj__entity__', relation_key: str = '__dj__relation__', api_endpoint: str | None = None, response_path: str | None = None, prompt_template: str | None = None, tuple_delimiter: str | None = None, record_delimiter: str | None = None, completion_delimiter: str | None = None, max_gleaning: Annotated[int, Ge(ge=0)] = 1, continue_prompt: str | None = None, if_loop_prompt: str | None = None, entity_pattern: str | None = None, relation_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param entity_types: Pre-defined entity types for knowledge graph. :param entity_key: The field name to store the entities. It’s

“__dj__entity__” in default.

Parameters:
  • relation_key – The field name to store the relations between entities. It’s “__dj__relation__” in default.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • prompt_template – The template of input prompt.

  • tuple_delimiter – Delimiter to separate items in outputs.

  • record_delimiter – Delimiter to separate records in outputs.

  • completion_delimiter – To mark the end of the output.

  • max_gleaning – the extra max num to call LLM to glean entities and relations.

  • continue_prompt – the prompt for gleaning entities and relations.

  • if_loop_prompt – the prompt to determine whether to stop gleaning.

  • entity_pattern – Regular expression for parsing entity record.

  • relation_pattern – Regular expression for parsing relation record.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

parse_output(raw_output)[source]
add_message(messages, role, content)[source]
light_rag_extraction(messages, rank=None)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ExtractEventMapper(api_model: str = 'gpt-4o', *, event_desc_key: str = '__dj__event_description__', relevant_char_key: str = '__dj__relevant_characters__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Extract events and relevant characters in the text

DEFAULT_SYSTEM_PROMPT = '给定一段文本,对文本的情节进行分点总结,并抽取与情节相关的人物。\n要求:\n- 尽量不要遗漏内容,不要添加文本中没有的情节,符合原文事实\n- 联系上下文说明前因后果,但仍然需要符合事实\n- 不要包含主观看法\n- 注意要尽可能保留文本的专有名词\n- 注意相关人物需要在对应情节中出现\n- 只抽取情节中的主要人物,不要遗漏情节的主要人物\n- 总结格式如下:\n### 情节1:\n- **情节描述**: ...\n- **相关人物**:人物1,人物2,人物3,...\n### 情节2:\n- **情节描述**: ...\n- **相关人物**:人物1,人物2,...\n### 情节3:\n- **情节描述**: ...\n- **相关人物**:人物1,...\n...\n'
DEFAULT_INPUT_TEMPLATE = '# 文本\n```\n{text}\n```\n'
DEFAULT_OUTPUT_PATTERN = '\n        \\#\\#\\#\\s*情节(\\d+):\\s*\n        -\\s*\\*\\*情节描述\\*\\*\\s*:\\s*(.*?)\\s*\n        -\\s*\\*\\*相关人物\\*\\*\\s*:\\s*(.*?)(?=\\#\\#\\#|\\Z)\n    '
__init__(api_model: str = 'gpt-4o', *, event_desc_key: str = '__dj__event_description__', relevant_char_key: str = '__dj__relevant_characters__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param event_desc_key: The field name to store the event descriptions.

It’s “__dj__event_description__” in default.

Parameters:
  • relevant_char_key – The field name to store the relevant characters to the events. It’s “__dj__relevant_characters__” in default.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt – System prompt for the task.

  • input_template – Template for building the model input.

  • output_pattern – Regular expression for parsing model output.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

parse_output(raw_output)[source]
process_batched(samples, rank=None)[source]
class data_juicer.ops.mapper.ExtractKeywordMapper(api_model: str = 'gpt-4o', *, keyword_key: str = '__dj__keyword__', api_endpoint: str | None = None, response_path: str | None = None, prompt_template: str | None = None, completion_delimiter: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Generate keywords for the text

DEFAULT_PROMPT_TEMPLATE = '-Goal-\nGiven a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.\n\n-Steps-\n1. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.\nFormat the content-level key words as ("content_keywords" <high_level_keywords>)\n\n3. Return output in the language of the given text.\n\n4. When finished, output {completion_delimiter}\n\n######################\n-Examples-\n######################\nExample 1:\n\nText:\n```\nwhile Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor\'s authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan\'s shared commitment to discovery was an unspoken rebellion against Cruz\'s narrowing vision of control and order.\n\nThen Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. “If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us.”\n\nThe underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor\'s, a wordless clash of wills softening into an uneasy truce.\n\nIt was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths\n```\n################\nOutput:\n("content_keywords" "power dynamics, ideological conflict, discovery, rebellion"){completion_delimiter}\n#############################\nExample 2:\n\nText:\n```\n他们不再是单纯的执行者;他们已成为某个超越星辰与条纹的领域的信息守护者。这一使命的提升不能被规则和既定协议所束缚——它需要一种新的视角,一种新的决心。\n\n随着与华盛顿的通讯在背景中嗡嗡作响,对话中的紧张情绪通过嘟嘟声和静电噪音贯穿始终。团队站立着,一股不祥的气息笼罩着他们。显然,他们在接下来几个小时内做出的决定可能会重新定义人类在宇宙中的位置,或者将他们置于无知和潜在危险之中。\n\n随着与星辰的联系变得更加牢固,小组开始处理逐渐成形的警告,从被动接受者转变为积极参与者。梅瑟后来的直觉占据了上风——团队的任务已经演变,不再仅仅是观察和报告,而是互动和准备。一场蜕变已经开始,而“杜尔塞行动”则以他们大胆的新频率震动,这种基调不是由世俗设定的\n```\n#############\nOutput:\n("content_keywords" "任务演变, 决策制定, 积极参与, 宇宙意义"){completion_delimiter}\n#############################\nExample 3:\n\nEntity_types: [person, role, technology, organization, event, location, concept]\nText:\n```\ntheir voice slicing through the buzz of activity. "Control may be an illusion when facing an intelligence that literally writes its own rules," they stated stoically, casting a watchful eye over the flurry of data.\n\n"It\'s like it\'s learning to communicate," offered Sam Rivera from a nearby interface, their youthful energy boding a mix of awe and anxiety. "This gives talking to strangers\' a whole new meaning."\n\nAlex surveyed his team—each face a study in concentration, determination, and not a small measure of trepidation. "This might well be our first contact," he acknowledged, "And we need to be ready for whatever answers back."\n\nTogether, they stood on the edge of the unknown, forging humanity\'s response to a message from the heavens. The ensuing silence was palpable—a collective introspection about their role in this grand cosmic play, one that could rewrite human history.\n\nThe encrypted dialogue continued to unfold, its intricate patterns showing an almost uncanny anticipation\n```\n#############\nOutput:\n("content_keywords" "first contact, control, communication, cosmic significance"){completion_delimiter}\n-Real Data-\n######################\nText:\n```\n{input_text}\n```\n######################\nOutput:\n'
DEFAULT_COMPLETION_DELIMITER = '<|COMPLETE|>'
DEFAULT_OUTPUT_PATTERN = '\\("content_keywords"(.*?)\\)'
__init__(api_model: str = 'gpt-4o', *, keyword_key: str = '__dj__keyword__', api_endpoint: str | None = None, response_path: str | None = None, prompt_template: str | None = None, completion_delimiter: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param keyword_key: The field name to store the keywords. It’s

“__dj__keyword__” in default.

Parameters:
  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • prompt_template – The template of input prompt.

  • completion_delimiter – To mark the end of the output.

  • output_pattern – Regular expression for parsing keywords.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ExtractNicknameMapper(api_model: str = 'gpt-4o', *, nickname_key: str = '__dj__nickname__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Extract nickname relationship in the text.

DEFAULT_SYSTEM_PROMPT = '给定你一段文本,你的任务是将人物之间的称呼方式(昵称)提取出来。\n要求:\n- 需要给出说话人对被称呼人的称呼,不要搞反了。\n- 相同的说话人和被称呼人最多给出一个最常用的称呼。\n- 请不要输出互相没有昵称的称呼方式。\n- 输出格式如下:\n```\n### 称呼方式1\n- **说话人**:...\n- **被称呼人**:...\n- **...对...的昵称**:...\n### 称呼方式2\n- **说话人**:...\n- **被称呼人**:...\n- **...对...的昵称**:...\n### 称呼方式3\n- **说话人**:...\n- **被称呼人**:...\n- **...对...的昵称**:...\n...\n```\n'
DEFAULT_INPUT_TEMPLATE = '# 文本\n```\n{text}\n```\n'
DEFAULT_OUTPUT_PATTERN = '\n        \\#\\#\\#\\s*称呼方式(\\d+)\\s*\n        -\\s*\\*\\*说话人\\*\\*\\s*:\\s*(.*?)\\s*\n        -\\s*\\*\\*被称呼人\\*\\*\\s*:\\s*(.*?)\\s*\n        -\\s*\\*\\*(.*?)对(.*?)的昵称\\*\\*\\s*:\\s*(.*?)(?=\\#\\#\\#|\\Z) # for double check\n    '
__init__(api_model: str = 'gpt-4o', *, nickname_key: str = '__dj__nickname__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param nickname_key: The field name to store the nickname

relationship. It’s “__dj__nickname__” in default.

Parameters:
  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt – System prompt for the task.

  • input_template – Template for building the model input.

  • output_pattern – Regular expression for parsing model output.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ExtractSupportTextMapper(api_model: str = 'gpt-4o', *, summary_key: str = '__dj__event_description__', support_text_key: str = '__dj__support_text__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Extract support sub text for a summary.

DEFAULT_SYSTEM_PROMPT = '你将扮演一个文本摘录助手的角色。你的主要任务是基于给定的文章(称为“原文”)以及对原文某个部分的简短描述或总结(称为“总结”),准确地识别并提取出与该总结相对应的原文片段。\n要求:\n- 你需要尽可能精确地匹配到最符合总结内容的那部分内容\n- 如果存在多个可能的答案,请选择最贴近总结意思的那个\n- 下面是一个例子帮助理解这一过程:\n### 原文:\n《红楼梦》是中国古典小说四大名著之一,由清代作家曹雪芹创作。它讲述了贾宝玉、林黛玉等人的爱情故事及四大家族的兴衰历程。书中通过复杂的人物关系展现了封建社会的各种矛盾冲突。其中关于贾府内部斗争的部分尤其精彩,特别是王熙凤与尤二姐之间的争斗,生动描绘了权力争夺下的女性形象。此外,《红楼梦》还以其精美的诗词闻名,这些诗词不仅增添了文学色彩,也深刻反映了人物的性格特点和命运走向。\n\n### 总结:\n描述了书中的两个女性角色之间围绕权力展开的竞争。\n\n### 原文摘录:\n其中关于贾府内部斗争的部分尤其精彩,特别是王熙凤与尤二姐之间的争斗,生动描绘了权力争夺下的女性形象。'
DEFAULT_INPUT_TEMPLATE = '### 原文:\n{text}\n\n### 总结:\n{summary}\n\n### 原文摘录:\n'
__init__(api_model: str = 'gpt-4o', *, summary_key: str = '__dj__event_description__', support_text_key: str = '__dj__support_text__', api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param summary_key: The field name to store the input summary.

Support for nested keys such as “__dj__stats__.text_len”. It’s “__dj__event_description__” in default.

Parameters:
  • support_text_key – The field name to store the output support text for the summary. It’s “__dj__support_text__” in default.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt – System prompt for the task.

  • input_template – Template for building the model input.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.FixUnicodeMapper(normalization: str | None = None, *args, **kwargs)[source]

Bases: Mapper

Mapper to fix unicode errors in text samples.

__init__(normalization: str | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • normalization – the specified form of Unicode normalization mode, which can be one of [‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’], default ‘NFC’.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.GenerateQAFromExamplesMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Bases: Mapper

Mapper to generate question and answer pairs from examples. You should configure an empty dataset in your yaml config file: ``` generated_dataset_config:

type: ‘EmptyFormatter’ # use RayEmptyFormatter when enable ray length: ${The number of generated samples} feature_keys: ${text key}

``` The number of samples generated is determined by the length of the empty dataset.

DEFAULT_SYSTEM_PROMPT = '请你仔细观察多个示例数据的输入和输出,按照你的理解,总结出相应规矩,然后写出一个新的【问题】和【回答】。注意,新生成的【问题】和【回答】需要满足如下要求:\n1. 生成的【问题】和【回答】不能与输入的【问题】和【回答】一致,但是需要保持格式相同。\n2. 生成的【问题】不一定要局限于输入【问题】的话题或领域,生成的【回答】需要正确回答生成的【问题】。\n3. 提供的【问题】和【回答】可能是多轮对话,生成的【问题】和【回答】也可以是多轮,但是需要保持格式相同。\n4. 生成的【问题】和【回答】必须成对出现,而且【问题】需要在【回答】之前。\n'
DEFAULT_INPUT_TEMPLATE = '{}'
DEFAULT_EXAMPLE_TEMPLATE = '\n如下是一条示例数据:\n{}'
DEFAULT_QA_PAIR_TEMPLATE = '【问题】\n{}\n【回答】\n{}\n'
DEFAULT_OUTPUT_PATTERN = '【问题】(.*?)【回答】(.*?)(?=【问题】|$)'
__init__(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, seed_file: str = '', example_num: Annotated[int, Gt(gt=0)] = 3, similarity_threshold: float = 0.7, system_prompt: str | None = None, input_template: str | None = None, example_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Initialization method.

Parameters:
  • hf_model – Hugginface model ID.

  • seed_file – Path to the seed file in chatml format.

  • example_num – The number of selected examples. Randomly select N examples from “seed_file” and put them into prompt as QA examples.

  • similarity_threshold – The similarity score threshold between the generated samples and the seed examples. Range from 0 to 1. Samples with similarity score less than this threshold will be kept.

  • system_prompt – System prompt for guiding the generation task.

  • input_template – Template for building the input prompt. It must include one placeholder ‘{}’, which will be replaced by example_num formatted examples defined by example_template.

  • example_template – Template for formatting one QA example. It must include one placeholder ‘{}’, which will be replaced by one formatted qa_pair.

  • qa_pair_template – Template for formatting a single QA pair within each example. Must include two placeholders ‘{}’ for the question and answer.

  • output_pattern – Regular expression pattern to extract questions and answers from model response.

  • enable_vllm – Whether to use vllm for inference acceleration.

  • model_params – Parameters for initializing the model.

  • sampling_params – Sampling parameters for text generation. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

build_input(qa_examples)[source]
parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.GenerateQAFromTextMapper(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', *, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Bases: Mapper

Mapper to generate question and answer pairs from text. Recommended model list: [

‘alibaba-pai/pai-llama3-8b-doc2qa’, ‘alibaba-pai/pai-baichuan2-7b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-4b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-7b-doc2qa’, ‘alibaba-pai/pai-qwen1_5-1b8-doc2qa’, ‘alibaba-pai/pai-qwen1_5-0b5-doc2qa’

] These recommended models are all trained with Chinese data and are suitable for Chinese.

__init__(hf_model: str = 'alibaba-pai/pai-qwen1_5-7b-doc2qa', *, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Initialization method.

Parameters:
  • hf_model – Hugginface model ID.

  • output_pattern – Regular expression pattern to extract questions and answers from model response.

  • enable_vllm – Whether to use vllm for inference acceleration.

  • model_params – Parameters for initializing the model.

  • sampling_params – Sampling parameters for text generation, e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

The default data format parsed by this interface is as follows: Model Input:

蒙古国的首都是乌兰巴托(Ulaanbaatar) 冰岛的首都是雷克雅未克(Reykjavik)

Model Output:

蒙古国的首都是乌兰巴托(Ulaanbaatar) 冰岛的首都是雷克雅未克(Reykjavik) Human: 请问蒙古国的首都是哪里? Assistant: 你好,根据提供的信息,蒙古国的首都是乌兰巴托(Ulaanbaatar)。 Human: 冰岛的首都是哪里呢? Assistant: 冰岛的首都是雷克雅未克(Reykjavik)。 …

parse_output(raw_output)[source]
process_batched(samples, rank=None)[source]
class data_juicer.ops.mapper.ImageBlurMapper(p: float = 0.2, blur_type: str = 'gaussian', radius: float = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to blur images.

__init__(p: float = 0.2, blur_type: str = 'gaussian', radius: float = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • p – Probability of the image being blured.

  • blur_type – Type of blur kernel, including [‘mean’, ‘box’, ‘gaussian’].

  • radius – Radius of blur kernel.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ImageCaptioningFromGPT4VMapper(mode: str = 'description', api_key: str = '', max_token: int = 500, temperature: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 1.0, system_prompt: str = '', user_prompt: str = '', user_prompt_key: str | None = None, keep_original_sample: bool = True, any_or_all: str = 'any', *args, **kwargs)[source]

Bases: Mapper

Mapper to generate samples whose texts are generated based on gpt-4-visison and the image.

__init__(mode: str = 'description', api_key: str = '', max_token: int = 500, temperature: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 1.0, system_prompt: str = '', user_prompt: str = '', user_prompt_key: str | None = None, keep_original_sample: bool = True, any_or_all: str = 'any', *args, **kwargs)[source]

Initialization method.

Parameters:
  • mode – mode of text generated from images, can be one of [‘resoning’, ‘description’, ‘conversation’, ‘custom’]

  • api_key – the API key to authenticate the request.

  • max_token – the maximum number of tokens to generate. Default is 500.

  • temperature – controls the randomness of the output (range from 0 to 1). Default is 0.

  • system_prompt – a string prompt used to set the context of a conversation and provide global guidance or rules for the gpt4-vision so that it can generate responses in the expected way. If mode set to custom, the parameter will be used.

  • user_prompt – a string prompt to guide the generation of gpt4-vision for each samples. It’s “” in default, which means no prompt provided.

  • uers_prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated text in the final datasets and the original text will be removed. It’s True in default.

  • any_or_all – keep this sample with ‘any’ or ‘all’ strategy of all images. ‘any’: keep this sample if any images meet the condition. ‘all’: keep this sample only if all images meet the condition.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.ImageCaptioningMapper(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, *args, **kwargs)[source]

Bases: Mapper

Mapper to generate samples whose captions are generated based on another model and the figure.

__init__(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_img2seq – model name on huggingface to generate caption

  • caption_num – how many candidate captions to generate for each image

  • keep_candidate_mode

    retain strategy for the generated $caption_num$ candidates.

    ’random_any’: Retain the random one from generated captions

    ’similar_one_simhash’: Retain the generated one that is most

    similar to the original caption

    ’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated captions in the final datasets and the original captions will be removed. It’s True in default.

  • prompt – a string prompt to guide the generation of blip2 model for all samples globally. It’s None in default, which means no prompt provided.

  • prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None)[source]

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for ‘random_any’ and ‘similar_one’ mode, and $(1+M)Nb$ for ‘all’ mode.

Parameters:

samples

Returns:

class data_juicer.ops.mapper.ImageDiffusionMapper(hf_diffusion: str = 'CompVis/stable-diffusion-v1-4', trust_remote_code: bool = False, torch_dtype: str = 'fp32', revision: str = 'main', strength: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.8, guidance_scale: float = 7.5, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, caption_key: str | None = None, hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', *args, **kwargs)[source]

Bases: Mapper

Generate image by diffusion model

__init__(hf_diffusion: str = 'CompVis/stable-diffusion-v1-4', trust_remote_code: bool = False, torch_dtype: str = 'fp32', revision: str = 'main', strength: Annotated[float, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=0), Le(le=1)])] = 0.8, guidance_scale: float = 7.5, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, caption_key: str | None = None, hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_diffusion – diffusion model name on huggingface to generate the image.

  • torch_dtype – the floating point type used to load the diffusion model. Can be one of [‘fp32’, ‘fp16’, ‘bf16’]

  • revision – The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier allowed by Git.

  • strength – Indicates extent to transform the reference image. Must be between 0 and 1. image is used as a starting point and more noise is added the higher the strength. The number of denoising steps depends on the amount of noise initially added. When strength is 1, added noise is maximum and the denoising process runs for the full number of iterations specified in num_inference_steps. A value of 1 essentially ignores image.

  • guidance_scale – A higher guidance scale value encourages the model to generate images closely linked to the text prompt at the expense of lower image quality. Guidance scale is enabled when guidance_scale > 1.

  • aug_num – The image number to be produced by stable-diffusion model.

  • keep_candidate_mode

    retain strategy for the generated $caption_num$ candidates.

    ’random_any’: Retain the random one from generated captions

    ’similar_one_simhash’: Retain the generated one that is most

    similar to the original caption

    ’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • caption_key – the key name of fields in samples to store captions for each images. It can be a string if there is only one image in each sample. Otherwise, it should be a list. If it’s none, ImageDiffusionMapper will produce captions for each images.

  • hf_img2seq – model name on huggingface to generate caption if caption_key is None.

process_batched(samples, rank=None, context=False)[source]

Note

This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote aug_num as $M$. the number of total samples after generation is $(1+M)Nb$.

Parameters:

samples

Returns:

class data_juicer.ops.mapper.ImageFaceBlurMapper(cv_classifier: str = '', blur_type: str = 'gaussian', radius: Annotated[float, Ge(ge=0)] = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to blur faces detected in images.

__init__(cv_classifier: str = '', blur_type: str = 'gaussian', radius: Annotated[float, Ge(ge=0)] = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • cv_classifier – OpenCV classifier path for face detection. By default, we will use ‘haarcascade_frontalface_alt.xml’.

  • blur_type – Type of blur kernel, including [‘mean’, ‘box’, ‘gaussian’].

  • radius – Radius of blur kernel.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.ImageTaggingMapper(tag_field_name: str = '__dj__image_tags__', *args, **kwargs)[source]

Bases: Mapper

Mapper to generate image tags.

__init__(tag_field_name: str = '__dj__image_tags__', *args, **kwargs)[source]

Initialization method. :param tag_field_name: the field name to store the tags. It’s

“__dj__image_tags__” in default.

Parameters:
  • args – extra args

  • kwargs – extra args

process_single(sample, rank=None, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.NlpaugEnMapper(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, delete_random_word: bool = False, swap_random_word: bool = False, spelling_error_word: bool = False, split_random_word: bool = False, keyboard_error_char: bool = False, ocr_error_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, insert_random_char: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to simply augment samples in English based on nlpaug library.

__init__(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, delete_random_word: bool = False, swap_random_word: bool = False, spelling_error_word: bool = False, split_random_word: bool = False, keyboard_error_char: bool = False, ocr_error_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, insert_random_char: bool = False, *args, **kwargs)[source]

Initialization method. All augmentation methods use default parameters in default. We recommend you to only use 1-3 augmentation methods at a time. Otherwise, the semantics of samples might be changed significantly.

Parameters:
  • sequential – whether combine all augmentation methods to a sequence. If it’s True, a sample will be augmented by all opened augmentation methods sequentially. If it’s False, each opened augmentation method would generate its augmented samples independently.

  • aug_num – number of augmented samples to be generated. If sequential is True, there will be total aug_num augmented samples generated. If it’s False, there will be (aug_num * #opened_aug_method) augmented samples generated.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated texts in the final datasets and the original texts will be removed. It’s True in default.

  • delete_random_word – whether to open the augmentation method of deleting random words from the original texts. e.g. “I love LLM” –> “I LLM”

  • swap_random_word – whether to open the augmentation method of swapping random contiguous words in the original texts. e.g. “I love LLM” –> “Love I LLM”

  • spelling_error_word – whether to open the augmentation method of simulating the spelling error for words in the original texts. e.g. “I love LLM” –> “Ai love LLM”

  • split_random_word – whether to open the augmentation method of splitting words randomly with whitespaces in the original texts. e.g. “I love LLM” –> “I love LL M”

  • keyboard_error_char – whether to open the augmentation method of simulating the keyboard error for characters in the original texts. e.g. “I love LLM” –> “I ;ov4 LLM”

  • ocr_error_char – whether to open the augmentation method of simulating the OCR error for characters in the original texts. e.g. “I love LLM” –> “I 10ve LLM”

  • delete_random_char – whether to open the augmentation method of deleting random characters from the original texts. e.g. “I love LLM” –> “I oe LLM”

  • swap_random_char – whether to open the augmentation method of swapping random contiguous characters in the original texts. e.g. “I love LLM” –> “I ovle LLM”

  • insert_random_char – whether to open the augmentation method of inserting random characters into the original texts. e.g. “I love LLM” –> “I ^lKove LLM”

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.NlpcdaZhMapper(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, replace_similar_word: bool = False, replace_homophone_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, replace_equivalent_num: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to simply augment samples in Chinese based on nlpcda library.

__init__(sequential: bool = False, aug_num: Annotated[int, Gt(gt=0)] = 1, keep_original_sample: bool = True, replace_similar_word: bool = False, replace_homophone_char: bool = False, delete_random_char: bool = False, swap_random_char: bool = False, replace_equivalent_num: bool = False, *args, **kwargs)[source]

Initialization method. All augmentation methods use default parameters in default. We recommend you to only use 1-3 augmentation methods at a time. Otherwise, the semantics of samples might be changed significantly. Notice: some augmentation method might not work for some special texts, so there might be no augmented texts generated.

Parameters:
  • sequential – whether combine all augmentation methods to a sequence. If it’s True, a sample will be augmented by all opened augmentation methods sequentially. If it’s False, each opened augmentation method would generate its augmented samples independently.

  • aug_num – number of augmented samples to be generated. If sequential is True, there will be total aug_num augmented samples generated. If it’s False, there will be (aug_num * #opened_aug_method) augmented samples generated.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated texts in the final datasets and the original texts will be removed. It’s True in default.

  • replace_similar_word – whether to open the augmentation method of replacing random words with their similar words in the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这边一共有5种不同的数据增强方法”

  • replace_homophone_char – whether to open the augmentation method of replacing random characters with their homophones in the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有5种不同的濖据增强方法”

  • delete_random_char – whether to open the augmentation method of deleting random characters from the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有5种不同的数据增强”

  • swap_random_char – whether to open the augmentation method of swapping random contiguous characters in the original texts. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有5种不同的数据强增方法”

  • replace_equivalent_num – whether to open the augmentation method of replacing random numbers with their equivalent representations in the original texts. Notice: Only for numbers for now. e.g. “这里一共有5种不同的数据增强方法” –> “这里一共有伍种不同的数据增强方法”

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.OptimizeQAMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Bases: Mapper

Mapper to optimize question-answer pairs.

DEFAULT_SYSTEM_PROMPT = '请优化输入的问答对,使【问题】和【回答】都更加详细、准确。必须按照以下标记格式,直接输出优化后的问答对:\n【问题】\n优化后的问题\n【回答】\n优化后的回答'
DEFAULT_INPUT_TEMPLATE = '以下是原始问答对:\n{}'
DEFAULT_QA_PAIR_TEMPLATE = '【问题】\n{}\n【回答】\n{}'
DEFAULT_OUTPUT_PATTERN = '.*?【问题】\\s*(.*?)\\s*【回答】\\s*(.*)'
__init__(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Initialization method.

Parameters:
  • hf_model – Hugging Face model ID.

  • system_prompt – System prompt for guiding the optimization task.

  • input_template – Template for building the input for the model. Please make sure the template contains one placeholder ‘{}’, which corresponds to the question and answer pair generated by param qa_pair_template.

  • qa_pair_template – Template for formatting the question and answer pair. Please make sure the template contains two ‘{}’ to format question and answer.

  • output_pattern – Regular expression pattern to extract question and answer from model response.

  • enable_vllm – Whether to use VLLM for inference acceleration.

  • model_params – Parameters for initializing the model.

  • sampling_params – Sampling parameters for text generation (e.g., {‘temperature’: 0.9, ‘top_p’: 0.95}).

  • kwargs – Extra keyword arguments.

build_input(sample)[source]
parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.OptimizeQueryMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Bases: OptimizeQAMapper

Mapper to optimize query in question-answer pairs.

DEFAULT_SYSTEM_PROMPT = '优化问答对中的【问题】,将其更加详细具体,但仍可以由原答案回答。只输出优化后的【问题】,不要输出多余内容。'
parse_output(raw_output)[source]
class data_juicer.ops.mapper.OptimizeResponseMapper(hf_model: str = 'Qwen/Qwen2.5-7B-Instruct', *, system_prompt: str | None = None, input_template: str | None = None, qa_pair_template: str | None = None, output_pattern: str | None = None, enable_vllm: bool = False, model_params: Dict | None = None, sampling_params: Dict | None = None, **kwargs)[source]

Bases: OptimizeQAMapper

Mapper to optimize response in question-answer pairs.

DEFAULT_SYSTEM_PROMPT = '请优化问答对中的回答,将其更加详细具体,但仍可以回答原问题。只输出优化后的回答,不要输出多余内容。'
parse_output(raw_output)[source]
class data_juicer.ops.mapper.PairPreferenceMapper(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, rejected_key: str = 'rejected_response', reason_key: str = 'reason', try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

Mapper to construct paired preference samples.

DEFAULT_SYSTEM_PROMPT = '你的任务是根据参考信息修改问答对中的回答,在语言风格、事实性、人物身份、立场等任一方面与原回答相反。必须按照以下标记格式输出,不要输出其他多余内容。\n【回答】\n生成的新回答\n【原因】\n生成该回答的原因'
DEFAULT_INPUT_TEMPLATE = '【参考信息】\n{reference}\n\n以下是原始问答对:\n【问题】\n{query}\n【回答】\n{response}'
DEFAULT_OUTPUT_PATTERN = '.*?【回答】\\s*(.*?)\\s*【原因】\\s*(.*)'
__init__(api_model: str = 'gpt-4o', *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt: str | None = None, input_template: str | None = None, output_pattern: str | None = None, rejected_key: str = 'rejected_response', reason_key: str = 'reason', try_num: Annotated[int, Gt(gt=0)] = 3, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method.

Parameters:
  • api_model – API model name.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt – System prompt for guiding the generation task.

  • input_template – Template for building the model input. It must contain placeholders ‘{query}’ and ‘{reponse}’, and can optionally include ‘{reference}’.

  • output_pattern – Regular expression for parsing model output.

  • rejected_key – The field name in the sample to store the generated rejected response. Defaults to ‘rejected_response’.

  • reason_key – The field name in the sample to store the reason for generating the response. Defaults to ‘reason’.

  • try_num – The number of retries for the API call in case of response parsing failure. Defaults to 3.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

build_input(sample)[source]
parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.PunctuationNormalizationMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to normalize unicode punctuations to English punctuations in text samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.PythonFileMapper(file_path: str = '', function_name: str = 'process_single', batched: bool = False, **kwargs)[source]

Bases: Mapper

Mapper for executing Python function defined in a file.

__init__(file_path: str = '', function_name: str = 'process_single', batched: bool = False, **kwargs)[source]

Initialization method.

Parameters:
  • file_path – The path to the Python file containing the function to be executed.

  • function_name – The name of the function defined in the file to be executed.

  • batched – A boolean indicating whether to process input data in batches.

  • kwargs – Additional keyword arguments passed to the parent class.

process_single(sample)[source]

Invoke the loaded function with the provided sample.

process_batched(samples)[source]

Invoke the loaded function with the provided samples.

class data_juicer.ops.mapper.PythonLambdaMapper(lambda_str: str = '', batched: bool = False, **kwargs)[source]

Bases: Mapper

Mapper for executing Python lambda function on data samples.

__init__(lambda_str: str = '', batched: bool = False, **kwargs)[source]

Initialization method.

Parameters:
  • lambda_str – A string representation of the lambda function to be executed on data samples. If empty, the identity function is used.

  • batched – A boolean indicating whether to process input data in batches.

  • kwargs – Additional keyword arguments passed to the parent class.

process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

process_batched(samples)[source]
class data_juicer.ops.mapper.RelationIdentityMapper(api_model: str = 'gpt-4o', source_entity: str | None = None, target_entity: str | None = None, input_key: str | None = None, output_key: str | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, output_pattern_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Bases: Mapper

identify relation between two entity in the text.

DEFAULT_SYSTEM_PROMPT_TEMPLATE = '给定关于{entity1}和{entity2}的文本信息。判断{entity1}和{entity2}之间的关系。\n要求:\n- 关系用一个或多个词语表示,必要时可以加一个形容词来描述这段关系\n- 输出关系时不要参杂任何标点符号\n- 需要你进行合理的推理才能得出结论\n- 如果两个人物身份是同一个人,输出关系为:另一个身份\n- 输出格式为:\n分析推理:...\n所以{entity2}是{entity1}的:...\n- 注意输出的是{entity2}是{entity1}的什么关系,而不是{entity1}是{entity2}的什么关系'
DEFAULT_INPUT_TEMPLATE = '关于{entity1}和{entity2}的文本信息:\n```\n{text}\n```\n'
DEFAULT_OUTPUT_PATTERN_TEMPLATE = '\n        \\s*分析推理:\\s*(.*?)\\s*\n        \\s*所以{entity2}是{entity1}的:\\s*(.*?)\\Z\n    '
__init__(api_model: str = 'gpt-4o', source_entity: str | None = None, target_entity: str | None = None, input_key: str | None = None, output_key: str | None = None, *, api_endpoint: str | None = None, response_path: str | None = None, system_prompt_template: str | None = None, input_template: str | None = None, output_pattern_template: str | None = None, try_num: Annotated[int, Gt(gt=0)] = 3, drop_text: bool = False, model_params: Dict = {}, sampling_params: Dict = {}, **kwargs)[source]

Initialization method. :param api_model: API model name. :param source_entity: The source entity of the relation to be

identified.

Parameters:
  • target_entity – The target entity of the relation to be identified.

  • input_key – The input field key in the samples. Support for nested keys such as “__dj__stats__.text_len”. It is text_key in default.

  • output_key – The output field key in the samples. Support for nested keys such as “__dj__stats__.text_len”. It is input_key in default.

  • api_endpoint – URL endpoint for the API.

  • response_path – Path to extract content from the API response. Defaults to ‘choices.0.message.content’.

  • system_prompt_template – System prompt template for the task.

  • input_template – Template for building the model input.

  • output_pattern_template – Regular expression template for parsing model output.

  • try_num – The number of retry attempts when there is an API call error or output parsing error.

  • drop_text – If drop the text in the output.

  • model_params – Parameters for initializing the API model.

  • sampling_params – Extra parameters passed to the API call. e.g {‘temperature’: 0.9, ‘top_p’: 0.95}

  • kwargs – Extra keyword arguments.

parse_output(raw_output)[source]
process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.RemoveBibliographyMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to remove bibliography at the end of documents in Latex samples.

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.RemoveCommentsMapper(doc_type: str | List[str] = 'tex', inline: bool = True, multiline: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove comments in different kinds of documents.

Only support ‘tex’ for now.

__init__(doc_type: str | List[str] = 'tex', inline: bool = True, multiline: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • doc_type – Type of document to remove comments.

  • inline – Whether to remove inline comments.

  • multiline – Whether to remove multiline comments.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.RemoveHeaderMapper(drop_no_head: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove headers at the beginning of documents in Latex samples.

__init__(drop_no_head: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • drop_no_head – whether to drop sample texts without headers.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.RemoveLongWordsMapper(min_len: int = 1, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove long words within a specific range.

__init__(min_len: int = 1, max_len: int = 9223372036854775807, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_len – The min mapper word length in this op, words will be filtered if their length is below this parameter.

  • max_len – The max mapper word length in this op, words will be filtered if their length exceeds this parameter.

  • args – extra args

  • kwargs – extra args

should_keep_long_word(word)[source]
process_batched(samples)[source]
class data_juicer.ops.mapper.RemoveNonChineseCharacterlMapper(keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove non chinese Character in text samples.

__init__(keep_alphabet: bool = True, keep_number: bool = True, keep_punc: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • keep_alphabet – whether to keep alphabet

  • keep_number – whether to keep number

  • keep_punc – whether to keep punctuation

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.RemoveRepeatSentencesMapper(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove repeat sentences in text samples.

__init__(lowercase: bool = False, ignore_special_character: bool = True, min_repeat_sentence_length: int = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lowercase – Whether to convert sample text to lower case

  • ignore_special_character – Whether to ignore special characters when judging repeated sentences. Special characters are all characters except Chinese characters, letters and numbers.

  • min_repeat_sentence_length – Sentences shorter than this length will not be deduplicated. If ignore_special_character is set to True, then special characters are not included in this length.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.RemoveSpecificCharsMapper(chars_to_remove: str | List[str] = '◆●■►▼▲▴∆▻▷❖♡□', *args, **kwargs)[source]

Bases: Mapper

Mapper to clean specific chars in text samples.

__init__(chars_to_remove: str | List[str] = '◆●■►▼▲▴∆▻▷❖♡□', *args, **kwargs)[source]

Initialization method.

Parameters:
  • chars_to_remove – a list or a string including all characters that need to be removed from text.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.RemoveTableTextMapper(min_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 2, max_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 20, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove table texts from text samples.

Regular expression is used to remove tables in the range of column number of tables.

__init__(min_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 2, max_col: Annotated[int, FieldInfo(annotation=NoneType, required=True, metadata=[Ge(ge=2), Le(le=20)])] = 20, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_col – The min number of columns of table to remove.

  • max_col – The max number of columns of table to remove.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.RemoveWordsWithIncorrectSubstringsMapper(lang: str = 'en', tokenization: bool = False, substrings: List[str] | None = None, *args, **kwargs)[source]

Bases: Mapper

Mapper to remove words with incorrect substrings.

__init__(lang: str = 'en', tokenization: bool = False, substrings: List[str] | None = None, *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – sample in which language

  • tokenization – whether to use model to tokenize documents

  • substrings – The incorrect substrings in words.

  • args – extra args

  • kwargs – extra args

should_keep_word_with_incorrect_substrings(word, substrings)[source]
process_batched(samples)[source]
class data_juicer.ops.mapper.ReplaceContentMapper(pattern: str | List[str] | None = None, repl: str | List[str] = '', *args, **kwargs)[source]

Bases: Mapper

Mapper to replace all content in the text that matches a specific regular expression pattern with a designated replacement string.

__init__(pattern: str | List[str] | None = None, repl: str | List[str] = '', *args, **kwargs)[source]

Initialization method.

Parameters:
  • pattern – regular expression pattern(s) to search for within text

  • repl – replacement string(s), default is empty string

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.SentenceSplitMapper(lang: str = 'en', *args, **kwargs)[source]

Bases: Mapper

Mapper to split text samples to sentences.

__init__(lang: str = 'en', *args, **kwargs)[source]

Initialization method.

Parameters:
  • lang – split sentence of text in which language.

  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]
class data_juicer.ops.mapper.TextChunkMapper(max_len: Annotated[int, Gt(gt=0)] | None = None, split_pattern: str | None = '\\n\\n', overlap_len: Annotated[int, Ge(ge=0)] = 0, tokenizer: str | None = None, trust_remote_code: bool = False, *args, **kwargs)[source]

Bases: Mapper

Split input text to chunks.

__init__(max_len: Annotated[int, Gt(gt=0)] | None = None, split_pattern: str | None = '\\n\\n', overlap_len: Annotated[int, Ge(ge=0)] = 0, tokenizer: str | None = None, trust_remote_code: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • max_len – Split text into multi texts with this max len if not None.

  • split_pattern – Make sure split in this pattern if it is not None and force cut if the length exceeds max_len.

  • overlap_len – Overlap length of the split texts if not split in the split pattern.

  • tokenizer – The tokenizer name of Hugging Face tokenizers. The text length will be calculate as the token num if it is offerd. Otherwise, the text length equals to string length. Support tiktoken tokenizer (such as gpt-4o), dashscope tokenizer (such as qwen2.5-72b-instruct) and huggingface tokenizer.

  • args – extra args

  • kwargs – extra args

Trust_remote_code:

for loading huggingface model

recursively_chunk(text)[source]
get_text_chunks(text, rank=None)[source]
process_batched(samples, rank=None)[source]
class data_juicer.ops.mapper.VideoCaptioningFromAudioMapper(keep_original_sample: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to caption a video according to its audio streams based on Qwen-Audio model.

__init__(keep_original_sample: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only captioned sample in the final datasets and the original sample will be removed. It’s True in default.

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None)[source]
class data_juicer.ops.mapper.VideoCaptioningFromFramesMapper(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to generate samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated to a single string.

__init__(hf_img2seq: str = 'Salesforce/blip2-opt-2.7b', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_img2seq – model name on huggingface to generate caption

  • caption_num – how many candidate captions to generate for each video

  • keep_candidate_mode

    retain strategy for the generated $caption_num$ candidates.

    ’random_any’: Retain the random one from generated captions

    ’similar_one_simhash’: Retain the generated one that is most

    similar to the original caption

    ’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated captions in the final datasets and the original captions will be removed. It’s True in default.

  • prompt – a string prompt to guide the generation of image-to-text model for all samples globally. It’s None in default, which means no prompt provided.

  • prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.

  • frame_sampling_method – sampling method of extracting frame videos from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • horizontal_flip – flip frame video horizontally (left to right).

  • vertical_flip – flip frame video vertically (top to bottom).

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None, context=False)[source]
Parameters:

samples

Returns:

Note

This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for ‘random_any’ and ‘similar_one’ mode, and $(1+M)Nb$ for ‘all’ mode.

class data_juicer.ops.mapper.VideoCaptioningFromSummarizerMapper(hf_summarizer: str | None = None, trust_remote_code: bool = False, consider_video_caption_from_video: bool = True, consider_video_caption_from_audio: bool = True, consider_video_caption_from_frames: bool = True, consider_video_tags_from_audio: bool = True, consider_video_tags_from_frames: bool = True, vid_cap_from_vid_args: Dict | None = None, vid_cap_from_frm_args: Dict | None = None, vid_tag_from_aud_args: Dict | None = None, vid_tag_from_frm_args: Dict | None = None, keep_tag_num: Annotated[int, Gt(gt=0)] = 5, keep_original_sample: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to generate video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, …)

__init__(hf_summarizer: str | None = None, trust_remote_code: bool = False, consider_video_caption_from_video: bool = True, consider_video_caption_from_audio: bool = True, consider_video_caption_from_frames: bool = True, consider_video_tags_from_audio: bool = True, consider_video_tags_from_frames: bool = True, vid_cap_from_vid_args: Dict | None = None, vid_cap_from_frm_args: Dict | None = None, vid_tag_from_aud_args: Dict | None = None, vid_tag_from_frm_args: Dict | None = None, keep_tag_num: Annotated[int, Gt(gt=0)] = 5, keep_original_sample: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_summarizer – the summarizer model used to summarize texts generated by other methods.

  • consider_video_caption_from_video – whether to consider the video caption generated from video directly in the summarization process. Default: True.

  • consider_video_caption_from_audio – whether to consider the video caption generated from audio streams in the video in the summarization process. Default: True.

  • consider_video_caption_from_frames – whether to consider the video caption generated from sampled frames from the video in the summarization process. Default: True.

  • consider_video_tags_from_audio – whether to consider the video tags generated from audio streams in the video in the summarization process. Default: True.

  • consider_video_tags_from_frames – whether to consider the video tags generated from sampled frames from the video in the summarization process. Default: True.

  • vid_cap_from_vid_args – the arg dict for video captioning from video directly with keys are the arg names and values are the arg values. Default: None.

  • vid_cap_from_frm_args – the arg dict for video captioning from sampled frames from the video with keys are the arg names and values are the arg values. Default: None.

  • vid_tag_from_aud_args – the arg dict for video tagging from audio streams in the video with keys are the arg names and values are the arg values. Default: None.

  • vid_tag_from_frm_args – the arg dict for video tagging from sampled frames from the video with keys are the arg names and values are the arg values. Default: None.

  • keep_tag_num – max number N of tags from sampled frames to keep. Too many tags might bring negative influence to summarized text, so we consider to only keep the N most frequent tags. Default: 5.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only summarized captions in the final datasets and the original captions will be removed. It’s True in default.

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None)[source]
class data_juicer.ops.mapper.VideoCaptioningFromVideoMapper(hf_video_blip: str = 'kpyu/video-blip-opt-2.7b-ego4d', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to generate samples whose captions are generated based on a video-to-text model and sampled video frame.

__init__(hf_video_blip: str = 'kpyu/video-blip-opt-2.7b-ego4d', trust_remote_code: bool = False, caption_num: Annotated[int, Gt(gt=0)] = 1, keep_candidate_mode: str = 'random_any', keep_original_sample: bool = True, prompt: str | None = None, prompt_key: str | None = None, frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, horizontal_flip: bool = False, vertical_flip: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_video_blip – video-blip model name on huggingface to generate caption

  • caption_num – how many candidate captions to generate for each video

  • keep_candidate_mode

    retain strategy for the generated $caption_num$ candidates.

    ’random_any’: Retain the random one from generated captions

    ’similar_one_simhash’: Retain the generated one that is most

    similar to the original caption

    ’all’: Retain all generated captions by concatenation

Note

This is a batched_OP, whose input and output type are both list. Suppose there are $N$ list of input samples, whose batch size is $b$, and denote caption_num as $M$. The number of total samples after generation is $2Nb$ when keep_original_sample is True and $Nb$ when keep_original_sample is False. For ‘random_any’ and ‘similar_one_simhash’ mode, it’s $(1+M)Nb$ for ‘all’ mode when keep_original_sample is True and $MNb$ when keep_original_sample is False.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only generated captions in the final datasets and the original captions will be removed. It’s True in default.

  • prompt – a string prompt to guide the generation of video-blip model for all samples globally. It’s None in default, which means no prompt provided.

  • prompt_key – the key name of fields in samples to store prompts for each sample. It’s used for set different prompts for different samples. If it’s none, use prompt in parameter “prompt”. It’s None in default.

  • frame_sampling_method – sampling method of extracting frame videos from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • horizontal_flip – flip frame video horizontally (left to right).

  • vertical_flip – flip frame video vertically (top to bottom).

  • args – extra args

  • kwargs – extra args

process_batched(samples, rank=None, context=False)[source]
Parameters:

samples

Returns:

Note

This is a batched_OP, whose the input and output type are both list. Suppose there are $N$ input sample list with batch size as $b$, and denote caption_num as $M$. the number of total samples after generation is $2Nb$ for ‘random_any’ and ‘similar_one’ mode, and $(1+M)Nb$ for ‘all’ mode.

class data_juicer.ops.mapper.VideoExtractFramesMapper(frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, frame_dir: str | None = None, frame_key='__dj__video_frames__', *args, **kwargs)[source]

Bases: Mapper

Mapper to extract frames from video files according to specified methods. Extracted Frames Data Format:

The data format for the extracted frames is a dictionary mapping video key to extracted frames directory where the extracted frames are saved. The dictionary follows the structure: {

“video_key_1”: “/${frame_dir}/video_key_1_filename/”, “video_key_2”: “/${frame_dir}/video_key_2_filename/”, …

}

__init__(frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, duration: float = 0, frame_dir: str | None = None, frame_key='__dj__video_frames__', *args, **kwargs)[source]

Initialization method. :param frame_sampling_method: sampling method of extracting frame

videos from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. If “duration” > 0, frame_sampling_method acts on every segment. Default: “all_keyframes”.

Parameters:
  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration. If “duration” > 0, frame_num is the number of frames per segment.

  • duration – The duration of each segment in seconds. If 0, frames are extracted from the entire video. If duration > 0, the video is segmented into multiple segments based on duration, and frames are extracted from each segment.

  • frame_dir – Output directory to save extracted frames. If None, a default directory based on the video file path is used.

  • frame_key – The name of field to save generated frames info.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoFFmpegWrappedMapper(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, *args, **kwargs)[source]

Bases: Mapper

Simple wrapper for FFmpeg video filters.

__init__(filter_name: str | None = None, filter_kwargs: Dict | None = None, global_args: List[str] | None = None, capture_stderr: bool = True, overwrite_output: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • filter_name – ffmpeg video filter name.

  • filter_kwargs – keyword-arguments passed to ffmpeg filter.

  • global_args – list-arguments passed to ffmpeg command-line.

  • capture_stderr – whether to capture stderr.

  • overwrite_output – whether to overwrite output file.

  • args – extra args

  • kwargs – extra args

process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoFaceBlurMapper(cv_classifier: str = '', blur_type: str = 'gaussian', radius: float = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to blur faces detected in videos.

__init__(cv_classifier: str = '', blur_type: str = 'gaussian', radius: float = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • cv_classifier – OpenCV classifier path for face detection. By default, we will use ‘haarcascade_frontalface_alt.xml’.

  • blur_type – Type of blur kernel, including [‘mean’, ‘box’, ‘gaussian’].

  • radius – Radius of blur kernel.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoRemoveWatermarkMapper(roi_strings: List[str] = ['0,0,0.1,0.1'], roi_type: str = 'ratio', roi_key: str | None = None, frame_num: Annotated[int, Gt(gt=0)] = 10, min_frame_threshold: Annotated[int, Gt(gt=0)] = 7, detection_method: str = 'pixel_value', *args, **kwargs)[source]

Bases: Mapper

Remove the watermarks in videos given regions.

__init__(roi_strings: List[str] = ['0,0,0.1,0.1'], roi_type: str = 'ratio', roi_key: str | None = None, frame_num: Annotated[int, Gt(gt=0)] = 10, min_frame_threshold: Annotated[int, Gt(gt=0)] = 7, detection_method: str = 'pixel_value', *args, **kwargs)[source]

Initialization method.

Parameters:
  • roi_strings – a given list of regions the watermarks locate. The format of each can be “x1, y1, x2, y2”, “(x1, y1, x2, y2)”, or “[x1, y1, x2, y2]”.

  • roi_type – the roi string type. When the type is ‘pixel’, (x1, y1), (x2, y2) are the locations of pixels in the top left corner and the bottom right corner respectively. If the roi_type is ‘ratio’, the coordinates are normalized by wights and heights.

  • roi_key – the key name of fields in samples to store roi_strings for each sample. It’s used for set different rois for different samples. If it’s none, use rois in parameter “roi_strings”. It’s None in default.

  • frame_num – the number of frames to be extracted uniformly from the video to detect the pixels of watermark.

  • min_frame_threshold – a coodination is considered as the location of a watermark pixel when it is that in no less min_frame_threshold frames.

  • detection_method – the method to detect the pixels of watermark. If it is ‘pixel_value’, we consider the distribution of pixel value in each frame. If it is ‘pixel_diversity’, we will consider the pixel diversity in different frames. The min_frame_threshold is useless and frame_num must be greater than 1 in ‘pixel_diversity’ mode.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoResizeAspectRatioMapper(min_ratio: str = '9/21', max_ratio: str = '21/9', strategy: str = 'increase', *args, **kwargs)[source]

Bases: Mapper

Mapper to resize videos by aspect ratio. AspectRatio = W / H.

STRATEGY = ['decrease', 'increase']
__init__(min_ratio: str = '9/21', max_ratio: str = '21/9', strategy: str = 'increase', *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_ratio – The minimum aspect ratio to enforce videos with an aspect ratio below min_ratio will be resized to match this minimum ratio. The ratio should be provided as a string in the format “9:21” or “9/21”.

  • max_ratio – The maximum aspect ratio to enforce videos with an aspect ratio above max_ratio will be resized to match this maximum ratio. The ratio should be provided as a string in the format “21:9” or “21/9”.

  • strategy – The resizing strategy to apply when adjusting the video dimensions. It can be either ‘decrease’ to reduce the dimension or ‘increase’ to enlarge it. Accepted values are [‘decrease’, ‘increase’].

  • args – extra args

  • kwargs – extra args

process_single(sample)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoResizeResolutionMapper(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, force_original_aspect_ratio: str = 'disable', force_divisible_by: Annotated[int, Gt(gt=0)] = 2, *args, **kwargs)[source]

Bases: Mapper

Mapper to resize videos resolution. We leave the super resolution with deep learning for future works.

__init__(min_width: int = 1, max_width: int = 9223372036854775807, min_height: int = 1, max_height: int = 9223372036854775807, force_original_aspect_ratio: str = 'disable', force_divisible_by: Annotated[int, Gt(gt=0)] = 2, *args, **kwargs)[source]

Initialization method.

Parameters:
  • min_width – Videos with width less than ‘min_width’ will be mapped to videos with equal or bigger width.

  • max_width – Videos with width more than ‘max_width’ will be mapped to videos with equal of smaller width.

  • min_height – Videos with height less than ‘min_height’ will be mapped to videos with equal or bigger height.

  • max_height – Videos with height more than ‘max_height’ will be mapped to videos with equal or smaller height.

  • force_original_aspect_ratio – Enable decreasing or increasing output video width or height if necessary to keep the original aspect ratio, including [‘disable’, ‘decrease’, ‘increase’].

  • force_divisible_by – Ensures that both the output dimensions, width and height, are divisible by the given integer when used together with force_original_aspect_ratio, must be a positive even number.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoSplitByDurationMapper(split_duration: float = 10, min_last_split_duration: float = 0, keep_original_sample: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to split video by duration.

__init__(split_duration: float = 10, min_last_split_duration: float = 0, keep_original_sample: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • split_duration – duration of each video split in seconds.

  • min_last_split_duration – The minimum allowable duration in seconds for the last video split. If the duration of the last split is less than this value, it will be discarded.

  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only cut sample in the final datasets and the original sample will be removed. It’s True in default.

  • args – extra args

  • kwargs – extra args

split_videos_by_duration(video_key, container)[source]
process_batched(samples)[source]
class data_juicer.ops.mapper.VideoSplitByKeyFrameMapper(keep_original_sample: bool = True, *args, **kwargs)[source]

Bases: Mapper

Mapper to split video by key frame.

__init__(keep_original_sample: bool = True, *args, **kwargs)[source]

Initialization method.

Parameters:
  • keep_original_sample – whether to keep the original sample. If it’s set to False, there will be only split sample in the final datasets and the original sample will be removed. It’s True in default.

  • args – extra args

  • kwargs – extra args

get_split_key_frame(video_key, container)[source]
process_batched(samples)[source]
class data_juicer.ops.mapper.VideoSplitBySceneMapper(detector: str = 'ContentDetector', threshold: Annotated[float, Ge(ge=0)] = 27.0, min_scene_len: Annotated[int, Ge(ge=0)] = 15, show_progress: bool = False, *args, **kwargs)[source]

Bases: Mapper

Mapper to cut videos into scene clips.

avaliable_detectors = {'AdaptiveDetector': ['window_width', 'min_content_val', 'weights', 'luma_only', 'kernel_size', 'video_manager', 'min_delta_hsv'], 'ContentDetector': ['weights', 'luma_only', 'kernel_size'], 'ThresholdDetector': ['fade_bias', 'add_final_scene', 'method', 'block_size']}
__init__(detector: str = 'ContentDetector', threshold: Annotated[float, Ge(ge=0)] = 27.0, min_scene_len: Annotated[int, Ge(ge=0)] = 15, show_progress: bool = False, *args, **kwargs)[source]

Initialization method.

Parameters:
  • detector – Algorithm from scenedetect.detectors. Should be one of [‘ContentDetector’, ‘ThresholdDetector’, ‘AdaptiveDetector`].

  • threshold – Threshold passed to the detector.

  • min_scene_len – Minimum length of any scene.

  • show_progress – Whether to show progress from scenedetect.

  • args – extra args

  • kwargs – extra args

process_single(sample, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoTaggingFromAudioMapper(hf_ast: str = 'MIT/ast-finetuned-audioset-10-10-0.4593', trust_remote_code: bool = False, tag_field_name: str = '__dj__video_audio_tags__', *args, **kwargs)[source]

Bases: Mapper

Mapper to generate video tags from audio streams extracted by video using the Audio Spectrogram Transformer.

__init__(hf_ast: str = 'MIT/ast-finetuned-audioset-10-10-0.4593', trust_remote_code: bool = False, tag_field_name: str = '__dj__video_audio_tags__', *args, **kwargs)[source]

Initialization method.

Parameters:
  • hf_ast – path to the HF model to tag from audios.

  • trust_remote_code – whether to trust the remote code of HF models

  • tag_field_name – the field name to store the tags. It’s “__dj__video_audio_tags__” in default.

  • args – extra args

  • kwargs – extra args

process_single(sample, rank=None)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.VideoTaggingFromFramesMapper(frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, tag_field_name: str = '__dj__video_frame_tags__', *args, **kwargs)[source]

Bases: Mapper

Mapper to generate video tags from frames extract by video.

__init__(frame_sampling_method: str = 'all_keyframes', frame_num: Annotated[int, Gt(gt=0)] = 3, tag_field_name: str = '__dj__video_frame_tags__', *args, **kwargs)[source]

Initialization method.

Parameters:
  • frame_sampling_method – sampling method of extracting frame images from the videos. Should be one of [“all_keyframes”, “uniform”]. The former one extracts all key frames (the number of which depends on the duration of the video) and the latter one extract specified number of frames uniformly from the video. Default: “all_keyframes”.

  • frame_num – the number of frames to be extracted uniformly from the video. Only works when frame_sampling_method is “uniform”. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration.

  • tag_field_name – the field name to store the tags. It’s “__dj__video_frame_tags__” in default.

  • args – extra args

  • kwargs – extra args

process_single(sample, rank=None, context=False)[source]

For sample level, sample –> sample

Parameters:

sample – sample to process

Returns:

processed sample

class data_juicer.ops.mapper.WhitespaceNormalizationMapper(*args, **kwargs)[source]

Bases: Mapper

Mapper to normalize different kinds of whitespaces to whitespace ‘ ‘ (0x20) in text samples.

Different kinds of whitespaces can be found here: https://en.wikipedia.org/wiki/Whitespace_character

__init__(*args, **kwargs)[source]

Initialization method.

Parameters:
  • args – extra args

  • kwargs – extra args

process_batched(samples)[source]