data_juicer.ops package¶
Subpackages¶
- data_juicer.ops.aggregator package
- Submodules
- data_juicer.ops.aggregator.entity_attribute_aggregator module
EntityAttributeAggregator
EntityAttributeAggregator.DEFAULT_SYSTEM_TEMPLATE
EntityAttributeAggregator.DEFAULT_EXAMPLE_PROMPT
EntityAttributeAggregator.DEFAULT_INPUT_TEMPLATE
EntityAttributeAggregator.DEFAULT_OUTPUT_PATTERN_TEMPLATE
EntityAttributeAggregator.__init__()
EntityAttributeAggregator.parse_output()
EntityAttributeAggregator.attribute_summary()
EntityAttributeAggregator.process_single()
- data_juicer.ops.aggregator.meta_tags_aggregator module
MetaTagsAggregator
MetaTagsAggregator.DEFAULT_SYSTEM_PROMPT
MetaTagsAggregator.DEFAULT_INPUT_TEMPLATE
MetaTagsAggregator.DEFAULT_TARGET_TAG_TEMPLATE
MetaTagsAggregator.DEFAULT_TAG_TEMPLATE
MetaTagsAggregator.DEFAULT_OUTPUT_PATTERN
MetaTagsAggregator.__init__()
MetaTagsAggregator.parse_output()
MetaTagsAggregator.meta_map()
MetaTagsAggregator.process_single()
- data_juicer.ops.aggregator.most_relevant_entities_aggregator module
MostRelevantEntitiesAggregator
MostRelevantEntitiesAggregator.DEFAULT_SYSTEM_TEMPLATE
MostRelevantEntitiesAggregator.DEFAULT_INPUT_TEMPLATE
MostRelevantEntitiesAggregator.DEFAULT_OUTPUT_PATTERN
MostRelevantEntitiesAggregator.__init__()
MostRelevantEntitiesAggregator.parse_output()
MostRelevantEntitiesAggregator.query_most_relevant_entities()
MostRelevantEntitiesAggregator.process_single()
- data_juicer.ops.aggregator.nested_aggregator module
- Module contents
NestedAggregator
MetaTagsAggregator
MetaTagsAggregator.DEFAULT_SYSTEM_PROMPT
MetaTagsAggregator.DEFAULT_INPUT_TEMPLATE
MetaTagsAggregator.DEFAULT_TARGET_TAG_TEMPLATE
MetaTagsAggregator.DEFAULT_TAG_TEMPLATE
MetaTagsAggregator.DEFAULT_OUTPUT_PATTERN
MetaTagsAggregator.__init__()
MetaTagsAggregator.parse_output()
MetaTagsAggregator.meta_map()
MetaTagsAggregator.process_single()
EntityAttributeAggregator
EntityAttributeAggregator.DEFAULT_SYSTEM_TEMPLATE
EntityAttributeAggregator.DEFAULT_EXAMPLE_PROMPT
EntityAttributeAggregator.DEFAULT_INPUT_TEMPLATE
EntityAttributeAggregator.DEFAULT_OUTPUT_PATTERN_TEMPLATE
EntityAttributeAggregator.__init__()
EntityAttributeAggregator.parse_output()
EntityAttributeAggregator.attribute_summary()
EntityAttributeAggregator.process_single()
MostRelevantEntitiesAggregator
MostRelevantEntitiesAggregator.DEFAULT_SYSTEM_TEMPLATE
MostRelevantEntitiesAggregator.DEFAULT_INPUT_TEMPLATE
MostRelevantEntitiesAggregator.DEFAULT_OUTPUT_PATTERN
MostRelevantEntitiesAggregator.__init__()
MostRelevantEntitiesAggregator.parse_output()
MostRelevantEntitiesAggregator.query_most_relevant_entities()
MostRelevantEntitiesAggregator.process_single()
- data_juicer.ops.common package
- Submodules
- data_juicer.ops.common.helper_func module
- data_juicer.ops.common.prompt2prompt_pipeline module
rescale_noise_cfg()
Prompt2PromptPipeline
P2PCrossAttnProcessor
AttentionControl
create_controller()
EmptyControl
AttentionStore
LocalBlend
AttentionControlEdit
AttentionReplace
AttentionRefine
AttentionReweight
update_alpha_time_word()
get_time_words_attention_alpha()
get_word_inds()
get_replacement_mapper_()
get_replacement_mapper()
get_equalizer()
ScoreParams
get_matrix()
get_traceback_matrix()
global_align()
get_aligned_sequences()
get_mapper()
get_refinement_mapper()
- data_juicer.ops.common.special_characters module
- Module contents
- data_juicer.ops.deduplicator package
- Submodules
- data_juicer.ops.deduplicator.document_deduplicator module
- data_juicer.ops.deduplicator.document_minhash_deduplicator module
- data_juicer.ops.deduplicator.document_simhash_deduplicator module
- data_juicer.ops.deduplicator.image_deduplicator module
- data_juicer.ops.deduplicator.ray_basic_deduplicator module
- data_juicer.ops.deduplicator.ray_bts_minhash_deduplicator module
IdGenerator
EdgeBuffer
BTSUnionFind
BTSUnionFind.__init__()
BTSUnionFind.add_key_value_pairs()
BTSUnionFind.flush_key_value_pairs()
BTSUnionFind.balanced_union_find()
BTSUnionFind.distribute_edge()
BTSUnionFind.set_edge_buffer()
BTSUnionFind.edge_redistribution()
BTSUnionFind.communication()
BTSUnionFind.find()
BTSUnionFind.union()
BTSUnionFind.union_list()
BTSUnionFind.rebalancing()
BTSUnionFind.squeeze()
BTSUnionFind.dup_idx()
get_remote_classes()
RayBTSMinhashDeduplicator
- data_juicer.ops.deduplicator.ray_document_deduplicator module
- data_juicer.ops.deduplicator.ray_image_deduplicator module
- data_juicer.ops.deduplicator.ray_video_deduplicator module
- data_juicer.ops.deduplicator.video_deduplicator module
- Module contents
- data_juicer.ops.filter package
- Submodules
- data_juicer.ops.filter.alphanumeric_filter module
- data_juicer.ops.filter.audio_duration_filter module
- data_juicer.ops.filter.audio_nmf_snr_filter module
- data_juicer.ops.filter.audio_size_filter module
- data_juicer.ops.filter.average_line_length_filter module
- data_juicer.ops.filter.character_repetition_filter module
- data_juicer.ops.filter.flagged_words_filter module
- data_juicer.ops.filter.general_field_filter module
- data_juicer.ops.filter.image_aesthetics_filter module
- data_juicer.ops.filter.image_aspect_ratio_filter module
- data_juicer.ops.filter.image_face_count_filter module
- data_juicer.ops.filter.image_face_ratio_filter module
- data_juicer.ops.filter.image_nsfw_filter module
- data_juicer.ops.filter.image_pair_similarity_filter module
- data_juicer.ops.filter.image_shape_filter module
- data_juicer.ops.filter.image_size_filter module
- data_juicer.ops.filter.image_text_matching_filter module
- data_juicer.ops.filter.image_text_similarity_filter module
- data_juicer.ops.filter.image_watermark_filter module
- data_juicer.ops.filter.language_id_score_filter module
- data_juicer.ops.filter.llm_difficulty_score_filter module
LLMDifficultyScoreFilter
LLMDifficultyScoreFilter.DEFAULT_SYSTEM_PROMPT
LLMDifficultyScoreFilter.DEFAULT_INPUT_TEMPLATE
LLMDifficultyScoreFilter.DEFAULT_FIELD_TEMPLATE
LLMDifficultyScoreFilter.__init__()
LLMDifficultyScoreFilter.build_input()
LLMDifficultyScoreFilter.parse_output()
LLMDifficultyScoreFilter.compute_stats_single()
LLMDifficultyScoreFilter.process_single()
- data_juicer.ops.filter.llm_quality_score_filter module
LLMQualityScoreFilter
LLMQualityScoreFilter.DEFAULT_SYSTEM_PROMPT
LLMQualityScoreFilter.DEFAULT_INPUT_TEMPLATE
LLMQualityScoreFilter.DEFAULT_FIELD_TEMPLATE
LLMQualityScoreFilter.__init__()
LLMQualityScoreFilter.build_input()
LLMQualityScoreFilter.parse_output()
LLMQualityScoreFilter.compute_stats_single()
LLMQualityScoreFilter.process_single()
- data_juicer.ops.filter.maximum_line_length_filter module
- data_juicer.ops.filter.perplexity_filter module
- data_juicer.ops.filter.phrase_grounding_recall_filter module
- data_juicer.ops.filter.special_characters_filter module
- data_juicer.ops.filter.specified_field_filter module
- data_juicer.ops.filter.specified_numeric_field_filter module
- data_juicer.ops.filter.stopwords_filter module
- data_juicer.ops.filter.suffix_filter module
- data_juicer.ops.filter.text_action_filter module
- data_juicer.ops.filter.text_entity_dependency_filter module
- data_juicer.ops.filter.text_length_filter module
- data_juicer.ops.filter.text_pair_similarity_filter module
- data_juicer.ops.filter.token_num_filter module
- data_juicer.ops.filter.video_aesthetics_filter module
- data_juicer.ops.filter.video_aspect_ratio_filter module
- data_juicer.ops.filter.video_duration_filter module
- data_juicer.ops.filter.video_frames_text_similarity_filter module
- data_juicer.ops.filter.video_motion_score_filter module
- data_juicer.ops.filter.video_motion_score_raft_filter module
- data_juicer.ops.filter.video_nsfw_filter module
- data_juicer.ops.filter.video_ocr_area_ratio_filter module
- data_juicer.ops.filter.video_resolution_filter module
- data_juicer.ops.filter.video_tagging_from_frames_filter module
- data_juicer.ops.filter.video_watermark_filter module
- data_juicer.ops.filter.word_repetition_filter module
- data_juicer.ops.filter.words_num_filter module
- Module contents
AlphanumericFilter
AudioDurationFilter
AudioNMFSNRFilter
AudioSizeFilter
AverageLineLengthFilter
CharacterRepetitionFilter
FlaggedWordFilter
ImageAestheticsFilter
ImageAspectRatioFilter
ImageFaceCountFilter
ImageFaceRatioFilter
ImageNSFWFilter
ImagePairSimilarityFilter
ImageShapeFilter
ImageSizeFilter
ImageTextMatchingFilter
ImageTextSimilarityFilter
ImageWatermarkFilter
LanguageIDScoreFilter
LLMQualityScoreFilter
LLMQualityScoreFilter.DEFAULT_SYSTEM_PROMPT
LLMQualityScoreFilter.DEFAULT_INPUT_TEMPLATE
LLMQualityScoreFilter.DEFAULT_FIELD_TEMPLATE
LLMQualityScoreFilter.__init__()
LLMQualityScoreFilter.build_input()
LLMQualityScoreFilter.parse_output()
LLMQualityScoreFilter.compute_stats_single()
LLMQualityScoreFilter.process_single()
LLMDifficultyScoreFilter
LLMDifficultyScoreFilter.DEFAULT_SYSTEM_PROMPT
LLMDifficultyScoreFilter.DEFAULT_INPUT_TEMPLATE
LLMDifficultyScoreFilter.DEFAULT_FIELD_TEMPLATE
LLMDifficultyScoreFilter.__init__()
LLMDifficultyScoreFilter.build_input()
LLMDifficultyScoreFilter.parse_output()
LLMDifficultyScoreFilter.compute_stats_single()
LLMDifficultyScoreFilter.process_single()
MaximumLineLengthFilter
PerplexityFilter
PhraseGroundingRecallFilter
SpecialCharactersFilter
SpecifiedFieldFilter
SpecifiedNumericFieldFilter
StopWordsFilter
SuffixFilter
TextActionFilter
TextEntityDependencyFilter
TextLengthFilter
TextPairSimilarityFilter
TokenNumFilter
VideoAestheticsFilter
VideoAspectRatioFilter
VideoDurationFilter
VideoFramesTextSimilarityFilter
VideoMotionScoreFilter
VideoMotionScoreRaftFilter
VideoNSFWFilter
VideoOcrAreaRatioFilter
VideoResolutionFilter
VideoTaggingFromFramesFilter
VideoWatermarkFilter
WordRepetitionFilter
WordsNumFilter
GeneralFieldFilter
- data_juicer.ops.grouper package
- data_juicer.ops.mapper package
- Subpackages
- Submodules
- data_juicer.ops.mapper.audio_add_gaussian_noise_mapper module
- data_juicer.ops.mapper.audio_ffmpeg_wrapped_mapper module
- data_juicer.ops.mapper.calibrate_qa_mapper module
CalibrateQAMapper
CalibrateQAMapper.DEFAULT_SYSTEM_PROMPT
CalibrateQAMapper.DEFAULT_INPUT_TEMPLATE
CalibrateQAMapper.DEFAULT_REFERENCE_TEMPLATE
CalibrateQAMapper.DEFAULT_QA_PAIR_TEMPLATE
CalibrateQAMapper.DEFAULT_OUTPUT_PATTERN
CalibrateQAMapper.__init__()
CalibrateQAMapper.build_input()
CalibrateQAMapper.parse_output()
CalibrateQAMapper.process_single()
- data_juicer.ops.mapper.calibrate_query_mapper module
- data_juicer.ops.mapper.calibrate_response_mapper module
- data_juicer.ops.mapper.chinese_convert_mapper module
- data_juicer.ops.mapper.clean_copyright_mapper module
- data_juicer.ops.mapper.clean_email_mapper module
- data_juicer.ops.mapper.clean_html_mapper module
- data_juicer.ops.mapper.clean_ip_mapper module
- data_juicer.ops.mapper.clean_links_mapper module
- data_juicer.ops.mapper.dialog_intent_detection_mapper module
DialogIntentDetectionMapper
DialogIntentDetectionMapper.DEFAULT_SYSTEM_PROMPT
DialogIntentDetectionMapper.DEFAULT_QUERY_TEMPLATE
DialogIntentDetectionMapper.DEFAULT_RESPONSE_TEMPLATE
DialogIntentDetectionMapper.DEFAULT_CANDIDATES_TEMPLATE
DialogIntentDetectionMapper.DEFAULT_ANALYSIS_TEMPLATE
DialogIntentDetectionMapper.DEFAULT_LABELS_TEMPLATE
DialogIntentDetectionMapper.DEFAULT_ANALYSIS_PATTERN
DialogIntentDetectionMapper.DEFAULT_LABELS_PATTERN
DialogIntentDetectionMapper.__init__()
DialogIntentDetectionMapper.build_input()
DialogIntentDetectionMapper.parse_output()
DialogIntentDetectionMapper.process_single()
- data_juicer.ops.mapper.dialog_sentiment_detection_mapper module
DialogSentimentDetectionMapper
DialogSentimentDetectionMapper.DEFAULT_SYSTEM_PROMPT
DialogSentimentDetectionMapper.DEFAULT_QUERY_TEMPLATE
DialogSentimentDetectionMapper.DEFAULT_RESPONSE_TEMPLATE
DialogSentimentDetectionMapper.DEFAULT_CANDIDATES_TEMPLATE
DialogSentimentDetectionMapper.DEFAULT_ANALYSIS_TEMPLATE
DialogSentimentDetectionMapper.DEFAULT_LABELS_TEMPLATE
DialogSentimentDetectionMapper.DEFAULT_ANALYSIS_PATTERN
DialogSentimentDetectionMapper.DEFAULT_LABELS_PATTERN
DialogSentimentDetectionMapper.__init__()
DialogSentimentDetectionMapper.build_input()
DialogSentimentDetectionMapper.parse_output()
DialogSentimentDetectionMapper.process_single()
- data_juicer.ops.mapper.dialog_sentiment_intensity_mapper module
DialogSentimentIntensityMapper
DialogSentimentIntensityMapper.DEFAULT_SYSTEM_PROMPT
DialogSentimentIntensityMapper.DEFAULT_QUERY_TEMPLATE
DialogSentimentIntensityMapper.DEFAULT_RESPONSE_TEMPLATE
DialogSentimentIntensityMapper.DEFAULT_ANALYSIS_TEMPLATE
DialogSentimentIntensityMapper.DEFAULT_INTENSITY_TEMPLATE
DialogSentimentIntensityMapper.DEFAULT_ANALYSIS_PATTERN
DialogSentimentIntensityMapper.DEFAULT_INTENSITY_PATTERN
DialogSentimentIntensityMapper.__init__()
DialogSentimentIntensityMapper.build_input()
DialogSentimentIntensityMapper.parse_output()
DialogSentimentIntensityMapper.process_single()
- data_juicer.ops.mapper.dialog_topic_detection_mapper module
DialogTopicDetectionMapper
DialogTopicDetectionMapper.DEFAULT_SYSTEM_PROMPT
DialogTopicDetectionMapper.DEFAULT_QUERY_TEMPLATE
DialogTopicDetectionMapper.DEFAULT_RESPONSE_TEMPLATE
DialogTopicDetectionMapper.DEFAULT_CANDIDATES_TEMPLATE
DialogTopicDetectionMapper.DEFAULT_ANALYSIS_TEMPLATE
DialogTopicDetectionMapper.DEFAULT_LABELS_TEMPLATE
DialogTopicDetectionMapper.DEFAULT_ANALYSIS_PATTERN
DialogTopicDetectionMapper.DEFAULT_LABELS_PATTERN
DialogTopicDetectionMapper.__init__()
DialogTopicDetectionMapper.build_input()
DialogTopicDetectionMapper.parse_output()
DialogTopicDetectionMapper.process_single()
- data_juicer.ops.mapper.expand_macro_mapper module
- data_juicer.ops.mapper.extract_entity_attribute_mapper module
ExtractEntityAttributeMapper
ExtractEntityAttributeMapper.DEFAULT_SYSTEM_PROMPT_TEMPLATE
ExtractEntityAttributeMapper.DEFAULT_INPUT_TEMPLATE
ExtractEntityAttributeMapper.DEFAULT_ATTR_PATTERN_TEMPLATE
ExtractEntityAttributeMapper.DEFAULT_DEMON_PATTERN
ExtractEntityAttributeMapper.__init__()
ExtractEntityAttributeMapper.parse_output()
ExtractEntityAttributeMapper.process_single()
- data_juicer.ops.mapper.extract_entity_relation_mapper module
ExtractEntityRelationMapper
ExtractEntityRelationMapper.DEFAULT_PROMPT_TEMPLATE
ExtractEntityRelationMapper.DEFAULT_CONTINUE_PROMPT
ExtractEntityRelationMapper.DEFAULT_IF_LOOP_PROMPT
ExtractEntityRelationMapper.DEFAULT_ENTITY_TYPES
ExtractEntityRelationMapper.DEFAULT_TUPLE_DELIMITER
ExtractEntityRelationMapper.DEFAULT_RECORD_DELIMITER
ExtractEntityRelationMapper.DEFAULT_COMPLETION_DELIMITER
ExtractEntityRelationMapper.DEFAULT_ENTITY_PATTERN
ExtractEntityRelationMapper.DEFAULT_RELATION_PATTERN
ExtractEntityRelationMapper.__init__()
ExtractEntityRelationMapper.parse_output()
ExtractEntityRelationMapper.add_message()
ExtractEntityRelationMapper.light_rag_extraction()
ExtractEntityRelationMapper.process_single()
- data_juicer.ops.mapper.extract_event_mapper module
- data_juicer.ops.mapper.extract_keyword_mapper module
- data_juicer.ops.mapper.extract_nickname_mapper module
- data_juicer.ops.mapper.extract_support_text_mapper module
- data_juicer.ops.mapper.extract_tables_from_html_mapper module
- data_juicer.ops.mapper.fix_unicode_mapper module
- data_juicer.ops.mapper.generate_qa_from_examples_mapper module
GenerateQAFromExamplesMapper
GenerateQAFromExamplesMapper.DEFAULT_SYSTEM_PROMPT
GenerateQAFromExamplesMapper.DEFAULT_INPUT_TEMPLATE
GenerateQAFromExamplesMapper.DEFAULT_EXAMPLE_TEMPLATE
GenerateQAFromExamplesMapper.DEFAULT_QA_PAIR_TEMPLATE
GenerateQAFromExamplesMapper.DEFAULT_OUTPUT_PATTERN
GenerateQAFromExamplesMapper.__init__()
GenerateQAFromExamplesMapper.build_input()
GenerateQAFromExamplesMapper.parse_output()
GenerateQAFromExamplesMapper.process_single()
- data_juicer.ops.mapper.generate_qa_from_text_mapper module
- data_juicer.ops.mapper.image_blur_mapper module
- data_juicer.ops.mapper.image_captioning_from_gpt4v_mapper module
- data_juicer.ops.mapper.image_captioning_mapper module
- data_juicer.ops.mapper.image_diffusion_mapper module
- data_juicer.ops.mapper.image_face_blur_mapper module
- data_juicer.ops.mapper.image_remove_background_mapper module
- data_juicer.ops.mapper.image_segment_mapper module
- data_juicer.ops.mapper.image_tagging_mapper module
- data_juicer.ops.mapper.imgdiff_difference_area_generator_mapper module
- data_juicer.ops.mapper.imgdiff_difference_caption_generator_mapper module
- data_juicer.ops.mapper.mllm_mapper module
- data_juicer.ops.mapper.nlpaug_en_mapper module
- data_juicer.ops.mapper.nlpcda_zh_mapper module
- data_juicer.ops.mapper.optimize_qa_mapper module
- data_juicer.ops.mapper.optimize_query_mapper module
- data_juicer.ops.mapper.optimize_response_mapper module
- data_juicer.ops.mapper.pair_preference_mapper module
- data_juicer.ops.mapper.punctuation_normalization_mapper module
- data_juicer.ops.mapper.python_file_mapper module
- data_juicer.ops.mapper.python_lambda_mapper module
- data_juicer.ops.mapper.query_intent_detection_mapper module
- data_juicer.ops.mapper.query_sentiment_detection_mapper module
- data_juicer.ops.mapper.query_topic_detection_mapper module
- data_juicer.ops.mapper.relation_identity_mapper module
- data_juicer.ops.mapper.remove_bibliography_mapper module
- data_juicer.ops.mapper.remove_comments_mapper module
- data_juicer.ops.mapper.remove_header_mapper module
- data_juicer.ops.mapper.remove_long_words_mapper module
- data_juicer.ops.mapper.remove_non_chinese_character_mapper module
- data_juicer.ops.mapper.remove_repeat_sentences_mapper module
- data_juicer.ops.mapper.remove_specific_chars_mapper module
- data_juicer.ops.mapper.remove_table_text_mapper module
- data_juicer.ops.mapper.remove_words_with_incorrect_substrings_mapper module
- data_juicer.ops.mapper.replace_content_mapper module
- data_juicer.ops.mapper.sdxl_prompt2prompt_mapper module
- data_juicer.ops.mapper.sentence_augmentation_mapper module
- data_juicer.ops.mapper.sentence_split_mapper module
- data_juicer.ops.mapper.text_chunk_mapper module
- data_juicer.ops.mapper.video_captioning_from_audio_mapper module
- data_juicer.ops.mapper.video_captioning_from_frames_mapper module
- data_juicer.ops.mapper.video_captioning_from_summarizer_mapper module
- data_juicer.ops.mapper.video_captioning_from_video_mapper module
- data_juicer.ops.mapper.video_extract_frames_mapper module
- data_juicer.ops.mapper.video_face_blur_mapper module
- data_juicer.ops.mapper.video_ffmpeg_wrapped_mapper module
- data_juicer.ops.mapper.video_remove_watermark_mapper module
- data_juicer.ops.mapper.video_resize_aspect_ratio_mapper module
- data_juicer.ops.mapper.video_resize_resolution_mapper module
- data_juicer.ops.mapper.video_split_by_duration_mapper module
- data_juicer.ops.mapper.video_split_by_key_frame_mapper module
- data_juicer.ops.mapper.video_split_by_scene_mapper module
- data_juicer.ops.mapper.video_tagging_from_audio_mapper module
- data_juicer.ops.mapper.video_tagging_from_frames_mapper module
- data_juicer.ops.mapper.whitespace_normalization_mapper module
- Module contents
AudioAddGaussianNoiseMapper
AudioFFmpegWrappedMapper
CalibrateQAMapper
CalibrateQAMapper.DEFAULT_SYSTEM_PROMPT
CalibrateQAMapper.DEFAULT_INPUT_TEMPLATE
CalibrateQAMapper.DEFAULT_REFERENCE_TEMPLATE
CalibrateQAMapper.DEFAULT_QA_PAIR_TEMPLATE
CalibrateQAMapper.DEFAULT_OUTPUT_PATTERN
CalibrateQAMapper.__init__()
CalibrateQAMapper.build_input()
CalibrateQAMapper.parse_output()
CalibrateQAMapper.process_single()
CalibrateQueryMapper
CalibrateResponseMapper
ChineseConvertMapper
CleanCopyrightMapper
CleanEmailMapper
CleanHtmlMapper
CleanIpMapper
CleanLinksMapper
DialogIntentDetectionMapper
DialogIntentDetectionMapper.DEFAULT_SYSTEM_PROMPT
DialogIntentDetectionMapper.DEFAULT_QUERY_TEMPLATE
DialogIntentDetectionMapper.DEFAULT_RESPONSE_TEMPLATE
DialogIntentDetectionMapper.DEFAULT_CANDIDATES_TEMPLATE
DialogIntentDetectionMapper.DEFAULT_ANALYSIS_TEMPLATE
DialogIntentDetectionMapper.DEFAULT_LABELS_TEMPLATE
DialogIntentDetectionMapper.DEFAULT_ANALYSIS_PATTERN
DialogIntentDetectionMapper.DEFAULT_LABELS_PATTERN
DialogIntentDetectionMapper.__init__()
DialogIntentDetectionMapper.build_input()
DialogIntentDetectionMapper.parse_output()
DialogIntentDetectionMapper.process_single()
DialogSentimentDetectionMapper
DialogSentimentDetectionMapper.DEFAULT_SYSTEM_PROMPT
DialogSentimentDetectionMapper.DEFAULT_QUERY_TEMPLATE
DialogSentimentDetectionMapper.DEFAULT_RESPONSE_TEMPLATE
DialogSentimentDetectionMapper.DEFAULT_CANDIDATES_TEMPLATE
DialogSentimentDetectionMapper.DEFAULT_ANALYSIS_TEMPLATE
DialogSentimentDetectionMapper.DEFAULT_LABELS_TEMPLATE
DialogSentimentDetectionMapper.DEFAULT_ANALYSIS_PATTERN
DialogSentimentDetectionMapper.DEFAULT_LABELS_PATTERN
DialogSentimentDetectionMapper.__init__()
DialogSentimentDetectionMapper.build_input()
DialogSentimentDetectionMapper.parse_output()
DialogSentimentDetectionMapper.process_single()
DialogSentimentIntensityMapper
DialogSentimentIntensityMapper.DEFAULT_SYSTEM_PROMPT
DialogSentimentIntensityMapper.DEFAULT_QUERY_TEMPLATE
DialogSentimentIntensityMapper.DEFAULT_RESPONSE_TEMPLATE
DialogSentimentIntensityMapper.DEFAULT_ANALYSIS_TEMPLATE
DialogSentimentIntensityMapper.DEFAULT_INTENSITY_TEMPLATE
DialogSentimentIntensityMapper.DEFAULT_ANALYSIS_PATTERN
DialogSentimentIntensityMapper.DEFAULT_INTENSITY_PATTERN
DialogSentimentIntensityMapper.__init__()
DialogSentimentIntensityMapper.build_input()
DialogSentimentIntensityMapper.parse_output()
DialogSentimentIntensityMapper.process_single()
DialogTopicDetectionMapper
DialogTopicDetectionMapper.DEFAULT_SYSTEM_PROMPT
DialogTopicDetectionMapper.DEFAULT_QUERY_TEMPLATE
DialogTopicDetectionMapper.DEFAULT_RESPONSE_TEMPLATE
DialogTopicDetectionMapper.DEFAULT_CANDIDATES_TEMPLATE
DialogTopicDetectionMapper.DEFAULT_ANALYSIS_TEMPLATE
DialogTopicDetectionMapper.DEFAULT_LABELS_TEMPLATE
DialogTopicDetectionMapper.DEFAULT_ANALYSIS_PATTERN
DialogTopicDetectionMapper.DEFAULT_LABELS_PATTERN
DialogTopicDetectionMapper.__init__()
DialogTopicDetectionMapper.build_input()
DialogTopicDetectionMapper.parse_output()
DialogTopicDetectionMapper.process_single()
Difference_Area_Generator_Mapper
ExtractEntityAttributeMapper
ExtractEntityAttributeMapper.DEFAULT_SYSTEM_PROMPT_TEMPLATE
ExtractEntityAttributeMapper.DEFAULT_INPUT_TEMPLATE
ExtractEntityAttributeMapper.DEFAULT_ATTR_PATTERN_TEMPLATE
ExtractEntityAttributeMapper.DEFAULT_DEMON_PATTERN
ExtractEntityAttributeMapper.__init__()
ExtractEntityAttributeMapper.parse_output()
ExtractEntityAttributeMapper.process_single()
ExtractEntityRelationMapper
ExtractEntityRelationMapper.DEFAULT_PROMPT_TEMPLATE
ExtractEntityRelationMapper.DEFAULT_CONTINUE_PROMPT
ExtractEntityRelationMapper.DEFAULT_IF_LOOP_PROMPT
ExtractEntityRelationMapper.DEFAULT_ENTITY_TYPES
ExtractEntityRelationMapper.DEFAULT_TUPLE_DELIMITER
ExtractEntityRelationMapper.DEFAULT_RECORD_DELIMITER
ExtractEntityRelationMapper.DEFAULT_COMPLETION_DELIMITER
ExtractEntityRelationMapper.DEFAULT_ENTITY_PATTERN
ExtractEntityRelationMapper.DEFAULT_RELATION_PATTERN
ExtractEntityRelationMapper.__init__()
ExtractEntityRelationMapper.parse_output()
ExtractEntityRelationMapper.add_message()
ExtractEntityRelationMapper.light_rag_extraction()
ExtractEntityRelationMapper.process_single()
ExtractEventMapper
ExtractKeywordMapper
ExtractNicknameMapper
ExtractSupportTextMapper
ExtractTablesFromHtmlMapper
FixUnicodeMapper
GenerateQAFromExamplesMapper
GenerateQAFromExamplesMapper.DEFAULT_SYSTEM_PROMPT
GenerateQAFromExamplesMapper.DEFAULT_INPUT_TEMPLATE
GenerateQAFromExamplesMapper.DEFAULT_EXAMPLE_TEMPLATE
GenerateQAFromExamplesMapper.DEFAULT_QA_PAIR_TEMPLATE
GenerateQAFromExamplesMapper.DEFAULT_OUTPUT_PATTERN
GenerateQAFromExamplesMapper.__init__()
GenerateQAFromExamplesMapper.build_input()
GenerateQAFromExamplesMapper.parse_output()
GenerateQAFromExamplesMapper.process_single()
GenerateQAFromTextMapper
HumanPreferenceAnnotationMapper
ImageBlurMapper
ImageCaptioningFromGPT4VMapper
ImageCaptioningMapper
ImageDiffusionMapper
ImageFaceBlurMapper
ImageRemoveBackgroundMapper
ImageSegmentMapper
ImageTaggingMapper
MllmMapper
NlpaugEnMapper
NlpcdaZhMapper
OptimizeQAMapper
OptimizeQueryMapper
OptimizeResponseMapper
PairPreferenceMapper
PunctuationNormalizationMapper
PythonFileMapper
PythonLambdaMapper
QuerySentimentDetectionMapper
QueryIntentDetectionMapper
QueryTopicDetectionMapper
RelationIdentityMapper
RemoveBibliographyMapper
RemoveCommentsMapper
RemoveHeaderMapper
RemoveLongWordsMapper
RemoveNonChineseCharacterlMapper
RemoveRepeatSentencesMapper
RemoveSpecificCharsMapper
RemoveTableTextMapper
RemoveWordsWithIncorrectSubstringsMapper
ReplaceContentMapper
SDXLPrompt2PromptMapper
SentenceAugmentationMapper
SentenceSplitMapper
TextChunkMapper
VideoCaptioningFromAudioMapper
VideoCaptioningFromFramesMapper
VideoCaptioningFromSummarizerMapper
VideoCaptioningFromVideoMapper
VideoExtractFramesMapper
VideoFFmpegWrappedMapper
VideoFaceBlurMapper
VideoRemoveWatermarkMapper
VideoResizeAspectRatioMapper
VideoResizeResolutionMapper
VideoSplitByDurationMapper
VideoSplitByKeyFrameMapper
VideoSplitBySceneMapper
VideoTaggingFromAudioMapper
VideoTaggingFromFramesMapper
WhitespaceNormalizationMapper
- data_juicer.ops.selector package
- Submodules
- data_juicer.ops.selector.frequency_specified_field_selector module
- data_juicer.ops.selector.random_selector module
- data_juicer.ops.selector.range_specified_field_selector module
- data_juicer.ops.selector.tags_specified_field_selector module
- data_juicer.ops.selector.topk_specified_field_selector module
- Module contents
Submodules¶
data_juicer.ops.base_op module¶
- data_juicer.ops.base_op.catch_map_batches_exception(method, skip_op_error=False, op_name=None)[source]¶
For batched-map sample-level fault tolerance.
- data_juicer.ops.base_op.catch_map_single_exception(method, return_sample=True, skip_op_error=False, op_name=None)[source]¶
For single-map sample-level fault tolerance. The input sample is expected batch_size = 1.
- class data_juicer.ops.base_op.OP(*args, **kwargs)[source]¶
Bases:
object
- __init__(*args, **kwargs)[source]¶
Base class of operators.
- Parameters:
text_key – the key name of field that stores sample texts to be processed.
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
index_key – index the samples before process if not None
batch_size – the batch size for processing
work_dir – the working directory for this operator
- remove_extra_parameters(param_dict, keys=None)[source]¶
at the beginning of the init of the mapper op, call self.remove_extra_parameters(locals()) to get the init parameter dict of the op for convenience
- class data_juicer.ops.base_op.Mapper(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that conducts data editing.
- Parameters:
text_key – the key name of field that stores sample texts to be processed.
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- class data_juicer.ops.base_op.Filter(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that removes specific info.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- compute_stats_single(sample, context=False)[source]¶
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats
- class data_juicer.ops.base_op.Deduplicator(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that conducts deduplication.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- compute_hash(sample)[source]¶
Compute hash values for the sample.
- Parameters:
sample – input sample
- Returns:
sample with computed hash value.
- class data_juicer.ops.base_op.Selector(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that conducts selection in dataset-level.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- class data_juicer.ops.base_op.Grouper(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that group samples.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- class data_juicer.ops.base_op.Aggregator(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that group samples.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
data_juicer.ops.load module¶
data_juicer.ops.mixins module¶
- class data_juicer.ops.mixins.EventDrivenMixin(*args, **kwargs)[source]¶
Bases:
object
Mixin for event-driven capabilities in operations.
This mixin provides functionality for registering event handlers, triggering events, and managing event polling.
- register_event_handler(event_type: str, handler: Callable)[source]¶
Register a handler for a specific event type.
- Parameters:
event_type – Type of event to handle
handler – Callback function to handle the event
- trigger_event(event_type: str, data: Dict)[source]¶
Trigger an event and call all registered handlers.
- Parameters:
event_type – Type of event to trigger
data – Event data to pass to handlers
- start_polling(event_type: str, poll_func: Callable, interval: int = 60)[source]¶
Start polling for a specific event type.
- Parameters:
event_type – Type of event to poll for
poll_func – Function to call for polling
interval – Polling interval in seconds
- stop_polling(event_type: str)[source]¶
Stop polling for a specific event type.
- Parameters:
event_type – Type of event to stop polling for
- wait_for_completion(condition_func: Callable[[], bool], timeout: int = 3600, poll_interval: int = 10, error_message: str = 'Operation timed out')[source]¶
Wait for a condition to be met.
- Parameters:
condition_func – Function that returns True when condition is met
timeout – Maximum time to wait in seconds
poll_interval – Polling interval in seconds
error_message – Error message to raise on timeout
- Raises:
TimeoutError – If the condition is not met within the timeout
- class data_juicer.ops.mixins.NotificationMixin(*args, **kwargs)[source]¶
Bases:
object
Mixin for sending notifications through various channels.
This mixin provides functionality for sending notifications via email, Slack, DingTalk, and other platforms.
Notification configuration can be specified as a “notification_config” parameter within an operator (for backward compatibility): ```yaml process:
- some_mapper:
- notification_config:
enabled: true email:
# … email settings …
For security best practices, sensitive information like passwords and tokens should be provided via environment variables:
Email: set ‘DATA_JUICER_EMAIL_PASSWORD’ environment variable or service-specific ‘DATA_JUICER_SMTP_SERVER_NAME_PASSWORD’
Slack: set ‘DATA_JUICER_SLACK_WEBHOOK’ environment variable
DingTalk: set ‘DATA_JUICER_DINGTALK_TOKEN’ and ‘DATA_JUICER_DINGTALK_SECRET’ environment variables
For even more secure email authentication, you can use TLS client certificates instead of passwords:
Generate a client certificate and key (example using OpenSSL): ```bash # Generate a private key openssl genrsa -out client.key 2048
# Generate a certificate signing request (CSR) openssl req -new -key client.key -out client.csr
# Generate a self-signed certificate openssl x509 -req -days 365 -in client.csr -signkey client.key
-out client.crt
- Configure your SMTP server to accept this client certificate for
authentication
Configure Data Juicer to use certificate authentication: ```yaml notification:
enabled: true email:
use_cert_auth: true client_cert_file: “/path/to/client.crt” client_key_file: “/path/to/client.key” smtp_server: “smtp.example.com” smtp_port: 587 sender_email: “notifications@example.com” recipients: [”recipient@example.com”]
Or use environment variables:
`bash export DATA_JUICER_EMAIL_CERT="/path/to/client.crt" export DATA_JUICER_EMAIL_KEY="/path/to/client.key" `
For maximum connection security, you can use a direct SSL connection instead of STARTTLS by enabling the ‘use_ssl’ option:
enabled: true email:
use_ssl: true smtp_port: 465 # Common port for SMTP over SSL # … other email configuration …
- This establishes an encrypted connection from the beginning, rather than
starting with an unencrypted connection and upgrading to TLS as with STARTTLS. Note that this option can be combined with certificate authentication for maximum security.
- The email notification system supports various email server configurations
through a flexible configuration system. Here are some examples for different servers:
Standard SMTP with STARTTLS: ```yaml notification:
enabled: true email:
smtp_server: “smtp.example.com” smtp_port: 587 username: “your.username@example.com” sender_email: “your.username@example.com” sender_name: “Your Name” # Optional recipients: [”recipient1@example.com”, “recipient2@example.com”]
Direct SSL Connection (e.g., Gmail): ```yaml notification:
enabled: true email:
smtp_server: “smtp.gmail.com” smtp_port: 465 use_ssl: true username: “your.username@gmail.com” sender_email: “your.username@gmail.com” sender_name: “Your Name” recipients: [”recipient1@example.com”, “recipient2@example.com”]
Alibaba Email Server: ```yaml notification:
enabled: true email:
smtp_server: “smtp.alibaba-inc.com” smtp_port: 465 username: “your.username@alibaba-inc.com” sender_email: “your.username@alibaba-inc.com” sender_name: “Your Name” recipient_separator: “;” # Use semicolons to separate recipients recipients: [”recipient1@example.com”, “recipient2@example.com”]
Environment variable usage examples: ```bash # General email password export DATA_JUICER_EMAIL_PASSWORD=”your_email_password”
# Server-specific passwords (preferred for clarity) export DATA_JUICER_SMTP_GMAIL_COM_PASSWORD=”your_gmail_password” export DATA_JUICER_SMTP_ALIBABA_INC_COM_PASSWORD=”your_alibaba_password”
# Slack webhook export DATA_JUICER_SLACK_WEBHOOK=”your_slack_webhook_url”
# DingTalk credentials export DATA_JUICER_DINGTALK_TOKEN=”your_dingtalk_token” export DATA_JUICER_DINGTALK_SECRET=”your_dingtalk_secret” ```
If environment variables are not set, the system will fall back to using values from the configuration file, but this is less secure and not recommended for production environments.
- send_notification(message: str, notification_type: str | None = None, **kwargs)[source]¶
Send a notification message.
- Parameters:
message – The message to send
notification_type – The type of notification to send. Email, Slack, DingTalk. If None, send nothing
**kwargs – Additional arguments to pass to the notification handler These can override any configuration settings for this specific notification
- Returns:
True if the notification was sent successfully, else False
- Return type:
bool
data_juicer.ops.op_fusion module¶
- data_juicer.ops.op_fusion.fuse_operators(ops, probe_res=None)[source]¶
Fuse the input ops list and return the fused ops list.
- Parameters:
ops – the corresponding list of op objects.
probe_res – the probed speed for each OP from Monitor.
- Returns:
a list of fused op objects.
- data_juicer.ops.op_fusion.fuse_filter_group(original_filter_group)[source]¶
Fuse single filter group and return the fused filter group.
- Parameters:
original_filter_group – the original filter group, including op definitions and objects.
- Returns:
the fused definitions and objects of the input filter group.
- class data_juicer.ops.op_fusion.FusedFilter(name: str, fused_filters: List)[source]¶
Bases:
Filter
A fused operator for filters.
- class data_juicer.ops.op_fusion.GeneralFusedOP(batch_size: int = 1, fused_op_list: List | None = None, *args, **kwargs)[source]¶
Bases:
OP
An explicitly fused operator designed to execute multiple sequential operations (OPs) on the same batch, enabling fine-grained control over data processing.
- __init__(batch_size: int = 1, fused_op_list: List | None = None, *args, **kwargs)[source]¶
Base class of operators.
- Parameters:
text_key – the key name of field that stores sample texts to be processed.
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
index_key – index the samples before process if not None
batch_size – the batch size for processing
work_dir – the working directory for this operator
Module contents¶
- data_juicer.ops.load_ops(process_list)[source]¶
Load op list according to the process list from config file.
- Parameters:
process_list – A process list. Each item is an op name and its arguments.
- Returns:
The op instance list.
- class data_juicer.ops.Filter(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that removes specific info.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- compute_stats_single(sample, context=False)[source]¶
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats
- class data_juicer.ops.Mapper(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that conducts data editing.
- Parameters:
text_key – the key name of field that stores sample texts to be processed.
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- class data_juicer.ops.Deduplicator(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that conducts deduplication.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- compute_hash(sample)[source]¶
Compute hash values for the sample.
- Parameters:
sample – input sample
- Returns:
sample with computed hash value.
- class data_juicer.ops.Selector(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that conducts selection in dataset-level.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- class data_juicer.ops.Grouper(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that group samples.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- class data_juicer.ops.Aggregator(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that group samples.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queries
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses