data_juicer.ops package¶
Subpackages¶
- data_juicer.ops.aggregator package
- Submodules
- data_juicer.ops.aggregator.entity_attribute_aggregator module
EntityAttributeAggregator
EntityAttributeAggregator.DEFAULT_SYSTEM_TEMPLATE
EntityAttributeAggregator.DEFAULT_EXAMPLE_PROMPT
EntityAttributeAggregator.DEFAULT_INPUT_TEMPLATE
EntityAttributeAggregator.DEFAULT_OUTPUT_PATTERN_TEMPLATE
EntityAttributeAggregator.__init__()
EntityAttributeAggregator.parse_output()
EntityAttributeAggregator.attribute_summary()
EntityAttributeAggregator.process_single()
- data_juicer.ops.aggregator.most_relavant_entities_aggregator module
MostRelavantEntitiesAggregator
MostRelavantEntitiesAggregator.DEFAULT_SYSTEM_TEMPLATE
MostRelavantEntitiesAggregator.DEFAULT_INPUT_TEMPLATE
MostRelavantEntitiesAggregator.DEFAULT_OUTPUT_PATTERN
MostRelavantEntitiesAggregator.__init__()
MostRelavantEntitiesAggregator.parse_output()
MostRelavantEntitiesAggregator.query_most_relavant_entities()
MostRelavantEntitiesAggregator.process_single()
- data_juicer.ops.aggregator.nested_aggregator module
- Module contents
NestedAggregator
EntityAttributeAggregator
EntityAttributeAggregator.DEFAULT_SYSTEM_TEMPLATE
EntityAttributeAggregator.DEFAULT_EXAMPLE_PROMPT
EntityAttributeAggregator.DEFAULT_INPUT_TEMPLATE
EntityAttributeAggregator.DEFAULT_OUTPUT_PATTERN_TEMPLATE
EntityAttributeAggregator.__init__()
EntityAttributeAggregator.parse_output()
EntityAttributeAggregator.attribute_summary()
EntityAttributeAggregator.process_single()
MostRelavantEntitiesAggregator
MostRelavantEntitiesAggregator.DEFAULT_SYSTEM_TEMPLATE
MostRelavantEntitiesAggregator.DEFAULT_INPUT_TEMPLATE
MostRelavantEntitiesAggregator.DEFAULT_OUTPUT_PATTERN
MostRelavantEntitiesAggregator.__init__()
MostRelavantEntitiesAggregator.parse_output()
MostRelavantEntitiesAggregator.query_most_relavant_entities()
MostRelavantEntitiesAggregator.process_single()
- data_juicer.ops.common package
- data_juicer.ops.deduplicator package
- Submodules
- data_juicer.ops.deduplicator.document_deduplicator module
- data_juicer.ops.deduplicator.document_minhash_deduplicator module
- data_juicer.ops.deduplicator.document_simhash_deduplicator module
- data_juicer.ops.deduplicator.image_deduplicator module
- data_juicer.ops.deduplicator.ray_basic_deduplicator module
- data_juicer.ops.deduplicator.ray_document_deduplicator module
- data_juicer.ops.deduplicator.ray_image_deduplicator module
- data_juicer.ops.deduplicator.ray_video_deduplicator module
- data_juicer.ops.deduplicator.video_deduplicator module
- Module contents
- data_juicer.ops.filter package
- Submodules
- data_juicer.ops.filter.alphanumeric_filter module
- data_juicer.ops.filter.audio_duration_filter module
- data_juicer.ops.filter.audio_nmf_snr_filter module
- data_juicer.ops.filter.audio_size_filter module
- data_juicer.ops.filter.average_line_length_filter module
- data_juicer.ops.filter.character_repetition_filter module
- data_juicer.ops.filter.flagged_words_filter module
- data_juicer.ops.filter.image_aesthetics_filter module
- data_juicer.ops.filter.image_aspect_ratio_filter module
- data_juicer.ops.filter.image_face_count_filter module
- data_juicer.ops.filter.image_face_ratio_filter module
- data_juicer.ops.filter.image_nsfw_filter module
- data_juicer.ops.filter.image_pair_similarity_filter module
- data_juicer.ops.filter.image_shape_filter module
- data_juicer.ops.filter.image_size_filter module
- data_juicer.ops.filter.image_text_matching_filter module
- data_juicer.ops.filter.image_text_similarity_filter module
- data_juicer.ops.filter.image_watermark_filter module
- data_juicer.ops.filter.language_id_score_filter module
- data_juicer.ops.filter.maximum_line_length_filter module
- data_juicer.ops.filter.perplexity_filter module
- data_juicer.ops.filter.phrase_grounding_recall_filter module
- data_juicer.ops.filter.special_characters_filter module
- data_juicer.ops.filter.specified_field_filter module
- data_juicer.ops.filter.specified_numeric_field_filter module
- data_juicer.ops.filter.stopwords_filter module
- data_juicer.ops.filter.suffix_filter module
- data_juicer.ops.filter.text_action_filter module
- data_juicer.ops.filter.text_entity_dependency_filter module
- data_juicer.ops.filter.text_length_filter module
- data_juicer.ops.filter.token_num_filter module
- data_juicer.ops.filter.video_aesthetics_filter module
- data_juicer.ops.filter.video_aspect_ratio_filter module
- data_juicer.ops.filter.video_duration_filter module
- data_juicer.ops.filter.video_frames_text_similarity_filter module
- data_juicer.ops.filter.video_motion_score_filter module
- data_juicer.ops.filter.video_motion_score_raft_filter module
- data_juicer.ops.filter.video_nsfw_filter module
- data_juicer.ops.filter.video_ocr_area_ratio_filter module
- data_juicer.ops.filter.video_resolution_filter module
- data_juicer.ops.filter.video_tagging_from_frames_filter module
- data_juicer.ops.filter.video_watermark_filter module
- data_juicer.ops.filter.word_repetition_filter module
- data_juicer.ops.filter.words_num_filter module
- Module contents
AlphanumericFilter
AudioDurationFilter
AudioNMFSNRFilter
AudioSizeFilter
AverageLineLengthFilter
CharacterRepetitionFilter
FlaggedWordFilter
ImageAestheticsFilter
ImageAspectRatioFilter
ImageFaceCountFilter
ImageFaceRatioFilter
ImageNSFWFilter
ImagePairSimilarityFilter
ImageShapeFilter
ImageSizeFilter
ImageTextMatchingFilter
ImageTextSimilarityFilter
ImageWatermarkFilter
LanguageIDScoreFilter
MaximumLineLengthFilter
PerplexityFilter
PhraseGroundingRecallFilter
SpecialCharactersFilter
SpecifiedFieldFilter
SpecifiedNumericFieldFilter
StopWordsFilter
SuffixFilter
TextActionFilter
TextEntityDependencyFilter
TextLengthFilter
TokenNumFilter
VideoAestheticsFilter
VideoAspectRatioFilter
VideoDurationFilter
VideoFramesTextSimilarityFilter
VideoMotionScoreFilter
VideoMotionScoreRaftFilter
VideoNSFWFilter
VideoOcrAreaRatioFilter
VideoResolutionFilter
VideoTaggingFromFramesFilter
VideoWatermarkFilter
WordRepetitionFilter
WordsNumFilter
- data_juicer.ops.grouper package
- data_juicer.ops.mapper package
- Submodules
- data_juicer.ops.mapper.audio_ffmpeg_wrapped_mapper module
- data_juicer.ops.mapper.calibrate_qa_mapper module
CalibrateQAMapper
CalibrateQAMapper.DEFAULT_SYSTEM_PROMPT
CalibrateQAMapper.DEFAULT_INPUT_TEMPLATE
CalibrateQAMapper.DEFAULT_REFERENCE_TEMPLATE
CalibrateQAMapper.DEFAULT_QA_PAIR_TEMPLATE
CalibrateQAMapper.DEFAULT_OUTPUT_PATTERN
CalibrateQAMapper.__init__()
CalibrateQAMapper.build_input()
CalibrateQAMapper.parse_output()
CalibrateQAMapper.process_single()
- data_juicer.ops.mapper.calibrate_query_mapper module
- data_juicer.ops.mapper.calibrate_response_mapper module
- data_juicer.ops.mapper.chinese_convert_mapper module
- data_juicer.ops.mapper.clean_copyright_mapper module
- data_juicer.ops.mapper.clean_email_mapper module
- data_juicer.ops.mapper.clean_html_mapper module
- data_juicer.ops.mapper.clean_ip_mapper module
- data_juicer.ops.mapper.clean_links_mapper module
- data_juicer.ops.mapper.expand_macro_mapper module
- data_juicer.ops.mapper.extract_entity_attribute_mapper module
ExtractEntityAttributeMapper
ExtractEntityAttributeMapper.DEFAULT_SYSTEM_PROMPT_TEMPLATE
ExtractEntityAttributeMapper.DEFAULT_INPUT_TEMPLATE
ExtractEntityAttributeMapper.DEFAULT_ATTR_PATTERN_TEMPLATE
ExtractEntityAttributeMapper.DEFAULT_DEMON_PATTERN
ExtractEntityAttributeMapper.__init__()
ExtractEntityAttributeMapper.parse_output()
ExtractEntityAttributeMapper.process_single()
- data_juicer.ops.mapper.extract_entity_relation_mapper module
ExtractEntityRelationMapper
ExtractEntityRelationMapper.DEFAULT_PROMPT_TEMPLATE
ExtractEntityRelationMapper.DEFAULT_CONTINUE_PROMPT
ExtractEntityRelationMapper.DEFAULT_IF_LOOP_PROMPT
ExtractEntityRelationMapper.DEFAULT_ENTITY_TYPES
ExtractEntityRelationMapper.DEFAULT_TUPLE_DELIMITER
ExtractEntityRelationMapper.DEFAULT_RECORD_DELIMITER
ExtractEntityRelationMapper.DEFAULT_COMPLETION_DELIMITER
ExtractEntityRelationMapper.DEFAULT_ENTITY_PATTERN
ExtractEntityRelationMapper.DEFAULT_RELATION_PATTERN
ExtractEntityRelationMapper.__init__()
ExtractEntityRelationMapper.parse_output()
ExtractEntityRelationMapper.add_message()
ExtractEntityRelationMapper.light_rag_extraction()
ExtractEntityRelationMapper.process_single()
- data_juicer.ops.mapper.extract_event_mapper module
- data_juicer.ops.mapper.extract_keyword_mapper module
- data_juicer.ops.mapper.extract_nickname_mapper module
- data_juicer.ops.mapper.extract_support_text_mapper module
- data_juicer.ops.mapper.fix_unicode_mapper module
- data_juicer.ops.mapper.generate_qa_from_examples_mapper module
GenerateQAFromExamplesMapper
GenerateQAFromExamplesMapper.DEFAULT_SYSTEM_PROMPT
GenerateQAFromExamplesMapper.DEFAULT_INPUT_TEMPLATE
GenerateQAFromExamplesMapper.DEFAULT_EXAMPLE_TEMPLATE
GenerateQAFromExamplesMapper.DEFAULT_QA_PAIR_TEMPLATE
GenerateQAFromExamplesMapper.DEFAULT_OUTPUT_PATTERN
GenerateQAFromExamplesMapper.__init__()
GenerateQAFromExamplesMapper.build_input()
GenerateQAFromExamplesMapper.parse_output()
GenerateQAFromExamplesMapper.process_single()
- data_juicer.ops.mapper.generate_qa_from_text_mapper module
- data_juicer.ops.mapper.image_blur_mapper module
- data_juicer.ops.mapper.image_captioning_from_gpt4v_mapper module
- data_juicer.ops.mapper.image_captioning_mapper module
- data_juicer.ops.mapper.image_diffusion_mapper module
- data_juicer.ops.mapper.image_face_blur_mapper module
- data_juicer.ops.mapper.image_tagging_mapper module
- data_juicer.ops.mapper.nlpaug_en_mapper module
- data_juicer.ops.mapper.nlpcda_zh_mapper module
- data_juicer.ops.mapper.optimize_qa_mapper module
- data_juicer.ops.mapper.optimize_query_mapper module
- data_juicer.ops.mapper.optimize_response_mapper module
- data_juicer.ops.mapper.pair_preference_mapper module
- data_juicer.ops.mapper.punctuation_normalization_mapper module
- data_juicer.ops.mapper.python_file_mapper module
- data_juicer.ops.mapper.python_lambda_mapper module
- data_juicer.ops.mapper.relation_identity_mapper module
- data_juicer.ops.mapper.remove_bibliography_mapper module
- data_juicer.ops.mapper.remove_comments_mapper module
- data_juicer.ops.mapper.remove_header_mapper module
- data_juicer.ops.mapper.remove_long_words_mapper module
- data_juicer.ops.mapper.remove_non_chinese_character_mapper module
- data_juicer.ops.mapper.remove_repeat_sentences_mapper module
- data_juicer.ops.mapper.remove_specific_chars_mapper module
- data_juicer.ops.mapper.remove_table_text_mapper module
- data_juicer.ops.mapper.remove_words_with_incorrect_substrings_mapper module
- data_juicer.ops.mapper.replace_content_mapper module
- data_juicer.ops.mapper.sentence_split_mapper module
- data_juicer.ops.mapper.text_chunk_mapper module
- data_juicer.ops.mapper.video_captioning_from_audio_mapper module
- data_juicer.ops.mapper.video_captioning_from_frames_mapper module
- data_juicer.ops.mapper.video_captioning_from_summarizer_mapper module
- data_juicer.ops.mapper.video_captioning_from_video_mapper module
- data_juicer.ops.mapper.video_extract_frames_mapper module
- data_juicer.ops.mapper.video_face_blur_mapper module
- data_juicer.ops.mapper.video_ffmpeg_wrapped_mapper module
- data_juicer.ops.mapper.video_remove_watermark_mapper module
- data_juicer.ops.mapper.video_resize_aspect_ratio_mapper module
- data_juicer.ops.mapper.video_resize_resolution_mapper module
- data_juicer.ops.mapper.video_split_by_duration_mapper module
- data_juicer.ops.mapper.video_split_by_key_frame_mapper module
- data_juicer.ops.mapper.video_split_by_scene_mapper module
- data_juicer.ops.mapper.video_tagging_from_audio_mapper module
- data_juicer.ops.mapper.video_tagging_from_frames_mapper module
- data_juicer.ops.mapper.whitespace_normalization_mapper module
- Module contents
AudioFFmpegWrappedMapper
CalibrateQAMapper
CalibrateQAMapper.DEFAULT_SYSTEM_PROMPT
CalibrateQAMapper.DEFAULT_INPUT_TEMPLATE
CalibrateQAMapper.DEFAULT_REFERENCE_TEMPLATE
CalibrateQAMapper.DEFAULT_QA_PAIR_TEMPLATE
CalibrateQAMapper.DEFAULT_OUTPUT_PATTERN
CalibrateQAMapper.__init__()
CalibrateQAMapper.build_input()
CalibrateQAMapper.parse_output()
CalibrateQAMapper.process_single()
CalibrateQueryMapper
CalibrateResponseMapper
ChineseConvertMapper
CleanCopyrightMapper
CleanEmailMapper
CleanHtmlMapper
CleanIpMapper
CleanLinksMapper
ExpandMacroMapper
ExtractEntityAttributeMapper
ExtractEntityAttributeMapper.DEFAULT_SYSTEM_PROMPT_TEMPLATE
ExtractEntityAttributeMapper.DEFAULT_INPUT_TEMPLATE
ExtractEntityAttributeMapper.DEFAULT_ATTR_PATTERN_TEMPLATE
ExtractEntityAttributeMapper.DEFAULT_DEMON_PATTERN
ExtractEntityAttributeMapper.__init__()
ExtractEntityAttributeMapper.parse_output()
ExtractEntityAttributeMapper.process_single()
ExtractEntityRelationMapper
ExtractEntityRelationMapper.DEFAULT_PROMPT_TEMPLATE
ExtractEntityRelationMapper.DEFAULT_CONTINUE_PROMPT
ExtractEntityRelationMapper.DEFAULT_IF_LOOP_PROMPT
ExtractEntityRelationMapper.DEFAULT_ENTITY_TYPES
ExtractEntityRelationMapper.DEFAULT_TUPLE_DELIMITER
ExtractEntityRelationMapper.DEFAULT_RECORD_DELIMITER
ExtractEntityRelationMapper.DEFAULT_COMPLETION_DELIMITER
ExtractEntityRelationMapper.DEFAULT_ENTITY_PATTERN
ExtractEntityRelationMapper.DEFAULT_RELATION_PATTERN
ExtractEntityRelationMapper.__init__()
ExtractEntityRelationMapper.parse_output()
ExtractEntityRelationMapper.add_message()
ExtractEntityRelationMapper.light_rag_extraction()
ExtractEntityRelationMapper.process_single()
ExtractEventMapper
ExtractKeywordMapper
ExtractNicknameMapper
ExtractSupportTextMapper
FixUnicodeMapper
GenerateQAFromExamplesMapper
GenerateQAFromExamplesMapper.DEFAULT_SYSTEM_PROMPT
GenerateQAFromExamplesMapper.DEFAULT_INPUT_TEMPLATE
GenerateQAFromExamplesMapper.DEFAULT_EXAMPLE_TEMPLATE
GenerateQAFromExamplesMapper.DEFAULT_QA_PAIR_TEMPLATE
GenerateQAFromExamplesMapper.DEFAULT_OUTPUT_PATTERN
GenerateQAFromExamplesMapper.__init__()
GenerateQAFromExamplesMapper.build_input()
GenerateQAFromExamplesMapper.parse_output()
GenerateQAFromExamplesMapper.process_single()
GenerateQAFromTextMapper
ImageBlurMapper
ImageCaptioningFromGPT4VMapper
ImageCaptioningMapper
ImageDiffusionMapper
ImageFaceBlurMapper
ImageTaggingMapper
NlpaugEnMapper
NlpcdaZhMapper
OptimizeQAMapper
OptimizeQueryMapper
OptimizeResponseMapper
PairPreferenceMapper
PunctuationNormalizationMapper
PythonFileMapper
PythonLambdaMapper
RelationIdentityMapper
RemoveBibliographyMapper
RemoveCommentsMapper
RemoveHeaderMapper
RemoveLongWordsMapper
RemoveNonChineseCharacterlMapper
RemoveRepeatSentencesMapper
RemoveSpecificCharsMapper
RemoveTableTextMapper
RemoveWordsWithIncorrectSubstringsMapper
ReplaceContentMapper
SentenceSplitMapper
TextChunkMapper
VideoCaptioningFromAudioMapper
VideoCaptioningFromFramesMapper
VideoCaptioningFromSummarizerMapper
VideoCaptioningFromVideoMapper
VideoExtractFramesMapper
VideoFFmpegWrappedMapper
VideoFaceBlurMapper
VideoRemoveWatermarkMapper
VideoResizeAspectRatioMapper
VideoResizeResolutionMapper
VideoSplitByDurationMapper
VideoSplitByKeyFrameMapper
VideoSplitBySceneMapper
VideoTaggingFromAudioMapper
VideoTaggingFromFramesMapper
WhitespaceNormalizationMapper
- data_juicer.ops.selector package
- Submodules
- data_juicer.ops.selector.frequency_specified_field_selector module
- data_juicer.ops.selector.random_selector module
- data_juicer.ops.selector.range_specified_field_selector module
- data_juicer.ops.selector.topk_specified_field_selector module
- Module contents
Submodules¶
data_juicer.ops.base_op module¶
- data_juicer.ops.base_op.catch_map_batches_exception(method)[source]¶
For batched-map sample-level fault tolerance.
- data_juicer.ops.base_op.catch_map_single_exception(method, return_sample=True)[source]¶
For single-map sample-level fault tolerance. The input sample is expected batch_size = 1.
- class data_juicer.ops.base_op.OP(*args, **kwargs)[source]¶
Bases:
object
- __init__(*args, **kwargs)[source]¶
Base class of operators.
- Parameters:
text_key – the key name of field that stores sample texts to be processed.
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queris
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
index_key – index the samples before process if not None
- remove_extra_parameters(param_dict, keys=None)[source]¶
at the begining of the init of the mapper op, call self.remove_extra_parameters(locals()) to get the init parameter dict of the op for convenience
- class data_juicer.ops.base_op.Mapper(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that conducts data editing.
- Parameters:
text_key – the key name of field that stores sample texts to be processed.
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queris
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- class data_juicer.ops.base_op.Filter(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that removes specific info.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queris
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- compute_stats_single(sample, context=False)[source]¶
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats
- class data_juicer.ops.base_op.Deduplicator(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that conducts deduplication.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queris
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- compute_hash(sample)[source]¶
Compute hash values for the sample.
- Parameters:
sample – input sample
- Returns:
sample with computed hash value.
- class data_juicer.ops.base_op.Selector(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that conducts selection in dataset-level.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queris
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- class data_juicer.ops.base_op.Grouper(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that group samples.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queris
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- class data_juicer.ops.base_op.Aggregator(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that group samples.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queris
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
data_juicer.ops.load module¶
data_juicer.ops.op_fusion module¶
- data_juicer.ops.op_fusion.fuse_operators(ops, probe_res=None)[source]¶
Fuse the input ops list and return the fused ops list.
- Parameters:
ops – the corresponding list of op objects.
probe_res – the probed speed for each OP from Monitor.
- Returns:
a list of fused op objects.
- data_juicer.ops.op_fusion.fuse_filter_group(original_filter_group)[source]¶
Fuse single filter group and return the fused filter group.
- Parameters:
original_filter_group – the original filter group, including op definitions and objects.
- Returns:
the fused definitions and objects of the input filter group.
Module contents¶
- data_juicer.ops.load_ops(process_list)[source]¶
Load op list according to the process list from config file.
- Parameters:
process_list – A process list. Each item is an op name and its arguments.
- Returns:
The op instance list.
- class data_juicer.ops.Filter(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that removes specific info.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queris
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- compute_stats_single(sample, context=False)[source]¶
Compute stats for the sample which is used as a metric to decide whether to filter this sample.
- Parameters:
sample – input sample.
context – whether to store context information of intermediate vars in the sample temporarily.
- Returns:
sample with computed stats
- class data_juicer.ops.Mapper(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that conducts data editing.
- Parameters:
text_key – the key name of field that stores sample texts to be processed.
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queris
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- class data_juicer.ops.Deduplicator(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that conducts deduplication.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queris
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- compute_hash(sample)[source]¶
Compute hash values for the sample.
- Parameters:
sample – input sample
- Returns:
sample with computed hash value.
- class data_juicer.ops.Selector(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that conducts selection in dataset-level.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queris
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- class data_juicer.ops.Grouper(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that group samples.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queris
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses
- class data_juicer.ops.Aggregator(*args, **kwargs)[source]¶
Bases:
OP
- __init__(*args, **kwargs)[source]¶
Base class that group samples.
- Parameters:
text_key – the key name of field that stores sample texts to be processed
image_key – the key name of field that stores sample image list to be processed
audio_key – the key name of field that stores sample audio list to be processed
video_key – the key name of field that stores sample video list to be processed
query_key – the key name of field that stores sample queris
response_key – the key name of field that stores responses
history_key – the key name of field that stores history of queries and responses