Welcome to data-juicer’s documentation!¶
Tutorial¶
We will give a tutorial on KDD’24, Multi-modal Data Processing for Foundation Models: Practical Guidances and Use Cases, see more details here!
API Reference
- data_juicer.core
- data_juicer.ops
- data_juicer.ops.filter
AlphanumericFilter
AudioDurationFilter
AudioNMFSNRFilter
AudioSizeFilter
AverageLineLengthFilter
CharacterRepetitionFilter
FlaggedWordFilter
ImageAestheticsFilter
ImageAspectRatioFilter
ImageFaceCountFilter
ImageFaceRatioFilter
ImageNSFWFilter
ImagePairSimilarityFilter
ImageShapeFilter
ImageSizeFilter
ImageTextMatchingFilter
ImageTextSimilarityFilter
ImageWatermarkFilter
LanguageIDScoreFilter
LLMQualityScoreFilter
LLMDifficultyScoreFilter
MaximumLineLengthFilter
PerplexityFilter
PhraseGroundingRecallFilter
SpecialCharactersFilter
SpecifiedFieldFilter
SpecifiedNumericFieldFilter
StopWordsFilter
SuffixFilter
TextActionFilter
TextEntityDependencyFilter
TextLengthFilter
TextPairSimilarityFilter
TokenNumFilter
VideoAestheticsFilter
VideoAspectRatioFilter
VideoDurationFilter
VideoFramesTextSimilarityFilter
VideoMotionScoreFilter
VideoMotionScoreRaftFilter
VideoNSFWFilter
VideoOcrAreaRatioFilter
VideoResolutionFilter
VideoTaggingFromFramesFilter
VideoWatermarkFilter
WordRepetitionFilter
WordsNumFilter
- data_juicer.ops.mapper
AudioFFmpegWrappedMapper
CalibrateQAMapper
CalibrateQueryMapper
CalibrateResponseMapper
ChineseConvertMapper
CleanCopyrightMapper
CleanEmailMapper
CleanHtmlMapper
CleanIpMapper
CleanLinksMapper
DialogIntentDetectionMapper
DialogSentimentDetectionMapper
DialogSentimentIntensityMapper
DialogTopicDetectionMapper
ExpandMacroMapper
ExtractEntityAttributeMapper
ExtractEntityRelationMapper
ExtractEventMapper
ExtractKeywordMapper
ExtractNicknameMapper
ExtractSupportTextMapper
FixUnicodeMapper
GenerateQAFromExamplesMapper
GenerateQAFromTextMapper
ImageBlurMapper
ImageCaptioningFromGPT4VMapper
ImageCaptioningMapper
ImageDiffusionMapper
ImageFaceBlurMapper
ImageRemoveBackgroundMapper
ImageSegmentMapper
ImageTaggingMapper
MllmMapper
NlpaugEnMapper
NlpcdaZhMapper
OptimizeQAMapper
OptimizeQueryMapper
OptimizeResponseMapper
PairPreferenceMapper
PunctuationNormalizationMapper
PythonFileMapper
PythonLambdaMapper
QuerySentimentDetectionMapper
QueryIntentDetectionMapper
QueryTopicDetectionMapper
RelationIdentityMapper
RemoveBibliographyMapper
RemoveCommentsMapper
RemoveHeaderMapper
RemoveLongWordsMapper
RemoveNonChineseCharacterlMapper
RemoveRepeatSentencesMapper
RemoveSpecificCharsMapper
RemoveTableTextMapper
RemoveWordsWithIncorrectSubstringsMapper
ReplaceContentMapper
SDXLPrompt2PromptMapper
SentenceAugmentationMapper
SentenceSplitMapper
TextChunkMapper
VideoCaptioningFromAudioMapper
VideoCaptioningFromFramesMapper
VideoCaptioningFromSummarizerMapper
VideoCaptioningFromVideoMapper
VideoExtractFramesMapper
VideoFFmpegWrappedMapper
VideoFaceBlurMapper
VideoRemoveWatermarkMapper
VideoResizeAspectRatioMapper
VideoResizeResolutionMapper
VideoSplitByDurationMapper
VideoSplitByKeyFrameMapper
VideoSplitBySceneMapper
VideoTaggingFromAudioMapper
VideoTaggingFromFramesMapper
WhitespaceNormalizationMapper
- data_juicer.ops.deduplicator
- data_juicer.ops.selector
- data_juicer.ops.common
- data_juicer.analysis
- data_juicer.config
- data_juicer.format