Welcome to data-juicer’s documentation!¶
Tutorial¶
We will give a tutorial on KDD’24, Multi-modal Data Processing for Foundation Models: Practical Guidances and Use Cases, see more details here!
- data_juicer.core
- data_juicer.ops
- data_juicer.ops.filter
ImageTextSimilarityFilter
VideoAspectRatioFilter
ImageTextMatchingFilter
ImageNSFWFilter
TokenNumFilter
TextLengthFilter
SpecifiedNumericFieldFilter
AudioNMFSNRFilter
VideoAestheticsFilter
PerplexityFilter
PhraseGroundingRecallFilter
MaximumLineLengthFilter
AverageLineLengthFilter
SpecifiedFieldFilter
VideoTaggingFromFramesFilter
TextEntityDependencyFilter
VideoResolutionFilter
AlphanumericFilter
ImageWatermarkFilter
ImageAestheticsFilter
AudioSizeFilter
StopWordsFilter
CharacterRepetitionFilter
ImageShapeFilter
VideoDurationFilter
TextActionFilter
VideoOcrAreaRatioFilter
VideoNSFWFilter
SpecialCharactersFilter
VideoFramesTextSimilarityFilter
ImageAspectRatioFilter
AudioDurationFilter
LanguageIDScoreFilter
SuffixFilter
ImageSizeFilter
VideoWatermarkFilter
WordsNumFilter
ImageFaceCountFilter
ImageFaceRatioFilter
FlaggedWordFilter
WordRepetitionFilter
VideoMotionScoreFilter
ImagePairSimilarityFilter
- data_juicer.ops.mapper
VideoCaptioningFromAudioMapper
VideoTaggingFromAudioMapper
ImageCaptioningFromGPT4VMapper
PunctuationNormalizationMapper
RemoveBibliographyMapper
SentenceSplitMapper
VideoSplitBySceneMapper
CleanIpMapper
CleanLinksMapper
RemoveHeaderMapper
RemoveTableTextMapper
VideoRemoveWatermarkMapper
RemoveRepeatSentencesMapper
ImageDiffusionMapper
ImageFaceBlurMapper
VideoFFmpegWrappedMapper
ChineseConvertMapper
NlpcdaZhMapper
OptimizeInstructionMapper
ImageBlurMapper
CleanCopyrightMapper
RemoveNonChineseCharacterlMapper
VideoSplitByKeyFrameMapper
RemoveSpecificCharsMapper
VideoResizeAspectRatioMapper
CleanHtmlMapper
WhitespaceNormalizationMapper
VideoTaggingFromFramesMapper
RemoveCommentsMapper
ExpandMacroMapper
ExtractQAMapper
ImageCaptioningMapper
RemoveWordsWithIncorrectSubstringsMapper
VideoCaptioningFromVideoMapper
VideoCaptioningFromSummarizerMapper
GenerateInstructionMapper
FixUnicodeMapper
NlpaugEnMapper
VideoCaptioningFromFramesMapper
RemoveLongWordsMapper
VideoResizeResolutionMapper
CleanEmailMapper
ReplaceContentMapper
AudioFFmpegWrappedMapper
VideoSplitByDurationMapper
VideoFaceBlurMapper
ImageTaggingMapper
- data_juicer.ops.deduplicator
- data_juicer.ops.selector
- data_juicer.ops.common
- data_juicer.analysis
- data_juicer.config
- data_juicer.format