Welcome to data-juicer’s documentation!¶
Tutorial¶
We will give a tutorial on KDD’24, Multi-modal Data Processing for Foundation Models: Practical Guidances and Use Cases, see more details here!
- data_juicer.core
- data_juicer.ops
- data_juicer.ops.filter
AlphanumericFilter
AudioDurationFilter
AudioNMFSNRFilter
AudioSizeFilter
AverageLineLengthFilter
CharacterRepetitionFilter
FlaggedWordFilter
ImageAestheticsFilter
ImageAspectRatioFilter
ImageFaceCountFilter
ImageFaceRatioFilter
ImageNSFWFilter
ImagePairSimilarityFilter
ImageShapeFilter
ImageSizeFilter
ImageTextMatchingFilter
ImageTextSimilarityFilter
ImageWatermarkFilter
LanguageIDScoreFilter
MaximumLineLengthFilter
PerplexityFilter
PhraseGroundingRecallFilter
SpecialCharactersFilter
SpecifiedFieldFilter
SpecifiedNumericFieldFilter
StopWordsFilter
SuffixFilter
TextActionFilter
TextEntityDependencyFilter
TextLengthFilter
TokenNumFilter
VideoAestheticsFilter
VideoAspectRatioFilter
VideoDurationFilter
VideoFramesTextSimilarityFilter
VideoMotionScoreFilter
VideoMotionScoreRaftFilter
VideoNSFWFilter
VideoOcrAreaRatioFilter
VideoResolutionFilter
VideoTaggingFromFramesFilter
VideoWatermarkFilter
WordRepetitionFilter
WordsNumFilter
- data_juicer.ops.mapper
AudioFFmpegWrappedMapper
CalibrateQAMapper
CalibrateQueryMapper
CalibrateResponseMapper
ChineseConvertMapper
CleanCopyrightMapper
CleanEmailMapper
CleanHtmlMapper
CleanIpMapper
CleanLinksMapper
ExpandMacroMapper
ExtractEntityAttributeMapper
ExtractEntityRelationMapper
ExtractEventMapper
ExtractKeywordMapper
ExtractNicknameMapper
FixUnicodeMapper
GenerateQAFromExamplesMapper
GenerateQAFromTextMapper
ImageBlurMapper
ImageCaptioningFromGPT4VMapper
ImageCaptioningMapper
ImageDiffusionMapper
ImageFaceBlurMapper
ImageTaggingMapper
NlpaugEnMapper
NlpcdaZhMapper
OptimizeQAMapper
OptimizeQueryMapper
OptimizeResponseMapper
PunctuationNormalizationMapper
RemoveBibliographyMapper
RemoveCommentsMapper
RemoveHeaderMapper
RemoveLongWordsMapper
RemoveNonChineseCharacterlMapper
RemoveRepeatSentencesMapper
RemoveSpecificCharsMapper
RemoveTableTextMapper
RemoveWordsWithIncorrectSubstringsMapper
ReplaceContentMapper
SentenceSplitMapper
TextChunkMapper
VideoCaptioningFromAudioMapper
VideoCaptioningFromFramesMapper
VideoCaptioningFromSummarizerMapper
VideoCaptioningFromVideoMapper
VideoFFmpegWrappedMapper
VideoFaceBlurMapper
VideoRemoveWatermarkMapper
VideoResizeAspectRatioMapper
VideoResizeResolutionMapper
VideoSplitByDurationMapper
VideoSplitByKeyFrameMapper
VideoSplitBySceneMapper
VideoTaggingFromAudioMapper
VideoTaggingFromFramesMapper
WhitespaceNormalizationMapper
- data_juicer.ops.deduplicator
- data_juicer.ops.selector
- data_juicer.ops.common
- data_juicer.analysis
- data_juicer.config
- data_juicer.format