KDD 2024 Hands-on Tutorial
Multi-modal Data Processing
for Foundation Models: Practical Guidances and Use
Cases
Date &
Time: 10:00 AM - 1:00 PM, August 25, 2024
Location: Room 124-125, Centre de
Convencions Internacional de Barcelona
In the foundation models era, efficiently processing multi-modal data
is crucial.
This tutorial covers key techniques for multi-modal data processing and
introduces the open-source Data-Juicer system, designed to tackle the
complexities of data variety, quality, and scale.
Participants will learn how to use Data-Juicer's operators and tools
for formatting, mapping, filtering, deduplicating, and selecting
multi-modal data efficiently and effectively.
They will also be familiar with the Data-Juicer Sandbox Lab, where
users can easily experiment with diverse data recipes that represent
methodical sequences of operators and streamline the creation of
scalable data processing pipelines.
This experience solidifies the concepts discussed, as well as provides
a space for innovation and exploration, highlighting how data recipes
can be optimized and deployed in high-performance distributed
environments.
By the end of this tutorial, attendees will be equipped with the
practical knowledge and skills to navigate the multi-modal data
processing for foundation models. They will leave with actionable
knowledge with an industrial open-source system and an enriched
perspective on the importance of high-quality data in AI, poised to
implement sustainable and scalable solutions in their projects.
The system and related materials are available at
https://github.com/modelscope/data-juicer.