KDD 2024 Hands-on Tutorial

Multi-modal Data Processing for Foundation Models: Practical Guidances and Use Cases

Date & Time: 10:00 AM - 1:00 PM, August 25, 2024

Location: Room 124-125, Centre de Convencions Internacional de Barcelona

In the foundation models era, efficiently processing multi-modal data is crucial. This tutorial covers key techniques for multi-modal data processing and introduces the open-source Data-Juicer system, designed to tackle the complexities of data variety, quality, and scale. Participants will learn how to use Data-Juicer's operators and tools for formatting, mapping, filtering, deduplicating, and selecting multi-modal data efficiently and effectively. They will also be familiar with the Data-Juicer Sandbox Lab, where users can easily experiment with diverse data recipes that represent methodical sequences of operators and streamline the creation of scalable data processing pipelines. This experience solidifies the concepts discussed, as well as provides a space for innovation and exploration, highlighting how data recipes can be optimized and deployed in high-performance distributed environments.

By the end of this tutorial, attendees will be equipped with the practical knowledge and skills to navigate the multi-modal data processing for foundation models. They will leave with actionable knowledge with an industrial open-source system and an enriched perspective on the importance of high-quality data in AI, poised to implement sustainable and scalable solutions in their projects.

The system and related materials are available at https://github.com/modelscope/data-juicer.

Date:10:00 AM - 1:00 PM, August 25, 2024
Location: Room 124-125, Centre de Convencions Internacional de Barcelona

(20 min) | Introduction and Overview: Multi-modal Data Processing and the Data-Juicer System
(20 min) | Building Blocks of Data Processing: Data-Juicer’s Operators
(20 min) | Composing Atomic Capabilities: Data-Juicer’s Data Recipes
(30 min) | Exploring Data Recipes: The Data-Juicer Sandbox Lab
(30 min) | From Exploration to Production: High-Performance Data Factory
(45 min) | Use Cases: From Text to Video Data Processing
(15 min) | Resources and Conclusion
We are the Data-Juicer team from Alibaba Tongyi
Data-Juicer