# Demos This folder contains some demos, which allow users to easily experience the basic functions and tools of Data-Juicer. ## Usage Use `app.py` in the subdirectory of `demos` to run the demos. ```shell cd streamlit run app.py ``` ## Available Demos - Data (`data`) - This folder contains some sample datasets. - Overview scan (`overview_scan`) - This demo introduces the basic concepts and functions of Data-Juicer, such as features, configuration, operators, and so on. - Data process loop (`data_process_loop`) - This demo analyzes and processes a dataset, providing a comparison of statistical information before and after the processing. - Data visualization diversity (`data_visualization_diversity`) - This demo analyzes the verb-noun structure of the CFT dataset and plots its diversity in sunburst format. - Data visualization op effect (`data_visualization_op_effect`) - This demo analyzes the statistics of dataset, and displays the effect of each Filter op by setting different thresholds. - Data visualization statistics (`data_visualization_statistics`) - This demo analyzes the dataset and obtain up to 13 statistics. - Process CFT Chinese data (`process_cft_zh_data`) - This demos analyzes and processes part of Chinese dataset in Alpaca-CoT to show how to process IFT or CFT data for LLM fine-tuning. - Process SCI data (`process_sci_data`) - This demos analyzes and processes part of arXiv dataset to show how to process scientific literature data for LLM pre-training. - Process code data (`process_code_data`) - This demos analyzes and processes part of Stack-Exchange dataset to show how to process code data for LLM pre-training. - Text quality classifier (`tool_quality_classifier`) - This demo provides 3 text quality classifier to score the dataset. - Dataset splitting by language (`tool_dataset_splitting_by_language`) - This demo splits a dataset to different sub-datasets by language. - Data mixture (`data_mixture`) - This demo selects and mixes samples from multiple datasets and exports them into a new dataset.