# Refine Alpaca-CoT Config Files This folder contains some configuration files to allow users to easily and quickly refine [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT). ## Preprocess The raw data files can be downloaded from [Alpaca-CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT) on HuggingFace. ### Convert raw Alpaca-CoT data to jsonl Use [raw_alpaca_cot_merge_add_meta.py](../../../tools/preprocess/raw_alpaca_cot_merge_add_meta.py) to select `instruction`, `input` and `output` columns and merge them to `text` field with a space, and add extra [ META ]( #meta_info) info to dataset: ```shell python tools/preprocess/raw_alpaca_cot_merge_add_meta.py \ --src_dir \ --target_dir \ --num_proc ``` ### Split datasets to sub-datasets by language Use [dataset_split_by_language.py](../../../tools/preprocess/dataset_split_by_language.py) to split the dataset to EN and ZH sub-datasets: ```shell python tools/preprocess/dataset_split_by_language.py \ --src_dir \ --target_dir \ --suffixes jsonl \ --num_proc ``` ## Process After preprocess, modify the dataset path in [alpaca-cot-en-refine.yaml](alpaca-cot-en-refine.yaml) and [alpaca-cot-zh-refine.yaml](alpaca-cot-zh-refine.yaml), and then execute the following command to reproduce the processing flow of refined Alpaca-CoT. ```shell # refine English dataset python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml # refine Chinese dataset python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml ``` ### Meta Info Each sample in refined data of Alpaca-CoT contains meta info listed as below: #### Alpaca-CoT original meta info * Language Tags: - EN: Instruction datasets in English - CN: Instruction datasets in Chinese - ML: [Multi-lingual] Instruction datasets in multiple languages * Task Tags - MT: [Multi-task] Datasets containing multiple tasks - TS: [Task-specific] Datasets tailored for specific tasks * Generation-method: - HG: [Human Generated Dataset] Datasets created by humans - SI: [Self-Instruct] Datasets generated using self-instruct methods - MIX: [Mixed Dataset] Dataset contains both human and machine generated data - COL: [Collection of Dataset] Dataset made from a collection of other datasets #### Data-Juicer Meta info * `Dataset`: dataset name in Alpaca-CoT * `origin_path`: original file path in Alpaca-CoT * `IFT`: tagged as Instruct Fine-Tuning datasets * `CFT`: tagged as Chat Fine-Tuning datasets * `CFT-SR`: tagged as Single-round Dialog datasets * `CFT-MR`: tagged as Multi-round Dialog datasets * `CFT-P`: tagged as Preference datasets #### Refined Alpaca-CoT dataset Meta info | | Task | Gen | Lang | Dataset | IFT | CFT-SR | CFT-MR | CFT-P | |---------------------- |------- |----- |------- |---------------------- |----- |--------- |--------- |---------------- | | Chain-of-Thought | MT | HG | EN/CN | Chain-of-Thought | ✅ | | | | | GPT4all | MT | COL | EN | GPT4all | ✅ | ✅ | | | | GPTeacher | MT | SI | EN | GPTeacher | | ✅ | | | | Guanaco | MT | SI | ML | Guanaco | | ✅ | | | | HC3 | TS | MIX | EN/CN | HC3 | | ✅ | | ✅ | | alpaca | MT | SI | EN | alpaca | | ✅ | | | | Natural-Instructions | MT | COL | ML | Natural-Instructions | ✅ | | | | | belle_cn | TS/MT | SI | CN | belle_cn | | ✅ | | | | instinwild | MT | SI | EN/CN | instinwild | | ✅ | | | | prosocial-dialog | TS | MIX | EN | prosocial-dialog | | ✅ | | | | finance | TS | COL | EN | finance | | ✅ | | | | xP3 | MT | COL | ML | xP3 | ✅ | | | | | firefly | MT | COL | CN | firefly | ✅ | | | | | instruct | MT | COL | EN | instruct | | ✅ | | | | CodeAlpaca | TS | SI | EN | CodeAlpaca | ✅ | | | | | alpacaGPT4 | MT | SI | EN/CN | alpacaGPT4 | | ✅ | | ✅ | | webGPT | TS | MIX | EN | webGPT | ✅ | | | ✅ | | dolly | TS | HG | EN | dolly | | ✅ | | | | baize | MT | COL | EN | baize | | ✅ | | | | hh-rlhf | TS | MIX | EN | hh-rlhf | | ✅ | ✅ | ✅ | | OIG | MT | COL | EN | OIG | | ✅ | | | | GAOKAO | MT | COL | CN | GAOKAO | ✅ | | | | | camel | MT | SI | EN | camel | ✅ | | | | | FLAN-Muffin | MT | COL | EN | FLAN-Muffin | ✅ | | | | | COIG | MT | COL | CN | COIG | | ✅ | | | | gpt4tools | MT | SI | EN | gpt4tools | ✅ | | | | | ShareGPT | MT | MIX | EN | ShareGPT | | ✅ | ✅ | | | Auto-CoT | MT | COL | EN | Auto-CoT | ✅ | | | | | MOSS | TS | SI | EN/CN | MOSS | | ✅ | | | | ultrachat | TS | SI | EN | ultrachat | | ✅ | | | | Chinese-medical | TS | COL | CN | Chinese-medical | | ✅ | | | | CSL | MT | COL | CN | CSL | ✅ | | | | | pCLUE | MT | COL | CN | pCLUE | ✅ | | | | | news_commentary | TS | COL | CN | news_commentary | ✅ | | | | | StackExchange | MT | COL | EN | StackExchange | | ✅ | | ✅ | | ConvAI2 | TS | HG | EN | ConvAI2 | | ✅ | | | | FastChat | MT | SI | EN | FastChat | | ✅ | | | | Tabular-LLM-Data | MT | COL | EN/CN | Tabular-LLM-Data | ✅ | | | |