Refine Alpaca-CoT Config Files¶

This folder contains some configuration files to allow users to easily and quickly refine Alpaca-CoT.

Preprocess¶

The raw data files can be downloaded from Alpaca-CoT on HuggingFace.

Convert raw Alpaca-CoT data to jsonl¶

Use raw_alpaca_cot_merge_add_meta.py to select instruction, input and output columns and merge them to text field with a space, and add extra [ META ](https://github.com/modelscope/data-juicer/blob/main/configs/data_juicer_recipes/alpaca_cot/ #meta_info) info to dataset:

python tools/preprocess/raw_alpaca_cot_merge_add_meta.py    \
    --src_dir             <Alpaca-CoT_src_dir>              \
    --target_dir          <target_dir>                      \
    --num_proc            <num_proc>

Split datasets to sub-datasets by language¶

Use dataset_split_by_language.py to split the dataset to EN and ZH sub-datasets:

python tools/preprocess/dataset_split_by_language.py    \
    --src_dir             <src_dir>                     \
    --target_dir          <target_dir>                  \
    --suffixes            jsonl                         \
    --num_proc            <num_proc>

Process¶

After preprocess, modify the dataset path in alpaca-cot-en-refine.yaml and alpaca-cot-zh-refine.yaml, and then execute the following command to reproduce the processing flow of refined Alpaca-CoT.

# refine English dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml

# refine Chinese dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml

Meta Info ¶

Each sample in refined data of Alpaca-CoT contains meta info listed as below:

Alpaca-CoT original meta info¶

Language Tags:
- EN: Instruction datasets in English
- CN: Instruction datasets in Chinese
- ML: [Multi-lingual] Instruction datasets in multiple languages
Task Tags
- MT: [Multi-task] Datasets containing multiple tasks
- TS: [Task-specific] Datasets tailored for specific tasks
Generation-method:
- HG: [Human Generated Dataset] Datasets created by humans
- SI: [Self-Instruct] Datasets generated using self-instruct methods
- MIX: [Mixed Dataset] Dataset contains both human and machine generated data
- COL: [Collection of Dataset] Dataset made from a collection of other datasets

Data-Juicer Meta info¶

Dataset: dataset name in Alpaca-CoT
origin_path: original file path in Alpaca-CoT
IFT: tagged as Instruct Fine-Tuning datasets
CFT: tagged as Chat Fine-Tuning datasets
- CFT-SR: tagged as Single-round Dialog datasets
- CFT-MR: tagged as Multi-round Dialog datasets
- CFT-P: tagged as Preference datasets

Refined Alpaca-CoT dataset Meta info¶

	Task	Gen	Lang	Dataset	IFT	CFT-SR	CFT-MR	CFT-P
Chain-of-Thought	MT	HG	EN/CN	Chain-of-Thought	✅
GPT4all	MT	COL	EN	GPT4all	✅	✅
GPTeacher	MT	SI	EN	GPTeacher		✅
Guanaco	MT	SI	ML	Guanaco		✅
HC3	TS	MIX	EN/CN	HC3		✅		✅
alpaca	MT	SI	EN	alpaca		✅
Natural-Instructions	MT	COL	ML	Natural-Instructions	✅
belle_cn	TS/MT	SI	CN	belle_cn		✅
instinwild	MT	SI	EN/CN	instinwild		✅
prosocial-dialog	TS	MIX	EN	prosocial-dialog		✅
finance	TS	COL	EN	finance		✅
xP3	MT	COL	ML	xP3	✅
firefly	MT	COL	CN	firefly	✅
instruct	MT	COL	EN	instruct		✅
CodeAlpaca	TS	SI	EN	CodeAlpaca	✅
alpacaGPT4	MT	SI	EN/CN	alpacaGPT4		✅		✅
webGPT	TS	MIX	EN	webGPT	✅			✅
dolly	TS	HG	EN	dolly		✅
baize	MT	COL	EN	baize		✅
hh-rlhf	TS	MIX	EN	hh-rlhf		✅	✅	✅
OIG	MT	COL	EN	OIG		✅
GAOKAO	MT	COL	CN	GAOKAO	✅
camel	MT	SI	EN	camel	✅
FLAN-Muffin	MT	COL	EN	FLAN-Muffin	✅
COIG	MT	COL	CN	COIG		✅
gpt4tools	MT	SI	EN	gpt4tools	✅
ShareGPT	MT	MIX	EN	ShareGPT		✅	✅
Auto-CoT	MT	COL	EN	Auto-CoT	✅
MOSS	TS	SI	EN/CN	MOSS		✅
ultrachat	TS	SI	EN	ultrachat		✅
Chinese-medical	TS	COL	CN	Chinese-medical		✅
CSL	MT	COL	CN	CSL	✅
pCLUE	MT	COL	CN	pCLUE	✅
news_commentary	TS	COL	CN	news_commentary	✅
StackExchange	MT	COL	EN	StackExchange		✅		✅
ConvAI2	TS	HG	EN	ConvAI2		✅
FastChat	MT	SI	EN	FastChat		✅
Tabular-LLM-Data	MT	COL	EN/CN	Tabular-LLM-Data	✅