Refine Alpaca-CoT Config Files¶
This folder contains some configuration files to allow users to easily and quickly refine Alpaca-CoT.
Preprocess¶
The raw data files can be downloaded from Alpaca-CoT on HuggingFace.
Convert raw Alpaca-CoT data to jsonl¶
Use raw_alpaca_cot_merge_add_meta.py to select instruction
, input
and output
columns and merge them to text
field with a space, and add extra [ META ](https://github.com/modelscope/data-juicer/blob/main/configs/data_juicer_recipes/alpaca_cot/ #meta_info) info to dataset:
python tools/preprocess/raw_alpaca_cot_merge_add_meta.py \
--src_dir <Alpaca-CoT_src_dir> \
--target_dir <target_dir> \
--num_proc <num_proc>
Split datasets to sub-datasets by language¶
Use dataset_split_by_language.py to split the dataset to EN and ZH sub-datasets:
python tools/preprocess/dataset_split_by_language.py \
--src_dir <src_dir> \
--target_dir <target_dir> \
--suffixes jsonl \
--num_proc <num_proc>
Process¶
After preprocess, modify the dataset path in alpaca-cot-en-refine.yaml and alpaca-cot-zh-refine.yaml, and then execute the following command to reproduce the processing flow of refined Alpaca-CoT.
# refine English dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml
# refine Chinese dataset
python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml
Meta Info ¶
Each sample in refined data of Alpaca-CoT contains meta info listed as below:
Alpaca-CoT original meta info¶
Language Tags:
EN: Instruction datasets in English
CN: Instruction datasets in Chinese
ML: [Multi-lingual] Instruction datasets in multiple languages
Task Tags
MT: [Multi-task] Datasets containing multiple tasks
TS: [Task-specific] Datasets tailored for specific tasks
Generation-method:
HG: [Human Generated Dataset] Datasets created by humans
SI: [Self-Instruct] Datasets generated using self-instruct methods
MIX: [Mixed Dataset] Dataset contains both human and machine generated data
COL: [Collection of Dataset] Dataset made from a collection of other datasets
Data-Juicer Meta info¶
Dataset
: dataset name in Alpaca-CoTorigin_path
: original file path in Alpaca-CoTIFT
: tagged as Instruct Fine-Tuning datasetsCFT
: tagged as Chat Fine-Tuning datasetsCFT-SR
: tagged as Single-round Dialog datasetsCFT-MR
: tagged as Multi-round Dialog datasetsCFT-P
: tagged as Preference datasets
Refined Alpaca-CoT dataset Meta info¶
Task |
Gen |
Lang |
Dataset |
IFT |
CFT-SR |
CFT-MR |
CFT-P |
|
---|---|---|---|---|---|---|---|---|
Chain-of-Thought |
MT |
HG |
EN/CN |
Chain-of-Thought |
✅ |
|||
GPT4all |
MT |
COL |
EN |
GPT4all |
✅ |
✅ |
||
GPTeacher |
MT |
SI |
EN |
GPTeacher |
✅ |
|||
Guanaco |
MT |
SI |
ML |
Guanaco |
✅ |
|||
HC3 |
TS |
MIX |
EN/CN |
HC3 |
✅ |
✅ |
||
alpaca |
MT |
SI |
EN |
alpaca |
✅ |
|||
Natural-Instructions |
MT |
COL |
ML |
Natural-Instructions |
✅ |
|||
belle_cn |
TS/MT |
SI |
CN |
belle_cn |
✅ |
|||
instinwild |
MT |
SI |
EN/CN |
instinwild |
✅ |
|||
prosocial-dialog |
TS |
MIX |
EN |
prosocial-dialog |
✅ |
|||
finance |
TS |
COL |
EN |
finance |
✅ |
|||
xP3 |
MT |
COL |
ML |
xP3 |
✅ |
|||
firefly |
MT |
COL |
CN |
firefly |
✅ |
|||
instruct |
MT |
COL |
EN |
instruct |
✅ |
|||
CodeAlpaca |
TS |
SI |
EN |
CodeAlpaca |
✅ |
|||
alpacaGPT4 |
MT |
SI |
EN/CN |
alpacaGPT4 |
✅ |
✅ |
||
webGPT |
TS |
MIX |
EN |
webGPT |
✅ |
✅ |
||
dolly |
TS |
HG |
EN |
dolly |
✅ |
|||
baize |
MT |
COL |
EN |
baize |
✅ |
|||
hh-rlhf |
TS |
MIX |
EN |
hh-rlhf |
✅ |
✅ |
✅ |
|
OIG |
MT |
COL |
EN |
OIG |
✅ |
|||
GAOKAO |
MT |
COL |
CN |
GAOKAO |
✅ |
|||
camel |
MT |
SI |
EN |
camel |
✅ |
|||
FLAN-Muffin |
MT |
COL |
EN |
FLAN-Muffin |
✅ |
|||
COIG |
MT |
COL |
CN |
COIG |
✅ |
|||
gpt4tools |
MT |
SI |
EN |
gpt4tools |
✅ |
|||
ShareGPT |
MT |
MIX |
EN |
ShareGPT |
✅ |
✅ |
||
Auto-CoT |
MT |
COL |
EN |
Auto-CoT |
✅ |
|||
MOSS |
TS |
SI |
EN/CN |
MOSS |
✅ |
|||
ultrachat |
TS |
SI |
EN |
ultrachat |
✅ |
|||
Chinese-medical |
TS |
COL |
CN |
Chinese-medical |
✅ |
|||
CSL |
MT |
COL |
CN |
CSL |
✅ |
|||
pCLUE |
MT |
COL |
CN |
pCLUE |
✅ |
|||
news_commentary |
TS |
COL |
CN |
news_commentary |
✅ |
|||
StackExchange |
MT |
COL |
EN |
StackExchange |
✅ |
✅ |
||
ConvAI2 |
TS |
HG |
EN |
ConvAI2 |
✅ |
|||
FastChat |
MT |
SI |
EN |
FastChat |
✅ |
|||
Tabular-LLM-Data |
MT |
COL |
EN/CN |
Tabular-LLM-Data |
✅ |