Auto Evaluation Toolkit¶
Automatically evaluate your model and monitor changes of metrics during the training process.
Preparation¶
Multiple GPU machines (at least 2, one for evaluation, the other for training).
Mount a shared file system (e.g., NAS) to the same path (e.g.,
/mnt/shared
) on the above machines.Install Data-Juicer in the shared file system (e.g.,
/mnt/shared/code/data-juicer
).Install thirdparty dependencies (Megatron-LM and HELM) accoroding to thirdparty/README.md on each machine.
Prepare your dataset and tokenizer, preprocess your dataset with Megatron-LM into mmap format (see README of Megatron-LM for more details) in the shared file system (e.g.,
/mnt/shared/dataset
).Run Megatron-LM on training machines and save the checkpoint in the shared file system (e.g.,
/mnt/shared/checkpoints
).
Usage¶
Use evaluator.py to automatically evaluate your models with HELM and OpenAI API.
python tools/evaluator.py \
--config <config> \
--begin-iteration <begin_iteration> \
[--end-iteration <end_iteration>] \
[--iteration-interval <iteration_interval>] \
[--check-interval <check_interval>] \
[--model-type <model_type>] \
[--eval-type <eval_type>]
config
: a yaml file containing various settings required to run the evaluation (see Configuration for details)begin_iteration
: iteration of the first checkpoint to be evaluatedend_iteration
: iteration of the last checkpoint to be evaluated. If not set, continuously monitor the training process and evaluate the generated checkpoints.iteration_interval
: iteration interval between two checkpoints, default is 1000 iterationscheck_interval
: time interval between checks, default is 30 minutesmodel_type
: type of your model, supportmegatron
andhuggingface
for nowmegatron
: evaluate Megatron-LM checkpoints (default)huggingface
: evaluate HuggingFace model, only support gpt eval type
eval-type
: type of the evaluation to run, supporthelm
andgpt
for nowhelm
: evaluate your model with HELM (default), you can change the benchmarks to run by modifying the helm specific template filegpt
: evaluate your model with OpenAI API, more details can be found in gpt_eval/README.md.
e.g.,
python evaluator.py --config <config_file> --begin-iteration 2000 --iteration-interval 1000 --check-interval 10will use HELM to evaluate a Megatron-LM checkpoint every 1000 iterations starting from iteration 2000, and check whether there is a new checkpoint meets the condition every 10 minutes.
After running the evaluator.py, you can use recorder/wandb_writer.py to visualize the evaluation results, more details can be found in recorder/README.md.
Configuration¶
The format of config_file
is as follows:
auto_eval:
project_name: <str> # your project name
model_name: <str> # your model name
cache_dir: <str> # path of cache dir
megatron:
process_num: <int> # number of process to run megatron
megatron_home: <str> # root dir of Megatron-LM
checkpoint_path: <str> # path of checkpoint dir
tokenizer_type: <str> # support gpt2 or sentencepiece for now
vocab_path: <str> # configuration for gpt2 tokenizer type, path to vocab file
merge_path: <str> # configuration for gpt2 tokenizer type, path to merge file
tokenizer_path: <str> # configuration for sentencepiece tokenizer type, path to model file
max_tokens: <int> # max tokens to generate in inference
token_per_iteration: <float> # billions tokens per iteration
helm:
helm_spec_template_path: <str> # path of helm spec template file, default is tools/evaluator/config/helm_spec_template.conf
helm_output_path: <str> # path of helm output dir
helm_env_name: <str> # helm conda env name
gpt_evaluation:
# openai config
openai_api_key: <str> # your api key
openai_organization: <str> # your organization
# files config
question_file: <str> # default is tools/evaluator/gpt_eval/config/question.jsonl
baseline_file: <str> # default is tools/evaluator/gpt_eval/answer/openai/gpt-3.5-turbo.jsonl
prompt_file: <str > # default is tools/evaluator/gpt_eval/config/prompt.jsonl
reviewer_file: <str> # default is tools/evaluator/gpt_eval/config/reviewer.jsonl
answer_file: <str> # path to generated answer file
result_file: <str> # path to generated review file