Auto Evaluation Toolkit¶

Automatically evaluate your model and monitor changes of metrics during the training process.

Preparation¶

Multiple GPU machines (at least 2, one for evaluation, the other for training).
Mount a shared file system (e.g., NAS) to the same path (e.g., /mnt/shared) on the above machines.
Install Data-Juicer in the shared file system (e.g., /mnt/shared/code/data-juicer).
Install thirdparty dependencies (Megatron-LM and HELM) accoroding to thirdparty/README.md on each machine.
Prepare your dataset and tokenizer, preprocess your dataset with Megatron-LM into mmap format (see README of Megatron-LM for more details) in the shared file system (e.g., /mnt/shared/dataset).
Run Megatron-LM on training machines and save the checkpoint in the shared file system (e.g., /mnt/shared/checkpoints).

Usage¶

Use evaluator.py to automatically evaluate your models with HELM and OpenAI API.

python tools/evaluator.py  \
    --config <config>      \
    --begin-iteration     <begin_iteration>     \
    [--end-iteration      <end_iteration>]      \
    [--iteration-interval <iteration_interval>] \
    [--check-interval <check_interval>]         \
    [--model-type     <model_type>]             \
    [--eval-type      <eval_type>]

config: a yaml file containing various settings required to run the evaluation (see Configuration for details)
begin_iteration: iteration of the first checkpoint to be evaluated
end_iteration: iteration of the last checkpoint to be evaluated. If not set, continuously monitor the training process and evaluate the generated checkpoints.
iteration_interval: iteration interval between two checkpoints, default is 1000 iterations
check_interval: time interval between checks, default is 30 minutes
model_type: type of your model, support megatron and huggingface for now
- megatron: evaluate Megatron-LM checkpoints (default)
- huggingface: evaluate HuggingFace model, only support gpt eval type
eval-type: type of the evaluation to run, support helm and gpt for now
- helm: evaluate your model with HELM (default), you can change the benchmarks to run by modifying the helm specific template file
- gpt: evaluate your model with OpenAI API, more details can be found in gpt_eval/README.md.

e.g.,
python evaluator.py --config <config_file> --begin-iteration 2000 --iteration-interval 1000 --check-interval 10
will use HELM to evaluate a Megatron-LM checkpoint every 1000 iterations starting from iteration 2000, and check whether there is a new checkpoint meets the condition every 10 minutes.

After running the evaluator.py, you can use recorder/wandb_writer.py to visualize the evaluation results, more details can be found in recorder/README.md.

Configuration¶

The format of config_file is as follows:

auto_eval:
  project_name: <str> # your project name
  model_name: <str>   # your model name
  cache_dir: <str>    # path of cache dir
  megatron:
    process_num: <int>     # number of process to run megatron
    megatron_home: <str>   # root dir of Megatron-LM
    checkpoint_path: <str> # path of checkpoint dir
    tokenizer_type: <str>  # support gpt2 or sentencepiece for now
    vocab_path: <str>      # configuration for gpt2 tokenizer type, path to vocab file
    merge_path: <str>      # configuration for gpt2 tokenizer type, path to merge file
    tokenizer_path: <str>  # configuration for sentencepiece tokenizer type, path to model file
    max_tokens: <int>      # max tokens to generate in inference
    token_per_iteration: <float> # billions tokens per iteration
  helm:
    helm_spec_template_path: <str> # path of helm spec template file, default is tools/evaluator/config/helm_spec_template.conf
    helm_output_path: <str>  # path of helm output dir
    helm_env_name: <str>     # helm conda env name
  gpt_evaluation:
    # openai config
    openai_api_key: <str>       # your api key
    openai_organization: <str>  # your organization
    # files config
    question_file: <str>  # default is tools/evaluator/gpt_eval/config/question.jsonl
    baseline_file: <str>  # default is tools/evaluator/gpt_eval/answer/openai/gpt-3.5-turbo.jsonl
    prompt_file: <str >   # default is tools/evaluator/gpt_eval/config/prompt.jsonl
    reviewer_file: <str>  # default is tools/evaluator/gpt_eval/config/reviewer.jsonl
    answer_file: <str>    # path to generated answer file
    result_file: <str>    # path to generated review file