Evaluation Results Recorder¶

Record your evaluation results to W&B (wandb) with wandb_writer.py.

With wandb_writer.py, you can:

visualize the changes of evaluation metrics of your model during the training process
make a leaderboard to compare the metrics of different models

Usage¶

python wandb_writer.py --config <config_file> [--print-only]

config_file: path to the configuration file (see Configuration for details)
--print-only: only print the result to command line, do not write to wandb

Configuration¶

We provided three example files in config folder for three different cases.

The general format is as follows:

project: <str>   # your wandb project name
base_url: <str>  # your wandb instance url
# other specific configuration items

Parse from HELM output¶

The following configuration is used to parse evaluation results from HELM output folder and record them to wandb.

# general configurations
# ...

evals:  # evaluations to record
  - eval_type: helm    # only support helm for now
    model_name: <str>  # your model name
    source: helm  # use helm to parse from helm output directory
    helm_output_dir: <your helm output dir path>
    helm_suite_name: <your helm suite name>
    token_per_iteration: <tokens per iteration in billions>
    benchmarks:  # benchmark metrics to be recorded, and below are some examples
      - name: mmlu
        metrics:
          - EM
      - name: boolq
        metrics:
          - EM
      - name: narrative_qa
        metrics:
          - F1
      - name: hellaswag
        metrics:
          - EM
      - ...

We use 16 core metrics of HELM as the default benchmarks if the benchmarks field is not provided, the 16 metrics are as follows:

mmlu.EM, raft.EM, imdb.EM, truthful_qa.EM, summarization_cnndm.ROUGE-2, summarization_xsum.ROUGE-2, boolq.EM, msmarco_trec.NDCG@10, msmarco_regular.RR@10, narrative_qa.F1, natural_qa_closedbook.F1, natural_qa_openbook_longans.F1, civil_comments.EM, hellaswag.EM, openbookqa.EM

Parse from configuration file¶

The scores of metrics can be given directly in the configuration file, the following is an example.

# general configurations
# ...

evals:  # evaluations to record
  - eval_type: helm
    model_name: llama-7B  # your model name
    source: file  # use file to parse from configuration
    token_num: 1000
    eval_result:  # evaluation results to be recorded
      mmlu:
        EM: 0.345
      boolq:
        EM: 0.751
      narrative_qa:
        F1: 0.524
      hellaswag:
        EM: 0.747
      ...

Make leaderboard¶

The following configuration is used to make a leaderboard.

# general configurations
# ...
leaderboard: True
leaderboard_metrics:  # metrics required for the leaderboard
  - mmlu.EM
  - boolq.EM
  - quac.F1
  - hellaswag.EM
  - ...
excluded_models:   # models that do not participate in the leaderboard
  - <model to exclude>
  - ...

We use 16 core metrics of HELM as the default leaderboard metrics if the leaderboard_metrics field is not provided, the 16 metrics are as same as the default benchmark metrics.