# Evaluation Results Recorder Record your evaluation results to [W&B](https://wandb.ai/) (wandb) with [`wandb_writer.py`](wandb_writer.py). With `wandb_writer.py`, you can: - visualize the changes of evaluation metrics of your model during the training process ![Metrics](../../../docs/imgs/eval-02.png "change of metrics") - make a leaderboard to compare the metrics of different models ![Leaderboard](../../../docs/imgs/eval-01.png "Leaderboard") ## Usage ```shell python wandb_writer.py --config [--print-only] ``` - `config_file`: path to the configuration file (see [Configuration](#configuration) for details) - `--print-only`: only print the result to command line, do not write to wandb ## Configuration We provided three example files in [`config`](config) folder for three different cases. The general format is as follows: ```yaml project: # your wandb project name base_url: # your wandb instance url # other specific configuration items ``` ### Parse from HELM output The following configuration is used to parse evaluation results from HELM output folder and record them to wandb. ```yaml # general configurations # ... evals: # evaluations to record - eval_type: helm # only support helm for now model_name: # your model name source: helm # use helm to parse from helm output directory helm_output_dir: helm_suite_name: token_per_iteration: benchmarks: # benchmark metrics to be recorded, and below are some examples - name: mmlu metrics: - EM - name: boolq metrics: - EM - name: narrative_qa metrics: - F1 - name: hellaswag metrics: - EM - ... ``` > We use 16 core metrics of HELM as the default benchmarks if the `benchmarks` field is not provided, the 16 metrics are as follows: > ``` > mmlu.EM, raft.EM, imdb.EM, truthful_qa.EM, summarization_cnndm.ROUGE-2, summarization_xsum.ROUGE-2, boolq.EM, msmarco_trec.NDCG@10, msmarco_regular.RR@10, narrative_qa.F1, natural_qa_closedbook.F1, natural_qa_openbook_longans.F1, civil_comments.EM, hellaswag.EM, openbookqa.EM > ``` ### Parse from configuration file The scores of metrics can be given directly in the configuration file, the following is an example. ```yaml # general configurations # ... evals: # evaluations to record - eval_type: helm model_name: llama-7B # your model name source: file # use file to parse from configuration token_num: 1000 eval_result: # evaluation results to be recorded mmlu: EM: 0.345 boolq: EM: 0.751 narrative_qa: F1: 0.524 hellaswag: EM: 0.747 ... ``` ### Make leaderboard The following configuration is used to make a leaderboard. ```yaml # general configurations # ... leaderboard: True leaderboard_metrics: # metrics required for the leaderboard - mmlu.EM - boolq.EM - quac.F1 - hellaswag.EM - ... excluded_models: # models that do not participate in the leaderboard - - ... ``` > We use 16 core metrics of HELM as the default leaderboard metrics if the `leaderboard_metrics` field is not provided, the 16 metrics are as same as the default benchmark metrics.