GPT EVAL: Evaluate your model with OpenAI API¶

Quick Start¶

Prepare your model and the baseline model.
- your model: Huggingface and Megatron-LM format models are supported, other models will be supported in future releases
- baseline model: Huggingface, Megatron-LM or OpenAI model
Evaluating Megatron-LM models requires a customized Megatron-LM which is provided in thirdparty.

Generate answers using answer_generator.py for both your model and the baseline model.

Prepare the benchmark dataset. The toolkit has provided Vicuna Bench(config/question.jsonl), and you can create custom dataset to generate answers. The custom datasets must be a single file in jsonl format, and each json object in it contains 3 attributes:
- question_id: int type
- text: the specific content of the question, string type
- category: the type of the question, string type

Build the config file (config.yaml). The format of the file is as follows:

answer_generation:
  model_name: <str>
  question_file: <str>  # path of the benchmark dataset file
  answer_file: <str>    # path of the answer file generated by the model
  batch_size: <int>     # batch size when generating answers
  max_tokens: <int>     # maximum token size for each generated answer
  temperature: <float>
  # Choose one of the following configurations according to your model type
  # Config for huggingface
  huggingface:
    model_path: <str> # path of your model
    tokenizer_path: <str> # path of your tokenizer
  # Config for megatron-lm
  megatron:
    megatron_home: <str>    # root dir of Megatron-LM code
    process_num: <int>      # number of processes to run megatron
    checkpoint_path: <str>  # megatron checkpoint dir path
    tokenizer_type: <str>   # only support 'gpt2' and 'sentencepiece' for now
    vocab_path: <str>       # path to the vocab file for gpt2 tokenizer
    merge_path: <str>       # path to the merge file for gpt2 tokenizer
    tokenizer_path: <str>   # path to the tokenizer model for sentencepiece tokenizer
    iteration: <int>        # iteration of the checkpoint to load
  # Config for openai
  openai:
    openai_organization: <str>
    openai_api_key: <str>
    model: <str> # the type of model，e.g., gpt-3.5-turbo
    max_retry: <int> # the maximum number of retries when api access fails

Run the script.

python answer_generator.py --config <path to config.yaml>

Get OpenAI API evaluation results via gpt_evaluator.py.
1. Prepare dependencies. Make sure the following files are ready:
  - question_file: the benchmark dataset file in previous step
  - answer_file: the answer file of your model in previous step
  - baseline_file: the answer file of the baseline model in previous step
  - prompt_file: a file contains multiple prompt templates, the toolkit has provided a sample file (config/prompt.jsonl)
  - reviewer_file: a file contains multiple reviewer templates (including the model type and other parameters used in the OpenAI api request)，the toolkit has provided a sample file (config/reviewer.json)
2. Build the config file (config.yaml). The format of the file is as follows:
```
gpt_evaluation:
  openai_organization: <str>
  openai_api_key: <str>
  question_file: <str>
  answer_file: <str>
  baseline_file: <str>
  prompt_file: <str>
  reviewer_file: <str>
  result_file: <str>    # path of the evaluation result
```
3. Run the script.
```
python gpt_evaluator.py --config <path to config.yaml>
```