# GPT EVAL: Evaluate your model with OpenAI API ## Quick Start 1. Prepare your model and the baseline model. - your model: Huggingface and Megatron-LM format models are supported, other models will be supported in future releases - baseline model: Huggingface, Megatron-LM or OpenAI model > Evaluating Megatron-LM models requires a customized Megatron-LM which is provided in [`thirdparty`](../../../thirdparty/). 2. Generate answers using [`answer_generator.py`](answer_generator.py) for both your model and the baseline model. 1. Prepare the benchmark dataset. The toolkit has provided Vicuna Bench([`config/question.jsonl`](./config/question.jsonl)), and you can create custom dataset to generate answers. The custom datasets must be a single file in jsonl format, and each json object in it contains 3 attributes: - question_id: int type - text: the specific content of the question, string type - category: the type of the question, string type 2. Build the config file (`config.yaml`). The format of the file is as follows: ```yaml answer_generation: model_name: question_file: # path of the benchmark dataset file answer_file: # path of the answer file generated by the model batch_size: # batch size when generating answers max_tokens: # maximum token size for each generated answer temperature: # Choose one of the following configurations according to your model type # Config for huggingface huggingface: model_path: # path of your model tokenizer_path: # path of your tokenizer # Config for megatron-lm megatron: megatron_home: # root dir of Megatron-LM code process_num: # number of processes to run megatron checkpoint_path: # megatron checkpoint dir path tokenizer_type: # only support 'gpt2' and 'sentencepiece' for now vocab_path: # path to the vocab file for gpt2 tokenizer merge_path: # path to the merge file for gpt2 tokenizer tokenizer_path: # path to the tokenizer model for sentencepiece tokenizer iteration: # iteration of the checkpoint to load # Config for openai openai: openai_organization: openai_api_key: model: # the type of model,e.g., gpt-3.5-turbo max_retry: # the maximum number of retries when api access fails ``` 3. Run the script. ```shell python answer_generator.py --config ``` 3. Get OpenAI API evaluation results via [`gpt_evaluator.py`](gpt_evaluator.py). 1. Prepare dependencies. Make sure the following files are ready: - question_file: the benchmark dataset file in previous step - answer_file: the answer file of your model in previous step - baseline_file: the answer file of the baseline model in previous step - prompt_file: a file contains multiple prompt templates, the toolkit has provided a sample file ([`config/prompt.jsonl`](config/prompt.jsonl)) - reviewer_file: a file contains multiple reviewer templates (including the model type and other parameters used in the OpenAI api request),the toolkit has provided a sample file ([`config/reviewer.json`](config/reviewer.jsonl)) 2. Build the config file (`config.yaml`). The format of the file is as follows: ```yaml gpt_evaluation: openai_organization: openai_api_key: question_file: answer_file: baseline_file: prompt_file: reviewer_file: result_file: # path of the evaluation result ``` 3. Run the script. ```shell python gpt_evaluator.py --config ```