GPT EVAL: Evaluate your model with OpenAI API¶
Quick Start¶
Prepare your model and the baseline model.
your model: Huggingface and Megatron-LM format models are supported, other models will be supported in future releases
baseline model: Huggingface, Megatron-LM or OpenAI model
Evaluating Megatron-LM models requires a customized Megatron-LM which is provided in
thirdparty
.Generate answers using
answer_generator.py
for both your model and the baseline model.Prepare the benchmark dataset. The toolkit has provided Vicuna Bench(
config/question.jsonl
), and you can create custom dataset to generate answers. The custom datasets must be a single file in jsonl format, and each json object in it contains 3 attributes:question_id: int type
text: the specific content of the question, string type
category: the type of the question, string type
Build the config file (
config.yaml
). The format of the file is as follows:answer_generation: model_name: <str> question_file: <str> # path of the benchmark dataset file answer_file: <str> # path of the answer file generated by the model batch_size: <int> # batch size when generating answers max_tokens: <int> # maximum token size for each generated answer temperature: <float> # Choose one of the following configurations according to your model type # Config for huggingface huggingface: model_path: <str> # path of your model tokenizer_path: <str> # path of your tokenizer # Config for megatron-lm megatron: megatron_home: <str> # root dir of Megatron-LM code process_num: <int> # number of processes to run megatron checkpoint_path: <str> # megatron checkpoint dir path tokenizer_type: <str> # only support 'gpt2' and 'sentencepiece' for now vocab_path: <str> # path to the vocab file for gpt2 tokenizer merge_path: <str> # path to the merge file for gpt2 tokenizer tokenizer_path: <str> # path to the tokenizer model for sentencepiece tokenizer iteration: <int> # iteration of the checkpoint to load # Config for openai openai: openai_organization: <str> openai_api_key: <str> model: <str> # the type of model,e.g., gpt-3.5-turbo max_retry: <int> # the maximum number of retries when api access fails
Run the script.
python answer_generator.py --config <path to config.yaml>
Get OpenAI API evaluation results via
gpt_evaluator.py
.Prepare dependencies. Make sure the following files are ready:
question_file: the benchmark dataset file in previous step
answer_file: the answer file of your model in previous step
baseline_file: the answer file of the baseline model in previous step
prompt_file: a file contains multiple prompt templates, the toolkit has provided a sample file (
config/prompt.jsonl
)reviewer_file: a file contains multiple reviewer templates (including the model type and other parameters used in the OpenAI api request),the toolkit has provided a sample file (
config/reviewer.json
)
Build the config file (
config.yaml
). The format of the file is as follows:gpt_evaluation: openai_organization: <str> openai_api_key: <str> question_file: <str> answer_file: <str> baseline_file: <str> prompt_file: <str> reviewer_file: <str> result_file: <str> # path of the evaluation result
Run the script.
python gpt_evaluator.py --config <path to config.yaml>