🤔 Considering integrating LLMs into your cloud application but unsure about their performance? Check this out 👇
📊 The table is generated with CloudEval-YAML, a practical Large Language Model (LLM) benchmark specifically designed for cloud-native application configuration generation. It comprises a comprehensive dataset of 1011 problems covering widely deployed applications including Kubernetes, Envoy and Istio. This benchmark is unique in its practical and realistic approach, featuring a combination of hand-written question contexts, reference YAML files, and unit test scripts for each problem. This benchmark can facilitate comparing different LLM models, as well as different sampling and prompt engineering techniques. 📄 Please refer to the following technical report for more detailed information: CloudEval-YAML: A Practical Benchmark for Cloud Native YAML Configuration Generation.
Benchmarking GPT-3.5 with all metrics but unit tests.
Install the python prerequisites
pip install -r requirements.txt
export OPENAI_API_KEY='YOUR_API_KEY'
python benchmark.py
Sample output from the terminal:
------ Problem: Kubernetes_pod_q1 ------
bleu score: 0.69 / 1
edit_distance score: 0.57 / 1
exact_match score: 0.00 / 1
kv_match score: 0.00 / 1
kv_wildcard score: 0.73 / 1
(more problems...)
------ Category: Kubernetes_pod ------
bleu score: 30.44 / 48
edit_distance score: 23.60 / 48
exact_match score: 4.00 / 48
kv_match score: 10.00 / 48
kv_wildcard score: 28.83 / 48
(more categories...)
------ Overall --- Model: gpt-3.5 ------
Dataset: original (337 problems)
bleu score: 211.66 / 337
edit_distance score: 174.34 / 337
exact_match score: 29.00 / 337
kv_match score: 58.00 / 337
kv_wildcard score: 212.39 / 337
(repeat above: more datasets...)
====== Overall === Model: gpt-3.5 ======
Dataset: original, simplified, translated (1011 problems)
bleu score: 0.631
edit_distance score: 0.522
exact_match score: 0.071
kv_match score: 0.154
kv_wildcard score: 0.605
All prompts, generated codes and scores by default will be saved in outputs/
. You man use them as checkpoints for further analysis.
🎉 Congratulations! You have successfully completed the quickstart and benchmarked GPT-3.5 (other than unit tests).
- Configure OpenAI API key
export OPENAI_API_KEY='YOUR_API_KEY'
- Include GPT-3.5/4 for benchmarking in
config.json
"models": [ "gpt-3_5", "gpt-4" ]
- Model-specific query is defined in
models/gpt.py
. You may customize them if needed. Please refer to OpenAI API documentation for more details.
- Configure PaLM API key
export PALM_API_KEY='YOUR_API_KEY'
- Include PaLM 2 for benchmarking in
config.json
"models": [ "palm-2" ]
- Model-specific query is defined in
models/palm.py
. You may customize them if needed. Please refer to Google API documentation for more details.
Replicate is a platform that allows you to run open-source or your own models at scale in the cloud. It abstracts away the complexity of deploying models locally and provides a simple API for query and response.
- Configure Replicate API token
export REPLICATE_API_TOKEN=='YOUR_API_TOKEN'
- Currently, a subset of supported open-source models are integrated into the benchmark. You may include them for benchmarking in
config.json
."models": [ "llama-2-70b-chat", "llama-2-13b-chat", "llama-2-7b-chat", "llama-13b-lora", "llama-7b", "wizardcoder-34b-v1.0", "wizardcoder-15b-v1.0", "codellama-13b-instruct", "codellama-7b-instruct" ]
- Model-specific query is defined in
models/replicate.py
. You may customize them, integrate more or even include your own models if needed. Please refer to Replicate API documentation for more details.
The easiest way to run unit tests is to launch an AWS EC2 instance using public AMI with cloudyaml_public
as name and ami-012306c17129a3b71
as ID in us-west-1 region (N. California). Notice that this is going to take several hours to evaluate. Alternatively, one can create a pull request and we can help to run on a cluster.
After launching the EC2 instance, you can pull this repo and enable unit test in config.json
"metrics": {
"unit_test": true
}
Then run the benchmark:
python benchmark.py
Sample output from the terminal:
------ Problem: Kubernetes_pod_q1 ------
bleu score: 0.69 / 1
edit_distance score: 0.57 / 1
exact_match score: 0.00 / 1
kv_match score: 0.00 / 1
kv_wildcard score: 0.73 / 1
unit_test score: 1.00 / 1
(more problems, categories and datasets...)
====== Overall === Model: gpt-3.5 ======
Dataset: original, simplified, translated (1011 problems)
bleu score: 0.631
edit_distance score: 0.522
exact_match score: 0.071
kv_match score: 0.154
kv_wildcard score: 0.605
unit_test score: 0.418
Please refer to Advanced.md for how to run your own model and other advanced features.
If you find this benchmark useful, please cite our paper:
@article{xu2024cloudeval,
title={CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation},
author={Xu, Yifei and Chen, Yuning and Zhang, Xumiao and Lin, Xianshang and Hu, Pan and Ma, Yunfei and Lu, Songwu and Du, Wan and Mao, Zhuoqing and Zhai, Ennan and others},
journal={Proceedings of Machine Learning and Systems},
volume={6},
pages={173--195},
year={2024}
}