AlphaCodium: LLM driven, test-based, multi-stage code generation #764
Labels
code-generation
code generation models and tools like copilot and aider
Code-Interpreter
OpenAI Code-Interpreter
dataset
public datasets and embeddings
finetuning
Tools for finetuning of LLMs e.g. SFT or RLHF
llm
Large Language Models
New-Label
Choose this option if the existing labels are insufficient to describe the content accurately
Papers
Research papers
prompt-engineering
Developing and optimizing prompts to efficiently use language models for various applications and re
Research
personal research notes for a topic
AlphaCodium
Description
Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering
Official Implementation by Tal Ridnik, Dedy Kredo, Itamar Friedman at CodiumAI
Table of Contents
Abstract
Code generation problems differ from common natural language problems - they require matching the exact syntax of the target language, identifying happy paths and edge cases, paying attention to numerous small details in the problem spec, and addressing other code-specific issues and requirements. Hence, many of the optimizations and tricks that have been successful in natural language generation may not be effective for code tasks.
In this work, we propose a new approach to code generation by LLMs, which we call AlphaCodium - a test-based, multi-stage, code-oriented iterative flow, that improves the performances of LLMs on code problems.
We tested AlphaCodium on a challenging code generation dataset called CodeContests, which includes competitive programming problems from platforms such as Codeforces. The proposed flow consistently and significantly improves results.
On the validation set, for example, GPT-4 accuracy (pass@5) increased from 19% with a single well-designed direct prompt to 44% with the AlphaCodium flow.
Installation
pip install -r requirements.txt
alpha_codium/settings/.secrets_template.toml
, rename it as.secrets.toml
, and fill in your OpenAI API key:How to run
Configuration
The file
alpha_codium/settings/configuration.toml
contains the configuration for the project. In theconfig
section, you can choose the model you want to use ("gpt-4", "gpt-3.5-turbo-16k", or others).Solving a specific problem from CodeContest
To solve a specific problem with AlphaCodium, from the root folder run:
dataset_name
is the path to the dataset folder you downloaded in the installation step.problem_number
parameter should be accordingly (zero-based).split_name
can be eithervalid
ortest
.solve
,self_reflection
,possible_solutions
,generate_ai_tests
,initial_code_generation
,public_tests
,ai_tests
enable to adjust possible configurations for the different stages of the flow.alpha_codium/example.log
. Reviewing the log file is a good way to understand what is going on in each stage of the flow.Solving an entire CodeContest dataset split
To solve the entire dataset with AlphaCodium, from the root folder run:
split_name
can be eithervalid
ortest
.database_solution_path
is the path to the directory where the solutions will be saved.dataset
section in the configuration file contains the configuration for the running and evaluation of a dataset.dataset.num_iterations
defines the number of iterations for each problem (pass@K). For a large number of iterations, it is recommended to introduce some randomness and different options for each iteration to achieve top results.Running the evaluation
Once you generate a solution for the entire dataset (valid or test), you can evaluate it by running:
Solving a new problem (CodeContest format)
To solve a custom problem with AlphaCodium, first create a json file that includes the CodeContest problem fields, and then from the root folder run:
my_problem_json_file
is the path to the custom problem json file.See the
my_problem_example.json
to see an example of a custom problem. The json file should include the following fields:name
is the name of the problem.description
is a description of the problem.public_tests
with the following fields:input
is a list of strings that represent the input.output
is a list of strings that represent the output.private_tests
, that follows the same structure aspublic_tests
generated_tests
, that follows the same structure aspublic_tests
Technical Q&A
Aggregating some technical questions we received about this project:
Q: How much time did you spend on "prompt engineering" compared to "flow engineering"?
A: Structured output almost completely eliminates the need for simple prompt engineering. We estimate that ~95% of the time we did more high-level design, reasoning, and injecting data at the correct places, ..., a.k.a. "flow engineering".
Q: How do you know that there wasn't a data leakage?
A: The test set of CodeContests dataset comprises problems published after September 2021, while the GPT-4 model variant we used (gpt-4-0613) has a data cutoff of September 2021. Hence, there is no data leakage for GPT4, on the test set. For other models like DeepSeek, we cannot be sure. However, note that our main result is a comparison of "direct prompt" vs. "AlphaCodium flow". Data leakage would help both approaches, so the relative improvement of AlphaCodium flow is still valid.
Q: Is this project relevant only to specific programming languages?
A: No. The proposed flow is language agnostic. We generated solutions in Python, but the flow can be applied to any language.
Q: How did you manage the context window?
A: We used models with a context window of 8192 tokens, and we did not encounter cases where it did not suffice. However, we clearly observed that as the context we used in practice grows larger (let's say, above 4000 tokens), the model starts to "ignore" some of the information in the context. Hence, there is a clear tradeoff: - Injecting the results of previous stages into the context, may help the model to generate better code. - However, it may also cause the model to ignore specific details and nuances from the problem description.
Q: Is this work "realistic" in terms of the number of LLM calls?
A: In comparison to AlphaCode, we do four orders of magnitude (!) fewer calls (per solution AlphaCodium does 15-20 calls). Yet we acknowledge that for some applications, this may still be too much, and more optimizations are needed. We however believe that many of the ideas and principles we acquired in this work are broadly applicable, even when the number of calls is further limited.
Q: Why do you iterate only on the generated code, and not on the AI-generated tests?
A: For code problems in CodeContests, the tests are a list of input-output pairs. Hence, you don't really learn anything new when you "fix" a test - you just change its output to the prediction of the generated code. Instead of fixing tests, we preferred to always try and fix the code, while using "test anchors". (see the paper for more details). However, for other code generation tasks, where the tests are more complex and contain runnable code, iterating on the tests, in addition to iterating on the generated code, may be beneficial.
Broader Applicability
While this work presents results on CodeContests dataset, we believe that it has a broader applicability.
First and foremost, we feel that the proposed AlphaCodium flow, with reasonable adjustments, can be used as a more general framework for other code generation tasks.
Secondly, many of the design concepts, principles, and tricks we acquired in this work are broadly applicable as-is to any general code generation tasks. For example:
divide the generated code into small sub-functions, with meaningful names and functionality
, we observe a better-produced code, with fewer bugs, and higher success rates for the iterative fixing stages.The list above is partial. See the paper for more details. The code provided in this repo can be used as a reference for better understanding the proposed concepts, and for applying them to other code generation tasks.
Example Problem
In this section, we present an example for a full problem from CodeContests dataset (test-set, problem 1), in order to demonstrate the complexity of the problems in the dataset, and the challenges they pose to LLMs.
Acknowledgments
Our process CodeContests dataset is based on the original CodeContests dataset. We removed the train set (which is not relevant to our work) and did some post-processing and cleaning to the validation and test sets.
Citation
Suggested labels
{'label-name': 'iterative-flow', 'label-description': 'Describes a multi-stage code-oriented iterative process for improving LLM performances on code problems.', 'confidence': 74.22}
The text was updated successfully, but these errors were encountered: