GPT-4o achieved 54% accuracy on CodeContests with AlphaCodium, compared to 48% for GPT-4T AlphaCodium breaks down a competitive programming problem into simple steps on an automated LLM flow

Here is AlphaCodium’s presentation. GPT-4o achieved 54% accuracy on CodeContests with AlphaCodium, compared to 48% for GPT-4T.

Code generation problems are different from common natural language problems—they require matching the exact syntax of the target language, identifying happy paths and edge cases, paying attention to many fine details in the problem specification, and solving other code-specific problems and requirements. As a result, many optimizations and tricks that have proven successful in natural language generation may not be effective for code-related tasks.

The developers propose a new approach to LLM code generation called AlphaCodium – a test-based, multi-step, code-oriented, iterative flow that improves LLM performance on problems.

They tested AlphaCodium on a challenging code generation dataset called CodeContests, which includes competitive programming problems from platforms like Codeforces. The proposed flow improves results in a consistent and meaningful way. On the validation set, for example, GPT-4 accuracy (pass@5) increased from 19% with one well-designed direct query to 44% with the AlphaCodium feed.

They believe that many of the principles and best practices learned through their work are broadly applicable to general code generation tasks.

Here is the AlphaCodium ranking with the results of the new GPT models and the Claude3 Opus. “GPT-4o” is currently the flagship model on AlphaCodium.

Installing AlphaCodium

1. Configure the virtual environment

1
2
python3 -m venv venv
source ./venv/bin/activate

and run: pip install -r requirements.txt.

2. Duplicate the file alpha_codium/settings/.secrets_template.tomlrename it to alpha_codium/settings/.tajne.tomland provide your OpenAI API key:

3. Download the CodeContest validation and testing data set from Hugface, extract the zip file and place the extracted folder in the root of the project.

How to run

Configuration

File : alpha_codium/settings/configuration.toml contains the project configuration. In the section configurationyou can choose the model you want to use ("gpt-4", "gpt-3.5-turbo-16k" or others).

Solve a specific problem from CodeContest

To fix a specific problem with AlphaCodium, from the root folder run:

1
2
3
4
python -m alpha_codium.solve_problem \
--dataset_name /path/to/dataset \
--split_name test \
--problem_number 0

THE dataset name is the path to the dataset folder you downloaded during the installation step.
Note that the validation set contains 117 problems and the test set contains 165 problems, so the parameter problem_number must be accordingly (zero-based).
THE split_name could be rolled Or test.
The following sections of the configuration file: solve, thinking about yourself,possible solutions, generate_ai_tests,initial_code_generation,public_tests, ai_tests allow customization of possible configurations for different stages of the flow.
Each work saves the results in a file called alpha_codium/example.log. Examining the log file is a good way to understand what is happening at each step of the flow.

Example problem (test set, problem number 12):

Solving CodeContest's split full dataset

To resolve the entire dataset using AlphaCodium, from the root of the startup folder:

1
2
3
4
python -m alpha_codium.solve_dataset \
--dataset_name /path/to/dataset \
--split_name test \
--database_solution_path /path/to/output/dir/dataset_output.json

THE split_name could be rolled Or test.
database_solution_path is the path to the directory where the solutions will be saved.
section data set The configuration file contains the configuration to run and evaluate the dataset.
Note that this is a time-consuming process, which can take several days with large models (eg GPT-4) and multiple iterations per problem.
data set.iteration_number defines the number of iterations for each problem (pass@K). For a large number of iterations, it is recommended to introduce some randomness and different options for each iteration to get the best results.

Performing an assessment

Once you have generated a solution for the entire data set (valid or tested), you can evaluate it by running:

1
2
3
4
python -m alpha_codium.evaluate_dataset \
--dataset_name /path/to/dataset \
--split_name test \
--database_solution_path /path/to/output/dir/dataset_output.json

Solving a new problem (CodeContest format)

To solve a custom problem with AlphaCodium, first create a json file that includes the fields from the CodeContest problem, then from the root folder run:

1
2
python -m alpha_codium.solve_my_problem \
--my_problem_json_file /path/to/my_problem.json

File my_problem_json_file is the path to the json file of the custom problem.

View the file my_problem_example.json to see an example of a custom problem. The JSON file must contain the following fields:

Name is the name of the problem.
description is a description of the problem.
(optional) public_tests with the following fields:
- input is a list of strings representing the input.
- Exit is a list of strings representing the output.
(optional) private_testswhich follows the same structure as public_tests
(optional) generated_testswhich follows the same structure as public_tests

Technical questions and answers

Developers share answers to technical questions about this project:

How much time did you spend on "guest engineering" versus "flow engineering"?

Structured output almost completely eliminates the need for simple rapid engineering. We estimate that about 95% of the time was spent on high-level design, thinking and inserting data into the right places, ..., so called. "flow engineering".

How do you know there hasn't been a data leak?

The CodeContests test set includes issues published after September 2021, while the GPT-4 model variant we used (gpt-4-0613) has a data usage date of September 2021, so there is no data leakage for the GPT4 test set. For other models like DeepSeek, we can't be sure. However, it should be noted that our main result is a comparison between "direct notification" and "AlphaCodium feed". Data leakage would favor both approaches, so the relative improvement of the AlphaCodium feed remains valid.

Is this project only about certain programming languages?

Not. The suggested flow is language independent. We generated the solutions in Python, but the flow can be applied to any language.

How did you deal with the popup?

We used models with a popup window of 8192 tokens and encountered no cases where this was not enough. However, we clearly observed that as the context used in practice grows (say, above 4000 tokens), the model starts to "ignore" some of the information contained in the context. Therefore, the compromise is obvious:

Putting the results of previous steps into context can help the model generate better code.
However, it can also cause the model to ignore certain details and nuances of the problem description.

Is this job "realistic" in terms of the number of calls to LLM?

Compared to AlphaCode, we make four orders of magnitude (!) fewer calls (AlphaCodium makes 15-20 calls per solution). However, we realize that for some applications this may be too much and additional optimizations are needed. However, we believe that many of the ideas and principles learned in this paper are broadly applicable, even when the number of calls is even more limited.

Why repeat only generated code and not AI generated tests?

For code problems in CodeContests, the tests are a list of input-output pairs. So you don't learn anything new when you "fix" a test - you just change its output to predict the generated code. Instead of fixing the tests, we've preferred to always try to fix the code, using "test anchors". However, for other code generation tasks, where the tests are more complex and contain executable code, iterating over the tests, in addition to iterating over the generated code, can be useful.

Wider application

Although AlphaCodium presents results on the CodeContests dataset, the developers believe it has wider applicability.

First, they believe that the proposed AlphaCodium flow, with reasonable adaptations, can be used as a more general framework for other code generation tasks.

Second, many of the concepts, principles, and design tips learned in this paper are broadly applicable as they are to all general code generation tasks. For example :

The output of the YAML structure : ask the model to generate output in YAML format, equivalent to a specific Pydantic class.
Semantic thinking through bullet analysis : Bullet analysis encourages a deeper understanding of the problem and forces the model to split the output into logical semantic chunks, leading to improved results.
LLMs work best when generating modular code : when we ask the model to split the generated code into small subfunctions, with names AND significant characteristicsthey noticed better produced code, with fewer errors and higher success rates for iterative correction steps.
Flexible decisions with double validation : with the double validation process, they added an extra step where, given the generated output, the model is asked to generate the same output again, but corrects it if necessary.
Leave room for exploration : Since the model can be wrong, it is best to avoid irreversible decisions and leave room to explore and iterate the code with different possible solutions.

The above list is partial. See the article for more details. The code contained in this repository can be used as a reference to better understand the proposed concepts and apply them to other code generation tasks.

Example problem

Here is an example of a complete problem from the CodeContests dataset (test-set, problem 1), to demonstrate the complexity of the problems in the dataset and the challenges they pose to LLMs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
problem name: '1575_B. Building an Amusement Park'
 
problem description:
Mr. Chanek lives in a city represented as a plane. He wants to build an amusement park in the shape of a circle of radius r. 
The circle must touch the origin (point (0, 0)).
There are n bird habitats that can be a photo spot for the tourists in the park. The i-th bird habitat is at point p_i = (x_i, y_i). 
 
Find the minimum radius r of a park with at least k bird habitats inside. 
 
A point is considered to be inside the park if and only if the distance between p_i and the center of the park is less than or equal 
to the radius of the park.
Note that the center and the radius of the park do not need to be integers.
 
In this problem, it is guaranteed that the given input always has a solution with r ≤ 2 ⋅ 10^5.
 
Input
 
The first line contains two integers n and k (1 ≤ n ≤ 10^5, 1 ≤ k ≤ n)  the number of bird habitats in the city and the number of bird 
habitats required to be inside the park.
The i-th of the next n lines contains two integers x_i and y_i (0 ≤ |x_i|, |y_i| ≤ 10^5)  the position of the i-th bird habitat.
 
Output
 
Output a single real number r denoting the minimum radius of a park with at least k bird habitats inside. It is guaranteed that the given 
input always has a solution with r ≤ 2 ⋅ 10^5.
Your answer is considered correct if its absolute or relative error does not exceed 10^{-4}.
Formally, let your answer be a, and the jury's answer be b. Your answer is accepted if and only if \frac{|a - b|}{max{(1, |b|)}} ≤ 10^{-4}.
 
Examples
 
Input
 
8 4
-3 1
-4 4
1 5
2 2
2 -2
-2 -4
-1 -1
-6 0
 
Output
 
3.1622776589
 
 
Input
 
1 1
0 0
 
 
Output
 
0.0000000000
 
Note
 
In the first example, Mr. Chanek can put the center of the park at (-3, -1) with radius √{10} ≈ 3.162. It can be proven this is the minimum r.