Skip to content

Commit 410ee85

Browse files
feat: swe bench harness (#590)
# Motivation Adds a SWE Bench Harness to the codegen agent. # Content - Loads SWE Bench dataset - For each entry in the database a modal instance is created where an agent can run - Output of each agent is stored and tested on modal using `swebench` - documentation in readme Contributions from: - @victorxheng : #521 # Please check the following before marking your PR as ready for review - [x] I have updated the documentation or added new documentation as needed --------- Co-authored-by: jemeza-codegen <[email protected]>
1 parent f4a8327 commit 410ee85

File tree

15 files changed

+942
-13
lines changed

15 files changed

+942
-13
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,3 +65,9 @@ graph-sitter-types/typings/**
6565
coverage.json
6666
tests/integration/verified_codemods/codemod_data/repo_commits.json
6767
.benchmarks/*
68+
69+
# SWE Bench results
70+
results.*.json
71+
codegen-examples/examples/swebench_agent_run/results/*
72+
codegen-examples/examples/swebench_agent_run/predictions/*
73+
codegen-examples/examples/swebench_agent_run/logs/*
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
OPENAI_API_KEY= # Your OpenAI API key
2+
ANTHROPIC_API_KEY= # Your Anthropic API key
3+
LANGSMITH_API_KEY= # Your Langsmith API key
4+
LANGCHAIN_TRACING_V2= # `true` for tracing, `false` for no tracing
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# INSTRUCTIONS
2+
3+
1. Create a `.env` file in the root directory and add your API keys.
4+
5+
1. cd into the `codegen-examples/examples/swebench_agent_run` directory
6+
7+
1. Create a `.venv` with `uv venv` and activate it with `source .venv/bin/activate`
8+
9+
1. Install the codegen dependencies with `uv add codegen`
10+
11+
- Note: If you'd like to install the dependencies in the global environment, you can use `uv pip install -e ../../../`. This will allow you to test modifications to the codegen codebase. You will need to run `uv pip install -e ../../../` each time you make changes to the codebase.
12+
13+
5. Ensure that you have a modal account and profile set up. If you don't have one, you can create one at https://modal.com/
14+
15+
1. Activate the appropriate modal profile `uv modal profile activate <profile_name>`
16+
17+
1. Launch the modal app with `uv run modal deploy --env=<env_name> entry_point.py`
18+
19+
1. Run the evaluation with `python run_eval.py` with the desired options:
20+
21+
- ```bash
22+
$ python run_eval.py --help
23+
Usage: run_eval.py [OPTIONS]
24+
25+
Options:
26+
--use-existing-preds Use existing predictions instead of
27+
generating new ones.
28+
--dataset [princeton-nlp/SWE-bench_Lite|princeton-nlp/SWE-bench|princeton-nlp/SWE-bench-verified]
29+
The dataset to use.
30+
--length INTEGER The number of examples to process.
31+
--instance-id TEXT The instance ID of the example to process.
32+
--help Show this message and exit.
33+
```
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
from codegen.extensions.swebench.utils import SweBenchExample
2+
from codegen.extensions.swebench.harness import run_agent_on_entry
3+
import modal
4+
5+
image = (
6+
modal.Image.debian_slim(python_version="3.13")
7+
.apt_install("git")
8+
.pip_install("fastapi[standard]")
9+
.copy_local_dir("../../../", "/root/codegen", ignore=[".venv", "**/.venv", "tests", "**/tests"])
10+
.run_commands("pip install -e /root/codegen")
11+
)
12+
13+
app = modal.App(name="swebench-agent-run", image=image, secrets=[modal.Secret.from_dotenv()])
14+
15+
16+
@app.function(timeout=5 * 60)
17+
async def run_agent_modal(entry: SweBenchExample):
18+
"""Modal function to process a single example from the SWE-bench dataset."""
19+
return run_agent_on_entry(entry)
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
[project]
2+
name = "swebench-agent-run"
3+
version = "0.1.0"
4+
description = "Add your description here"
5+
readme = "README.md"
6+
requires-python = ">=3.12, <3.14"
7+
dependencies = []
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
import asyncio
2+
import json
3+
import traceback
4+
from pathlib import Path
5+
import modal
6+
import click
7+
from datetime import datetime
8+
from codegen.extensions.swebench.utils import SWEBenchDataset, get_swe_bench_example, get_swe_bench_examples
9+
from codegen.extensions.swebench.report import generate_report
10+
11+
PREDS_DNAME = Path(__file__).parent / "predictions"
12+
LOG_DIR = Path(__file__).parent / "logs"
13+
14+
run_agent_modal = modal.Function.lookup("swebench-agent-run", "run_agent_modal")
15+
16+
17+
async def process_batch(examples, batch_size=10):
18+
"""Process a batch of examples concurrently.
19+
20+
Args:
21+
examples: List of SweBenchExample objects to process
22+
batch_size: Number of examples to process concurrently.
23+
Default is 50 which provides good parallelization
24+
while staying well within Modal's limits.
25+
"""
26+
results = []
27+
28+
# Process examples in batches
29+
for i in range(0, len(examples), batch_size):
30+
batch = examples[i : i + batch_size]
31+
32+
# Create tasks for this batch
33+
batch_tasks = [run_agent_modal.remote.aio(example) for example in batch]
34+
35+
# Wait for all tasks in this batch to complete
36+
print(f"Processing batch {i // batch_size + 1}/{len(examples) // batch_size + 1} (examples {i + 1}-{min(i + batch_size, len(examples))})")
37+
38+
try:
39+
batch_results = await asyncio.gather(*batch_tasks, return_exceptions=True)
40+
41+
# Store results
42+
for example, result in zip(batch, batch_results):
43+
error_info = None
44+
45+
if isinstance(result, Exception):
46+
error_type = type(result).__name__
47+
error_info = {
48+
"error_type": error_type,
49+
"error_message": str(result),
50+
"traceback": traceback.format_exception(type(result), result, result.__traceback__),
51+
}
52+
53+
if isinstance(result, modal.exception.Error):
54+
error_info["modal_error_code"] = getattr(result, "code", None)
55+
error_info["modal_error_details"] = getattr(result, "details", None)
56+
57+
print(f"Error processing {example.instance_id}:")
58+
print(f"Type: {error_type}")
59+
print(f"Message: {str(result)}")
60+
print("Traceback:")
61+
print("".join(error_info["traceback"]))
62+
63+
results.append({"instance_id": example.instance_id, "status": "error", "error_info": error_info})
64+
else:
65+
if result is None:
66+
print(f"Warning: Null result for {example.instance_id}")
67+
results.append({"instance_id": example.instance_id, "status": "error", "error_info": {"error_type": "NullResult", "error_message": "Process returned None"}})
68+
else:
69+
results.append(result)
70+
71+
except Exception as e:
72+
print("Batch processing error:")
73+
print(f"Type: {type(e).__name__}")
74+
print(f"Message: {str(e)}")
75+
traceback.print_exc()
76+
77+
# Mark all examples in the batch as failed
78+
for example in batch:
79+
results.append(
80+
{
81+
"instance_id": example.instance_id,
82+
"status": "error",
83+
"error_info": {"error_type": type(e).__name__, "error_message": str(e), "traceback": traceback.format_exc(), "batch_failure": True},
84+
}
85+
)
86+
87+
return results
88+
89+
90+
async def run_eval(use_existing_preds, dataset, length, instance_id=None):
91+
dataset = SWEBenchDataset(dataset)
92+
if instance_id:
93+
examples = [get_swe_bench_example(instance_id, dataset=dataset)]
94+
else:
95+
examples = get_swe_bench_examples(dataset=dataset, length=length)
96+
97+
try:
98+
if not use_existing_preds:
99+
print(f"Processing {len(examples)} examples...")
100+
101+
# Create output directory if it doesn't exist
102+
PREDS_DNAME.mkdir(exist_ok=True)
103+
results_dir = PREDS_DNAME / "results"
104+
results_dir.mkdir(exist_ok=True)
105+
106+
# Create a timestamp for this run
107+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
108+
109+
# Process all examples in parallel batches
110+
results = await process_batch(examples)
111+
112+
# Save individual results
113+
for result in results:
114+
if result and "instance_id" in result:
115+
instance_id = result["instance_id"]
116+
output_file = results_dir / f"{instance_id}.json"
117+
with open(output_file, "w") as f:
118+
json.dump(result, f, indent=4)
119+
120+
# Save summary file
121+
summary_file = results_dir / f"summary_{timestamp}.json"
122+
summary = {
123+
"timestamp": timestamp,
124+
"total_examples": len(examples),
125+
"successful": len([r for r in results if r and "status" not in r]),
126+
"failed": len([r for r in results if r and "status" in r and r["status"] == "error"]),
127+
"error_types": {},
128+
"results": results,
129+
}
130+
131+
# Collect error statistics
132+
for result in results:
133+
if result and "status" in result and result["status"] == "error":
134+
error_type = result.get("error_info", {}).get("error_type", "Unknown")
135+
summary["error_types"][error_type] = summary["error_types"].get(error_type, 0) + 1
136+
137+
with open(summary_file, "w") as f:
138+
json.dump(summary, f, indent=4)
139+
140+
print("\nProcessing complete!")
141+
print(f"Results saved to: {results_dir}")
142+
print(f"Summary saved to: {summary_file}")
143+
print(f"Successful: {summary['successful']}/{summary['total_examples']}")
144+
print(f"Failed: {summary['failed']}/{summary['total_examples']}")
145+
if summary["error_types"]:
146+
print("\nError type distribution:")
147+
for error_type, count in summary["error_types"].items():
148+
print(f" {error_type}: {count}")
149+
150+
# Generate Report on Modal
151+
generate_report(PREDS_DNAME, LOG_DIR, dataset)
152+
except Exception:
153+
print("Fatal error in run_eval:")
154+
traceback.print_exc()
155+
raise
156+
157+
158+
@click.command()
159+
@click.option("--use-existing-preds", is_flag=True, help="Use existing predictions instead of generating new ones.")
160+
@click.option("--dataset", help="The dataset to use.", type=click.Choice([dataset.value for dataset in SWEBenchDataset]), default=SWEBenchDataset.LITE.value)
161+
@click.option("--length", help="The number of examples to process.", type=int, default=10)
162+
@click.option("--instance-id", help="The instance ID of the example to process.")
163+
def run_eval_command(use_existing_preds, dataset, length, instance_id):
164+
asyncio.run(run_eval(use_existing_preds, dataset, length, instance_id))
165+
166+
167+
if __name__ == "__main__":
168+
run_eval_command()

codegen-examples/pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,9 @@ dev-dependencies = [
3131
"deptry>=0.22.0",
3232
]
3333

34+
[tool.uv.workspace]
35+
members = ["examples/swebench_agent_run"]
36+
3437
[tool.pre-commit-uv]
3538
requirements = ["strict-requirements"]
3639

codegen-examples/uv.lock

Lines changed: 11 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ dependencies = [
6969
"modal>=0.73.45",
7070
"slack-sdk",
7171
"langchain-anthropic>=0.3.7",
72+
"lox>=0.12.0",
7273
]
7374

7475
license = { text = "Apache-2.0" }
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
## Codegen Harness and Evaluator for SWE Bennch Development Tool
2+
3+
This folder contains a harness and evaluator for the SWE Bench leaderboard, and enables developers to test and evaluate their codegen models on the SWE Bench leaderboard.
4+
5+
It integrates directly into the Codegen agentic framework and can be built on top of.
6+
7+
### Setup
8+
9+
Remember to install all the dependencies for the environment.
10+
11+
### Usage
12+
13+
#### Edit agent.py, your codegen agent
14+
15+
This file contains the main logic for the agent.
16+
17+
The agent taps into the tree sitter using codegen. You can modify this by adding additional tools, extending its capabilities, prompts, and more.
18+
19+
It is invoked in the harness script.
20+
21+
#### Run harness.py to run the agent
22+
23+
This script will gather the correct dataset, run the agent, and save the results.
24+
25+
#### Run report.py to generate a report
26+
27+
This script will generate a report from the results. It will loop through all the results and generate a report to evaluate each. Currently, there is an error in the docker image.
28+
29+
There are currently example predictions in the `predictions/results` folder.

0 commit comments

Comments
 (0)