Swe bench treehacks codegen (#552) #2

Zeeeepa · 2025-04-22T19:44:10Z

User description

Motivation

Content

Testing

Please check the following before marking your PR as ready for review

I have added tests for my changes
I have updated the documentation or added new documentation as needed

Motivation

Content

Testing

Please check the following before marking your PR as ready for review

I have added tests for my changes
I have updated the documentation or added new documentation as needed

PR Type

enhancement, tests, documentation

Description

Add SWE Bench harness for agentic codegen evaluation
- Implements harness.py to run agents on SWE Bench dataset
- Integrates with Codegen agent and toolset
- Saves and manages prediction results
Add evaluation and reporting utilities for SWE Bench
- Implements report.py to aggregate and analyze results
- Converts predictions to JSONL, runs evaluation, updates reports
Provide utilities for dataset loading, predictions, and winner selection
- Utilities for dataset caching, prediction management, and plausibility checks
Add example agent and documentation for SWE Bench integration
- Example agent (agent.py) using Codegen tools and LangChain
- README with setup, usage, and workflow instructions
Add test harness for running and validating predictions
- Implements tests.py for patch filtering, test running, and docker image checks
Add example prediction results for demonstration
- Includes sample JSON and JSONL results for reference
Add datasets dependency to project requirements

Changes walkthrough 📝

Relevant files

Enhancement

6 files

harness.py `Add main harness to run agent on SWE Bench`	+383/-0
agent.py `Add example agent using Codegen and LangChain tools`	+129/-0
report.py `Add evaluation and reporting script for SWE Bench results`	+280/-0
utils.py `Add dataset and prediction utilities for SWE Bench`	+186/-0
swe_bench_wrapper.py `Add SWE Bench dataset wrapper for agent integration`	+80/-0
utils.py `Add utility functions and schemas for SWE Bench wrapper`	+42/-0

Tests

5 files

tests.py `Add test harness for patch filtering and validation`	+237/-0
all_preds.jsonl `Add example predictions in JSONL format`	+3/-0
pallets__flask-4992.json `Add example result for Flask issue in JSON format`	+29/-0
psf__requests-1963.json `Add example result for Requests issue in JSON format`	+27/-0
pytest-dev__pytest-7168.json `Add example result for Pytest issue in JSON format`	+27/-0

Documentation

1 files

README.md `Add documentation for SWE Bench harness and evaluator`	+29/-0

Dependencies

1 files

pyproject.toml `Add datasets dependency for SWE Bench integration`	+1/-0

Additional files

1 files

__init__.py	[link]

Need help?
Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
Check out the documentation for more information.

# Motivation  # Content  # Testing  # Please check the following before marking your PR as ready for review - [ ] I have added tests for my changes - [ ] I have updated the documentation or added new documentation as needed --------- Co-authored-by: Victor Cheng <[email protected]> Co-authored-by: tomcodgen <[email protected]>

sourcery-ai · 2025-04-22T19:44:15Z

Reviewer's Guide by Sourcery

This pull request introduces a comprehensive framework for evaluating code generation models on the SWE Bench dataset. It includes tools for running agents, evaluating results, and managing codebases, along with example prediction files and documentation.

No diagrams generated as the changes look simple and do not need a visual representation.

File-Level Changes

Change	Details	Files
Introduces a harness for running AI agents on the SWE Bench dataset, including functionalities for cloning repositories, applying patches, and running tests.	Added `harness.py` to manage the execution of AI agents on the SWE Bench dataset. Implemented functions to clone git repositories at specific commits. Added functions to apply and revert patches. Added functions to run pre-existing tests. Added functions to process SWE Bench instances and store results. Added functions to manage chat logs and predictions. Added dependencies such as `datasets` to `pyproject.toml`.	`pyproject.toml` `src/codegen/extensions/swebench/harness.py`
Implements reporting and evaluation functionalities for SWE Bench results, including test execution and result aggregation.	Added `report.py` to generate reports from SWE Bench results. Implemented functions to run evaluations and parse results. Implemented functions to update prediction JSON files with evaluation results. Implemented functions to convert predictions to JSONL format. Implemented functions to combine JSONL logs. Implemented functions to run evaluations on a directory of predictions.	`src/codegen/extensions/swebench/report.py`
Adds testing utilities for SWE Bench, including patch application and test execution within a Docker environment.	Added `tests.py` to provide testing utilities for SWE Bench. Implemented functions to run tests with optional model patches and test patches. Implemented functions to remove patches to test files. Implemented functions to extract modified files from a patch. Implemented functions to check Docker images. Implemented functions to update cache.	`src/codegen/extensions/swebench/tests.py`
Introduces utility functions for loading and managing SWE Bench datasets and predictions.	Added `utils.py` to provide utility functions for SWE Bench. Implemented functions to load SWE Bench datasets from Hugging Face. Implemented functions to load predictions from JSON files. Implemented functions to determine plausible solutions. Implemented functions to choose the best prediction from multiple results.	`src/codegen/extensions/swebench/utils.py`
Creates an agent with codebase tools for interacting with the SWE Bench dataset.	Added `agent.py` to create an agent with codebase tools. Implemented functions to create a codebase agent with specified model and tools. Implemented functions to analyze dependencies.	`src/codegen/extensions/swebench/agent.py`
Adds a wrapper class for the SWE Bench dataset to facilitate iteration and codebase management.	Added `swe_bench_wrapper.py` to wrap the SWE Bench dataset. Implemented a class to manage the dataset and its splits. Implemented functions to create repository groups and retrieve entries for each split. Implemented functions to construct codebases for each repository.	`src/codegen/extensions/swe_bench/swe_bench_wrapper.py`
Adds utility functions for the SWE Bench dataset to facilitate codebase management.	Added `utils.py` to provide utility functions for SWE Bench. Implemented functions to construct codebases for each repository.	`src/codegen/extensions/swe_bench/utils.py`
Adds a README file to the SWE Bench extension.	Added `README.md` to provide documentation for the SWE Bench extension.	`src/codegen/extensions/swebench/README.md`
Adds example prediction files to the SWE Bench extension.	Added `all_preds.jsonl` to provide example prediction files for the SWE Bench extension. Added `pallets__flask-4992.json` to provide example prediction files for the SWE Bench extension. Added `psf__requests-1963.json` to provide example prediction files for the SWE Bench extension. Added `pytest-dev__pytest-7168.json` to provide example prediction files for the SWE Bench extension.	`src/codegen/extensions/swebench/results/all_preds.jsonl` `src/codegen/extensions/swebench/results/pallets__flask-4992.json` `src/codegen/extensions/swebench/results/psf__requests-1963.json` `src/codegen/extensions/swebench/results/pytest-dev__pytest-7168.json`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!
Generate a plan of action for an issue: Comment @sourcery-ai plan on
an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

codiumai-pr-agent-free · 2025-04-22T19:44:52Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 Security concerns Sensitive information exposure: The code in `swe_bench_wrapper.py` uses a hardcoded path `/tmp/codegen/{repo}` for storing cloned repositories (line 65), which could lead to information disclosure if multiple users share the same system. Additionally, in `harness.py`, the code executes arbitrary git commands using subprocess without proper input validation, which could potentially lead to command injection if the repository URL or commit values come from untrusted sources.
⚡ Recommended focus areas for review Resource Leak The codebase variable may not be properly cleaned up if an exception occurs during iteration. The cleanup code only runs after the loop completes successfully. if codebase and self.remove_after_run: # remove the repo from the tmp_dir shutil.rmtree(f"/tmp/codegen/{repo}") Undefined Functions The code references several functions that are imported or defined elsewhere but not visible in the PR, such as run_tests, get_plausible, and checkout_repo_url_commit. from utils import get_plausible, load_predictions, pick_winner Missing Imports The file references several modules and functions that are not imported, including tempfile, asyncio, get_docker_image, get_test_directives, and MAP_REPO_TO_TEST_FRAMEWORK. with tempfile.TemporaryDirectory(dir="/mnt/aider") as log_dir: timeout = 60 log_suffix = "" asyncio.run(run_docker_evaluation(entry_instance, namespace, log_dir, timeout, log_suffix))

qodo-merge-pro · 2025-04-22T19:44:52Z

Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here.

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 Security concerns Sensitive information exposure: The code in `src/codegen/extensions/swe_bench/swe_bench_wrapper.py` uses hardcoded paths like `/tmp/codegen/{repo}` for storing cloned repositories. This could lead to information leakage if the /tmp directory is accessible by other users on the system. Additionally, the harness.py file contains code that clones repositories and executes arbitrary code from those repositories, which could potentially lead to code execution vulnerabilities if malicious repositories are processed.
⚡ Recommended focus areas for review Resource Leak The codebase variable might not be properly cleaned up if an exception occurs during iteration. The cleanup code only runs after the loop completes successfully. if codebase and self.remove_after_run: # remove the repo from the tmp_dir shutil.rmtree(f"/tmp/codegen/{repo}") Undefined Functions The script imports functions like run_tests that are referenced but not fully implemented, and there are commented out function calls that suggest incomplete functionality. # from tests import run_tests from utils import get_full_dataset # noqa: F401 Missing Imports The file references functions and modules (like tempfile, asyncio, get_docker_image, get_test_directives, MAP_REPO_TO_TEST_FRAMEWORK) that are not imported. with tempfile.TemporaryDirectory(dir="/mnt/aider") as log_dir: timeout = 60 log_suffix = "" asyncio.run(run_docker_evaluation(entry_instance, namespace, log_dir, timeout, log_suffix))

sourcery-ai

Hey @Zeeeepa - I've reviewed your changes - here's some feedback:

Overall Comments:

Consider adding a script to automate the process of setting up the environment and running the tests.
It would be helpful to include a guide on how to interpret the results and identify areas for improvement.

Here's what I looked at during the review

🟡 General issues: 4 issues found
🟢 Security: all looks good
🟡 Testing: 1 issue found
🟡 Complexity: 1 issue found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-04-22T19:45:33Z

src/codegen/extensions/swebench/harness.py

+    cost = 0
+    winner = None
+
+    num_tries = 1


suggestion: Consider making the number of retries configurable.

Currently, num_tries is hard-coded to 1, which limits the agent to only one attempt. This value could be parameterized (or read from a configuration) to allow future experiments with multiple tries.

sourcery-ai · 2025-04-22T19:45:33Z

src/codegen/extensions/swebench/harness.py

+    print("press enter...")
+    input()


suggestion: Avoid interactive pauses in automated workflows.

The use of input() to pause execution may be problematic in non-interactive or automated environments. Consider parameterizing this behavior or including a flag to disable the prompt.

Suggested implementation:

import os # Ensure os is imported if not already if os.getenv("NO_PROMPT", "0") == "0": print() input("Press enter to continue...")

This change uses the environment variable NO_PROMPT. In an automated workflow, set NO_PROMPT to a non-"0" value (or "1") to disable the pause.
Also, check if the file already imports the os module to avoid duplicate imports.

sourcery-ai · 2025-04-22T19:45:33Z

src/codegen/extensions/swebench/report.py

+    any_need_evals = any("resolved" not in pred for pred in predictions.values())
+    any_need_evals = True
+    if any_need_evals:
+        run_evals(FULL_DATASET_FNAME, str(log_dir), predictions_jsonl)


suggestion: Redundant assignment to 'any_need_evals'.

The code unconditionally sets any_need_evals to True immediately after computing its value from predictions. If this is intentional for debugging or forcing re-evaluation, adding a comment or a conditional flag might improve clarity.

Suggested change

any_need_evals = any("resolved" not in pred for pred in predictions.values())

any_need_evals = True

if any_need_evals:

run_evals(FULL_DATASET_FNAME, str(log_dir), predictions_jsonl)

any_need_evals = any("resolved" not in pred for pred in predictions.values())

# If you need to force evaluations regardless of prediction state for debugging,

# uncomment the following line:

# any_need_evals = True

if any_need_evals:

run_evals(FULL_DATASET_FNAME, str(log_dir), predictions_jsonl)

sourcery-ai · 2025-04-22T19:45:33Z

src/codegen/extensions/swebench/tests.py

@@ -0,0 +1,237 @@
+#!/usr/bin/env python
+


suggestion (testing): Improve test outcome validation

Relying solely on the presence of ">>>>> All Tests Passed" in the log might be brittle. Consider using a more robust method to determine test success, such as parsing test results or checking exit codes. This will make the tests more reliable.

Suggested implementation:

print("\n".join(log_lines)) # Import regular expressions module if not already imported import re # Check if any log line indicates that all tests passed. all_tests_passed = any(">>>>> All Tests Passed" in line for line in log_lines) # Check if any log line contains failure or error indicators. has_failures = any(re.search(r"FAILED|ERROR", line) for line in log_lines) passed = all_tests_passed and not has_failures

If further robustness is needed, consider parsing structured test output or capturing and checking the exit code from the test runner. Adjust the regular expressions or parsing logic as necessary to match the exact format of your test logs.

sourcery-ai · 2025-04-22T19:45:33Z

src/codegen/extensions/swebench/README.md

@@ -0,0 +1,29 @@
+## Codegen Harness and Evaluator for SWE Bennch Development Tool


issue (typo): Typo: "Bennch" should be "Bench"

Suggested change

## Codegen Harness and Evaluator for SWE Bennch Development Tool

## Codegen Harness and Evaluator for SWE Bench Development Tool

sourcery-ai · 2025-04-22T19:45:35Z

src/codegen/extensions/swebench/tests.py

+            if to.startswith("b/") and (
+                "/test/" in to
+                or "/tests/" in to
+                or "/testing/" in to
+                or "/test_" in to
+                or "/tox.ini" in to
+            ):
+                is_tests = True
+            else:
+                is_tests = False


issue (code-quality): We've found these issues:

Replace if statement with if expression (assign-if-exp)

Simplify boolean if expression (boolean-if-exp-identity)

sourcery-ai · 2025-04-22T19:45:35Z

src/codegen/extensions/swebench/tests.py

+
+
+def main_preds():
+    dataset = get_dataset()


issue (code-quality): We've found these issues:

Move assignment closer to its usage within a block (move-assign-in-block)

Replace manual loop counter with call to enumerate (convert-to-enumerate)

sourcery-ai · 2025-04-22T19:45:35Z

src/codegen/extensions/swebench/utils.py

+
+
+def get_plausible(preds):
+    return set(inst for inst, pred in preds.items() if is_plausible(pred))


suggestion (code-quality): Replace list(), dict() or set() with comprehension (collection-builtin-to-comprehension)

Suggested change

return set(inst for inst, pred in preds.items() if is_plausible(pred))

return {inst for inst, pred in preds.items() if is_plausible(pred)}

sourcery-ai · 2025-04-22T19:45:35Z

src/codegen/extensions/swebench/utils.py

+    for attr in attrs:
+        if not pred[attr]:
+            return False
+    return True


suggestion (code-quality): We've found these issues:

Use any() instead of for loop (use-any)

Invert any/all to simplify comparisons (invert-any-all)

Suggested change

for attr in attrs:

if not pred[attr]:

return False

return True

return all(pred[attr] for attr in attrs)

sourcery-ai · 2025-04-22T19:45:35Z

src/codegen/extensions/swebench/utils.py

+            md_fname = pred_dname / res["dname"] / (inst + ".md")
+            assert md_fname.exists(), md_fname
+            new_md_fname = pred_dname / model_name_or_path / (inst + ".md")


issue (code-quality): Use f-string instead of string concatenation [×2] (use-fstring-for-concatenation)

qodo-merge-pro · 2025-04-22T19:45:45Z

PR Code Suggestions ✨

Latest suggestions up to f2d4627
Explore these optional code suggestions:

Category	Suggestion	Impact
General	Incorrect code structure There's an extra space after `from utils import` which can cause linting issues. Also, the script has a shebang line (`#!/usr/bin/env python`) after the docstring, which is incorrect - the shebang should be the very first line of the file. src/codegen/extensions/swebench/harness.py [1-24] #!/usr/bin/env python + +""" +This is the harness for running an AI agent on the SWE Bench dataset. + +""" import json import random import subprocess import sys import tempfile from pathlib import Path import datetime import pprint import lox # Replace the dump import with pprint # from dump import dump # from tests import run_tests from utils import get_full_dataset # noqa: F401 from utils import get_lite_dataset # noqa: F401 -from utils import get_plausible, load_predictions, pick_winner +from utils import get_plausible, load_predictions, pick_winner `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 5 __ Why: The shebang line appears after the docstring (should be first line), and there's an extra space in the import statement that could cause linting issues.	Low
More

Previous suggestions

Suggestions up to commit f2d4627

Category	Suggestion	Impact
Possible issue	Fix non-functional method The function `run_pre_existing_tests` is defined but contains only commented-out code and doesn't return any value. Either implement the function properly or remove it since it's currently non-functional and may cause issues when called. src/codegen/extensions/swebench/harness.py [109-141] def run_pre_existing_tests(entry, git_dname): """Given the current contents of the `git_dname`, run the tests that were present in the entry's `repo` at the time of the `base_commit` or which have been added into the repo since. This checks if the code in the `git_dname` has broken pre-existing tests or is failing any newly added tests. It does NOT attempt to run the tests in the `test_patch` which are used to evaluate whether the `model_patch` has resolved the `problem_statement`. Returns None if all the tests passed. Returns the text of the test run output if any failed. """ model_patch = diff_versus_commit(git_dname, entry["base_commit"]) - # passed, output = run_tests( - # entry, - # model_patch=model_patch, - # use_test_patch=False, - # ) - # We were UNABLE to run tests - # if passed is None: - # return + # Function is not fully implemented yet + return None - # if passed: - # return - - # Just keep the output after the (no-op) test patch applied, - # which is the actual output from the tests that were run. - # output = output.split(">>>>> Applied Patch (test)")[-1] - - # return output - Suggestion importance[1-10]: 8 __ Why: The function is defined but contains only commented-out code and doesn't return any value, which would cause issues when called. The suggestion properly implements a minimal return value to prevent runtime errors.	Medium
General	Fix import formatting There's an extra space after `from utils import` which can cause linting issues. Additionally, the imports are not properly organized, with a mix of standard library and local imports without clear separation. src/codegen/extensions/swebench/harness.py [24] -from utils import get_plausible, load_predictions, pick_winner +from utils import get_plausible, load_predictions, pick_winner Suggestion importance[1-10]: 3 __ Why: The extra space in the import statement is a minor style issue that could cause linting problems but doesn't affect functionality.	Low

korbit-ai

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.

Category	Issue	Status
	Non-portable Temporary Directory Path ▹ view
	Print Statements Instead of Logger ▹ view
	Missing Progress Tracking ▹ view
	Missing context for SWEBenchEnvSetup class ▹ view
	Missing function parameter documentation ▹ view
	Unsafe Repository Name Extraction ▹ view
	Mixed Logging with Business Logic ▹ view
	Untyped Dictionary in Model ▹ view
	Unbounded Memory Growth Risk ▹ view
	Inefficient Repository Loading ▹ view

Files scanned

File Path	Reviewed
src/codegen/extensions/swe_bench/utils.py	✅
src/codegen/extensions/swebench/agent.py	✅
src/codegen/extensions/swe_bench/swe_bench_wrapper.py	✅
src/codegen/extensions/swebench/utils.py	✅
src/codegen/extensions/swebench/tests.py	✅
src/codegen/extensions/swebench/report.py	✅
src/codegen/extensions/swebench/harness.py	✅

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

korbit-ai · 2025-04-22T19:46:55Z