-
Notifications
You must be signed in to change notification settings - Fork 0
Swe bench treehacks codegen (#552) #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Motivation <!-- Why is this change necessary? --> # Content <!-- Please include a summary of the change --> # Testing <!-- How was the change tested? --> # Please check the following before marking your PR as ready for review - [ ] I have added tests for my changes - [ ] I have updated the documentation or added new documentation as needed --------- Co-authored-by: Victor Cheng <[email protected]> Co-authored-by: tomcodgen <[email protected]>
Reviewer's Guide by SourceryThis pull request introduces a comprehensive framework for evaluating code generation models on the SWE Bench dataset. It includes tools for running agents, evaluating results, and managing codebases, along with example prediction files and documentation. No diagrams generated as the changes look simple and do not need a visual representation. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
Qodo Merge was enabled for this repository. To continue using it, please link your Git account with your Qodo account here. PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Zeeeepa - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider adding a script to automate the process of setting up the environment and running the tests.
- It would be helpful to include a guide on how to interpret the results and identify areas for improvement.
Here's what I looked at during the review
- 🟡 General issues: 4 issues found
- 🟢 Security: all looks good
- 🟡 Testing: 1 issue found
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
cost = 0 | ||
winner = None | ||
|
||
num_tries = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Consider making the number of retries configurable.
Currently, num_tries is hard-coded to 1, which limits the agent to only one attempt. This value could be parameterized (or read from a configuration) to allow future experiments with multiple tries.
print("press enter...") | ||
input() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Avoid interactive pauses in automated workflows.
The use of input() to pause execution may be problematic in non-interactive or automated environments. Consider parameterizing this behavior or including a flag to disable the prompt.
Suggested implementation:
import os # Ensure os is imported if not already
if os.getenv("NO_PROMPT", "0") == "0":
print()
input("Press enter to continue...")
This change uses the environment variable NO_PROMPT. In an automated workflow, set NO_PROMPT to a non-"0" value (or "1") to disable the pause.
Also, check if the file already imports the os module to avoid duplicate imports.
any_need_evals = any("resolved" not in pred for pred in predictions.values()) | ||
any_need_evals = True | ||
if any_need_evals: | ||
run_evals(FULL_DATASET_FNAME, str(log_dir), predictions_jsonl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Redundant assignment to 'any_need_evals'.
The code unconditionally sets any_need_evals to True immediately after computing its value from predictions. If this is intentional for debugging or forcing re-evaluation, adding a comment or a conditional flag might improve clarity.
any_need_evals = any("resolved" not in pred for pred in predictions.values()) | |
any_need_evals = True | |
if any_need_evals: | |
run_evals(FULL_DATASET_FNAME, str(log_dir), predictions_jsonl) | |
any_need_evals = any("resolved" not in pred for pred in predictions.values()) | |
# If you need to force evaluations regardless of prediction state for debugging, | |
# uncomment the following line: | |
# any_need_evals = True | |
if any_need_evals: | |
run_evals(FULL_DATASET_FNAME, str(log_dir), predictions_jsonl) |
@@ -0,0 +1,237 @@ | |||
#!/usr/bin/env python | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Improve test outcome validation
Relying solely on the presence of ">>>>> All Tests Passed" in the log might be brittle. Consider using a more robust method to determine test success, such as parsing test results or checking exit codes. This will make the tests more reliable.
Suggested implementation:
print("\n".join(log_lines))
# Import regular expressions module if not already imported
import re
# Check if any log line indicates that all tests passed.
all_tests_passed = any(">>>>> All Tests Passed" in line for line in log_lines)
# Check if any log line contains failure or error indicators.
has_failures = any(re.search(r"FAILED|ERROR", line) for line in log_lines)
passed = all_tests_passed and not has_failures
If further robustness is needed, consider parsing structured test output or capturing and checking the exit code from the test runner. Adjust the regular expressions or parsing logic as necessary to match the exact format of your test logs.
@@ -0,0 +1,29 @@ | |||
## Codegen Harness and Evaluator for SWE Bennch Development Tool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (typo): Typo: "Bennch" should be "Bench"
## Codegen Harness and Evaluator for SWE Bennch Development Tool | |
## Codegen Harness and Evaluator for SWE Bench Development Tool |
if to.startswith("b/") and ( | ||
"/test/" in to | ||
or "/tests/" in to | ||
or "/testing/" in to | ||
or "/test_" in to | ||
or "/tox.ini" in to | ||
): | ||
is_tests = True | ||
else: | ||
is_tests = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Replace if statement with if expression (
assign-if-exp
) - Simplify boolean if expression (
boolean-if-exp-identity
)
|
||
|
||
def main_preds(): | ||
dataset = get_dataset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Move assignment closer to its usage within a block (
move-assign-in-block
) - Replace manual loop counter with call to enumerate (
convert-to-enumerate
)
|
||
|
||
def get_plausible(preds): | ||
return set(inst for inst, pred in preds.items() if is_plausible(pred)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Replace list(), dict() or set() with comprehension (collection-builtin-to-comprehension
)
return set(inst for inst, pred in preds.items() if is_plausible(pred)) | |
return {inst for inst, pred in preds.items() if is_plausible(pred)} |
for attr in attrs: | ||
if not pred[attr]: | ||
return False | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): We've found these issues:
- Use any() instead of for loop (
use-any
) - Invert any/all to simplify comparisons (
invert-any-all
)
for attr in attrs: | |
if not pred[attr]: | |
return False | |
return True | |
return all(pred[attr] for attr in attrs) |
md_fname = pred_dname / res["dname"] / (inst + ".md") | ||
assert md_fname.exists(), md_fname | ||
new_md_fname = pred_dname / model_name_or_path / (inst + ".md") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Use f-string instead of string concatenation [×2] (use-fstring-for-concatenation
)
PR Code Suggestions ✨Latest suggestions up to f2d4627
Previous suggestionsSuggestions up to commit f2d4627
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by Korbit AI
Korbit automatically attempts to detect when you fix issues in new commits.
Category | Issue | Status |
---|---|---|
Non-portable Temporary Directory Path ▹ view | ||
Print Statements Instead of Logger ▹ view | ||
Missing Progress Tracking ▹ view | ||
Missing context for SWEBenchEnvSetup class ▹ view | ||
Missing function parameter documentation ▹ view | ||
Unsafe Repository Name Extraction ▹ view | ||
Mixed Logging with Business Logic ▹ view | ||
Untyped Dictionary in Model ▹ view | ||
Unbounded Memory Growth Risk ▹ view | ||
Inefficient Repository Loading ▹ view |
Files scanned
File Path | Reviewed |
---|---|
src/codegen/extensions/swe_bench/utils.py | ✅ |
src/codegen/extensions/swebench/agent.py | ✅ |
src/codegen/extensions/swe_bench/swe_bench_wrapper.py | ✅ |
src/codegen/extensions/swebench/utils.py | ✅ |
src/codegen/extensions/swebench/tests.py | ✅ |
src/codegen/extensions/swebench/report.py | ✅ |
src/codegen/extensions/swebench/harness.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Check out our docs on how you can make Korbit work best for you and your team.
print(f"Cloning or pulling {repo_full_name}...") | ||
remote_operator = RemoteRepoOperator(repo_config=repo_config, bot_commit=False) | ||
print(f"Cloned or pulled {repo_full_name}.") | ||
|
||
# create the project config | ||
projects = [ProjectConfig(repo_operator=remote_operator, base_path=None, subdirectories=None)] | ||
|
||
# parse the codebase | ||
print("Parsing codebase...") | ||
codebase = Codebase(projects=projects) | ||
print("Codebase parsed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Print Statements Instead of Logger 
Tell me more
What is the issue?
Using print statements instead of proper logging framework for operational status updates.
Why this matters
Print statements don't provide timestamp, log level, or structured context that would be valuable for debugging and monitoring in production environments.
Suggested change ∙ Feature Preview
import logging
logger = logging.getLogger(__name__)
def construct_codebase(repo_full_name: str) -> PyCodebaseType:
repo_name = repo_full_name.split("/")[-1]
repo_config = RepoConfig(name=repo_name, full_name=repo_full_name, base_dir="/tmp/codegen")
logger.info(f"Cloning or pulling repository {repo_full_name}")
remote_operator = RemoteRepoOperator(repo_config=repo_config, bot_commit=False)
logger.info(f"Successfully cloned/pulled repository {repo_full_name}")
projects = [ProjectConfig(repo_operator=remote_operator, base_path=None, subdirectories=None)]
logger.info("Starting codebase parsing")
codebase = Codebase(projects=projects)
logger.info("Successfully parsed codebase")
return codebase
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
print("Parsing codebase...") | ||
codebase = Codebase(projects=projects) | ||
print("Codebase parsed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing Progress Tracking 
Tell me more
What is the issue?
The codebase parsing operation lacks progress tracking for large repositories, which can be a time-consuming operation.
Why this matters
Without progress tracking, users have no visibility into the parsing status of large codebases, leading to uncertainty about whether the operation is progressing or stuck.
Suggested change ∙ Feature Preview
Implement a progress callback or logging system to show parsing progress:
logger.info(f"Starting codebase parsing for {repo_full_name}")
codebase = Codebase(projects=projects, progress_callback=report_progress)
logger.info(f"Completed parsing {repo_full_name}")
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
class SWEBenchEnvSetup(BaseModel): | ||
split: SWEBenchSplit | ||
environment_setup_commit: str = NO_ENV_SETUP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing context for SWEBenchEnvSetup class 
Tell me more
What is the issue?
The class lacks documentation explaining its purpose and the significance of the environment_setup_commit default value.
Why this matters
Without understanding the purpose of environment setup and why NO_ENV_SETUP is used as default, developers might misuse this class or change its behavior incorrectly.
Suggested change ∙ Feature Preview
class SWEBenchEnvSetup(BaseModel):
"""Represents environment setup configuration for SWE-bench.
The environment_setup_commit tracks the commit hash where environment setup
was performed, defaulting to NO_ENV_SETUP when not applicable.
"""
split: SWEBenchSplit
environment_setup_commit: str = NO_ENV_SETUP
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
entry: dict | ||
|
||
|
||
def construct_codebase(repo_full_name: str) -> PyCodebaseType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing function parameter documentation 
Tell me more
What is the issue?
The function signature has type hints but lacks a docstring explaining the purpose and expected format of repo_full_name.
Why this matters
Without clear documentation about the expected format of repo_full_name (e.g., 'owner/repo'), users might pass incorrectly formatted repository names.
Suggested change ∙ Feature Preview
def construct_codebase(repo_full_name: str) -> PyCodebaseType:
"""Clones a repository and constructs a PyCodebase object.
Args:
repo_full_name: Repository identifier in 'owner/repo' format.
Returns:
A PyCodebase instance for the specified repository.
"""
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
|
||
|
||
def construct_codebase(repo_full_name: str) -> PyCodebaseType: | ||
repo_name = repo_full_name.split("/")[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsafe Repository Name Extraction 
Tell me more
What is the issue?
The code assumes the repository name can be extracted by splitting on '/' and taking the last element, but this assumption might fail for repository names containing forward slashes.
Why this matters
If a repository name contains forward slashes, the extracted name will be incorrect, leading to potential issues in repository management and file operations.
Suggested change ∙ Feature Preview
Use a more robust method to extract the repository name, such as:
from urllib.parse import urlparse
def get_repo_name(repo_full_name: str) -> str:
# Handle both URLs and organization/repo format
path = urlparse(repo_full_name).path or repo_full_name
return path.rstrip('/').split('/')[-1]
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
|
||
def construct_codebase(repo_full_name: str) -> PyCodebaseType: | ||
repo_name = repo_full_name.split("/")[-1] | ||
repo_config = RepoConfig(name=repo_name, full_name=repo_full_name, base_dir="/tmp/codegen") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-portable Temporary Directory Path 
Tell me more
What is the issue?
Hardcoding the temporary directory path may cause functionality issues on different operating systems or environments where /tmp is not available or writable.
Why this matters
The code may fail on Windows systems or in environments with restricted /tmp access, preventing the core functionality of cloning and managing repositories.
Suggested change ∙ Feature Preview
Use a platform-independent temporary directory:
import tempfile
import os
repo_config = RepoConfig(
name=repo_name,
full_name=repo_full_name,
base_dir=os.path.join(tempfile.gettempdir(), 'codegen')
)
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
def construct_codebase(repo_full_name: str) -> PyCodebaseType: | ||
repo_name = repo_full_name.split("/")[-1] | ||
repo_config = RepoConfig(name=repo_name, full_name=repo_full_name, base_dir="/tmp/codegen") | ||
|
||
# clone or pull the repo | ||
print(f"Cloning or pulling {repo_full_name}...") | ||
remote_operator = RemoteRepoOperator(repo_config=repo_config, bot_commit=False) | ||
print(f"Cloned or pulled {repo_full_name}.") | ||
|
||
# create the project config | ||
projects = [ProjectConfig(repo_operator=remote_operator, base_path=None, subdirectories=None)] | ||
|
||
# parse the codebase | ||
print("Parsing codebase...") | ||
codebase = Codebase(projects=projects) | ||
print("Codebase parsed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixed Logging with Business Logic 
Tell me more
What is the issue?
The function violates the Single Responsibility Principle by mixing logging/progress reporting with business logic.
Why this matters
Direct print statements make the function harder to test, reuse, and modify logging behavior. It also tightly couples progress reporting with codebase construction.
Suggested change ∙ Feature Preview
Extract logging to a separate logging utility or use a proper logging framework:
from logging import getLogger
logger = getLogger(__name__)
def construct_codebase(repo_full_name: str) -> PyCodebaseType:
logger.info(f"Cloning or pulling {repo_full_name}...")
# ... rest of the code
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
class SWEBenchEntry(BaseModel): | ||
split: SWEBenchSplit | ||
entry: dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Untyped Dictionary in Model 
Tell me more
What is the issue?
The 'entry' field uses a generic dict type without specifying its structure, making the model too permissive and less type-safe.
Why this matters
Using an untyped dictionary reduces code clarity, makes refactoring risky, and bypasses Pydantic's validation capabilities.
Suggested change ∙ Feature Preview
Create a specific model for the entry content:
class EntryContent(BaseModel):
# Define specific fields expected in the entry
field1: str
field2: int
class SWEBenchEntry(BaseModel):
split: SWEBenchSplit
entry: EntryContent
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
) | ||
|
||
# Create message history handler | ||
message_history = InMemoryChatMessageHistory(messages=chat_history) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unbounded Memory Growth Risk 
Tell me more
What is the issue?
Using InMemoryChatMessageHistory can lead to memory issues with large chat histories as all messages are stored in RAM.
Why this matters
For long-running sessions or large volumes of messages, memory usage will grow unbounded, potentially causing out-of-memory errors and degraded performance.
Suggested change ∙ Feature Preview
Consider implementing a persistent storage solution or a size-limited message history cache. For example:
class LimitedMemoryHistory(InMemoryChatMessageHistory):
def __init__(self, max_messages=1000, *args, **kwargs):
super().__init__(*args, **kwargs)
self.max_messages = max_messages
def add_message(self, message):
super().add_message(message)
if len(self.messages) > self.max_messages:
self.messages = self.messages[-self.max_messages:]
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
|
||
|
||
# Initialize codebase | ||
codebase = Codebase.from_repo("fastapi/fastapi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient Repository Loading 
Tell me more
What is the issue?
Loading an entire repository synchronously without any caching strategy can be slow and resource-intensive.
Why this matters
Each time the script runs, it will download and parse the entire repository, causing unnecessary network I/O and computation overhead.
Suggested change ∙ Feature Preview
Implement repository caching and selective loading:
from pathlib import Path
import hashlib
def get_cached_repo(repo_url, cache_dir=Path("./repo_cache")):
repo_hash = hashlib.sha256(repo_url.encode()).hexdigest()
cache_path = cache_dir / repo_hash
if cache_path.exists():
return Codebase.from_path(cache_path)
codebase = Codebase.from_repo(repo_url)
codebase.save(cache_path)
return codebase
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
…1133) This PR updates the Slack integration documentation to address feedback from Slack marketplace reviewers and ensure compliance with their requirements. ## Changes Made ### ✅ Privacy Policy Link (Feedback #4) - Added prominent link to https://www.codegen.com/privacy-policy in the Data Privacy and Security section ### ✅ AI Disclaimer (Feedback #5) - Added comprehensive "AI Components and Usage" section explaining: - AI-powered functionality and capabilities - How AI processes data from Slack messages - AI limitations and recommendations for code review ### ✅ Pricing Information (Feedback #8) - Added "Pricing and Plans" section with link to https://www.codegen.com/pricing - Explains that Slack integration is available across all plan tiers ### ✅ Enhanced Permissions Documentation (Feedback #7) - Restructured permissions section with detailed explanations - Added specific scope clarifications: - `mpim:read` - For group DM functionality - `chat:write.customize` - For custom usernames/avatars when representing different contexts - `users:read.email` - For mapping Slack accounts to Codegen accounts for proper authentication - Explained why each permission is necessary ### ✅ Privacy Enhancements (Feedback #2) - Clarified that private channel names are anonymized as "Private channel" for non-members - Enhanced privacy metadata handling explanation ## Slack Marketplace Feedback Addressed This PR directly addresses the following feedback items from Slack reviewers: - **#2**: Privacy model compliance - private channel name anonymization - **#4**: Privacy policy link requirement - **#5**: AI disclaimer requirement for AI-enabled apps - **#7**: Scope usage clarification for `chat:write.customize` and `users:read.email` - **#8**: Pricing information requirement ## Remaining Technical Issues The following items require code changes (not documentation) and are outside the scope of this PR: - **#1**: Missing `mpim:read` scope in OAuth URL (technical implementation) - **#3**: OAuth state parameter uniqueness (technical implementation) - **#6**: Group DM response issue related to missing `mpim:read` scope (technical implementation) ## Files Changed - `docs/integrations/slack.mdx` - Updated with all compliance requirements --- [💻 View my work](https://codegen.sh/agent/trace/35953) • [About Codegen](https://codegen.com) --------- Co-authored-by: codegen-sh[bot] <131295404+codegen-sh[bot]@users.noreply.github.com>
User description
Motivation
Content
Testing
Please check the following before marking your PR as ready for review
Motivation
Content
Testing
Please check the following before marking your PR as ready for review
PR Type
enhancement, tests, documentation
Description
Add SWE Bench harness for agentic codegen evaluation
harness.py
to run agents on SWE Bench datasetAdd evaluation and reporting utilities for SWE Bench
report.py
to aggregate and analyze resultsProvide utilities for dataset loading, predictions, and winner selection
Add example agent and documentation for SWE Bench integration
agent.py
) using Codegen tools and LangChainAdd test harness for running and validating predictions
tests.py
for patch filtering, test running, and docker image checksAdd example prediction results for demonstration
Add
datasets
dependency to project requirementsChanges walkthrough 📝
6 files
Add main harness to run agent on SWE Bench
Add example agent using Codegen and LangChain tools
Add evaluation and reporting script for SWE Bench results
Add dataset and prediction utilities for SWE Bench
Add SWE Bench dataset wrapper for agent integration
Add utility functions and schemas for SWE Bench wrapper
5 files
Add test harness for patch filtering and validation
Add example predictions in JSONL format
Add example result for Flask issue in JSON format
Add example result for Requests issue in JSON format
Add example result for Pytest issue in JSON format
1 files
Add documentation for SWE Bench harness and evaluator
1 files
Add datasets dependency for SWE Bench integration
1 files