Code and data for the ICSE 2025 paper "HumanEvo: An Evolving-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation"
In this paper, we identify two common flaws in the existing evluation approaches for repository-level code generation.
To provide LLMs with a more realistic evaluation scenario, we construct HumanEvo. The follwoing is the construction pipeline of HumanEvo.
- clone this repository
- run
conda env create -f environment.yml
to create a conda environment named HumanEvo
- HumanEvo_construct
- collect
- make_repo
- call_make_repo.py : call make_repo.sh to make mirror repo for a high-quality github repository
- make_repo.sh
- build_dataset.py : build intial dataset for validation
- get_top_pypi.py : get high-quality Python repository
- print_pulls.py : crawl pull requests from the target repository
- run_build_dataset.sh
- utils.py
- make_repo
- get_version
- extract_web
- get_version_java.py : get version for each pull request (PR)
- get_version_python.py : get version for each PR
- extract_web
- validation
- constans.py
- context_manager.py
- engine_validation.py
- run_validation.sh
- collect
Take construct pipeline of HumanEvo-Python for example:
- run
get_top_pypi.py
for high-quality repository - run
call_make_repo.py
to make mirror repo - run
print_pulls.py
to crawl pull requests from the target repository - run
run_build_dataset.sh
to handle the initial PRs - run
get_version_python.py
to get version number for each crawled PR - run
run_validation.sh
to invokeengine_validation.py
to validate the PRs crawled
After aodingg all this, we can make sure the PR wo get is of high-quality and covered by the project's test fromework.
The command to run evaluation is in eval/run.sh
, you should cd eval
and run bash run.sh
. Please remember to fill all the needed path before you run the source code.
Here is an example:
python run_eval.py \
--instances_path "../HumanEvo/HumanEvo_Python.json" \
--log_dir "./log" \
--num_workers 1 \
--path_conda "path/to/your/conda" \
--testbed "./testbed" \
--language "python" \
--timeout 1800 \
--verbose
Before the program starts evaluation process, It may take a while to clone all the target repositories and creat runtime environments for every signle task instance, both of which may can not be completed in one round, please be patient.