Skip to content

Commit 18cb0c7

Browse files
authored
Update README.md
1 parent 2ba0357 commit 18cb0c7

File tree

1 file changed

+1
-158
lines changed

1 file changed

+1
-158
lines changed

README.md

Lines changed: 1 addition & 158 deletions
Original file line numberDiff line numberDiff line change
@@ -1,158 +1 @@
1-
# spec2repo
2-
3-
We set up a task where, given a specification, the goal is to produce an implementation of the specification.
4-
Specifically, we are interested in converting library specifications to implementations (i.e., repositories).
5-
We lay out the steps to create a spec2repo example and perform an evaluation on the example using the SWE-bench framework.
6-
7-
First, to install required packages,
8-
```
9-
pip install -r requirements.txt
10-
```
11-
12-
Please provide the following information for the list of repositories in a YAML file,
13-
```
14-
repos.yml
15-
0: # just an index
16-
name: [repository_name] # in the form of {organization_name}/{library_name}
17-
commit: [commit_sha]
18-
tag: [version_tag]
19-
setup:
20-
- [command_1]
21-
- [command_2]
22-
- ...
23-
```
24-
There are two options to specify the version of the library:
25-
you can either provide a specific commit or a specific tag. You cannot specify both at the same time.
26-
Finally, include the commands that sets up the library from a local repository.
27-
For example, to create an example for the ``msiemens/tinydb`` with version 4.8,
28-
```
29-
repos.yml
30-
0:
31-
name: "msiemens/tinydb"
32-
commit: null
33-
tag: "v4.8.0"
34-
setup:
35-
- "python -m pip install --upgrade pip twine"
36-
- "pip install poetry"
37-
- "poetry install"
38-
```
39-
40-
We are now ready to generate the dataset. Before that, add your GitHub token in the environment.
41-
```
42-
export GITHUB_TOKEN=[github_token]
43-
```
44-
Now run,
45-
```
46-
python create-data/build_dataset.py repos.json --hf_name wentingzhao/spec2repo
47-
```
48-
where ``repos.json`` is the file we specified above, and ``wentingzhao/spec2repo`` is where you want to upload the dataset on HF.
49-
This command produces the base commit (with function body removed), gold patch that passes all unit tests, and all test function names.
50-
Note that this script will create a fork for the libaray. The fork will be created under organization ``spec2repo``.
51-
You can change the organization to somewhere else. But if you want to create a fork under ``spec2repo``, please contact Wenting Zhao to be added to the organization.
52-
53-
Now that dataset has been generated, we move on to using SWE-bench to perform an evaluation.
54-
First, follow the instructions in the [Docker setup guide](https://docs.docker.com/engine/install/) to install Docker on your machine.
55-
If you're setting up on Linux, we recommend seeing the [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/) as well.
56-
57-
To install SWE-bench:
58-
```bash
59-
git clone https://github.com/princeton-nlp/SWE-bench.git
60-
cd SWE-bench
61-
pip install -e .
62-
```
63-
64-
Now, let's add a configuration file to build a DOCKER environment for the library in a YAML file:
65-
```
66-
configs/specs.yml
67-
spec2repo/tinydb:
68-
"1.0":
69-
python: 3.11
70-
install: "python -m pip install --upgrade pip twine; pip install poetry; poetry install"
71-
test_cmd: "pytest"
72-
```
73-
To make this for your own library, leave the ``1.0`` unchanged, specify the Python version with ``python``, and how to locally build the library with ``install``, and how to run tests with ``test_cmd``.
74-
75-
You also need to write your own function to process the test logs. Please add your function in ``configs/log_parsers.py``. The function should take in a log text file and return a dictionary that maps from a test function to its test stutas such as passed or failed. After that, update the global variable ``ADD_MAP_REPO_TO_PARSER``.
76-
```
77-
configs/log_parsers.py
78-
def parse_log_tinydb(log: str) -> dict[str, str]:
79-
"""
80-
Parser for test logs generated with TinyDB framework
81-
82-
Args:
83-
log (str): log content
84-
Returns:
85-
dict: test case to test status mapping
86-
"""
87-
test_status_map = {}
88-
pattern = r"^(.*\/.*)::(.*)\s+\w+\s+\[\s*(\d+%)\]$"
89-
for line in log.split("\n"):
90-
line = line.strip()
91-
m = re.match(pattern, line)
92-
if m:
93-
line = line.split()
94-
test, value = line[:2]
95-
if value == "PASSED":
96-
test_status_map[test] = TestStatus.PASSED.value
97-
else:
98-
test_status_map[test] = TestStatus.FAILED.value
99-
return test_status_map
100-
101-
ADD_MAP_REPO_TO_PARSER = {
102-
"spec2repo/tinydb": parse_log_tinydb,
103-
}
104-
```
105-
106-
Finally, to run evaluation for the created example using the gold patch with the following script:
107-
```
108-
python run.py \
109-
--dataset_name wentingzhao/spec2repo \
110-
--split train \
111-
--max_workers 2 \
112-
--predictions_path 'gold' \
113-
--instance_ids spec2repo__tinydb-01 \
114-
--run_id validate-gold \
115-
--spec_config configs/specs.yml
116-
```
117-
118-
## Baseline
119-
### Baseline Input & Output
120-
121-
A simple baseline evaluation can be described like this
122-
```python
123-
def run_baseline(base_model, agent, prompt, context, target, error_history) -> test_results, error_message:
124-
pass
125-
```
126-
127-
**Input**
128-
129-
`base_model`: base LLM, e.g. `gpt-4o`, `claude-3-5-sonnet-20240620`
130-
131-
`agent`: agent, e.g. `aider`, `opendevin`, `None`
132-
133-
`prompt`: the prompt/instruction given to `agent`/`base_model`
134-
135-
`context`: there are 3 types of context
136-
- `context-type-1`: reference doc/pdf/website
137-
- `context-type-2`: unit_tests that target will be tested with
138-
- `context-type-3`: Repo info
139-
- skeleton of the repo(filenames under each dir)
140-
- function stubs
141-
- function name in each file(granularity need to be specified)
142-
143-
`target`: target function or file for agent or base_model to complete
144-
`edit_history`: entire edit histories, each contains previous implementation, updated implementation and corresponding error message
145-
146-
**Output**
147-
148-
`test_results`: WIP
149-
`error_message`: WIP
150-
151-
## Baseline Evaluation & Ablation
152-
153-
There are mainly 3 axes.
154-
- different `base_model`
155-
- different `agent`
156-
- different `context`
157-
158-
Current priority is to run `gpt-4o` + `aider` with certain `context` to get first baseline result.
1+
# Commit0

0 commit comments

Comments
 (0)