|
1 |
| -# spec2repo |
2 |
| - |
3 |
| -We set up a task where, given a specification, the goal is to produce an implementation of the specification. |
4 |
| -Specifically, we are interested in converting library specifications to implementations (i.e., repositories). |
5 |
| -We lay out the steps to create a spec2repo example and perform an evaluation on the example using the SWE-bench framework. |
6 |
| - |
7 |
| -First, to install required packages, |
8 |
| -``` |
9 |
| -pip install -r requirements.txt |
10 |
| -``` |
11 |
| - |
12 |
| -Please provide the following information for the list of repositories in a YAML file, |
13 |
| -``` |
14 |
| -repos.yml |
15 |
| -0: # just an index |
16 |
| - name: [repository_name] # in the form of {organization_name}/{library_name} |
17 |
| - commit: [commit_sha] |
18 |
| - tag: [version_tag] |
19 |
| - setup: |
20 |
| - - [command_1] |
21 |
| - - [command_2] |
22 |
| - - ... |
23 |
| -``` |
24 |
| -There are two options to specify the version of the library: |
25 |
| -you can either provide a specific commit or a specific tag. You cannot specify both at the same time. |
26 |
| -Finally, include the commands that sets up the library from a local repository. |
27 |
| -For example, to create an example for the ``msiemens/tinydb`` with version 4.8, |
28 |
| -``` |
29 |
| -repos.yml |
30 |
| -0: |
31 |
| - name: "msiemens/tinydb" |
32 |
| - commit: null |
33 |
| - tag: "v4.8.0" |
34 |
| - setup: |
35 |
| - - "python -m pip install --upgrade pip twine" |
36 |
| - - "pip install poetry" |
37 |
| - - "poetry install" |
38 |
| -``` |
39 |
| - |
40 |
| -We are now ready to generate the dataset. Before that, add your GitHub token in the environment. |
41 |
| -``` |
42 |
| -export GITHUB_TOKEN=[github_token] |
43 |
| -``` |
44 |
| -Now run, |
45 |
| -``` |
46 |
| -python create-data/build_dataset.py repos.json --hf_name wentingzhao/spec2repo |
47 |
| -``` |
48 |
| -where ``repos.json`` is the file we specified above, and ``wentingzhao/spec2repo`` is where you want to upload the dataset on HF. |
49 |
| -This command produces the base commit (with function body removed), gold patch that passes all unit tests, and all test function names. |
50 |
| -Note that this script will create a fork for the libaray. The fork will be created under organization ``spec2repo``. |
51 |
| -You can change the organization to somewhere else. But if you want to create a fork under ``spec2repo``, please contact Wenting Zhao to be added to the organization. |
52 |
| - |
53 |
| -Now that dataset has been generated, we move on to using SWE-bench to perform an evaluation. |
54 |
| -First, follow the instructions in the [Docker setup guide](https://docs.docker.com/engine/install/) to install Docker on your machine. |
55 |
| -If you're setting up on Linux, we recommend seeing the [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/) as well. |
56 |
| - |
57 |
| -To install SWE-bench: |
58 |
| -```bash |
59 |
| -git clone https://github.com/princeton-nlp/SWE-bench.git |
60 |
| -cd SWE-bench |
61 |
| -pip install -e . |
62 |
| -``` |
63 |
| - |
64 |
| -Now, let's add a configuration file to build a DOCKER environment for the library in a YAML file: |
65 |
| -``` |
66 |
| -configs/specs.yml |
67 |
| -spec2repo/tinydb: |
68 |
| - "1.0": |
69 |
| - python: 3.11 |
70 |
| - install: "python -m pip install --upgrade pip twine; pip install poetry; poetry install" |
71 |
| - test_cmd: "pytest" |
72 |
| -``` |
73 |
| -To make this for your own library, leave the ``1.0`` unchanged, specify the Python version with ``python``, and how to locally build the library with ``install``, and how to run tests with ``test_cmd``. |
74 |
| - |
75 |
| -You also need to write your own function to process the test logs. Please add your function in ``configs/log_parsers.py``. The function should take in a log text file and return a dictionary that maps from a test function to its test stutas such as passed or failed. After that, update the global variable ``ADD_MAP_REPO_TO_PARSER``. |
76 |
| -``` |
77 |
| -configs/log_parsers.py |
78 |
| -def parse_log_tinydb(log: str) -> dict[str, str]: |
79 |
| - """ |
80 |
| - Parser for test logs generated with TinyDB framework |
81 |
| -
|
82 |
| - Args: |
83 |
| - log (str): log content |
84 |
| - Returns: |
85 |
| - dict: test case to test status mapping |
86 |
| - """ |
87 |
| - test_status_map = {} |
88 |
| - pattern = r"^(.*\/.*)::(.*)\s+\w+\s+\[\s*(\d+%)\]$" |
89 |
| - for line in log.split("\n"): |
90 |
| - line = line.strip() |
91 |
| - m = re.match(pattern, line) |
92 |
| - if m: |
93 |
| - line = line.split() |
94 |
| - test, value = line[:2] |
95 |
| - if value == "PASSED": |
96 |
| - test_status_map[test] = TestStatus.PASSED.value |
97 |
| - else: |
98 |
| - test_status_map[test] = TestStatus.FAILED.value |
99 |
| - return test_status_map |
100 |
| -
|
101 |
| -ADD_MAP_REPO_TO_PARSER = { |
102 |
| - "spec2repo/tinydb": parse_log_tinydb, |
103 |
| -} |
104 |
| -``` |
105 |
| - |
106 |
| -Finally, to run evaluation for the created example using the gold patch with the following script: |
107 |
| -``` |
108 |
| -python run.py \ |
109 |
| - --dataset_name wentingzhao/spec2repo \ |
110 |
| - --split train \ |
111 |
| - --max_workers 2 \ |
112 |
| - --predictions_path 'gold' \ |
113 |
| - --instance_ids spec2repo__tinydb-01 \ |
114 |
| - --run_id validate-gold \ |
115 |
| - --spec_config configs/specs.yml |
116 |
| -``` |
117 |
| - |
118 |
| -## Baseline |
119 |
| -### Baseline Input & Output |
120 |
| - |
121 |
| -A simple baseline evaluation can be described like this |
122 |
| -```python |
123 |
| -def run_baseline(base_model, agent, prompt, context, target, error_history) -> test_results, error_message: |
124 |
| - pass |
125 |
| -``` |
126 |
| - |
127 |
| -**Input** |
128 |
| - |
129 |
| -`base_model`: base LLM, e.g. `gpt-4o`, `claude-3-5-sonnet-20240620` |
130 |
| - |
131 |
| -`agent`: agent, e.g. `aider`, `opendevin`, `None` |
132 |
| - |
133 |
| -`prompt`: the prompt/instruction given to `agent`/`base_model` |
134 |
| - |
135 |
| -`context`: there are 3 types of context |
136 |
| -- `context-type-1`: reference doc/pdf/website |
137 |
| -- `context-type-2`: unit_tests that target will be tested with |
138 |
| -- `context-type-3`: Repo info |
139 |
| - - skeleton of the repo(filenames under each dir) |
140 |
| - - function stubs |
141 |
| - - function name in each file(granularity need to be specified) |
142 |
| - |
143 |
| -`target`: target function or file for agent or base_model to complete |
144 |
| -`edit_history`: entire edit histories, each contains previous implementation, updated implementation and corresponding error message |
145 |
| - |
146 |
| -**Output** |
147 |
| - |
148 |
| -`test_results`: WIP |
149 |
| -`error_message`: WIP |
150 |
| - |
151 |
| -## Baseline Evaluation & Ablation |
152 |
| - |
153 |
| -There are mainly 3 axes. |
154 |
| -- different `base_model` |
155 |
| -- different `agent` |
156 |
| -- different `context` |
157 |
| - |
158 |
| -Current priority is to run `gpt-4o` + `aider` with certain `context` to get first baseline result. |
| 1 | +# Commit0 |
0 commit comments