commit-0
diff --git a/‎.github/workflows/system.yml
Lines changed: 7 additions & 2 deletions b/‎.github/workflows/system.yml
Lines changed: 7 additions & 2 deletions
diff --git a/‎README.md
Lines changed: 1 addition & 158 deletions b/‎README.md
Lines changed: 1 addition & 158 deletions
diff --git a/‎commit0/__main__.py
Lines changed: 22 additions & 3 deletions b/‎commit0/__main__.py
Lines changed: 22 additions & 3 deletions
diff --git a/‎commit0/configs/base.yaml
Lines changed: 3 additions & 0 deletions b/‎commit0/configs/base.yaml
Lines changed: 3 additions & 0 deletions
diff --git a/‎commit0/configs/config_class.py
Lines changed: 4 additions & 0 deletions b/‎commit0/configs/config_class.py
Lines changed: 4 additions & 0 deletions
diff --git a/‎commit0/configs/user.yaml
Lines changed: 2 additions & 0 deletions b/‎commit0/configs/user.yaml
Lines changed: 2 additions & 0 deletions
diff --git a/‎commit0/harness/build.py
Lines changed: 9 additions & 4 deletions b/‎commit0/harness/build.py
Lines changed: 9 additions & 4 deletions
diff --git a/‎commit0/harness/constants.py
Lines changed: 6 additions & 0 deletions b/‎commit0/harness/constants.py
Lines changed: 6 additions & 0 deletions
diff --git a/‎commit0/harness/docker_utils.py
Lines changed: 0 additions & 43 deletions b/‎commit0/harness/docker_utils.py
Lines changed: 0 additions & 43 deletions
@@ -18,13 +18,18 @@ jobs:
         uses: docker/setup-buildx-action@v3
       - name: Install the project
         run: uv sync
-      - name: Clone
+      - name: Set up commit0
         run: uv run commit0 clone simpy
-      - name: Setup
+      - name: Build docker images
         run: uv run commit0 build simpy
       - name: Get tests
         run: uv run commit0 get-tests simpy
       - name: Test
         run: uv run commit0 test-reference simpy tests/test_event.py::test_succeed
       - name: Evaluate
         run: uv run commit0 evaluate-reference simpy
+      - name: Save
+        env:
+          GITHUB_TOKEN: ${{ secrets.MY_GITHUB_TOKEN }}
+        run: |
+          uv run commit0 save simpy test-save-commit0
@@ -1,158 +1 @@
-# spec2repo
-
-We set up a task where, given a specification, the goal is to produce an implementation of the specification.
-Specifically, we are interested in converting library specifications to implementations (i.e., repositories).
-We lay out the steps to create a spec2repo example and perform an evaluation on the example using the SWE-bench framework.
-
-First, to install required packages,
-```
-pip install -r requirements.txt
-```
-
-Please provide the following information for the list of repositories in a YAML file,
-```
-repos.yml
-0: # just an index
-  name: [repository_name] # in the form of {organization_name}/{library_name}
-  commit: [commit_sha]
-  tag: [version_tag]
-  setup:
-    - [command_1]
-    - [command_2]
-    - ...
-```
-There are two options to specify the version of the library:
-you can either provide a specific commit or a specific tag. You cannot specify both at the same time.
-Finally, include the commands that sets up the library from a local repository.
-For example, to create an example for the ``msiemens/tinydb`` with version 4.8,
-```
-repos.yml
-0:
-  name: "msiemens/tinydb"
-  commit: null
-  tag: "v4.8.0"
-  setup:
-    - "python -m pip install --upgrade pip twine"
-    - "pip install poetry"
-    - "poetry install"
-```
-
-We are now ready to generate the dataset. Before that, add your GitHub token in the environment.
-```
-export GITHUB_TOKEN=[github_token]
-```
-Now run,
-```
-python create-data/build_dataset.py repos.json --hf_name wentingzhao/spec2repo
-```
-where ``repos.json`` is the file we specified above, and ``wentingzhao/spec2repo`` is where you want to upload the dataset on HF.
-This command produces the base commit (with function body removed), gold patch that passes all unit tests, and all test function names.
-Note that this script will create a fork for the libaray. The fork will be created under organization ``spec2repo``.
-You can change the organization to somewhere else. But if you want to create a fork under ``spec2repo``, please contact Wenting Zhao to be added to the organization.
-
-Now that dataset has been generated, we move on to using SWE-bench to perform an evaluation.
-First, follow the instructions in the [Docker setup guide](https://docs.docker.com/engine/install/) to install Docker on your machine.
-If you're setting up on Linux, we recommend seeing the [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/) as well.
-
-To install SWE-bench:
-```bash
-git clone https://github.com/princeton-nlp/SWE-bench.git
-cd SWE-bench
-pip install -e .
-```
-
-Now, let's add a configuration file to build a DOCKER environment for the library in a YAML file:
-```
-configs/specs.yml
-spec2repo/tinydb:
-  "1.0":
-    python: 3.11
-    install: "python -m pip install --upgrade pip twine; pip install poetry; poetry install"
-    test_cmd: "pytest"
-```
-To make this for your own library, leave the ``1.0`` unchanged, specify the Python version with ``python``, and how to locally build the library with ``install``, and how to run tests with ``test_cmd``.
-
-You also need to write your own function to process the test logs. Please add your function in ``configs/log_parsers.py``. The function should take in a log text file and return a dictionary that maps from a test function to its test stutas such as passed or failed. After that, update the global variable ``ADD_MAP_REPO_TO_PARSER``.
-```
-configs/log_parsers.py
-def parse_log_tinydb(log: str) -> dict[str, str]:
-    """
-    Parser for test logs generated with TinyDB framework
-
-    Args:
-        log (str): log content
-    Returns:
-        dict: test case to test status mapping
-    """
-    test_status_map = {}
-    pattern = r"^(.*\/.*)::(.*)\s+\w+\s+\[\s*(\d+%)\]$"
-    for line in log.split("\n"):
-        line = line.strip()
-        m = re.match(pattern, line)
-        if m:
-            line = line.split()
-            test, value = line[:2]
-            if value == "PASSED":
-                test_status_map[test] = TestStatus.PASSED.value
-            else:
-                test_status_map[test] = TestStatus.FAILED.value
-    return test_status_map
-
-ADD_MAP_REPO_TO_PARSER = {
-    "spec2repo/tinydb": parse_log_tinydb,
-}
-```
-
-Finally, to run evaluation for the created example using the gold patch with the following script:
-```
-python run.py \
-    --dataset_name wentingzhao/spec2repo \
-    --split train \
-    --max_workers 2 \
-    --predictions_path 'gold' \
-    --instance_ids spec2repo__tinydb-01 \
-    --run_id validate-gold \
-    --spec_config configs/specs.yml
-```
-
-## Baseline
-### Baseline Input & Output
-
-A simple baseline evaluation can be described like this
-```python
-def run_baseline(base_model, agent, prompt, context, target, error_history) -> test_results, error_message:
-    pass
-```
-
-**Input**
-
-`base_model`: base LLM, e.g. `gpt-4o`, `claude-3-5-sonnet-20240620`
-
-`agent`: agent, e.g. `aider`, `opendevin`, `None`
-
-`prompt`: the prompt/instruction given to `agent`/`base_model`
-
-`context`: there are 3 types of context
-- `context-type-1`: reference doc/pdf/website
-- `context-type-2`: unit_tests that target will be tested with
-- `context-type-3`: Repo info
-  - skeleton of the repo(filenames under each dir)
-  - function stubs
-  - function name in each file(granularity need to be specified)
-
-`target`: target function or file for agent or base_model to complete
-`edit_history`: entire edit histories, each contains previous implementation, updated implementation and corresponding error message
-
-**Output**
-
-`test_results`: WIP
-`error_message`: WIP
-
-## Baseline Evaluation & Ablation
-
-There are mainly 3 axes.
-- different `base_model`
-- different `agent`
-- different `context`
-
-Current priority is to run `gpt-4o` + `aider` with certain `context` to get first baseline result.
+# Commit0
@@ -3,6 +3,7 @@
 import commit0.harness.build
 import commit0.harness.setup
 import commit0.harness.evaluate
+import commit0.harness.save
 import copy
 import sys
 import os
@@ -20,7 +21,7 @@ def main() -> None:
         )
     # type check config values
     cs = ConfigStore.instance()
-    cs.store(name="user", node=Commit0Config)
+    cs.store(name="user", group="Commit0Config", node=Commit0Config)
     # have hydra to ignore all command-line arguments
     sys_argv = copy.deepcopy(sys.argv)
     sys.argv = [sys.argv[0]]
@@ -29,8 +30,14 @@ def main() -> None:
     # after hydra gets all configs, put command-line arguments back
     sys.argv = sys_argv
     # repo_split: split from command line has a higher priority than split in hydra
-    if command in ["clone", "build", "evaluate", "evaluate-reference"]:
-        if len(sys.argv) == 3:
+    if command in [
+        "clone",
+        "build",
+        "evaluate",
+        "evaluate-reference",
+        "save",
+    ]:
+        if len(sys.argv) >= 3:
             if sys.argv[2] not in SPLIT:
                 raise ValueError(
                     f"repo split must be from {', '.join(SPLIT.keys())}, but you provided {sys.argv[2]}"
@@ -52,6 +59,7 @@ def main() -> None:
             config.dataset_split,
             config.repo_split,
             config.num_workers,
+            config.backend,
         )
     elif command == "get-tests":
         repo = sys.argv[2]
@@ -85,6 +93,17 @@ def main() -> None:
             config.timeout,
             config.num_workers,
         )
+    elif command == "save":
+        organization = sys.argv[3]
+        commit0.harness.save.main(
+            config.dataset_name,
+            config.dataset_split,
+            config.repo_split,
+            config.base_dir,
+            organization,
+            config.branch,
+            config.github_token,
+        )
 
 
 if __name__ == "__main__":
 
@@ -16,3 +16,6 @@ num_workers: 8
 backend: local
 branch: ai
 timeout: 1_800
+
+# save related
+github_token: null
@@ -1,4 +1,5 @@
 from dataclasses import dataclass
+from typing import Optional
 
 
 @dataclass
@@ -21,3 +22,6 @@ class Commit0Config:
     branch: str
     # timeout for running pytest
     timeout: int
+
+    # save related
+    github_token: Optional[str]
@@ -1,3 +1,5 @@
 defaults:
   - base
   - _self_
+
+backend: local
@@ -4,9 +4,9 @@
 from datasets import load_dataset
 from typing import Iterator
 
+from commit0.harness.constants import RepoInstance, SPLIT
 from commit0.harness.docker_build import build_repo_images
 from commit0.harness.spec import make_spec
-from commit0.harness.constants import RepoInstance, SPLIT
 
 logging.basicConfig(
     level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
@@ -15,7 +15,11 @@
 
 
 def main(
-    dataset_name: str, dataset_split: str, repo_split: str, num_workers: int
+    dataset_name: str,
+    dataset_split: str,
+    repo_split: str,
+    num_workers: int,
+    backend: str,
 ) -> None:
     dataset: Iterator[RepoInstance] = load_dataset(dataset_name, split=dataset_split)  # type: ignore
     specs = []
@@ -26,8 +30,9 @@ def main(
         spec = make_spec(example)
         specs.append(spec)
 
-    client = docker.from_env()
-    build_repo_images(client, specs, num_workers)
+    if backend == "local":
+        client = docker.from_env()
+        build_repo_images(client, specs, num_workers)
 
 
 __all__ = []
@@ -11,6 +11,11 @@ class RepoInstance(TypedDict):
     test: Dict[str, str]
 
 
+class Files(TypedDict):
+    eval_script: Dict[str, Path]
+    patch: Dict[str, Path]
+
+
 # Constants - Evaluation Log Directories
 BASE_IMAGE_BUILD_DIR = Path("logs/build_images/base")
 REPO_IMAGE_BUILD_DIR = Path("logs/build_images/repo")
@@ -34,6 +39,7 @@ class RepoInstance(TypedDict):
     "get-tests",
     "evaluate",
     "evaluate-reference",
+    "save",
 ]
 # repo splits
 SPLIT_MINITORCH = ["minitorch"]
 
@@ -140,49 +140,6 @@ def delete_file_from_container(container: Container, file_path: str) -> None:
         raise Exception(f"General Error: {str(e)}")
 
 
-def copy_ssh_pubkey_from_container(container: Container) -> None:
-    """Copy the SSH public key from a Docker container to the local authorized_keys file.
-
-    Args:
-    ----
-    container (Container): Docker container to copy the key from.
-
-    Raises:
-    ------
-    docker.errors.APIError: If there is an error calling the Docker API.
-    Exception: If the file reading or writing process fails.
-
-    """
-    try:
-        exit_code, output = container.exec_run("cat /root/.ssh/id_rsa.pub")
-        if exit_code != 0:
-            raise Exception(f"Error reading file: {output.decode('utf-8').strip()}")
-        public_key = output.decode("utf-8").strip()
-
-        local_authorized_keys_path = os.path.expanduser("~/.ssh/authorized_keys")
-        os.makedirs(os.path.dirname(local_authorized_keys_path), exist_ok=True)
-        if not os.path.exists(local_authorized_keys_path):
-            # Since the file does not exist, create it
-            open(local_authorized_keys_path, "a").close()
-            write = True
-        else:
-            with open(local_authorized_keys_path, "r") as authorized_keys_file:
-                content = authorized_keys_file.read()
-                if public_key not in content:
-                    write = True
-                else:
-                    write = False
-
-        if write:
-            with open(local_authorized_keys_path, "a") as authorized_keys_file:
-                authorized_keys_file.write(public_key + "\n")
-
-    except docker.errors.APIError as e:
-        raise docker.errors.APIError(f"Docker API Error: {str(e)}")
-    except Exception as e:
-        raise Exception(f"General Error: {str(e)}")
-
-
 def write_to_container(container: Container, data: str, dst: Path) -> None:
     """Write a string to a file in a docker container"""
     # echo with heredoc to file
-Original file line number
+Diff line change
@@ @@ -1,3 +1,5 @@ @@
 defaults:
   - base
   - _self_
++
 +backend: local