-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Rust: regenerate MaD files using DCA #19674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the bulk model generator to use the DCA strategy for regenerating MaD files, switches configuration from JSON to YAML, and enhances parallelism and cleanup in the Python script.
- Migrate bulk generation config files from JSON to a terser YAML format.
- Refactor
bulk_generate_mad.py
to add a genericrun_in_parallel
helper for cloning and downloading in parallel, with cleanup of old artifacts. - Regenerate all Rust QL test expected files based on the new DCA outputs.
Reviewed Changes
Copilot reviewed 70 out of 70 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
misc/scripts/models-as-data/bulk_generate_mad.py | Add run_in_parallel , parallel DCA downloads, YAML parsing, cleanup logic |
rust/misc/bulk_generation_targets.yml | New YAML config replacing JSON targets for Rust bulk generation |
cpp/bulk_generation_targets.yml | New YAML config replacing JSON targets for C++ bulk generation |
Various .expected files under rust/ql/test |
Regeneration of QL test expectations to reflect new DCA outputs |
Comments suppressed due to low confidence (1)
misc/scripts/models-as-data/bulk_generate_mad.py:115
- The generic type parameters T and U are used in the function signature but not defined; add TypeVar definitions such as
T = TypeVar('T')
andU = TypeVar('U')
before their use.
def run_in_parallel[T, U](
|
||
project_dirs = [project_dirs_map[project["name"]] for project in projects] | ||
|
||
dirs = run_in_parallel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Exiting from within a utility function (via sys.exit in on_error handlers) can make the logic harder to test or reuse; consider returning errors and handling exit at the top level instead.
dirs = run_in_parallel( | |
failed = run_in_parallel( |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Some comments / discussion, one test annotation needs fixing.
import yaml | ||
except ImportError: | ||
print("ERROR: PyYAML is not installed. Please install it with 'pip install pyyaml'.") | ||
sys.exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hit a similar problem with requests
when I first ran this script, FWIW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I'll add that as well
@@ -341,6 +342,13 @@ def download_dca_databases( | |||
print(f"Skipping {pretty_name} as it is not in the list of projects") | |||
continue | |||
|
|||
if pretty_name in artifact_map: | |||
print(f"Skipping previous database {artifact_map[pretty_name]['artifact_name']} for {pretty_name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it being worth choosing the best database of a particular name (e.g. the most recent) or are they likely to be very close and/or difficult to compare anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the DBs are ordered in download.json
by the order they are extracted, which makes selecting the last select the variant (rather than the baseline), which I think is what one would want when for example using a nightly run. But let me double check
- name: rocket | ||
- name: actix-web | ||
- name: hyper | ||
- name: clap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice simple list, once everything is merged and stable I'll add a bunch more targets to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing to keep in mind is that at the moment this list needs to be topologically ordered with respect to dependencies (so later additions should depend on earlier ones and not the other way around). Possibly worth a comment here, now that this is yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, just so you know, you can tweak what gets generated with any of
with-sinks: false
with-sources: false
with-summaries: false
(all are true by default)
@@ -501,3 +524,5 @@ nodes | |||
| main.rs:323:27:323:27 | v | semmle.label | v | | |||
| main.rs:324:25:324:25 | v | semmle.label | v | | |||
subpaths | |||
testFailures | |||
| main.rs:202:32:202:38 | realloc | Unexpected result: Alert=arg1 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This result is marked MISSING
in the source, all we need to do is remove the word MISSING:
in main.rs
line 528 and it's a win!
I don't see any other meaningful changes to results of tests. 🎉
# And then we iterate over the contents of the extracted directory | ||
# and extract the tar.gz files inside it | ||
for entry in os.listdir(artifact_unzipped_location): | ||
artifact_tar_location = os.path.join(artifact_unzipped_location, entry) | ||
with tarfile.open(artifact_tar_location, "r:gz") as tar_ref: | ||
# And we just untar it to the same directory as the zip file | ||
tar_ref.extractall(artifact_unzipped_location) | ||
database_results[pretty_name] = os.path.join( | ||
artifact_unzipped_location, remove_extension(entry) | ||
) | ||
# and extract the language tar.gz file inside it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not iterating any more though, right? We're just unzipping the one correct .tar.gz
?
Why did we iterate in the previous design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was just a kind of way to take the only containing file without specifying its name, but the name is easy to specify which is what I've done here.
Also fix some minor things in `bulk_generate_mad.py`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really great!
A few comments and I think you need to run black
again as there's a few formatting changes from that.
|
||
artifact_map[pretty_name] = analyzed_database | ||
|
||
def download_and_extract(item: tuple[str, dict]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using "extract" here could be a bit confusing as we already use "extract" with a different meaning in the context of QL. What about using "unzip" instead?
def download_and_extract(item: tuple[str, dict]) -> str: | |
def download_and_unzip(item: tuple[str, dict]) -> str: |
There's a few "extract" below that could also be "unzip".
|
||
results = run_in_parallel( | ||
download_and_extract, | ||
list(artifact_map.items()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download_and_extract
doesn't use the name, so we could get away with
list(artifact_map.items()), | |
list(artifact_map.values()), |
and adjust download_and_extract
accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is theoretically used by the error printing functions of run_in_parallel
, but that can be tweaked
print("\n=== Finding projects ===") | ||
response = get_json_from_github( | ||
f"https://raw.githubusercontent.com/github/codeql-dca-main/data/{experiment_name}/reports/downloads.json", | ||
pat, | ||
) | ||
targets = response["targets"] | ||
project_map = {project["name"]: project for project in projects} | ||
artifact_map = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There already is an artifact_name
inside download_and_extract
which end up shadowing this one. Could we rename one of them just to make this a bit clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as they hold exactly the same values, taken the same way from the analyzed_database dict, I don't think that is confusing, so I'd rather leave this
Uh oh!
There was an error while loading. Please reload this page.