Skip to content

Rust: regenerate MaD files using DCA #19674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Rust: regenerate MaD files using DCA #19674

wants to merge 13 commits into from

Conversation

redsun82
Copy link
Contributor

@redsun82 redsun82 commented Jun 5, 2025

  • rust autogenerated models now use the DCA strategy
  • models were regenerated from a recent DCA run
  • the bulk model generator got some changes:
    • the configuration files are now in YAML format, which is terser and more consistent with how we generally configure stuff
    • running the DCA strategy the generator will now take the last DB artifact for each project, which makes it compatible to run against comparing DCA runs
    • downloads from DCA are now run in parallel (up to a maximum of 8 workers), which scales much better with respect to the number of sources
    • the bulk generator cleans up extracted DB locations, which makes it rerunnable without any manual cleanup
    • the generator can now by run directly on POSIX, without needing an explicit python invocation

@github-actions github-actions bot added C++ Rust Pull requests that update Rust code labels Jun 5, 2025
@redsun82 redsun82 marked this pull request as ready for review June 5, 2025 10:54
@Copilot Copilot AI review requested due to automatic review settings June 5, 2025 10:54
@redsun82 redsun82 requested review from a team as code owners June 5, 2025 10:54
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the bulk model generator to use the DCA strategy for regenerating MaD files, switches configuration from JSON to YAML, and enhances parallelism and cleanup in the Python script.

  • Migrate bulk generation config files from JSON to a terser YAML format.
  • Refactor bulk_generate_mad.py to add a generic run_in_parallel helper for cloning and downloading in parallel, with cleanup of old artifacts.
  • Regenerate all Rust QL test expected files based on the new DCA outputs.

Reviewed Changes

Copilot reviewed 70 out of 70 changed files in this pull request and generated 1 comment.

File Description
misc/scripts/models-as-data/bulk_generate_mad.py Add run_in_parallel, parallel DCA downloads, YAML parsing, cleanup logic
rust/misc/bulk_generation_targets.yml New YAML config replacing JSON targets for Rust bulk generation
cpp/bulk_generation_targets.yml New YAML config replacing JSON targets for C++ bulk generation
Various .expected files under rust/ql/test Regeneration of QL test expectations to reflect new DCA outputs
Comments suppressed due to low confidence (1)

misc/scripts/models-as-data/bulk_generate_mad.py:115

  • The generic type parameters T and U are used in the function signature but not defined; add TypeVar definitions such as T = TypeVar('T') and U = TypeVar('U') before their use.
def run_in_parallel[T, U](


project_dirs = [project_dirs_map[project["name"]] for project in projects]

dirs = run_in_parallel(
Copy link
Preview

Copilot AI Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Exiting from within a utility function (via sys.exit in on_error handlers) can make the logic harder to test or reuse; consider returning errors and handling exit at the top level instead.

Suggested change
dirs = run_in_parallel(
failed = run_in_parallel(

Copilot uses AI. Check for mistakes.

Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Some comments / discussion, one test annotation needs fixing.

import yaml
except ImportError:
print("ERROR: PyYAML is not installed. Please install it with 'pip install pyyaml'.")
sys.exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hit a similar problem with requests when I first ran this script, FWIW.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I'll add that as well

@@ -341,6 +342,13 @@ def download_dca_databases(
print(f"Skipping {pretty_name} as it is not in the list of projects")
continue

if pretty_name in artifact_map:
print(f"Skipping previous database {artifact_map[pretty_name]['artifact_name']} for {pretty_name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it being worth choosing the best database of a particular name (e.g. the most recent) or are they likely to be very close and/or difficult to compare anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the DBs are ordered in download.json by the order they are extracted, which makes selecting the last select the variant (rather than the baseline), which I think is what one would want when for example using a nightly run. But let me double check

- name: rocket
- name: actix-web
- name: hyper
- name: clap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice simple list, once everything is merged and stable I'll add a bunch more targets to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing to keep in mind is that at the moment this list needs to be topologically ordered with respect to dependencies (so later additions should depend on earlier ones and not the other way around). Possibly worth a comment here, now that this is yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, just so you know, you can tweak what gets generated with any of

with-sinks: false
with-sources: false
with-summaries: false

(all are true by default)

@@ -501,3 +524,5 @@ nodes
| main.rs:323:27:323:27 | v | semmle.label | v |
| main.rs:324:25:324:25 | v | semmle.label | v |
subpaths
testFailures
| main.rs:202:32:202:38 | realloc | Unexpected result: Alert=arg1 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This result is marked MISSING in the source, all we need to do is remove the word MISSING: in main.rs line 528 and it's a win!

I don't see any other meaningful changes to results of tests. 🎉

Comment on lines 395 to +396
# And then we iterate over the contents of the extracted directory
# and extract the tar.gz files inside it
for entry in os.listdir(artifact_unzipped_location):
artifact_tar_location = os.path.join(artifact_unzipped_location, entry)
with tarfile.open(artifact_tar_location, "r:gz") as tar_ref:
# And we just untar it to the same directory as the zip file
tar_ref.extractall(artifact_unzipped_location)
database_results[pretty_name] = os.path.join(
artifact_unzipped_location, remove_extension(entry)
)
# and extract the language tar.gz file inside it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not iterating any more though, right? We're just unzipping the one correct .tar.gz?

Why did we iterate in the previous design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was just a kind of way to take the only containing file without specifying its name, but the name is easy to specify which is what I've done here.

Also fix some minor things in `bulk_generate_mad.py`.
@redsun82 redsun82 requested a review from paldepind June 6, 2025 08:17
Copy link
Contributor

@paldepind paldepind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really great!

A few comments and I think you need to run black again as there's a few formatting changes from that.


artifact_map[pretty_name] = analyzed_database

def download_and_extract(item: tuple[str, dict]) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using "extract" here could be a bit confusing as we already use "extract" with a different meaning in the context of QL. What about using "unzip" instead?

Suggested change
def download_and_extract(item: tuple[str, dict]) -> str:
def download_and_unzip(item: tuple[str, dict]) -> str:

There's a few "extract" below that could also be "unzip".


results = run_in_parallel(
download_and_extract,
list(artifact_map.items()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download_and_extract doesn't use the name, so we could get away with

Suggested change
list(artifact_map.items()),
list(artifact_map.values()),

and adjust download_and_extract accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is theoretically used by the error printing functions of run_in_parallel, but that can be tweaked

print("\n=== Finding projects ===")
response = get_json_from_github(
f"https://raw.githubusercontent.com/github/codeql-dca-main/data/{experiment_name}/reports/downloads.json",
pat,
)
targets = response["targets"]
project_map = {project["name"]: project for project in projects}
artifact_map = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already is an artifact_name inside download_and_extract which end up shadowing this one. Could we rename one of them just to make this a bit clearer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as they hold exactly the same values, taken the same way from the analyzed_database dict, I don't think that is confusing, so I'd rather leave this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ Rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants