Rust: regenerate MaD files using DCA #19674

redsun82 · 2025-06-05T08:14:24Z

rust autogenerated models now use the DCA strategy
models were regenerated from a recent DCA run
the bulk model generator got some changes:
- the configuration files are now in YAML format, which is terser and more consistent with how we generally configure stuff
- running the DCA strategy the generator will now take the last DB artifact for each project, which makes it compatible to run against comparing DCA runs
- downloads from DCA are now run in parallel (up to a maximum of 8 workers), which scales much better with respect to the number of sources
- the bulk generator cleans up extracted DB locations, which makes it rerunnable without any manual cleanup
- the generator can now by run directly on POSIX, without needing an explicit python invocation

Copilot

Pull Request Overview

This PR updates the bulk model generator to use the DCA strategy for regenerating MaD files, switches configuration from JSON to YAML, and enhances parallelism and cleanup in the Python script.

Migrate bulk generation config files from JSON to a terser YAML format.
Refactor bulk_generate_mad.py to add a generic run_in_parallel helper for cloning and downloading in parallel, with cleanup of old artifacts.
Regenerate all Rust QL test expected files based on the new DCA outputs.

Reviewed Changes

Copilot reviewed 70 out of 70 changed files in this pull request and generated 1 comment.

File	Description
misc/scripts/models-as-data/bulk_generate_mad.py	Add `run_in_parallel`, parallel DCA downloads, YAML parsing, cleanup logic
rust/misc/bulk_generation_targets.yml	New YAML config replacing JSON targets for Rust bulk generation
cpp/bulk_generation_targets.yml	New YAML config replacing JSON targets for C++ bulk generation
Various `.expected` files under `rust/ql/test`	Regeneration of QL test expectations to reflect new DCA outputs

Comments suppressed due to low confidence (1)

misc/scripts/models-as-data/bulk_generate_mad.py:115

The generic type parameters T and U are used in the function signature but not defined; add TypeVar definitions such as T = TypeVar('T') and U = TypeVar('U') before their use.

def run_in_parallel[T, U](

Copilot · 2025-06-05T10:56:45Z

misc/scripts/models-as-data/bulk_generate_mad.py

-
-    project_dirs = [project_dirs_map[project["name"]] for project in projects]
-
+    dirs = run_in_parallel(


[nitpick] Exiting from within a utility function (via sys.exit in on_error handlers) can make the logic harder to test or reuse; consider returning errors and handling exit at the top level instead.

Suggested change

dirs = run_in_parallel(

failed = run_in_parallel(

geoffw0

Looks great! Some comments / discussion, one test annotation needs fixing.

geoffw0 · 2025-06-05T11:06:53Z

misc/scripts/models-as-data/bulk_generate_mad.py

+    import yaml
+except ImportError:
+    print("ERROR: PyYAML is not installed. Please install it with 'pip install pyyaml'.")
+    sys.exit(1)


I hit a similar problem with requests when I first ran this script, FWIW.

good point, I'll add that as well

geoffw0 · 2025-06-05T11:09:15Z

misc/scripts/models-as-data/bulk_generate_mad.py

@@ -341,6 +342,13 @@ def download_dca_databases(
            print(f"Skipping {pretty_name} as it is not in the list of projects")
            continue

+        if pretty_name in artifact_map:
+            print(f"Skipping previous database {artifact_map[pretty_name]['artifact_name']} for {pretty_name}")


Would it being worth choosing the best database of a particular name (e.g. the most recent) or are they likely to be very close and/or difficult to compare anyway?

AFAIK the DBs are ordered in download.json by the order they are extracted, which makes selecting the last select the variant (rather than the baseline), which I think is what one would want when for example using a nightly run. But let me double check

geoffw0 · 2025-06-05T11:24:06Z

rust/bulk_generation_targets.yml

+- name: rocket
+- name: actix-web
+- name: hyper
+- name: clap


This is a nice simple list, once everything is merged and stable I'll add a bunch more targets to it.

one thing to keep in mind is that at the moment this list needs to be topologically ordered with respect to dependencies (so later additions should depend on earlier ones and not the other way around). Possibly worth a comment here, now that this is yaml

also, just so you know, you can tweak what gets generated with any of

with-sinks: false with-sources: false with-summaries: false

(all are true by default)

geoffw0 · 2025-06-05T11:27:29Z

rust/ql/test/query-tests/security/CWE-770/UncontrolledAllocationSize.expected

@@ -501,3 +524,5 @@ nodes
 | main.rs:323:27:323:27 | v | semmle.label | v |
 | main.rs:324:25:324:25 | v | semmle.label | v |
 subpaths
+testFailures
+| main.rs:202:32:202:38 | realloc | Unexpected result: Alert=arg1 |


This result is marked MISSING in the source, all we need to do is remove the word MISSING: in main.rs line 528 and it's a win!

I don't see any other meaningful changes to results of tests. 🎉

geoffw0 · 2025-06-05T11:29:35Z

misc/scripts/models-as-data/bulk_generate_mad.py

            # And then we iterate over the contents of the extracted directory
-            # and extract the tar.gz files inside it
-            for entry in os.listdir(artifact_unzipped_location):
-                artifact_tar_location = os.path.join(artifact_unzipped_location, entry)
-                with tarfile.open(artifact_tar_location, "r:gz") as tar_ref:
-                    # And we just untar it to the same directory as the zip file
-                    tar_ref.extractall(artifact_unzipped_location)
-                    database_results[pretty_name] = os.path.join(
-                        artifact_unzipped_location, remove_extension(entry)
-                    )
+            # and extract the language tar.gz file inside it


We're not iterating any more though, right? We're just unzipping the one correct .tar.gz?

Why did we iterate in the previous design?

I think it was just a kind of way to take the only containing file without specifying its name, but the name is easy to specify which is what I've done here.

Also fix some minor things in `bulk_generate_mad.py`.

paldepind

Looks really great!

A few comments and I think you need to run black again as there's a few formatting changes from that.

paldepind · 2025-06-06T08:17:56Z

misc/scripts/models-as-data/bulk_generate_mad.py

+
+        artifact_map[pretty_name] = analyzed_database
+
+    def download_and_extract(item: tuple[str, dict]) -> str:


Using "extract" here could be a bit confusing as we already use "extract" with a different meaning in the context of QL. What about using "unzip" instead?

Suggested change

def download_and_extract(item: tuple[str, dict]) -> str:

def download_and_unzip(item: tuple[str, dict]) -> str:

There's a few "extract" below that could also be "unzip".

paldepind · 2025-06-06T08:29:59Z

misc/scripts/models-as-data/bulk_generate_mad.py

+
+    results = run_in_parallel(
+        download_and_extract,
+        list(artifact_map.items()),


download_and_extract doesn't use the name, so we could get away with

Suggested change

list(artifact_map.items()),

list(artifact_map.values()),

and adjust download_and_extract accordingly.

it is theoretically used by the error printing functions of run_in_parallel, but that can be tweaked

paldepind · 2025-06-06T08:38:11Z

misc/scripts/models-as-data/bulk_generate_mad.py

    print("\n=== Finding projects ===")
    response = get_json_from_github(
        f"https://raw.githubusercontent.com/github/codeql-dca-main/data/{experiment_name}/reports/downloads.json",
        pat,
    )
    targets = response["targets"]
    project_map = {project["name"]: project for project in projects}
+    artifact_map = {}


There already is an artifact_name inside download_and_extract which end up shadowing this one. Could we rename one of them just to make this a bit clearer?

as they hold exactly the same values, taken the same way from the analyzed_database dict, I don't think that is confusing, so I'd rather leave this

Paolo Tranquilli added 10 commits June 5, 2025 08:37

Bulk model generator: switch from json to yml configuration files

31d1604

MaD generator: only pick up last database on comparison DCAs

900a3b0

MaD generator: reformat

d5c16d6

MaD generator: make bulk generator executable

31954fa

MaD generator: move bulk generation config files one directory up

fbd5058

MaD: make bulk generator DCA strategy download DBs in parallel

4f47ee2

MaD: make bulk generator cleanup downloaded DBs

ee7eb86

MaD generator: some final minor tweaks

530b990

Rust: switch to DCA strategy for MaD bulk generation

f4bbef9

Rust: regenerate MaD models

ec77eb3

github-actions bot added C++ Rust Pull requests that update Rust code labels Jun 5, 2025

Rust: accept test changes

6162cf5

redsun82 mentioned this pull request Jun 5, 2025

Rust: Use QL computed canonical paths in MaD Field tokens #19667

Draft

redsun82 marked this pull request as ready for review June 5, 2025 10:54

Copilot AI review requested due to automatic review settings June 5, 2025 10:54

redsun82 requested review from a team as code owners June 5, 2025 10:54

Copilot AI reviewed Jun 5, 2025

View reviewed changes

geoffw0 reviewed Jun 5, 2025

View reviewed changes

Rust: address review

e1eb1f6

Also fix some minor things in `bulk_generate_mad.py`.

redsun82 requested a review from paldepind June 6, 2025 08:17

paldepind requested changes Jun 6, 2025

View reviewed changes

MaD generator: use decompress terminology instead of extract

d6d13b9


		project_dirs = [project_dirs_map[project["name"]] for project in projects]

		dirs = run_in_parallel(


		artifact_map[pretty_name] = analyzed_database

		def download_and_extract(item: tuple[str, dict]) -> str:

	def download_and_extract(item: tuple[str, dict]) -> str:
	def download_and_unzip(item: tuple[str, dict]) -> str:

Rust: regenerate MaD files using DCA #19674

Are you sure you want to change the base?

Rust: regenerate MaD files using DCA #19674

Uh oh!

Conversation

redsun82 commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

geoffw0 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paldepind left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

redsun82 commented Jun 5, 2025 •

edited

Loading