Skip to content

build: parallelize the qstr build steps #3538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 12, 2020

Conversation

jepler
Copy link

@jepler jepler commented Oct 11, 2020

The generation of "qstr defs" involves running the C preprocessor on a selection of source files, extracting pattern matches from them, and writing them out into a single output file. Then, a second sequential step reads all that output to produce individual "qstr" files. Especially with a non-lto build on a powerful desktop computer, these sequential steps actually have a pretty large effect on the overall build time.

The improvements came in two discrete steps; refer to the individual commit messages for a clearer view.

  • The first commit parallelizes the generation of qstr.i.last. It is the larger timesaver.
  • The second commit eliminates qstr.i.last, and pulls the steps of "makeqstrdefs.py split" into "genlast.py", also parallelizing the regular expression search. It is a smaller timesaver, but still important.

I tested by repeatedly building the feather stm32f405 on my 8 core / 16-thread Ryzen. Before, a build took about 16s. After, a build took about 9.5s. This is elapsed time; CPU time was about 1m12s before and 1m13s after.

@jepler jepler force-pushed the parallel-qstrlast branch from 2d6fe17 to dd04f4b Compare October 11, 2020 23:55
Rather than simply invoking gcc in preprocessor mode with a list of files, use
a Python script with the (python3) ThreadPoolExecutor to invoke the
preprocessor in parallel.

The amount of concurrency is the number of system CPUs, not the makefile "-j"
parallelism setting, because there is no simple and correct way for a Python
program to correctly work together with make's idea of parallelism.

This reduces the build time of stm32f405 feather (a non-LTO build) from 16s to
12s on my 16-thread Ryzen machine.
@jepler jepler force-pushed the parallel-qstrlast branch 2 times, most recently from 084aaa8 to 9380508 Compare October 12, 2020 01:48
This gets a further speedup of about 2s (12s -> 9.5s elapsed build time)
for stm32f405_feather

For what are probably historical reasons, the qstr process involves
preprocessing a large number of source files into a single "qstr.i.last"
file, then reading this and splitting it into one "qstr" file for each
original source ("*.c") file.

By eliminating the step of writing qstr.i.last as well as making the
regular-expression-matching part be parallelized, build speed is further
improved.

Because the step to build QSTR_DEFS_COLLECTED does not access
qstr.i.last, the path is replaced with "-" in the Makefile.
@jepler jepler force-pushed the parallel-qstrlast branch from 9380508 to 479552c Compare October 12, 2020 02:18
Copy link
Member

@tannewt tannewt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

In my ninja branch I had started splitting this processing at the build level so that this step would be done on changed files only. Obviously, that is a larger change than this one. :-)

@tannewt tannewt merged commit 9de9678 into adafruit:main Oct 12, 2020
@jepler jepler deleted the parallel-qstrlast branch November 3, 2021 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants