Skip to content

CXX-3198 upgrade to ClangFormat 19 and replace clang_format.py with clang-format-all.sh #1299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Dec 20, 2024

Conversation

eramongodb
Copy link
Contributor

@eramongodb eramongodb commented Dec 9, 2024

Summary

Resolves CXX-3198. Verified by this patch (containing the proposed changes in #1298).

Replaces the convoluted clang_format.py script with a clang-format-all.sh script that mirrors the one being used by the C Driver.

Unlike the C Driver, this PR proposes using Astral UV to obtain the clang-format binary used by this script. This is the first use of Astral UV in the Evergreen config since its introduction in #1244. All the complexity of obtaining the clang-format binary and verifying its version is delegated to Astral UV. The script only needs to be executed using the command:

uv run --frozen etc/clang-format-all.sh

The current latest ClangFormat version provided by the clang-format package and used by this PR is ClangFormat 19.1.4 (see corresponding docs). A big upgrade from ClangFormat 7.0.1! 🎉

ClangFormat Options

The .clang-format file is updated with the latest configuration options using clang-format --dump-config --style=Google. The Google style is used as the base style for consistency with the former option BasedOnStyle: Google. Old settings were then imported as-is into the new config and the contents cleaned up (de-Googled).

Some minor tweaks were made to the config file to take advantage of new options available and improve formatting behavior to better match our existing coding style:

  • Macros were added to the configuration file to assist ClangFormat with understanding our code. This changed how lengthy types in export macros are formatted (largest % of diffs in this PR).
  • Set QualifierAlignment: Left to enforce consistent const style (const T). (Although personally I would much prefer using T const...)
  • Manual line break comments were removed to avoid interfering with ClangFormat behavior.
  • PenaltyBreakScopeResolution was increased to avoid breaking on namespace scope resolution operators.

The only required manual modifications to source code were disabling ClangFormat in the X macro headers (to avoid treating BSONCXX_ENUM like an attribute macro) and an incorrect fix applied to the ntop function (std::uint8_t const* src (original) -> std::const uint8_t* src (wrong) -> const std::uint8_t* src (correct)).

This PR focuses on the ClangFormat and EVG config upgrades. More major changes to the configuration options are deferred to subsequent PRs for separate and more targeted review.

@eramongodb eramongodb requested a review from kevinAlbs December 9, 2024 20:47
@eramongodb eramongodb self-assigned this Dec 9, 2024
@kevinAlbs kevinAlbs requested review from vector-of-bool and a user and removed request for kevinAlbs December 10, 2024 14:05
@ghost
Copy link

ghost commented Dec 10, 2024

Do we not need any of clang_format.py's features? It had some interesting support for integrating with git, but if nobody uses that I guess we don't need it. It also ran formatting in parallel, which we could do using xargs in the new script.

Do we have a security policy about which external https certificates should be trusted for arbitrary remote code execution?

@eramongodb
Copy link
Contributor Author

eramongodb commented Dec 10, 2024

Do we not need any of clang_format.py's features?

AFAIK only the format, lint, and "find and download the exact Clang binary version" features were being actively used. Git-awareness is primarily for the purpose of finding and filtering the files within the repository to be sorted; simply listing the directories/files to be sorted is a far simpler and straightforward approach (and does not lead to formatting issues with untracked/missing source files).

It also ran formatting in parallel, which we could do using xargs in the new script.

This could be done, but I've opted not to do this to avoid overlapping format warnings in EVG output. Preventing overlapping would require some type of job scheduler such as GNU Parallel which is not readily available on EVG distros. This script is primarily for use by the EVG config; local dev tooling and scripts may be used to reduce and select the files that need to be reformatted (e.g. uv sync --frozen -> point tools at .venv/bin/clang-format or source .venv/bin/activate + run clang-format).

Do we have a security policy about which external https certificates should be trusted for arbitrary remote code execution?

I believe this is handled by DEVPROD when configuring EVG distros and their installed/available system certificates. Nevertheless, filed DEVPROD-13413 requesting preinstallation of Astral UV in EVG distros so the curl+sh pattern can be removed in the future.

@eramongodb
Copy link
Contributor Author

eramongodb commented Dec 11, 2024

formatting in parallel

After further investigation, there doesn't appear to be a significant benefit to running ClangFormat in parallel vs. passing all source files to ClangFormat.

Using multitime for benchmarking (average across 5 runs) on my local 16 logical core machine:

Single command with all files:

$ multitime -n 5 bash -c 'uv run --frozen clang-format -i $@' "${source_files[@]}"
            Mean        Std.Dev.    Min         Median      Max
real        4.748       0.045       4.692       4.773       4.806
user        4.669       0.034       4.609       4.676       4.713
sys         0.077       0.030       0.030       0.087       0.119

Command per file with max parallelism:

$ multitime -n 5 bash -c 'printf "%s\n" $@ | xargs -P 0 -n 1 uv run --frozen clang-format -i' "${source_files[@]}"
            Mean        Std.Dev.    Min         Median      Max
real        6.775       0.186       6.587       6.680       7.013
user        74.767      2.124       72.298      74.140      78.416
sys         21.366      1.228       20.053      20.710      22.975

Command per file, no parallelism:

$ time bash -c 'printf "%s\n" $@ | xargs -n 1 uv run --frozen clang-format -i' "${source_files[@]}"
real    0m32.314s
user    0m22.622s
sys     0m7.847s

Unless there is a flaw I overlooked in my benchmarking method, this seems to suggests using xargs to run clang-format per file in parallel is actually slower than passing all files to a single clang-format command. 🤔 I do not know why this is the case. The single ClangFormat command doesn't appear to be using multiple threads (at least as far as I can observe according to my CPU monitors), but its performance is comparable (and even better than) manually-arranged parallel execution, so I cannot explain where the comparative performance advantage is coming from (possibly filesystem I/O throughput optimizations...?).

@kevinAlbs kevinAlbs requested review from adriandole and removed request for a user December 12, 2024 20:13
@ghost ghost self-requested a review December 16, 2024 19:14
@ghost
Copy link

ghost commented Dec 16, 2024

If I understand correctly my feedback is still desired here and I was just removed above because I was out sick; if I misunderstood, please disregard.

The extra tests and justifications around the simpler clang-format invocation sound good and I appreciate the clarification.

Re executing code over https, I personally have reservations about giving a 3rd party key infrastructure this much power, but if it's not against policy I can approve the change.

@eramongodb
Copy link
Contributor Author

eramongodb commented Dec 16, 2024

To alleviate concerns regarding the curl | sh pattern, this PR has been updated so a clone of the uv-installer.sh script for uv 0.5.9 (latest release) is now committed into the repository as .evergreen/scripts/uv-installer.sh to avoid repeatedly invoking the curl command and blindly running the resulting script per task execution.

Note

This means the uv binary version is pinned to 0.5.9.

AFAICT uv self update is effectively no different to the curl | sh pattern being avoided, so it is deliberately omitted from this PR.

Furthermore, the script is manually patched with the SHA256 checksum values for the 0.5.9 archives (manually obtained from the release assets) and unconditionally validated with each download by setting _checksum_style="sha256" (it is unclear why the install script does not already do this automatically).

An incorrect checksum is reported as follows (triggered locally by manually setting the last digit to 0):

$ .evergreen/scripts/uv-installer.sh
downloading uv 0.5.9 x86_64-unknown-linux-gnu
ERROR: checksum mismatch
            want: e9cca3fb618dbc056f770d3ac4d52af491b532e60c8b19b97b9ba24f42db2bc0
            got:  e9cca3fb618dbc056f770d3ac4d52af491b532e60c8b19b97b9ba24f42db2bc1

The absence of "no checksums to verify" (from the original patch) in the script output indicates the checksum is being validated.

Latest changes verified by this patch.

@ghost
Copy link

ghost commented Dec 16, 2024

Nice! I appreciate the much reduced opportunity for tampering, thanks :)

Copy link
Contributor

@vector-of-bool vector-of-bool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One remark.

Comment on lines 7 to 13
class InstallUV(Function):
name = 'install-uv'
commands = bash_exec(
command_type=EvgCommandType.SETUP,
working_dir='mongo-cxx-driver',
script='.evergreen/scripts/uv-installer.sh'
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be modified to install uv on a cache/scratch space specific to mongo-cxx-driver and the desired uv version?

Copy link
Contributor Author

@eramongodb eramongodb Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new set-cache-dir EVG function which defines a MONGO_CXX_DRIVER_CACHE_DIR EVG expansion. This expansion points to a dedicated cache directory for the C++ Driver (e.g. ~/.cache/mongo-cxx-driver or equivalent). uv is now installed into this cache directory in the uv-0.5.9 (versioned) subdirectory and also defines a UV_INSTALL_DIR EVG expansion for subsequent use in EVG scripts.

downloading uv 0.5.9 x86_64-unknown-linux-gnu
installing to /home/ubuntu/.cache/mongo-cxx-driver/uv-0.5.9
  uv
  uvx
everything's installed!

@eramongodb eramongodb merged commit f1ad6b6 into mongodb:master Dec 20, 2024
17 checks passed
@eramongodb eramongodb deleted the cxx-3198 branch December 20, 2024 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants