Skip to content

Allow ccache to reuse results across build directories #1522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 24, 2024

Conversation

eramongodb
Copy link
Contributor

This PR proposes adding environment variables instructing ccache to allow reuse of compilation results across different build directories. Verified by this patch (the unexpected task failures seem unrelated to this PR).

This is motivated by the realization that Evergreen tasks are typically executed under a working directory of the form /data/mci/<hash>, where <hash> is used to avoid conflicts between tasks running on the given host. Unfortunately this is also preventing ccache from reusing compilation results as intended.

Per ccache documentation under "Compiling in different directories":

... if you compile the same code in different locations, you can’t share compilation results between the different build directories since you get cache misses because of the absolute build directory paths that are part of the hash.

The presence of <hash> in the absolute paths means every execution of a task will fail to reuse cached compilation results on the given host... or even re-execution of the same task on the same host, as <hash> is apparently computed using a combination of task ID, execution number, and PID. A proper solution would probably also involve a remote storage backend (so that cached results can be reused across hosts as well), but I have not explored how to go about supporting such a setup yet. Instead, this PR applies the instructions given "to enable cache hits between different build directories":

  • If you build with -g (or similar) to add debug information to the object file, you must [...] set hash_dir = false.
  • If you use absolute paths anywhere on the command line [...] you must set base_dir to an absolute path to a “base directory”. Ccache will then rewrite absolute paths under that directory to relative before computing the hash.

This PR applies both suggestions using the environment variables CCACHE_BASEDIR and CCACHE_NOHASHDIR. These are only added to scripts that are expected to be executed on non-Windows-like distros (our Windows tasks don't appear to be using ccache anyways). The scope of the env vars are deliberately such that they (generally) only apply to our builds (positioned immediately before CMake configure commands run on the C Driver, which also identifies the directory to use as base_dir, and unset as necessary to avoid impacting unrelated builds).

I've elected to use the path to the CMake source directory (as identified by the CMake configure command) as base_dir, since (I believe) paths to source files (including header files and include directories) are primarily what impact the ccache hash, and these should be consistent regardless of the location of the source directory to maximize cache hits. Incidentally, for many tasks in the C Driver, this is equivalent to the binary directory (meaning they are in-source builds, which we should probably change to be out-of-source builds at some point...).

I do not expect these changes to lead to problems with cache reuse on Evergreen hosts. The hash still includes many toolchain and configuration details which in aggregate are unlikely to lead to undesirable conflicts. Concerning base_dir, ccache warns:

It works OK in many cases, but there might be cases where things break. One known issue is that absolute paths are not reproduced in dependency files, which can mess up dependency detection in tools like Make and Ninja.

This is probably not a concern for our EVG tasks, which are always(?) doing a clean build, thus even if dependency detection is flawed, so long as the required artifacts are still built, it should not be an issue. Similarly, concerning hash_dir:

The reason for including the CWD in the hash by default is to prevent a problem with the storage of the current working directory in the debug info of an object file, which can lead ccache to return a cached object file that has the working directory in the debug info set incorrectly. You can disable this option to get cache hits when compiling the same source code in different directories if you don’t mind that CWD in the debug info might be incorrect.

I do not think we will mind the <hash> in /data/mci/<hash> being different in debug info so long as the relative paths to actual source and binary files remains consistent and understandable, which should be the case given the other information that are still included in the ccache hash (preprocessor output, preprocessor and compiler options, input source file, etc.).

@eramongodb eramongodb requested a review from kevinAlbs January 24, 2024 18:27
@eramongodb eramongodb self-assigned this Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants