Skip to content

[llvm][cas] Add validate-if-needed to recover from invalid data #10581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

benlangmuir
Copy link

Introduce a new validate-if-needed API to the UnifiedOnDiskCache and
llvm-cas tool that triggers out-of-process validation of the CAS once
for every machine boot, and optionally recovers from invalid data by
marking it for garbage collection. This fixes a hole in the CAS data
coherence when a power loss or similar failure causes the OS to not
flush all of the pages in the mmaped on-disk CAS files.

The intent is that clients such as the clang scanning daemon or a build
system should trigger this validation at least once before using the
CAS.

rdar://123542312

Introduce a new validate-if-needed API to the UnifiedOnDiskCache and
llvm-cas tool that triggers out-of-process validation of the CAS once
for every machine boot, and optionally recovers from invalid data by
marking it for garbage collection. This fixes a hole in the CAS data
coherence when a power loss or similar failure causes the OS to not
flush all of the pages in the mmaped on-disk CAS files.

The intent is that clients such as the clang scanning daemon or a build
system should trigger this validation at least once before using the
CAS.

rdar://123542312
Use the new validate-if-needed functionality to ensure the clang
scanning daemon's CAS data is valid.
@benlangmuir
Copy link
Author

@swift-ci please test llvm

@benlangmuir benlangmuir requested a review from akyrtzi April 29, 2025 20:06
Copy link

@cachemeifyoucan cachemeifyoucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the implementation is fine to me. Comments inline.

I think we need to add the plugin version (at least a placeholder) so the APIs can be settled.

sys::path::remove_filename(PathBuf);
sys::path::append(PathBuf, "lock");

int LockFD = -1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is a good idea to lock before validate (so there is no need to acquire lock here). At least that prevents CAS being broken as the validation is run.

Also maybe the recovery should be done out of process if needed as well instead of always perform in-process?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe the recovery should be done out of process if needed as well instead of always perform in-process?

What is the benefit of recovering out of process?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is a good idea to lock before validate (so there is no need to acquire lock here).

I considered this, but there are two main advantages of only locking when we need to modify the CAS data:

  • It makes this implementation more lenient when working with tools that don't have validation yet or that don't call validation thoroughly, because it will only cause an error if there is actually invalid data. If we locked exclusive immediately then we either need to block indefinitely for other cas users or you get an error even if there is no invalid data.
  • It simplifies the locking because the out-of-process validation can just use the shared lock like normal without needing a new code path.

For

At least that prevents CAS being broken as the validation is run.

For once-per-boot validation to protect against missing flushed pages, it is outside the use-case to care about the CAS being made invalid by our own tooling or mid-way through operation. We could consider changing this down the road if we do period revalidation to catch bugs, but I prefer the current version especially as we are first rolling it out and likely to have tools that don't call it operating from the same CAS.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that since llvm-cas can recover, maybe just let it do the work. I am fine either way and we can make future adjustments if needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this offline already, but to make sure it's captured:
The reason we don't do recovery in the spawned sub process is that we also want recovery to work if validation crashes. If we can improve validation so it is safe enough to run in-process, that would be ideal in the future.

/// The data is already valid.
Valid,
/// The data was invalid, but was recovered.
RecoveredValid,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just call this Recovered?

@benlangmuir
Copy link
Author

I think we need to add the plugin version (at least a placeholder) so the APIs can be settled.

I want to do this as a follow-up. The API will look basically the same except with the plugin path and options, but it's totally separate in theory since it's a new top-level API not just a method you call on an existing CAS instance.

Copy link

@cachemeifyoucan cachemeifyoucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The pluginCAS doesn't actually share the APIs here. We can do that as followup.

@@ -119,3 +119,12 @@ llvm::Error CASOptions::initCache() const {
std::tie(Cache.CAS, Cache.AC) = std::move(DBs);
return llvm::Error::success();
}

std::string CASOptions::getResolvedCASPath() const {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly prefer to have this in the tool if this function only has one call site. I think we want to sunset the auto name in general.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with getting rid of auto in the future, but I need this API to match CASOptions::initCache() since it's still in use. I changed it so both use this API to keep it in sync.

sys::path::remove_filename(PathBuf);
sys::path::append(PathBuf, "lock");

int LockFD = -1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that since llvm-cas can recover, maybe just let it do the work. I am fine either way and we can make future adjustments if needed.

Documentation says it can either be ENOTEMPTY (like Darwin) or EEXIST.
Also print the error.
@benlangmuir
Copy link
Author

@swift-ci please test llvm

@benlangmuir benlangmuir merged commit c343be7 into swiftlang:next May 1, 2025
0 of 2 checks passed
@benlangmuir benlangmuir deleted the eng/blangmuir/validate-if-needed-123542312 branch May 1, 2025 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants