Skip to content

Fix data race in the umfIpcOpenedCacheDestroy function #1111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 20, 2025

Conversation

vinser52
Copy link
Contributor

@vinser52 vinser52 commented Feb 19, 2025

Description

This PR fixes data race (found by Coverity) in the umfIpcOpenedCacheDestroy function.

Checklist

  • Code compiles without errors locally
  • All tests pass locally
  • CI workflows execute properly
  • New tests added, especially if they will fail without my changes

@vinser52 vinser52 requested a review from a team as a code owner February 19, 2025 14:26
@vinser52 vinser52 mentioned this pull request Feb 19, 2025
Comment on lines -476 to -479
#ifndef NDEBUG
check_if_tracker_is_empty(p->hTracker, p->pool);
#endif /* NDEBUG */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldorau I was required to delete this check because TSAN and Valgrind reported a lot of data races when several pools are destroyed concurrently. It looks like we never tested concurrent pools destroy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I insist on keeping it under additional ifdef not defined in our CI (maybe UMF_DEBUG_MODE?) - it can be very useful during debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but can't we just use a lock here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that we need to introduce a lock in other places where trackers are used. I am not sure that it is a good idea to introduce lock for the debugging purposes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not, if this is the only place?
these checks could help us catch bugs very early

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code wrong, or this is only false positive?
If it is false positive, than this is an issue that when we took critnib from PMDK we did not imported implementation of the "VALGRIND_HG_DRD_DISABLE_CHECKING" macros, which are used in critnib to fix false positives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code wrong, or this is only false positive? If it is false positive, than this is an issue that when we took critnib from PMDK we did not imported implementation of the "VALGRIND_HG_DRD_DISABLE_CHECKING" macros, which are used in critnib to fix false positives.

I am not 100% sure but I think it is a real issue rather than false-postive. Btw the same flow with critnib is used by the disjoint pool, see these issues for reference: #1114, #1115.

The test do the following:

  • Thread 1 destroys Pool A and as a result, the check_if_tracker_is_empty is called and it iterates over the records in the memory tracker (which critnib map).
  • Thread 2 destroys Pool B and the pool deallocates memory blocks (that were cached) and these blocks are removed from the tracker.
  • As a result Thread 2 deallocates from the tracker only records that corresponds to Pool B. But Thread 1 touches all entries in the tracker. And it is the problem because thread 1 might read the entry which is in a process of deletion by thread 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not, if this is the only place?
these checks could help us catch bugs very early

@bratpiorka I do not want to introduce locks in other places just to support this debugging functionality. And I do not think that this debugging check is really needed because we are checking if the tracker is empty when the tracker itself is destructed (when libumf is unloaded). And this removed check is called when the pool is destroyed and checks that the tracker does not contain records from the corresponding pool. But without checking at the pool destruction stage we will be able to catch the same issue when the tracker itself is destructed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinser52 ok if we check the tracker at libumf unload then we could remove it from here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinser52 ok if we check the tracker at libumf unload then we could remove it from here

Done

@vinser52 vinser52 force-pushed the svinogra_tests branch 2 times, most recently from d7ca29e to 4ddf260 Compare February 20, 2025 10:28
constexpr size_t NUM_ALLOCS = 100;
constexpr size_t NUM_POOLS = 10;
void *ptrs[NUM_ALLOCS];
void *openedPtrs[NUM_POOLS][NUM_ALLOCS];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You use camel and snake case when naming variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, now variables in the test use camel case and constants use capital letters.

@lukaszstolarczuk lukaszstolarczuk merged commit a9ff7a8 into oneapi-src:main Feb 20, 2025
79 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants