Skip to content

Fix data race in the umfIpcOpenedCacheDestroy function #1111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions src/ipc_cache.c
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,8 @@ umfIpcOpenedCacheCreate(ipc_opened_cache_eviction_cb_t eviction_cb) {

void umfIpcOpenedCacheDestroy(ipc_opened_cache_handle_t cache) {
ipc_opened_cache_entry_t *entry, *tmp;

utils_mutex_lock(&(cache->global->cache_lock));
HASH_ITER(hh, cache->hash_table, entry, tmp) {
DL_DELETE(cache->global->lru_list, entry);
HASH_DEL(cache->hash_table, entry);
Expand All @@ -153,6 +155,7 @@ void umfIpcOpenedCacheDestroy(ipc_opened_cache_handle_t cache) {
umf_ba_free(cache->global->cache_allocator, entry);
}
HASH_CLEAR(hh, cache->hash_table);
utils_mutex_unlock(&(cache->global->cache_lock));

umf_ba_global_free(cache);
}
Expand Down
4 changes: 0 additions & 4 deletions src/provider/provider_tracking.c
Original file line number Diff line number Diff line change
Expand Up @@ -473,10 +473,6 @@ static void trackingFinalize(void *provider) {

critnib_delete(p->ipcCache);

#ifndef NDEBUG
check_if_tracker_is_empty(p->hTracker, p->pool);
#endif /* NDEBUG */

Comment on lines -476 to -479
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldorau I was required to delete this check because TSAN and Valgrind reported a lot of data races when several pools are destroyed concurrently. It looks like we never tested concurrent pools destroy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I insist on keeping it under additional ifdef not defined in our CI (maybe UMF_DEBUG_MODE?) - it can be very useful during debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but can't we just use a lock here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that we need to introduce a lock in other places where trackers are used. I am not sure that it is a good idea to introduce lock for the debugging purposes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not, if this is the only place?
these checks could help us catch bugs very early

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code wrong, or this is only false positive?
If it is false positive, than this is an issue that when we took critnib from PMDK we did not imported implementation of the "VALGRIND_HG_DRD_DISABLE_CHECKING" macros, which are used in critnib to fix false positives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code wrong, or this is only false positive? If it is false positive, than this is an issue that when we took critnib from PMDK we did not imported implementation of the "VALGRIND_HG_DRD_DISABLE_CHECKING" macros, which are used in critnib to fix false positives.

I am not 100% sure but I think it is a real issue rather than false-postive. Btw the same flow with critnib is used by the disjoint pool, see these issues for reference: #1114, #1115.

The test do the following:

  • Thread 1 destroys Pool A and as a result, the check_if_tracker_is_empty is called and it iterates over the records in the memory tracker (which critnib map).
  • Thread 2 destroys Pool B and the pool deallocates memory blocks (that were cached) and these blocks are removed from the tracker.
  • As a result Thread 2 deallocates from the tracker only records that corresponds to Pool B. But Thread 1 touches all entries in the tracker. And it is the problem because thread 1 might read the entry which is in a process of deletion by thread 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not, if this is the only place?
these checks could help us catch bugs very early

@bratpiorka I do not want to introduce locks in other places just to support this debugging functionality. And I do not think that this debugging check is really needed because we are checking if the tracker is empty when the tracker itself is destructed (when libumf is unloaded). And this removed check is called when the pool is destroyed and checks that the tracker does not contain records from the corresponding pool. But without checking at the pool destruction stage we will be able to catch the same issue when the tracker itself is destructed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinser52 ok if we check the tracker at libumf unload then we could remove it from here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinser52 ok if we check the tracker at libumf unload then we could remove it from here

Done

umf_ba_global_free(provider);
}

Expand Down
69 changes: 69 additions & 0 deletions test/ipcFixtures.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -593,4 +593,73 @@ TEST_P(umfIpcTest, ConcurrentOpenCloseHandles) {
EXPECT_EQ(stat.openCount, stat.closeCount);
}

TEST_P(umfIpcTest, ConcurrentDestroyIpcHandlers) {
constexpr size_t SIZE = 100;
constexpr size_t NUM_ALLOCS = 100;
constexpr size_t NUM_POOLS = 10;
void *ptrs[NUM_ALLOCS];
void *openedPtrs[NUM_POOLS][NUM_ALLOCS];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You use camel and snake case when naming variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, now variables in the test use camel case and constants use capital letters.

std::vector<umf::pool_unique_handle_t> consumerPools;
umf::pool_unique_handle_t producerPool = makePool();
ASSERT_NE(producerPool.get(), nullptr);

for (size_t i = 0; i < NUM_POOLS; ++i) {
consumerPools.push_back(makePool());
}

for (size_t i = 0; i < NUM_ALLOCS; ++i) {
void *ptr = umfPoolMalloc(producerPool.get(), SIZE);
ASSERT_NE(ptr, nullptr);
ptrs[i] = ptr;
}

for (size_t i = 0; i < NUM_ALLOCS; ++i) {
umf_ipc_handle_t ipcHandle = nullptr;
size_t handleSize = 0;
umf_result_t ret = umfGetIPCHandle(ptrs[i], &ipcHandle, &handleSize);
ASSERT_EQ(ret, UMF_RESULT_SUCCESS);

for (size_t poolId = 0; poolId < NUM_POOLS; poolId++) {
void *ptr = nullptr;
umf_ipc_handler_handle_t ipcHandler = nullptr;
ret =
umfPoolGetIPCHandler(consumerPools[poolId].get(), &ipcHandler);
ASSERT_EQ(ret, UMF_RESULT_SUCCESS);
ASSERT_NE(ipcHandler, nullptr);

ret = umfOpenIPCHandle(ipcHandler, ipcHandle, &ptr);
ASSERT_EQ(ret, UMF_RESULT_SUCCESS);
openedPtrs[poolId][i] = ptr;
}

ret = umfPutIPCHandle(ipcHandle);
ASSERT_EQ(ret, UMF_RESULT_SUCCESS);
}

for (size_t poolId = 0; poolId < NUM_POOLS; poolId++) {
for (size_t i = 0; i < NUM_ALLOCS; ++i) {
umf_result_t ret = umfCloseIPCHandle(openedPtrs[poolId][i]);
EXPECT_EQ(ret, UMF_RESULT_SUCCESS);
}
}

for (size_t i = 0; i < NUM_ALLOCS; ++i) {
umf_result_t ret = umfFree(ptrs[i]);
EXPECT_EQ(ret, UMF_RESULT_SUCCESS);
}

// Destroy pools in parallel to cause IPC cache cleanup in parallel.
umf_test::syncthreads_barrier syncthreads(NUM_POOLS);
auto poolDestroyFn = [&consumerPools, &syncthreads](size_t tid) {
syncthreads();
consumerPools[tid].reset(nullptr);
};
umf_test::parallel_exec(NUM_POOLS, poolDestroyFn);

producerPool.reset(nullptr);

EXPECT_EQ(stat.putCount, stat.getCount);
EXPECT_EQ(stat.openCount, stat.closeCount);
}

#endif /* UMF_TEST_IPC_FIXTURES_HPP */
17 changes: 17 additions & 0 deletions test/supp/drd-umf_test-provider_devdax_memory_ipc.supp
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,20 @@
fun:umfOpenIPCHandle
...
}

{
False-positive ConflictingAccess in jemalloc
drd:ConflictingAccess
fun:atomic_*
...
fun:je_*
...
}

{
False-positive ConflictingAccess in tbbmalloc
drd:ConflictingAccess
...
fun:tbb_pool_finalize
...
}
17 changes: 17 additions & 0 deletions test/supp/drd-umf_test-provider_file_memory_ipc.supp
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,20 @@
fun:umfOpenIPCHandle
...
}

{
False-positive ConflictingAccess in jemalloc
drd:ConflictingAccess
fun:atomic_*
...
fun:je_*
...
}

{
False-positive ConflictingAccess in tbbmalloc
drd:ConflictingAccess
...
fun:tbb_pool_finalize
...
}
17 changes: 17 additions & 0 deletions test/supp/drd-umf_test-provider_os_memory.supp
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,20 @@
fun:umfOpenIPCHandle
...
}

{
False-positive ConflictingAccess in jemalloc
drd:ConflictingAccess
fun:atomic_*
...
fun:je_*
...
}

{
False-positive ConflictingAccess in tbbmalloc
drd:ConflictingAccess
...
fun:tbb_pool_finalize
...
}
17 changes: 17 additions & 0 deletions test/supp/helgrind-umf_test-provider_devdax_memory_ipc.supp
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,20 @@
fun:umfOpenIPCHandle
...
}

{
False-positive ConflictingAccess in jemalloc
Helgrind:Race
fun:atomic_*
...
fun:je_*
...
}

{
False-positive ConflictingAccess in tbbmalloc
Helgrind:Race
...
fun:tbb_pool_finalize
...
}
17 changes: 17 additions & 0 deletions test/supp/helgrind-umf_test-provider_file_memory_ipc.supp
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,20 @@
fun:critnib_find
...
}

{
False-positive ConflictingAccess in jemalloc
Helgrind:Race
fun:atomic_*
...
fun:je_*
...
}

{
False-positive ConflictingAccess in tbbmalloc
Helgrind:Race
...
fun:tbb_pool_finalize
...
}
17 changes: 17 additions & 0 deletions test/supp/helgrind-umf_test-provider_os_memory.supp
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,20 @@
fun:umfOpenIPCHandle
...
}

{
False-positive ConflictingAccess in jemalloc
Helgrind:Race
fun:atomic_*
...
fun:je_*
...
}

{
False-positive ConflictingAccess in tbbmalloc
Helgrind:Race
...
fun:tbb_pool_finalize
...
}
Loading