You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[mlir] Optimize ThreadLocalCache by removing atomic bottleneck (#93270)
The ThreadLocalCache implementation is used by the MLIRContext (among
other things) to try to manage thread contention in the StorageUniquers.
There is a bunch of fancy shared pointer/weak pointer setups that
basically keeps everything alive across threads at the right time, but a
huge bottleneck is the `weak_ptr::lock` call inside the `::get` method.
This is because the `lock` method has to hit the atomic refcount several
times, and this is bottlenecking performance across many threads.
However, all this is doing is checking whether the storage is
initialized. We know that it cannot be an expired weak pointer because
the thread local cache object we're calling into owns the memory and is
still alive for the method call to be valid. Thus, we can store and
extra `Value *` inside the thread local cache for speedy retrieval if
the cache is already initialized for the thread, which is the common
case.
This also tightens the size of the critical section in the same method
by scoping the mutex more to just the mutation on `perInstanceState`.
Before:
<img width="560" alt="image"
src="https://github.com/llvm/llvm-project/assets/15016832/f4ea3f32-6649-4c10-88c4-b7522031e8c9">
After:
<img width="344" alt="image"
src="https://github.com/llvm/llvm-project/assets/15016832/1216db25-3dc1-4b0f-be89-caeff622dd35">
0 commit comments