add thread local cache for brgemm #350

crazydemo · 2024-09-20T01:21:38Z

ZhennanQin · 2024-09-20T01:26:44Z

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp

@@ -137,24 +150,38 @@ void dnnl_brgemm_tilerelease() {
 void dnnl_brgemm_execute(int64_t kernel_idx, void *A, uint64_t A_offset,
                         void *B, uint64_t B_offset, void *C, uint64_t C_offset,
                         int num) {
+  auto it = tl_cache.find(kernel_idx);


It's better not define tl_cache as global static, define it here as a function static.

thanks for the advice, fixed.

ZhennanQin · 2024-09-20T01:27:07Z

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp

+  if (it != tl_cache.end()) {
+    desc_ptr = &it->second.desc;
+    kernel = it->second.kernel;
+  } else {
    read_lock_guard_t g(g_brgemm_lock);


Since it's thread local, do we still need this lock?

when the target brgemm kernel is not found in thread_local cache, we still need to lock the global cache to get the target brgemm.

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp

huanghaixin008 · 2024-09-20T02:29:20Z

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp

+struct brgemm_cache_info_t {
+  std::shared_ptr<brgemm_desc_t> desc;
+  std::shared_ptr<brgemm_kernel_t> kernel;
+  std::shared_ptr<char> palette;


we could use shared_ptr<char[]> for palette, and for desc and kernel we don't need to change, since they are not managed by smart ptr, and storing ptr to vector elements is dangerous as well

huanghaixin008 · 2024-09-20T02:30:39Z

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp

+  std::shared_ptr<char> palette;
+
+  brgemm_cache_info_t() = default;
+  brgemm_cache_info_t(brgemm_desc_t *d, brgemm_kernel_t *k, char *p)


ideally we need to change the unique_ptr in global palette pool to shared_ptr as well, and pass the shared_ptr of palette here for construction

yifeizh2 · 2024-09-20T03:06:13Z

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp

+struct brgemm_cache_info_t {
+  std::shared_ptr<brgemm_desc_t> desc;
+  std::shared_ptr<brgemm_kernel_t> kernel;
+  std::shared_ptr<char> palette;


As we created brgemm_cache_info_t to store desc/kernel/palette together thread locally, would it be a better manner to also use brgemm_cache_info_t for global management?

I think it's a good idea, we can unify the struct used in both thead-local and global

use brgemm_cache_info_t for both thread local and global cache.

zhczhong · 2024-09-23T02:15:33Z

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp

-      return;
-    }
-    palette_buffer = g_brgemm_palette[kernel_idx].get();
+    info = {&g_brgemm_desc_list[kernel_idx], g_brgemm_kernel_list[kernel_idx],


It is not safe to assign the raw pointer to the shared_ptr in struct which will release the pointer when the ref_count == 0

huanghaixin008 · 2024-09-23T05:52:13Z

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp

@@ -93,33 +102,33 @@ int64_t dnnl_brgemm_dispatch(int64_t M, int64_t N, int64_t K, int64_t LDA,
  brgemm_desc_set_attr(&desc, dnnl_attrs);

  // TODO(haixin): Reuse identical palettes across kernels
-  char *palette_buffer = nullptr;
+  std::shared_ptr<char[]> palette_buffer(new char[PALETTE_SIZE],
+                                         std::default_delete<char[]>());


We only need to new palette buffer when desc.is_tmm is true.

Menooker · 2024-09-23T07:17:03Z

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp


 // TODO(haixin): use syscall to determine page size?
 static constexpr size_t SCRATCH_SIZE = 2 * 4096;
 // TODO(haixin): need to use custom thread management for scratch in the future?
 static thread_local char scratch[SCRATCH_SIZE] = {0};

+static std::unordered_map<int64_t, brgemm_cache_info_t> &get_tl_cache() {
+  thread_local std::unordered_map<int64_t, brgemm_cache_info_t> tl_cache;


Sorry I am late for the party. Can we use std::vector for better performance?

The "key" here might not be contiguous?

Haixin originally used a vector to hold the kernels. I think he tried to make them contiguous. Need to double check that.

In global, it's contiguous, but in thread local cache it might be not.
But I think we can still use vector for thread local cache, with empty 'hole's inside the vector.
Using unordered_map indeed would bring some extra cost.

The previous design was able to use a vector for access because there was only a single global cache storing the BRGEMM information. This PR introduces a new thread-local cache, and the indices in this cache may not necessarily align with those in the global cache.

I think it still profitable to use a vector. It is contiguous in memory and in most of time, it should be dense (will it be common when a thread calls brgemm A, and another calls brgemm B?) Please note that std::unordered_map is slow and space-consuming. It stores k-v for each pair and the pairs are stored in a linked list. That is at least 3 times of the space of a vector.

add thread local cache for brgemm

29f25e2

crazydemo requested review from huanghaixin008, zhczhong and yifeizh2 September 20, 2024 01:21

crazydemo linked an issue Sep 20, 2024 that may be closed by this pull request

Performance regression caused by read lock in brgemm #323

Closed

ZhennanQin reviewed Sep 20, 2024

View reviewed changes

huanghaixin008 reviewed Sep 20, 2024

View reviewed changes

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp Outdated Show resolved Hide resolved

huanghaixin008 reviewed Sep 20, 2024

View reviewed changes

lib/gc/ExecutionEngine/CPURuntime/Microkernel/BrgemmOnednn.cpp Outdated Show resolved Hide resolved

encapsulate thread local cache

37267a7

huanghaixin008 reviewed Sep 20, 2024

View reviewed changes

yifeizh2 reviewed Sep 20, 2024

View reviewed changes

zhczhong reviewed Sep 23, 2024

View reviewed changes

best perf

880a3ff

yifeizh2 approved these changes Sep 23, 2024

View reviewed changes

use static func

64fe90e

huanghaixin008 reviewed Sep 23, 2024

View reviewed changes

fix comment

36de9b0

huanghaixin008 approved these changes Sep 23, 2024

View reviewed changes

zhczhong approved these changes Sep 23, 2024

View reviewed changes

crazydemo merged commit a62e88e into main Sep 23, 2024
6 checks passed

crazydemo deleted the zhangyan/fix_perf branch September 23, 2024 07:13

Menooker reviewed Sep 23, 2024

View reviewed changes

crazydemo restored the zhangyan/fix_perf branch September 23, 2024 07:18

add thread local cache for brgemm #350

add thread local cache for brgemm #350

Uh oh!

Conversation

crazydemo commented Sep 20, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yifeizh2 Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yifeizh2 Sep 20, 2024 •

edited

Loading