Skip to content

add thread local cache for brgemm #350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 23, 2024
Merged

add thread local cache for brgemm #350

merged 5 commits into from
Sep 23, 2024

Conversation

crazydemo
Copy link

Tracking Issue#323

@crazydemo crazydemo linked an issue Sep 20, 2024 that may be closed by this pull request
@@ -137,24 +150,38 @@ void dnnl_brgemm_tilerelease() {
void dnnl_brgemm_execute(int64_t kernel_idx, void *A, uint64_t A_offset,
void *B, uint64_t B_offset, void *C, uint64_t C_offset,
int num) {
auto it = tl_cache.find(kernel_idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better not define tl_cache as global static, define it here as a function static.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the advice, fixed.

if (it != tl_cache.end()) {
desc_ptr = &it->second.desc;
kernel = it->second.kernel;
} else {
read_lock_guard_t g(g_brgemm_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's thread local, do we still need this lock?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the target brgemm kernel is not found in thread_local cache, we still need to lock the global cache to get the target brgemm.

struct brgemm_cache_info_t {
std::shared_ptr<brgemm_desc_t> desc;
std::shared_ptr<brgemm_kernel_t> kernel;
std::shared_ptr<char> palette;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could use shared_ptr<char[]> for palette, and for desc and kernel we don't need to change, since they are not managed by smart ptr, and storing ptr to vector elements is dangerous as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

std::shared_ptr<char> palette;

brgemm_cache_info_t() = default;
brgemm_cache_info_t(brgemm_desc_t *d, brgemm_kernel_t *k, char *p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we need to change the unique_ptr in global palette pool to shared_ptr as well, and pass the shared_ptr of palette here for construction

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

struct brgemm_cache_info_t {
std::shared_ptr<brgemm_desc_t> desc;
std::shared_ptr<brgemm_kernel_t> kernel;
std::shared_ptr<char> palette;
Copy link
Contributor

@yifeizh2 yifeizh2 Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we created brgemm_cache_info_t to store desc/kernel/palette together thread locally, would it be a better manner to also use brgemm_cache_info_t for global management?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea, we can unify the struct used in both thead-local and global

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use brgemm_cache_info_t for both thread local and global cache.

return;
}
palette_buffer = g_brgemm_palette[kernel_idx].get();
info = {&g_brgemm_desc_list[kernel_idx], g_brgemm_kernel_list[kernel_idx],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not safe to assign the raw pointer to the shared_ptr in struct which will release the pointer when the ref_count == 0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -93,33 +102,33 @@ int64_t dnnl_brgemm_dispatch(int64_t M, int64_t N, int64_t K, int64_t LDA,
brgemm_desc_set_attr(&desc, dnnl_attrs);

// TODO(haixin): Reuse identical palettes across kernels
char *palette_buffer = nullptr;
std::shared_ptr<char[]> palette_buffer(new char[PALETTE_SIZE],
std::default_delete<char[]>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to new palette buffer when desc.is_tmm is true.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@crazydemo crazydemo merged commit a62e88e into main Sep 23, 2024
6 checks passed
@crazydemo crazydemo deleted the zhangyan/fix_perf branch September 23, 2024 07:13

// TODO(haixin): use syscall to determine page size?
static constexpr size_t SCRATCH_SIZE = 2 * 4096;
// TODO(haixin): need to use custom thread management for scratch in the future?
static thread_local char scratch[SCRATCH_SIZE] = {0};

static std::unordered_map<int64_t, brgemm_cache_info_t> &get_tl_cache() {
thread_local std::unordered_map<int64_t, brgemm_cache_info_t> tl_cache;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am late for the party. Can we use std::vector for better performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "key" here might not be contiguous?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haixin originally used a vector to hold the kernels. I think he tried to make them contiguous. Need to double check that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In global, it's contiguous, but in thread local cache it might be not.
But I think we can still use vector for thread local cache, with empty 'hole's inside the vector.
Using unordered_map indeed would bring some extra cost.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous design was able to use a vector for access because there was only a single global cache storing the BRGEMM information. This PR introduces a new thread-local cache, and the indices in this cache may not necessarily align with those in the global cache.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still profitable to use a vector. It is contiguous in memory and in most of time, it should be dense (will it be common when a thread calls brgemm A, and another calls brgemm B?) Please note that std::unordered_map is slow and space-consuming. It stores k-v for each pair and the pairs are stored in a linked list. That is at least 3 times of the space of a vector.

@crazydemo crazydemo restored the zhangyan/fix_perf branch September 23, 2024 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regression caused by read lock in brgemm
7 participants