Skip to content

[libspirv][ptx-nvidiacl] Change __clc__group_scratch size to 32 x i128 #18431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

wenju-he
Copy link
Contributor

@wenju-he wenju-he commented May 13, 2025

To align with the comment in the file that specifies 32 storage locations and 128 bits per warp.
Change file to opaque pointer mode.
Add more global variables for different sizes to resolve Reducing storage for small data types.

To align with the comment in the file that specifies 32 storage
locations and 128 bits per warp.
@wenju-he wenju-he requested a review from a team as a code owner May 13, 2025 02:06
@wenju-he wenju-he requested a review from ldrumm May 13, 2025 02:06
@wenju-he wenju-he temporarily deployed to WindowsCILock May 13, 2025 02:06 — with GitHub Actions Inactive
@wenju-he wenju-he temporarily deployed to WindowsCILock May 13, 2025 02:31 — with GitHub Actions Inactive
@wenju-he wenju-he temporarily deployed to WindowsCILock May 13, 2025 02:31 — with GitHub Actions Inactive
Copy link
Contributor

@frasercrmck frasercrmck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even still need all the @__clc__get_group_scratch_<type> overloads? With opaque pointers they're all equivalent.


define i8 addrspace(3)* @__clc__get_group_scratch_bool() nounwind alwaysinline {
entry:
%ptr = getelementptr inbounds [128 x i64], [128 x i64] addrspace(3)* @__clc__group_scratch, i64 0, i64 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could/should probably rewrite this to use opaque pointers while we're here. If we do, almost all of this can go away. You could just return ptr addrspace(3) @__clc__group_scratch for every overload, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could/should probably rewrite this to use opaque pointers while we're here.

done

If we do, almost all of this can go away. You could just return ptr addrspace(3) @__clc__group_scratch for every overload, I think?

I added more global variables for different sizes. It should resolve the comment Reducing storage for small data types or increasing it for user-defined types will likely require an additional pass to track group algorithm usage on the top of the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I take it the scratch memory isn't mean to be shared between the different types? If so we couldn''t have separate globals in this way.

%cast = bitcast i64 addrspace(3)* %ptr to i8 addrspace(3)*
ret i8 addrspace(3)* %cast
}

define i8 addrspace(3)* @__clc__get_group_scratch_char() nounwind alwaysinline {
entry:
%ptr = getelementptr inbounds [128 x i64], [128 x i64] addrspace(3)* @__clc__group_scratch, i64 0, i64 0
%ptr = getelementptr inbounds [32 x i128], [32 x i128] addrspace(3)* @__clc__group_scratch, i64 0, i64 0
%cast = bitcast i64 addrspace(3)* %ptr to i8 addrspace(3)*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't switch to opaque pointers, this needs fixing. Having bitcast i64 is incorrect as %ptr would be i128 addrspace(3)*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

%ptr = getelementptr inbounds [128 x i64], [128 x i64] addrspace(3)* @__clc__group_scratch, i64 0, i64 0
%cast = bitcast i64 addrspace(3)* %ptr to i8 addrspace(3)*
ret i8 addrspace(3)* %cast
%0 = getelementptr inbounds [32 x i1], ptr addrspace(3) @__clc__group_scratch_i1, i64 0, i64 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to simply ret ptr addrspace(3) @__clc_group_scratch_i1. I don't think we need these getelementptrs at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I'm not sure we even really need the _<TYPE> functions anymore. We could probably just have one unified scratch helper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to simply ret ptr addrspace(3) @__clc_group_scratch_i1. I don't think we need these getelementptrs at all.

done, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I'm not sure we even really need the _<TYPE> functions anymore. We could probably just have one unified scratch helper.

Using multiple global variable with different sizes can solve the overestimation of local variable size issue, as mentioned in deleted comment by this PR. For instance, if a test is using char type, there is no need to use a local variable of size 32 x i128.

@wenju-he wenju-he temporarily deployed to WindowsCILock May 22, 2025 01:03 — with GitHub Actions Inactive
@wenju-he wenju-he requested a review from frasercrmck May 22, 2025 01:05
@wenju-he wenju-he temporarily deployed to WindowsCILock May 22, 2025 01:31 — with GitHub Actions Inactive
@wenju-he wenju-he temporarily deployed to WindowsCILock May 22, 2025 01:31 — with GitHub Actions Inactive
Copy link
Contributor

@frasercrmck frasercrmck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I suppose the tradeoff is that if an application uses multiple types then we'll end up using more memory than before. I don't have a good idea which to prioritise so I'm okay with this as you've proposed it.

We could probably combine these two files in a subsequent PR.

@wenju-he
Copy link
Contributor Author

Yeah, I suppose the tradeoff is that if an application uses multiple types then we'll end up using more memory than before.

I think that is a general issue that is unrelated to this PR, because any kernel that uses multiple local variables in multiple non-kernel functions would have the same issue. It would lead to overestimation of the total size of used local memory in a kernel. However, I think the backend should be able to optimize this case and accurately report the the total size of used local memory.

@wenju-he
Copy link
Contributor Author

@intel/llvm-gatekeepers please merge, thanks. Jenkins/Precommit fail is probably infrastructure issue.

@uditagarwal97 uditagarwal97 merged commit 89f6a39 into intel:sycl May 26, 2025
23 of 24 checks passed
@wenju-he wenju-he deleted the ptx-nvidiacl-collectives_helpers.ll branch May 27, 2025 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants