-
Notifications
You must be signed in to change notification settings - Fork 787
[libspirv][ptx-nvidiacl] Change __clc__group_scratch size to 32 x i128 #18431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libspirv][ptx-nvidiacl] Change __clc__group_scratch size to 32 x i128 #18431
Conversation
To align with the comment in the file that specifies 32 storage locations and 128 bits per warp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even still need all the @__clc__get_group_scratch_<type>
overloads? With opaque pointers they're all equivalent.
|
||
define i8 addrspace(3)* @__clc__get_group_scratch_bool() nounwind alwaysinline { | ||
entry: | ||
%ptr = getelementptr inbounds [128 x i64], [128 x i64] addrspace(3)* @__clc__group_scratch, i64 0, i64 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could/should probably rewrite this to use opaque pointers while we're here. If we do, almost all of this can go away. You could just return ptr addrspace(3) @__clc__group_scratch
for every overload, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could/should probably rewrite this to use opaque pointers while we're here.
done
If we do, almost all of this can go away. You could just
return ptr addrspace(3) @__clc__group_scratch
for every overload, I think?
I added more global variables for different sizes. It should resolve the comment Reducing storage for small data types or increasing it for user-defined types will likely require an additional pass to track group algorithm usage
on the top of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I take it the scratch memory isn't mean to be shared between the different types? If so we couldn''t have separate globals in this way.
%cast = bitcast i64 addrspace(3)* %ptr to i8 addrspace(3)* | ||
ret i8 addrspace(3)* %cast | ||
} | ||
|
||
define i8 addrspace(3)* @__clc__get_group_scratch_char() nounwind alwaysinline { | ||
entry: | ||
%ptr = getelementptr inbounds [128 x i64], [128 x i64] addrspace(3)* @__clc__group_scratch, i64 0, i64 0 | ||
%ptr = getelementptr inbounds [32 x i128], [32 x i128] addrspace(3)* @__clc__group_scratch, i64 0, i64 0 | ||
%cast = bitcast i64 addrspace(3)* %ptr to i8 addrspace(3)* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't switch to opaque pointers, this needs fixing. Having bitcast i64
is incorrect as %ptr
would be i128 addrspace(3)*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
%ptr = getelementptr inbounds [128 x i64], [128 x i64] addrspace(3)* @__clc__group_scratch, i64 0, i64 0 | ||
%cast = bitcast i64 addrspace(3)* %ptr to i8 addrspace(3)* | ||
ret i8 addrspace(3)* %cast | ||
%0 = getelementptr inbounds [32 x i1], ptr addrspace(3) @__clc__group_scratch_i1, i64 0, i64 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent to simply ret ptr addrspace(3) @__clc_group_scratch_i1
. I don't think we need these getelementptr
s at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I'm not sure we even really need the _<TYPE>
functions anymore. We could probably just have one unified scratch helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent to simply
ret ptr addrspace(3) @__clc_group_scratch_i1
. I don't think we need thesegetelementptr
s at all.
done, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I'm not sure we even really need the
_<TYPE>
functions anymore. We could probably just have one unified scratch helper.
Using multiple global variable with different sizes can solve the overestimation of local variable size issue, as mentioned in deleted comment by this PR. For instance, if a test is using char type, there is no need to use a local variable of size 32 x i128.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suppose the tradeoff is that if an application uses multiple types then we'll end up using more memory than before. I don't have a good idea which to prioritise so I'm okay with this as you've proposed it.
We could probably combine these two files in a subsequent PR.
I think that is a general issue that is unrelated to this PR, because any kernel that uses multiple local variables in multiple non-kernel functions would have the same issue. It would lead to overestimation of the total size of used local memory in a kernel. However, I think the backend should be able to optimize this case and accurately report the the total size of used local memory. |
@intel/llvm-gatekeepers please merge, thanks. |
To align with the comment in the file that specifies 32 storage locations and 128 bits per warp.
Change file to opaque pointer mode.
Add more global variables for different sizes to resolve
Reducing storage for small data types
.