[rocm6.4_internal_testing] [ROCm] Improvements for vectorized elementwise kernels (#143269) #1874

jerrymannil · 2025-01-31T19:52:02Z

Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes
- for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4
- But elems_per_thread = 8 works better on half datypes for AMD gpus
Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively

Pull Request resolved: pytorch#143269
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

@akadutta

* Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4 * But elems_per_thread = 8 works better on half datypes for AMD gpus * Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively Co-author: @akadutta Pull Request resolved: pytorch#143269 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Pruthvi Madugundu <[email protected]>

@akadutta

… (#1874) * Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4 * But elems_per_thread = 8 works better on half datypes for AMD gpus * Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively Co-author: @akadutta Pull Request resolved: pytorch#143269 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Pruthvi Madugundu <[email protected]> (cherry picked from commit 4686828)

@akadutta

… (#1874) * Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4 * But elems_per_thread = 8 works better on half datypes for AMD gpus * Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively Co-author: @akadutta Pull Request resolved: pytorch#143269 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Pruthvi Madugundu <[email protected]> (cherry picked from commit 4686828)

@akadutta

… (#1874) * Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4 * But elems_per_thread = 8 works better on half datypes for AMD gpus * Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively Co-author: @akadutta Pull Request resolved: pytorch#143269 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Pruthvi Madugundu <[email protected]> (cherry picked from commit 4686828)

jerrymannil requested a review from pruthvistony January 31, 2025 19:52

pruthvistony approved these changes Jan 31, 2025

View reviewed changes

pruthvistony merged commit 4686828 into ROCm:rocm6.4_internal_testing Jan 31, 2025

BLOrange-AMD changed the title ~~[ROCm] Improvements for vectorized elementwise kernels (#143269)~~ [rocm6.4_internal_testing] [ROCm] Improvements for vectorized elementwise kernels (#143269) Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rocm6.4_internal_testing] [ROCm] Improvements for vectorized elementwise kernels (#143269) #1874

[rocm6.4_internal_testing] [ROCm] Improvements for vectorized elementwise kernels (#143269) #1874

Uh oh!

jerrymannil commented Jan 31, 2025

Uh oh!

Uh oh!

[rocm6.4_internal_testing] [ROCm] Improvements for vectorized elementwise kernels (#143269) #1874

[rocm6.4_internal_testing] [ROCm] Improvements for vectorized elementwise kernels (#143269) #1874

Uh oh!

Conversation

jerrymannil commented Jan 31, 2025

Uh oh!

Uh oh!