Commit 044aae5

authored and

committed

[ROCm] Improvements for vectorized elementwise kernels (pytorch#143269) (#1874)

* Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4 * But elems_per_thread = 8 works better on half datypes for AMD gpus * Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively Co-author: @akadutta Pull Request resolved: pytorch#143269 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Pruthvi Madugundu <[email protected]> (cherry picked from commit 4686828)

1 parent 76e12d5 commit 044aae5Copy full SHA for 044aae5

0 file changed

-0

lines changed

0 file changed

-0

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 044aae5

0 file changed

0 file changed

File tree

0 file changed

0 file changed

0 commit comments