You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[ROCm] Improvements for vectorized elementwise kernels (pytorch#143269) (#1874)
* Make io_size calculation as minimum of size of input and output size,
rather than the summation of all sizes
* for e.g, for torch.add() on half dtypes (bfloat16/float16),
calc_io_size() returns 6 causing elems_per_thread to be 4
* But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by
using vector size of 8 and 16 respectively
Co-author: @akadutta
Pull Request resolved: pytorch#143269
Approved by: https://github.com/jeffdaily,
https://github.com/pruthvistony
Co-authored-by: Pruthvi Madugundu <[email protected]>
(cherry picked from commit 4686828)
0 commit comments