You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improved performance of reduction kernel with atomics
1. Contig implementation kernel gets a dedicated name
(easier to spot in the output of onetrace)
2. Increase work-group multiple
3. Change the order in which workgroups tile the array
from 'along reduction axis' moves fastest to
'along iteration axis' moves fastests.
This last change contributes to significant performance improvement:
```
================= Before change
In [1]: import dpctl.tensor as dpt
In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128))
In [3]: %time y = dpt.sum(x, axis=0)
CPU times: user 309 ms, sys: 128 ms, total: 437 ms
Wall time: 473 ms
In [4]: %time y = dpt.sum(x, axis=0)
CPU times: user 132 ms, sys: 160 ms, total: 292 ms
Wall time: 316 ms
In [5]: %time y = dpt.sum(x, axis=0)
CPU times: user 104 ms, sys: 185 ms, total: 289 ms
Wall time: 312 ms
```
```
===== After change
In [1]: import dpctl.tensor as dpt
In [2]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128))
In [3]: %time y = dpt.sum(x, axis=0)
CPU times: user 150 ms, sys: 32.9 ms, total: 183 ms
Wall time: 198 ms
In [4]: %time y = dpt.sum(x, axis=0)
CPU times: user 20 ms, sys: 22.7 ms, total: 42.7 ms
Wall time: 49.4 ms
In [5]: %time y = dpt.sum(x, axis=0)
CPU times: user 10.2 ms, sys: 28.9 ms, total: 39.1 ms
Wall time: 41.4 ms
In [6]: %time y = dpt.sum(x, axis=0)
CPU times: user 23 ms, sys: 18 ms, total: 41 ms
Wall time: 43.5 ms
```
0 commit comments