Skip to content

Boolean reduction performance improvements #1401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 18, 2023

Conversation

ndgrigorian
Copy link
Collaborator

@ndgrigorian ndgrigorian commented Sep 17, 2023

This PR makes changes to boolean reductions which align with #1364

Namely, the traversal pattern of work groups in boolean reductions has been changed to be fastest over the iteration dimension, rather than the reduction dimension, and a specialized kernel for reductions over axis 0 in matrices has been added.

The original contiguous boolean reduction kernel has also been renamed to make the difference more apparent.

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • If this PR is a work in progress, are you opening the PR as a draft?

Similar to changes in sum, now traverses the iteration dimension the fastest
- Aligns with similar changes to sum
@github-actions
Copy link

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev5=py310ha25a700_4 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

@ndgrigorian
Copy link
Collaborator Author

Using the same example as in #1364, the performance benefits are clear:

Before:

In [5]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(\
   ...:                            dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128))

In [6]: %time y = dpt.all(x, axis=0)
CPU times: user 481 ms, sys: 349 ms, total: 830 ms
Wall time: 763 ms

In [7]: %time y = dpt.all(x, axis=0)
CPU times: user 232 ms, sys: 325 ms, total: 556 ms
Wall time: 601 ms

In [8]: %time y = dpt.all(x, axis=0)
CPU times: user 316 ms, sys: 235 ms, total: 551 ms
Wall time: 599 ms

In [9]: %time y = dpt.any(x, axis=0)
CPU times: user 454 ms, sys: 261 ms, total: 715 ms
Wall time: 774 ms

In [10]: %time y = dpt.any(x, axis=0)
CPU times: user 284 ms, sys: 308 ms, total: 592 ms
Wall time: 639 ms

In [11]: %time y = dpt.any(x, axis=0)
CPU times: user 280 ms, sys: 325 ms, total: 605 ms
Wall time: 654 ms

after:

In [3]: x = dpt.reshape(dpt.asarray(1, dtype="f4")/dpt.square(\
   ...:                            dpt.arange(1, 1282200*128 + 1, dtype="f4")), (1282200, 128))

In [4]: %time y = dpt.all(x, axis=0)
CPU times: user 210 ms, sys: 17.3 ms, total: 227 ms
Wall time: 198 ms

In [5]: %time y = dpt.all(x, axis=0)
CPU times: user 8.63 ms, sys: 36.6 ms, total: 45.3 ms
Wall time: 50.3 ms

In [6]: %time y = dpt.all(x, axis=0)
CPU times: user 15 ms, sys: 35.2 ms, total: 50.3 ms
Wall time: 51.1 ms

In [7]: %time y = dpt.any(x, axis=0)
CPU times: user 81 ms, sys: 19.2 ms, total: 100 ms
Wall time: 108 ms

In [8]: %time y = dpt.any(x, axis=0)
CPU times: user 18.6 ms, sys: 25.4 ms, total: 44 ms
Wall time: 46.6 ms

In [9]: %time y = dpt.any(x, axis=0)
CPU times: user 7.31 ms, sys: 35.7 ms, total: 43 ms
Wall time: 45.5 ms

@oleksandr-pavlyk
Copy link
Contributor

Please add "numpy<1.26" restriction to pip install commands in "generate-coverage" and "sycl-nightly" workflows. It looks good to go in, but I'd prefer a green CI

@coveralls
Copy link
Collaborator

Coverage Status

coverage: 85.774%. remained the same when pulling 351232a on boolean-reduction-performance into 83fff33 on master.

@ndgrigorian
Copy link
Collaborator Author

Please add "numpy<1.26" restriction to pip install commands in "generate-coverage" and "sycl-nightly" workflows. It looks good to go in, but I'd prefer a green CI

It's been added and fixes the CI. I'll look into properly solving the problem in a separate PR.

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.6dev5=py310ha25a700_5 ran successfully.
Passed: 916
Failed: 84
Skipped: 119

@ndgrigorian ndgrigorian merged commit b32fc71 into master Sep 18, 2023
@ndgrigorian ndgrigorian deleted the boolean-reduction-performance branch September 20, 2023 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants