-
Notifications
You must be signed in to change notification settings - Fork 135
Speedup Python implementation of Blockwise #1391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup Python implementation of Blockwise #1391
Conversation
91976ee
to
086c717
Compare
It seems to add a non-trivial 100ns
8ed3400
to
9292030
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR aims to improve the speed of the Blockwise implementations by replacing many calls that use the high‐overhead versions (e.g. np.vectorize, np.broadcast_to, np.ndindex) with their optimized counterparts (faster_broadcast_to and faster_ndindex).
- Replaces np.broadcast_to and np.ndindex with faster_* equivalents in several random variable and tensor functions
- Updates inline comments by removing the “strict” parameter from zip calls in hot loops to reflect newer best practices
Reviewed Changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated no comments.
File | Description |
---|---|
pytensor/tensor/random/basic.py | Updated Dirichlet and Multinomial random variable functions to use faster_broadcast_to/faster_ndindex instead of np.broadcast_to/np.ndindex |
pytensor/tensor/elemwise.py | Removed the strict argument from zip calls and updated the inline comments accordingly |
pytensor/tensor/blockwise.py | Replaced the underlying vectorize implementation in Blockwise with a custom one using faster_* routines and added type hints for clarity |
Comments suppressed due to low confidence (2)
pytensor/tensor/elemwise.py:715
- [nitpick] Since the zip calls have been updated to remove the strict argument, confirm that all client code is compatible with the standard zip behavior and update any related comments for clarity.
for i, (variable, storage, nout) in enumerate(zip(variables, output_storage, node.outputs)):
pytensor/tensor/random/basic.py:980
- Ensure that the new 'faster_broadcast_to' and 'faster_ndindex' functions handle all edge cases (e.g. empty shapes, high-dimensional inputs) as robustly as their NumPy counterparts.
if size is not None:
Codecov ReportAttention: Patch coverage is
❌ Your patch status has failed because the patch coverage (95.34%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #1391 +/- ##
=======================================
Coverage 82.02% 82.03%
=======================================
Files 207 207
Lines 49294 49343 +49
Branches 8746 8754 +8
=======================================
+ Hits 40433 40478 +45
- Misses 6695 6698 +3
- Partials 2166 2167 +1
🚀 New features to boost your workflow:
|
pytensor/tensor/blockwise.py
Outdated
storage_map: StorageMapType, | ||
compute_map: ComputeMapType, | ||
no_recycling: list[Variable], | ||
impl: str | None = None,""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the story with this docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AI shit that I didn't notice got auto-completed in
For reference why do we case about these miliseconds. The logp+dlogp of the 8 school model that nutpie uses as the example takes ~18us to evaluate on the C backend on my laptop, and explains ~0.7 seconds of the sampling time on the PyMC nuts sampler over ~36k evals, for a total sampling of 1.5/2 seconds . Every 1ms saved is 5% of the logp+dlogp function and 2% of the total sampling time. For small models / functions it would be nice not to create unnecessary overhead. Speed is a feature for PyMC. For large models / functions (e.g., with large matmuls that put the eval on the order of ms) these hacks doesn't matter at all of course. Now looking at the big picture this PR doesn't do just a couple of us saves. I forgot to post the results of the new banchmark, but a small blockwise of Conv1d with 7 batch dimensions and a valid conv of 128:20, is now twice as fast: Before:
After:
A function with a single conv (non-blockwised) of the same size is:
The overhead is still not negligeable but twice better |
9292030
to
b8164a9
Compare
b8164a9
to
e5e090c
Compare
numpy.vectorize
has a lot of overhead which is problematic for Blockwise because we don't have a C implementation.This PR strips down to the bare minimum we need to try to reduce Python overhead as much as possible.
📚 Documentation preview 📚: https://pytensor--1391.org.readthedocs.build/en/1391/