Speedup Python implementation of Blockwise #1391

ricardoV94 · 2025-05-07T16:51:58Z

numpy.vectorize has a lot of overhead which is problematic for Blockwise because we don't have a C implementation.

This PR strips down to the bare minimum we need to try to reduce Python overhead as much as possible.

📚 Documentation preview 📚: https://pytensor--1391.org.readthedocs.build/en/1391/

pyproject.toml

pytensor/graph/op.py

It seems to add a non-trivial 100ns

Copilot

Pull Request Overview

This PR aims to improve the speed of the Blockwise implementations by replacing many calls that use the high‐overhead versions (e.g. np.vectorize, np.broadcast_to, np.ndindex) with their optimized counterparts (faster_broadcast_to and faster_ndindex).

Replaces np.broadcast_to and np.ndindex with faster_* equivalents in several random variable and tensor functions
Updates inline comments by removing the “strict” parameter from zip calls in hot loops to reflect newer best practices

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated no comments.

File	Description
pytensor/tensor/random/basic.py	Updated Dirichlet and Multinomial random variable functions to use faster_broadcast_to/faster_ndindex instead of np.broadcast_to/np.ndindex
pytensor/tensor/elemwise.py	Removed the strict argument from zip calls and updated the inline comments accordingly
pytensor/tensor/blockwise.py	Replaced the underlying vectorize implementation in Blockwise with a custom one using faster_* routines and added type hints for clarity

Comments suppressed due to low confidence (2)

pytensor/tensor/elemwise.py:715

[nitpick] Since the zip calls have been updated to remove the strict argument, confirm that all client code is compatible with the standard zip behavior and update any related comments for clarity.

for i, (variable, storage, nout) in enumerate(zip(variables, output_storage, node.outputs)):

pytensor/tensor/random/basic.py:980

Ensure that the new 'faster_broadcast_to' and 'faster_ndindex' functions handle all edge cases (e.g. empty shapes, high-dimensional inputs) as robustly as their NumPy counterparts.

if size is not None:

codecov · 2025-05-08T15:37:07Z

Codecov Report

Attention: Patch coverage is 95.34884% with 6 lines in your changes missing coverage. Please review.

Project coverage is 82.03%. Comparing base (afe934b) to head (e5e090c).
Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
pytensor/tensor/blockwise.py	95.58%	2 Missing and 1 partial ⚠️
pytensor/link/c/op.py	75.00%	1 Missing and 1 partial ⚠️
pytensor/link/basic.py	66.66%	1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (95.34%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1391   +/-   ##
=======================================
  Coverage   82.02%   82.03%           
=======================================
  Files         207      207           
  Lines       49294    49343   +49     
  Branches     8746     8754    +8     
=======================================
+ Hits        40433    40478   +45     
- Misses       6695     6698    +3     
- Partials     2166     2167    +1

Files with missing lines	Coverage Δ
pytensor/compile/builders.py	`88.62% <100.00%> (-0.04%)`	⬇️
pytensor/compile/function/types.py	`80.71% <100.00%> (ø)`
pytensor/graph/op.py	`88.32% <100.00%> (+0.24%)`	⬆️
pytensor/ifelse.py	`52.27% <100.00%> (ø)`
pytensor/link/c/basic.py	`87.72% <100.00%> (ø)`
pytensor/link/numba/dispatch/basic.py	`80.38% <ø> (ø)`
pytensor/link/numba/dispatch/cython_support.py	`88.48% <100.00%> (ø)`
pytensor/link/numba/dispatch/extra_ops.py	`95.49% <100.00%> (ø)`
pytensor/link/numba/dispatch/slinalg.py	`69.76% <100.00%> (ø)`
pytensor/link/numba/dispatch/subtensor.py	`95.58% <100.00%> (ø)`
... and 12 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pytensor/graph/op.py

jessegrabowski · 2025-05-09T08:14:06Z

pytensor/tensor/blockwise.py

+    storage_map: StorageMapType,
+    compute_map: ComputeMapType,
+    no_recycling: list[Variable],
+    impl: str | None = None,"""


What's the story with this docstring?

AI shit that I didn't notice got auto-completed in

pytensor/tensor/blockwise.py

pytensor/tensor/utils.py

ricardoV94 · 2025-05-09T10:06:58Z

For reference why do we case about these miliseconds. The logp+dlogp of the 8 school model that nutpie uses as the example takes ~18us to evaluate on the C backend on my laptop, and explains ~0.7 seconds of the sampling time on the PyMC nuts sampler over ~36k evals, for a total sampling of 1.5/2 seconds . Every 1ms saved is 5% of the logp+dlogp function and 2% of the total sampling time.

For small models / functions it would be nice not to create unnecessary overhead. Speed is a feature for PyMC. For large models / functions (e.g., with large matmuls that put the eval on the order of ms) these hacks doesn't matter at all of course.

Now looking at the big picture this PR doesn't do just a couple of us saves. I forgot to post the results of the new banchmark, but a small blockwise of Conv1d with 7 batch dimensions and a valid conv of 128:20, is now twice as fast:

Before:

----------------------------------------------------------- benchmark: 1 tests -----------------------------------------------------------
Name (time in us)                         Min         Max      Mean   StdDev    Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------
test_small_blockwise_performance     126.1360  1,166.1640  135.0608  19.6666  129.8830  9.5125   175;209        7.4041    3820           1
------------------------------------------------------------------------------------------------------------------------------------------

After:

-------------------------------------------------------- benchmark: 1 tests --------------------------------------------------------
Name (time in us)                        Min       Max     Mean  StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------
test_small_blockwise_performance     61.1740  230.8430  64.8810  5.2530  63.5090  0.9820  698;1097       15.4128    7011           1
------------------------------------------------------------------------------------------------------------------------------------

A function with a single conv (non-blockwised) of the same size is:

------------------------------------------------------ benchmark: 1 tests ------------------------------------------------------
Name (time in us)                       Min      Max    Mean  StdDev  Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------
test_small_blockwise_performance     6.7920  45.0040  7.4829  1.5829  7.0940  0.1590  901;3342      133.6374   20606           1
--------------------------------------------------------------------------------------------------------------------------------

The overhead is still not negligeable but twice better

pytensor/tensor/blockwise.py

ricardoV94 commented May 7, 2025

View reviewed changes

pyproject.toml Show resolved Hide resolved

ricardoV94 commented May 7, 2025

View reviewed changes

pytensor/graph/op.py Show resolved Hide resolved

ricardoV94 changed the title ~~Speedup python blockwise~~ Speedup Python implementation of Blockwise May 7, 2025

ricardoV94 force-pushed the speedup_python_impl_of_blockwise branch from 91976ee to 086c717 Compare May 7, 2025 19:52

Don't specify zip strict kwarg in hot loops

65d01e1

It seems to add a non-trivial 100ns

ricardoV94 force-pushed the speedup_python_impl_of_blockwise branch 2 times, most recently from 8ed3400 to 9292030 Compare May 8, 2025 14:48

ricardoV94 requested a review from Copilot May 8, 2025 14:59

Copilot AI reviewed May 8, 2025

View reviewed changes

ricardoV94 added maintenance performance labels May 8, 2025

ricardoV94 requested a review from jessegrabowski May 8, 2025 16:24

jessegrabowski requested changes May 9, 2025

View reviewed changes

Avoid numpy broadcast_to and ndindex in hot loops

dd4ecda

ricardoV94 force-pushed the speedup_python_impl_of_blockwise branch from 9292030 to b8164a9 Compare May 9, 2025 10:20

ricardoV94 requested a review from jessegrabowski May 9, 2025 10:25

jessegrabowski approved these changes May 9, 2025

View reviewed changes

pytensor/tensor/blockwise.py Show resolved Hide resolved

Speedup python implementation of Blockwise

e5e090c

ricardoV94 force-pushed the speedup_python_impl_of_blockwise branch from b8164a9 to e5e090c Compare May 9, 2025 13:12

ricardoV94 merged commit 5fa5c9b into pymc-devs:main May 9, 2025
72 of 73 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speedup Python implementation of Blockwise #1391

Speedup Python implementation of Blockwise #1391

Uh oh!

ricardoV94 commented May 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

codecov bot commented May 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jessegrabowski May 9, 2025

Uh oh!

ricardoV94 May 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ricardoV94 commented May 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Speedup Python implementation of Blockwise #1391

Speedup Python implementation of Blockwise #1391

Uh oh!

Conversation

ricardoV94 commented May 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

codecov bot commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

jessegrabowski May 9, 2025

Choose a reason for hiding this comment

Uh oh!

ricardoV94 May 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ricardoV94 commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ricardoV94 commented May 7, 2025 •

edited by github-actions bot

Loading

codecov bot commented May 8, 2025 •

edited

Loading

ricardoV94 commented May 9, 2025 •

edited

Loading