Skip to content

Speedup Python implementation of Blockwise #1391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ricardoV94
Copy link
Member

@ricardoV94 ricardoV94 commented May 7, 2025

numpy.vectorize has a lot of overhead which is problematic for Blockwise because we don't have a C implementation.

This PR strips down to the bare minimum we need to try to reduce Python overhead as much as possible.


📚 Documentation preview 📚: https://pytensor--1391.org.readthedocs.build/en/1391/

@ricardoV94 ricardoV94 changed the title Speedup python blockwise Speedup Python implementation of Blockwise May 7, 2025
@ricardoV94 ricardoV94 force-pushed the speedup_python_impl_of_blockwise branch from 91976ee to 086c717 Compare May 7, 2025 19:52
It seems to add a non-trivial 100ns
@ricardoV94 ricardoV94 force-pushed the speedup_python_impl_of_blockwise branch 2 times, most recently from 8ed3400 to 9292030 Compare May 8, 2025 14:48
@ricardoV94 ricardoV94 requested a review from Copilot May 8, 2025 14:59
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR aims to improve the speed of the Blockwise implementations by replacing many calls that use the high‐overhead versions (e.g. np.vectorize, np.broadcast_to, np.ndindex) with their optimized counterparts (faster_broadcast_to and faster_ndindex).

  • Replaces np.broadcast_to and np.ndindex with faster_* equivalents in several random variable and tensor functions
  • Updates inline comments by removing the “strict” parameter from zip calls in hot loops to reflect newer best practices

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated no comments.

File Description
pytensor/tensor/random/basic.py Updated Dirichlet and Multinomial random variable functions to use faster_broadcast_to/faster_ndindex instead of np.broadcast_to/np.ndindex
pytensor/tensor/elemwise.py Removed the strict argument from zip calls and updated the inline comments accordingly
pytensor/tensor/blockwise.py Replaced the underlying vectorize implementation in Blockwise with a custom one using faster_* routines and added type hints for clarity
Comments suppressed due to low confidence (2)

pytensor/tensor/elemwise.py:715

  • [nitpick] Since the zip calls have been updated to remove the strict argument, confirm that all client code is compatible with the standard zip behavior and update any related comments for clarity.
for i, (variable, storage, nout) in enumerate(zip(variables, output_storage, node.outputs)):

pytensor/tensor/random/basic.py:980

  • Ensure that the new 'faster_broadcast_to' and 'faster_ndindex' functions handle all edge cases (e.g. empty shapes, high-dimensional inputs) as robustly as their NumPy counterparts.
if size is not None:

Copy link

codecov bot commented May 8, 2025

Codecov Report

Attention: Patch coverage is 95.34884% with 6 lines in your changes missing coverage. Please review.

Project coverage is 82.03%. Comparing base (afe934b) to head (e5e090c).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
pytensor/tensor/blockwise.py 95.58% 2 Missing and 1 partial ⚠️
pytensor/link/c/op.py 75.00% 1 Missing and 1 partial ⚠️
pytensor/link/basic.py 66.66% 1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (95.34%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1391   +/-   ##
=======================================
  Coverage   82.02%   82.03%           
=======================================
  Files         207      207           
  Lines       49294    49343   +49     
  Branches     8746     8754    +8     
=======================================
+ Hits        40433    40478   +45     
- Misses       6695     6698    +3     
- Partials     2166     2167    +1     
Files with missing lines Coverage Δ
pytensor/compile/builders.py 88.62% <100.00%> (-0.04%) ⬇️
pytensor/compile/function/types.py 80.71% <100.00%> (ø)
pytensor/graph/op.py 88.32% <100.00%> (+0.24%) ⬆️
pytensor/ifelse.py 52.27% <100.00%> (ø)
pytensor/link/c/basic.py 87.72% <100.00%> (ø)
pytensor/link/numba/dispatch/basic.py 80.38% <ø> (ø)
pytensor/link/numba/dispatch/cython_support.py 88.48% <100.00%> (ø)
pytensor/link/numba/dispatch/extra_ops.py 95.49% <100.00%> (ø)
pytensor/link/numba/dispatch/slinalg.py 69.76% <100.00%> (ø)
pytensor/link/numba/dispatch/subtensor.py 95.58% <100.00%> (ø)
... and 12 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ricardoV94 ricardoV94 requested a review from jessegrabowski May 8, 2025 16:24
storage_map: StorageMapType,
compute_map: ComputeMapType,
no_recycling: list[Variable],
impl: str | None = None,"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the story with this docstring?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI shit that I didn't notice got auto-completed in

@ricardoV94
Copy link
Member Author

ricardoV94 commented May 9, 2025

For reference why do we case about these miliseconds. The logp+dlogp of the 8 school model that nutpie uses as the example takes ~18us to evaluate on the C backend on my laptop, and explains ~0.7 seconds of the sampling time on the PyMC nuts sampler over ~36k evals, for a total sampling of 1.5/2 seconds . Every 1ms saved is 5% of the logp+dlogp function and 2% of the total sampling time.

For small models / functions it would be nice not to create unnecessary overhead. Speed is a feature for PyMC. For large models / functions (e.g., with large matmuls that put the eval on the order of ms) these hacks doesn't matter at all of course.

Now looking at the big picture this PR doesn't do just a couple of us saves. I forgot to post the results of the new banchmark, but a small blockwise of Conv1d with 7 batch dimensions and a valid conv of 128:20, is now twice as fast:

Before:

----------------------------------------------------------- benchmark: 1 tests -----------------------------------------------------------
Name (time in us)                         Min         Max      Mean   StdDev    Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------
test_small_blockwise_performance     126.1360  1,166.1640  135.0608  19.6666  129.8830  9.5125   175;209        7.4041    3820           1
------------------------------------------------------------------------------------------------------------------------------------------

After:

-------------------------------------------------------- benchmark: 1 tests --------------------------------------------------------
Name (time in us)                        Min       Max     Mean  StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------
test_small_blockwise_performance     61.1740  230.8430  64.8810  5.2530  63.5090  0.9820  698;1097       15.4128    7011           1
------------------------------------------------------------------------------------------------------------------------------------

A function with a single conv (non-blockwised) of the same size is:

------------------------------------------------------ benchmark: 1 tests ------------------------------------------------------
Name (time in us)                       Min      Max    Mean  StdDev  Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------
test_small_blockwise_performance     6.7920  45.0040  7.4829  1.5829  7.0940  0.1590  901;3342      133.6374   20606           1
--------------------------------------------------------------------------------------------------------------------------------

The overhead is still not negligeable but twice better

@ricardoV94 ricardoV94 force-pushed the speedup_python_impl_of_blockwise branch from 9292030 to b8164a9 Compare May 9, 2025 10:20
@ricardoV94 ricardoV94 requested a review from jessegrabowski May 9, 2025 10:25
@ricardoV94 ricardoV94 force-pushed the speedup_python_impl_of_blockwise branch from b8164a9 to e5e090c Compare May 9, 2025 13:12
@ricardoV94 ricardoV94 merged commit 5fa5c9b into pymc-devs:main May 9, 2025
72 of 73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants