Skip to content

Implement symbolic minimize and root Ops #1182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jun 10, 2025

Conversation

jessegrabowski
Copy link
Member

@jessegrabowski jessegrabowski commented Jan 31, 2025

Description

Implement scipy optimization routines, with implicit gradients. This PR should add:

  • optimize.minimize
  • optimize.root
  • optimize.scalar_minimize
  • optimize.scalar_root

It would also be nice to have rewrites to transform e.g. root to scalar_root when we know that there is only one input.

The implementation @ricardoV94 and I cooked up (ok ok it was mostly him) uses the graph to implicitly define the inputs to the objective function. For example:

    import pytensor.tensor as pt
    from pytensor.tensor.optimize import minimize
    x = pt.scalar("x")
    a = pt.scalar("a")
    c = pt.scalar("c")

    b = a * 2
    b.name = "b"
    out = (x - b * c) ** 2

    minimized_x, success = minimize(out, x, debug=False)

We optimize out with respect to x, so x becomes the control variable. By graph inspection we find that out also depends on a and c, so the generated graph includes them as parameters. In scipy lingo, we end up with:

minimize(fun=out, x0=x, args=(a, c)

We get the following graph. The inner graph includes the gradients of the cost function by default, which is automatically used by scipy.

MinimizeOp.0 [id A]
 ├─ x [id B]
 └─ Mul [id C]
    ├─ 2.0 [id D]
    ├─ a [id E]
    └─ c [id F]

Inner graphs:

MinimizeOp [id A]
 ← Pow [id G]
    ├─ Sub [id H]
    │  ├─ x [id I]
    │  └─ <Scalar(float64, shape=())> [id J]
    └─ 2 [id K]
 ← Mul [id L]
    ├─ Mul [id M]
    │  ├─ Second [id N]
    │  │  ├─ Pow [id G]
    │  │  │  └─ ···
    │  │  └─ 1.0 [id O]
    │  └─ 2 [id K]
    └─ Pow [id P]
       ├─ Sub [id H]
       │  └─ ···
       └─ Sub [id Q]
          ├─ 2 [id K]
          └─ DimShuffle{order=[]} [id R]
             └─ 1 [id S]

We can also ask for the gradients of the maximum value with respect to parameters:

x_grad, a_grad, c_grad = pt.grad(minimized_x, [x, a, c])

# x_grad.dprint()
0.0 [id A]

# a_grad.dprint()
Mul [id A]
 ├─ 2.0 [id B]
 └─ c [id C]

# c_grad.dprint()
Mul [id A]
 ├─ 2.0 [id B]
 └─ a [id C]

Related Issue

Checklist

Type of change

  • New feature / enhancement
  • Bug fix
  • Documentation
  • Maintenance
  • Other (please specify):

📚 Documentation preview 📚: https://pytensor--1182.org.readthedocs.build/en/1182/

@ricardoV94
Copy link
Member

We could swap the scipy method to powell if pt.grad raises?


# TODO: Does clone replace do what we want? It might need a merge optimization pass afterwards
replace = dict(zip(self.fgraph.inputs, (x_star, *args), strict=True))
grad_f_wrt_x_star, *grad_f_wrt_args = clone_replace(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should avoid my TODO concern above

Suggested change
grad_f_wrt_x_star, *grad_f_wrt_args = clone_replace(
grad_f_wrt_x_star, *grad_f_wrt_args = graph_replace(

@twiecki
Copy link
Member

twiecki commented Jan 31, 2025

Is there a path to later add other optimizers, e.g. jax or pytensor? Could then optimize on the GPU. Lasagne has a bunch of optimizers implemented in theano: https://lasagne.readthedocs.io/en/latest/modules/updates.html

@ricardoV94
Copy link
Member

ricardoV94 commented Jan 31, 2025

Is there a path to later add other optimizers, e.g. jax or pytensor? Could then optimize on the GPU. Lasagne has a bunch of optimizers implemented in theano: https://lasagne.readthedocs.io/en/latest/modules/updates.html

Yes, the MinimizeOp can be dispatched to JaxOpt on the jax backend. On the Python/C-backend we can also replace the MinimizeOp by any equilavent optimizer as well. As @jessegrabowski mentioned, we may even analyze the graph to decide what to use, such as scalar_root when that's adequate.

@jessegrabowski
Copy link
Member Author

Is there a path to later add other optimizers, e.g. jax or pytensor? Could then optimize on the GPU. Lasagne has a bunch of optimizers implemented in theano: https://lasagne.readthedocs.io/en/latest/modules/updates.html

The Lasagne optimizers are for SGD in minibatch settings, so it's slightly different from what I have in mind here. This functionality would be useful in cases where you want to solve a sub-problem and then use the result in a downstream computation. For example, I think @theorashid wanted to use this for INLA to integrate out nuisance parameters via optimization before running MCMC on the remaining parameters of interest.

Another use case would be an agent-based model where we assume agents behave optimally. For example, we could try to estimate investor risk aversion parameters, assuming some utility function. The market prices would be the result of portfolio optimization subject to risk aversion (estimated), expected return vector, and market covariance matrix. Or use it in an RL-type scheme where agents have to solve a Bellman equation to get (an approximation to) their value function. I'm looking forward to cooking up some example models using this.

@jessegrabowski
Copy link
Member Author

We could swap the scipy method to powell if pt.grad raises?

We can definitely change the method via rewrites. There are several gradient-free (or approximate gradient) options in that respect. Optimizers can be fussy though, so I'm a bit hesitant to take this type of configuration out of the hands of the user.

@ricardoV94
Copy link
Member

Optimizers can be fussy though, so I'm a bit hesitant to take this type of configuration out of the hands of the user.

Let users choose but try to provide the best default?

@cetagostini
Copy link

Carlos its interested in this 👀

@aphc14
Copy link

aphc14 commented Feb 16, 2025

Would it be possible to insert a callback to the optimiser to store the position and gradient history?

Copy link
Member

@ricardoV94 ricardoV94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments for ScalarOptimization but they all apply to the General Optimize Op.


def perform(self, node, inputs, outputs):
f = self.fn_wrapped
x0, *args = inputs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x0 isn't passed as the initial point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the input to the Op is always taken to be the initial value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were not using it in the perform of this op

If the input `x` is the same as the last input, return the cached result. Otherwise update the cache with the
new input and result.
"""
cache_hit = np.all(x == self.last_x)
Copy link
Member

@ricardoV94 ricardoV94 Jun 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why in this separate variable? You end up doing more checks, you again check for self.last_x is None. It's also wasteful here if self.last_x is None, you end up doing elementwise comparison between x entries and None and then reducing it with all.

Also tiny optimization, but doing (x == self.last_x).all() is slightly faster, as it goes directly to the C method, instead of the numpy wrapper np.all that allows dispatching and all that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to cover the case where x is a scalar

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should still be a numpy 0d array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? When I run it in a debugger I get numpy.float64 for x in the scalar case

Copy link
Member

@ricardoV94 ricardoV94 Jun 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you use ps.float64() as input? pt.scalar(shape=()) is the PyTensor type for a 0d array, the former for a np float64.

If you get a float64 and didn't expect to there's a bug somewhere. Also you can assert the minimize graph inputs are tensorvariables (and probably assert the x0/output is 0d in the init method)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I figured it out, it's a dumb scipy thing.

Comment on lines 247 to 251
args = [
arg
for arg in graph_inputs([objective], [x])
if (arg is not x and not isinstance(arg, Constant))
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not truncated_graph_inputs this will put the whole graph up to the roots in the inner function. Example:

a = pt.scalar("a")
x = pt.scalar("x")
objective = x * pt.exp(a)

We don't need to compute exp(a) in every iteration of the inner function...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncated_graph_inputs was returning ExpandDims on constants. I guess I just have to filter those?

Copy link
Member

@ricardoV94 ricardoV94 Jun 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the problem with an expand dims? Just means the graph is not as optimal as it could be, but that means we should have a graph rewrite that pushes constants into the inner graph like we do with scan (those expand dims will be constant folded later).

Your solution is more expensive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compute the jacobian with respect to all the args, so having 5 extra expand_dim inputs makes things a lot slower. I made the change though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why slower, just because it wasn't constant folded into the inner function? expand_dims and the gradient is otherwise as cheap as it gets


implicit_f = grad(inner_fx, inner_x)

df_dx = atleast_2d(concatenate(jacobian(implicit_f, [inner_x]), axis=-1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you concatenating? Also you can get the jacobian of all terms in one call and do what you need with the later.

Doesn't matter much but there's a bit less of graph transversal for shared nodes (maybe?) Not confident at all

Copy link
Member

@ricardoV94 ricardoV94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, we need that rewrite to shove constants into the inner graph as well?

@jessegrabowski jessegrabowski force-pushed the pytensor-optimize branch 2 times, most recently from f801c96 to 48c2c83 Compare June 9, 2025 10:12
@jessegrabowski jessegrabowski marked this pull request as ready for review June 9, 2025 11:10
@jessegrabowski
Copy link
Member Author

I implemented all the things I said I would, so I'm taking this off draft.

The optimize.py file has gotten pretty unreadable, what do you think about making it a sub-module with root.py and minimize.py ? I could imagine some more functionality being requested that would essentially recycle the same implicit gradient code (fixed point iteration, for example).

We might also consider offering the scan-based pure pytensor one (maybe with an L_op override to use the implicit formula instead of backproping through the scan steps). That would have the advantage of working in other backends right out the gates, plus it could give users fine control over e.g. the convergence check function.

@jessegrabowski
Copy link
Member Author

jessegrabowski commented Jun 9, 2025

Also failing sparse tests don't look like my fault needed to update my local main and rebase nope it's stll failing

@ricardoV94
Copy link
Member

This looks great! Left some comments about asserting the fgraph variables are of the expected type, otherwise this looks good. Do we want to follow up with JAX dispatch? I guess there's no off-the shelve numba stuff we can plug-in? Maybe some optional third-party library like we do with tensorflow-probability for some of the JAX dispatches?

Copy link

codecov bot commented Jun 10, 2025

Codecov Report

Attention: Patch coverage is 90.81967% with 28 lines in your changes missing coverage. Please review.

Project coverage is 82.17%. Comparing base (d10f245) to head (bfa63f6).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
pytensor/tensor/optimize.py 90.81% 17 Missing and 11 partials ⚠️

❌ Your patch status has failed because the patch coverage (90.81%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1182      +/-   ##
==========================================
+ Coverage   82.12%   82.17%   +0.05%     
==========================================
  Files         211      212       +1     
  Lines       49757    50062     +305     
  Branches     8819     8840      +21     
==========================================
+ Hits        40862    41139     +277     
- Misses       6715     6732      +17     
- Partials     2180     2191      +11     
Files with missing lines Coverage Δ
pytensor/tensor/optimize.py 90.81% <90.81%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@ricardoV94 ricardoV94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to squash ;)

@ricardoV94
Copy link
Member

Maybe more informative PR title as well?

@jessegrabowski jessegrabowski changed the title Add pytensor.tensor.optimize Implement symbolic minimize and root Ops Jun 10, 2025
@jessegrabowski jessegrabowski merged commit 646a734 into pymc-devs:main Jun 10, 2025
72 of 73 checks passed
@jessegrabowski jessegrabowski deleted the pytensor-optimize branch June 10, 2025 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants