-
Notifications
You must be signed in to change notification settings - Fork 134
Implement symbolic minimize
and root
Ops
#1182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement symbolic minimize
and root
Ops
#1182
Conversation
We could swap the scipy method to powell if pt.grad raises? |
pytensor/tensor/optimize.py
Outdated
|
||
# TODO: Does clone replace do what we want? It might need a merge optimization pass afterwards | ||
replace = dict(zip(self.fgraph.inputs, (x_star, *args), strict=True)) | ||
grad_f_wrt_x_star, *grad_f_wrt_args = clone_replace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should avoid my TODO concern above
grad_f_wrt_x_star, *grad_f_wrt_args = clone_replace( | |
grad_f_wrt_x_star, *grad_f_wrt_args = graph_replace( |
Is there a path to later add other optimizers, e.g. jax or pytensor? Could then optimize on the GPU. Lasagne has a bunch of optimizers implemented in theano: https://lasagne.readthedocs.io/en/latest/modules/updates.html |
Yes, the MinimizeOp can be dispatched to JaxOpt on the jax backend. On the Python/C-backend we can also replace the MinimizeOp by any equilavent optimizer as well. As @jessegrabowski mentioned, we may even analyze the graph to decide what to use, such as scalar_root when that's adequate. |
The Lasagne optimizers are for SGD in minibatch settings, so it's slightly different from what I have in mind here. This functionality would be useful in cases where you want to solve a sub-problem and then use the result in a downstream computation. For example, I think @theorashid wanted to use this for INLA to integrate out nuisance parameters via optimization before running MCMC on the remaining parameters of interest. Another use case would be an agent-based model where we assume agents behave optimally. For example, we could try to estimate investor risk aversion parameters, assuming some utility function. The market prices would be the result of portfolio optimization subject to risk aversion (estimated), expected return vector, and market covariance matrix. Or use it in an RL-type scheme where agents have to solve a Bellman equation to get (an approximation to) their value function. I'm looking forward to cooking up some example models using this. |
We can definitely change the method via rewrites. There are several gradient-free (or approximate gradient) options in that respect. Optimizers can be fussy though, so I'm a bit hesitant to take this type of configuration out of the hands of the user. |
Let users choose but try to provide the best default? |
Carlos its interested in this 👀 |
Would it be possible to insert a callback to the optimiser to store the position and gradient history? |
4af61b5
to
84882a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments for ScalarOptimization but they all apply to the General Optimize Op.
|
||
def perform(self, node, inputs, outputs): | ||
f = self.fn_wrapped | ||
x0, *args = inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x0 isn't passed as the initial point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the input to the Op is always taken to be the initial value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were not using it in the perform of this op
pytensor/tensor/optimize.py
Outdated
If the input `x` is the same as the last input, return the cached result. Otherwise update the cache with the | ||
new input and result. | ||
""" | ||
cache_hit = np.all(x == self.last_x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why in this separate variable? You end up doing more checks, you again check for self.last_x is None
. It's also wasteful here if self.last_x is None
, you end up doing elementwise comparison between x entries and None and then reducing it with all.
Also tiny optimization, but doing (x == self.last_x).all()
is slightly faster, as it goes directly to the C method, instead of the numpy wrapper np.all
that allows dispatching and all that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to cover the case where x is a scalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should still be a numpy 0d array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? When I run it in a debugger I get numpy.float64
for x
in the scalar case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you use ps.float64()
as input? pt.scalar(shape=())
is the PyTensor type for a 0d array, the former for a np float64.
If you get a float64 and didn't expect to there's a bug somewhere. Also you can assert the minimize graph inputs are tensorvariables (and probably assert the x0/output is 0d in the init method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I figured it out, it's a dumb scipy thing.
pytensor/tensor/optimize.py
Outdated
args = [ | ||
arg | ||
for arg in graph_inputs([objective], [x]) | ||
if (arg is not x and not isinstance(arg, Constant)) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not truncated_graph_inputs
this will put the whole graph up to the roots in the inner function. Example:
a = pt.scalar("a")
x = pt.scalar("x")
objective = x * pt.exp(a)
We don't need to compute exp(a)
in every iteration of the inner function...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truncated_graph_inputs
was returning ExpandDims
on constants. I guess I just have to filter those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the problem with an expand dims? Just means the graph is not as optimal as it could be, but that means we should have a graph rewrite that pushes constants into the inner graph like we do with scan (those expand dims will be constant folded later).
Your solution is more expensive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I compute the jacobian with respect to all the args, so having 5 extra expand_dim inputs makes things a lot slower. I made the change though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why slower, just because it wasn't constant folded into the inner function? expand_dims and the gradient is otherwise as cheap as it gets
pytensor/tensor/optimize.py
Outdated
|
||
implicit_f = grad(inner_fx, inner_x) | ||
|
||
df_dx = atleast_2d(concatenate(jacobian(implicit_f, [inner_x]), axis=-1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you concatenating? Also you can get the jacobian of all terms in one call and do what you need with the later.
Doesn't matter much but there's a bit less of graph transversal for shared nodes (maybe?) Not confident at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, we need that rewrite to shove constants into the inner graph as well?
f801c96
to
48c2c83
Compare
I implemented all the things I said I would, so I'm taking this off draft. The We might also consider offering the scan-based pure pytensor one (maybe with an L_op override to use the implicit formula instead of backproping through the scan steps). That would have the advantage of working in other backends right out the gates, plus it could give users fine control over e.g. the convergence check function. |
|
48a6252
to
cdb41cc
Compare
cdb41cc
to
f06526e
Compare
This looks great! Left some comments about asserting the fgraph variables are of the expected type, otherwise this looks good. Do we want to follow up with JAX dispatch? I guess there's no off-the shelve numba stuff we can plug-in? Maybe some optional third-party library like we do with tensorflow-probability for some of the JAX dispatches? |
Codecov ReportAttention: Patch coverage is
❌ Your patch status has failed because the patch coverage (90.81%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #1182 +/- ##
==========================================
+ Coverage 82.12% 82.17% +0.05%
==========================================
Files 211 212 +1
Lines 49757 50062 +305
Branches 8819 8840 +21
==========================================
+ Hits 40862 41139 +277
- Misses 6715 6732 +17
- Partials 2180 2191 +11
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to squash ;)
Maybe more informative PR title as well? |
pytensor.tensor.optimize
minimize
and root
Ops
Description
Implement scipy optimization routines, with implicit gradients. This PR should add:
optimize.minimize
optimize.root
optimize.scalar_minimize
optimize.scalar_root
It would also be nice to have rewrites to transform e.g.
root
toscalar_root
when we know that there is only one input.The implementation @ricardoV94 and I cooked up (ok ok it was mostly him) uses the graph to implicitly define the inputs to the objective function. For example:
We optimize
out
with respect tox
, sox
becomes the control variable. By graph inspection we find thatout
also depends ona
andc
, so the generated graph includes them as parameters. In scipy lingo, we end up with:We get the following graph. The inner graph includes the gradients of the cost function by default, which is automatically used by scipy.
We can also ask for the gradients of the maximum value with respect to parameters:
Related Issue
Checklist
Type of change
📚 Documentation preview 📚: https://pytensor--1182.org.readthedocs.build/en/1182/