-
Notifications
You must be signed in to change notification settings - Fork 123
Why DifferentiableOptimizer detaches parameters when track_higher_grads = False? #102
Comments
As a workaround, I think you can use |
@eric-mitchell is the right work around to set do track_higher_order_grads = True but without Eric's grads_callback trick:
with deterministic code. So if I run it again it should print the same number.
close enough!🙂 . Now let's change the seed (from 0 to 42, 142, 1142), the grad norm value should change:
now returning to zero:
close enough again!🙂 Now if eric's trick works (passing a grads callback), then the gradient value should change since it's now using FO and no higher order info. So will change my code in steps.
Running it again I get (to confirm determinism of code):
confirming that this combination does something different (i.e. his grads_callback changes the behaviour). Now what if I use Eric's call back but use track_higher_order_grads=False:
gives a bug. So setting track_higher_order_grads is always wrong it seems. This makes me feel your solution at least changes the behaviour though I don't know why it works or why the original code by higher doesn't work. My self contained reproducible script:
|
Now I will check how fast the code runs by reading the output of tdqm. If it's truly doing FO (and not using higher grads), then there should be some speed up. Running this in my m1 laptop. The combination for the following run is track_higher_grads = True and diffopt.step(inner_loss, grad_callback=lambda grads: [g.detach() for g in grads]) so this should be FO (the faster one). So it should end quicker than the next run with higher grads/hessians:
Now with track_higher_grads = True and diffopt.step(inner_loss) , which is with higher grads (hessian):
since it's taking much longer I will conclude this indeed uses hessians & it's NOT the fo maml. I assume the difference would be more noticeable if the networks was larger (due to ~ quadratic size of Hessien). |
@eric-mitchell hi eric! Do you mind explaining to us (briefly) why your solution works. I must admit it's strange given that the code seemed to already do a detach & I would have expected the requires_grads not do anything (but perhaps it does clearly). Thank you for your time! |
More qualitatively sanity checks. First order MAML in my real script:
now not FO maml:
FO is 6 days while higher order one is 13, so it's likely correct! |
solution is easy, they are doing detach on params p not on gradients g which is totally of course! |
Hi! Thank you for this awesome library, it helps me a lot.
I am not sure whether I'm missing something, but I'm confused about why DifferentiableOptimizer detaches parameters when
track_higher_grads = False
:higher/higher/optim.py
Lines 251 to 257 in 1e20cf9
which cuts the gradient path back to the original model parameters, even though
copy_initial_weights=False
. When we setcopy_initial_weights=False
, we want to allow gradients flow back to the original model parameters, but line 257 cut off the gradient flow.In my use case, I want to implement something like FOMAML and here is a simplify version of my code:
The gradients were not propagated back to the original parameters. My code works well after I edit the code of higher to:
I know this problem can be solved by manully mapping the gradients, but I just wonder why detaching parameters is necessary here. And thank you for your nice work again!
The text was updated successfully, but these errors were encountered: