-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalized Proximal Gradient Fix to match Proximal Gradient #143
Conversation
…n algorithm from Flicker & Rigaugt 2005
…ProximalGradient when there's only 1 Proximal Operator g_i
Tagging @olivierleblanc, who worked on this code, upon @mrava87 's request. |
Hi @shnaqvi, Thank you for tagging me. If you have some time, it would be great harmonizing the code and the docs. |
@shnaqvi good stuff and great @olivierleblanc agrees :) I agree about trying to harmonize the two codes since we are fixing them, and since the library is still v0.x we can make breaking changes without worrying too much - the fewer the better but if this improves consistency I am up for it. @shnaqvi given the merge conflict maybe it is best we finalize the code here and then you either resolve it locally or just make a new PR where you start from the current version of pyproximal/optimization/primal.py and inject your changes. One last thing, since we are checking the consistency between the two algorithms for a case where one should converge to the other, it may be good to add a test with a simple example that show this is indeed the case :) |
@olivierleblanc but I honestly do not see any η, can you point us where you see it - make sure you check the latest version of the doc on RTD as I did changes to the codes some time ago (reason for the conflict I guess....) where I tried to clean up a bit both the code and doc of GeneralizedProximalGradient |
@olivierleblanc , another question about a scaling factor we have in the parameter passed to the prox operator of the prior. Do we need to have
@mrava87 , sorry for the merge conflict. I thought I had pulled the latest changes from |
…dProximalGradient, 2) Removed a redundant scaling factor in the parameter passed to the prox function call in the proximal step in the GeneralizedProximalGradient.
@mrava87 , I've added an example script verifying that the 2 methods produce equivalent results. I've also removed a scaling factor in the parameter passed to the proximal function call in the GeneralizedProximalGradient method. Kindly verify. Thanks! |
@shnaqvi I see, no problem we can Squash and merge later so all intermediate commits go away. For the code changes, they look good. I suggest we wait for @olivierleblanc to answer your question as well as my question on the inconsistency in code and doc (which I do not see where is exactly). For the notebook, maybe I wasn't clear when I asked for a test. We usually do not want to add notebooks in the library and even more importantly test should run on very simple and fast to run examples. I created one using a basic L2+L1 problem and compared the two solvers.. so far I see I can only pass the test up to 2 decimals and I noticed the same in your notebook, the results although similar differ in small decimals (also clear from the log of the two solvers if you look at how f and g evolve. Do you have an intuition why they do not give exactly the same results? |
@olivierleblanc, I'm not able to figure out where's the discrepancy arising from. There is some extraneous arithmetic logic I feel in the proximal step of the
be replaced by:
I don't see why do we only add the difference of the proximal term from the original to the last iteration's value. But regardless, I was not able to get rid of that small discrepancy you are seeing with this simplification either. |
To check the proposition of @shnaqvi, I was reading a note I wrote last year that I'm attaching.
Reading my notes again, I suspect this Don't hesitate if you have further questions about the implementation I proposed at that time. |
Oh I see, I didn't realize that \eta was what you write, I somehow thought it was related to the \eps you can put on g_i, but apparently in your notes the strength of each term is controlled directly by omega_i that does not appear into the objective function. I am happy to go back to this definition so we make sure is consistent with your reference (which we will add to the Notes - I wasn't very happy I had forgotten at that time to add a reference so everyone would know where this algorithm is coming from). @shnaqvi, are you happy to take the lead on this? I think after seeing these notes the discrepancy between GenProxGrad and ProxGrad will go away if there are implemented following these equations (effectively Olivier nicely shows how eq3 goes back to our case when fixing eta=1. I also wonder if we could add eta also to ProxGrad to make it more general, which I suspect for the case of n=1 would still be compatible with accelerations? |
@shnaqvi just checking you saw the message above, please let us know if you have time and interestg to continue on this? |
@olivierleblanc , in your note, I went through the reference (pdf here), where you're referring to their Algorithm 10.3, but I didn't quite follow where they are dealing with the composite minimization problem of I did find another reference from Yao-Liang Yu, 2013 that does solve the problem 1 in your writeup, see their problem 1. Here, author proposes Algorithm 1, proving that the "proximal map of the average of all the priors is best approximated by the average of the proximal maps of the individual priors, given in equation 2 (see this)". I think this Yao-Liang's paper is directly addressing our problem of the |
@shnaqvi, I'll let @olivierleblanc reply with his point of view, however for me the two algorithms are slightly different. You are right that they both try to solve the same kind of objective function but that does not mean that they are the same. This can be simply seen by the fact that:
So, I think we should stick to finalizing the code following Olivier notes, removing We can then add a new algorithm |
Hi guys, Let me answer both of your comments below:
Hi @shnaqvi. I agree with you on the fact that in my short note, I don't refer to the origin of the generalized Forward-Backward algorithm. I'm rather providing the algorithm as granted, and I'm making the analogy of each intermediate proximal step with the reference (pdf here) I shared.
I haven't read this in details. But, from what is written in Algorithm 1 it seems that it is indeed almost solving our problem of interest. I don't agree with @mrava87's distinction with the multiple
To conclude, I indeed think it is better to allow for a flexible choice of the different weights and parameters. You can set Tell me if I missed something in yours comments. |
I am not sure I agree with your comment ' First, notice that without further assumption on the gi's, both algorithms can represent the same problem.' as one algorithm requires scaled 'g' functions with scalings that sum to 1 and the other does not make any request of such a kind... so what I simply did is to allow also A1 to have scalings knowing that I can also write prox_wg if I know prox_g. Sorry the g1 -> g2 was just a typo, I fixed it now :) And yes, I added now subscripts to clarify the iterations. So, in the end, do you agree that the algorithms are different? |
Actually you wrote in your note what I wanted to say. Nothing prevents you from compensating the I agree that they are different. |
We haven't heard in a while from @shnaqvi, so I guess he may have been busy with other tasks. I wanted to get this done before we all forget. I followed the PDF shared by @olivierleblanc and included I will also make a Github issue to remember about the other solver, Proximal Average Proximal Gradient. |
I did not run the code by myself, but I read the last commit and it looks ok! |
Thanks @olivierleblanc for looking over it :) I'll give a couple of days to @shnaqvi to reply, otherwise I'll move on with merging the PR |
Fix to GeneralizedProximalGradient so that its logic matches that of ProximalGradient when there's only 1 Proximal Operator g_i. Currently the epsg scaling factor is off-placed. So the lines 398 and 399 should be changed from this:
to this: