Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalized Proximal Gradient Fix to match Proximal Gradient #143

Merged
merged 8 commits into from
Nov 29, 2023

Conversation

shnaqvi
Copy link

@shnaqvi shnaqvi commented Oct 10, 2023

Fix to GeneralizedProximalGradient so that its logic matches that of ProximalGradient when there's only 1 Proximal Operator g_i. Currently the epsg scaling factor is off-placed. So the lines 398 and 399 should be changed from this:

tmp[:] = proxg.prox(tmp, tau *len(proxgs) )
zs[i] += epsg * (tmp - y)

to this:

tmp[:] = proxg.prox(tmp, epsg *tau *len(proxgs) )
zs[i] += (tmp - y)

Salman Naqvi added 2 commits August 17, 2023 18:46
…ProximalGradient when there's only 1 Proximal Operator g_i
@shnaqvi
Copy link
Author

shnaqvi commented Oct 10, 2023

Tagging @olivierleblanc, who worked on this code, upon @mrava87 's request.

@olivierleblanc
Copy link
Contributor

olivierleblanc commented Oct 10, 2023

Hi @shnaqvi,

Thank you for tagging me.
I agree with your comment. But it seems that there is a confusing mismatch between the variable names in the code and in the code. I guess the $\epsilon$ variable in the code function plays the role of the $\eta$ scaling parameter in the docs.

If you have some time, it would be great harmonizing the code and the docs.

@mrava87
Copy link
Contributor

mrava87 commented Oct 10, 2023

@shnaqvi good stuff and great @olivierleblanc agrees :)

I agree about trying to harmonize the two codes since we are fixing them, and since the library is still v0.x we can make breaking changes without worrying too much - the fewer the better but if this improves consistency I am up for it.

@shnaqvi given the merge conflict maybe it is best we finalize the code here and then you either resolve it locally or just make a new PR where you start from the current version of pyproximal/optimization/primal.py and inject your changes.

One last thing, since we are checking the consistency between the two algorithms for a case where one should converge to the other, it may be good to add a test with a simple example that show this is indeed the case :)

@mrava87
Copy link
Contributor

mrava87 commented Oct 10, 2023

@olivierleblanc but I honestly do not see any η, can you point us where you see it - make sure you check the latest version of the doc on RTD as I did changes to the codes some time ago (reason for the conflict I guess....) where I tried to clean up a bit both the code and doc of GeneralizedProximalGradient

@shnaqvi
Copy link
Author

shnaqvi commented Oct 11, 2023

@olivierleblanc , another question about a scaling factor we have in the parameter passed to the prox operator of the prior. Do we need to have len(proxgs) in below? I don't see the implication of not having it. Thanks

tmp[:] = proxg.prox(tmp, epsg *tau *len(proxgs) )

@mrava87 , sorry for the merge conflict. I thought I had pulled the latest changes from dev before branching off to make this change. I've kept everything from your file and made just 2 changes, 1 to move epsg and other to print tau during backtracking in ProximalGradient like we had discussed before.

Salman Naqvi added 4 commits October 11, 2023 16:38
…dProximalGradient, 2) Removed a redundant scaling factor in the parameter passed to the prox function call in the proximal step in the GeneralizedProximalGradient.
@shnaqvi
Copy link
Author

shnaqvi commented Oct 12, 2023

@mrava87 , I've added an example script verifying that the 2 methods produce equivalent results. I've also removed a scaling factor in the parameter passed to the proximal function call in the GeneralizedProximalGradient method. Kindly verify. Thanks!

@mrava87
Copy link
Contributor

mrava87 commented Oct 12, 2023

@shnaqvi I see, no problem we can Squash and merge later so all intermediate commits go away.

For the code changes, they look good. I suggest we wait for @olivierleblanc to answer your question as well as my question on the inconsistency in code and doc (which I do not see where is exactly).

For the notebook, maybe I wasn't clear when I asked for a test. We usually do not want to add notebooks in the library and even more importantly test should run on very simple and fast to run examples. I created one using a basic L2+L1 problem and compared the two solvers.. so far I see I can only pass the test up to 2 decimals and I noticed the same in your notebook, the results although similar differ in small decimals (also clear from the log of the two solvers if you look at how f and g evolve. Do you have an intuition why they do not give exactly the same results?

@shnaqvi
Copy link
Author

shnaqvi commented Oct 12, 2023

@olivierleblanc, I'm not able to figure out where's the discrepancy arising from. There is some extraneous arithmetic logic I feel in the proximal step of the GeneralizedProximalGradient that can potentially be simplified, if @olivierleblanc agrees. In particular, can this:

    y = x.copy()
    zs = [x.copy() for _ in range(len(proxgs))]

    for iiter in range(niter):
        ...
        for i, proxg in enumerate(proxgs):
            tmp = (2 * y - zs[i]) - tau * grad
            tmp[:] = proxg.prox(tmp, tau)
            zs[i] += (tmp - y)
            x += zs[i] / len(proxgs)

be replaced by:

        for i, proxg in enumerate(proxgs):
            zs[i] += proxg.prox(zs[i] - tau * grad, tau)
            x += zs[i] / len(proxgs)

I don't see why do we only add the difference of the proximal term from the original to the last iteration's value. But regardless, I was not able to get rid of that small discrepancy you are seeing with this simplification either.

@olivierleblanc
Copy link
Contributor

@olivierleblanc but I honestly do not see any η, can you point us where you see it - make sure you check the latest version of the doc on RTD as I did changes to the codes some time ago (reason for the conflict I guess....) where I tried to clean up a bit both the code and doc of GeneralizedProximalGradient
generalized_FB.pdf

To check the proposition of @shnaqvi, I was reading a note I wrote last year that I'm attaching.

@olivierleblanc , another question about a scaling factor we have in the parameter passed to the prox operator of the prior. Do we need to have len(proxgs) in below? I don't see the implication of not having it. Thanks

tmp[:] = proxg.prox(tmp, epsg *tau *len(proxgs) )

Reading my notes again, I suspect this len(proxgs) might be related to the relative weights that one can set for each non differentiable function.

Don't hesitate if you have further questions about the implementation I proposed at that time.

@mrava87
Copy link
Contributor

mrava87 commented Oct 15, 2023

Oh I see, I didn't realize that \eta was what you write, I somehow thought it was related to the \eps you can put on g_i, but apparently in your notes the strength of each term is controlled directly by omega_i that does not appear into the objective function. I am happy to go back to this definition so we make sure is consistent with your reference (which we will add to the Notes - I wasn't very happy I had forgotten at that time to add a reference so everyone would know where this algorithm is coming from).

@shnaqvi, are you happy to take the lead on this? I think after seeing these notes the discrepancy between GenProxGrad and ProxGrad will go away if there are implemented following these equations (effectively Olivier nicely shows how eq3 goes back to our case when fixing eta=1. I also wonder if we could add eta also to ProxGrad to make it more general, which I suspect for the case of n=1 would still be compatible with accelerations?

@mrava87
Copy link
Contributor

mrava87 commented Oct 19, 2023

@shnaqvi just checking you saw the message above, please let us know if you have time and interestg to continue on this?

@shnaqvi
Copy link
Author

shnaqvi commented Oct 23, 2023

@olivierleblanc , in your note, I went through the reference (pdf here), where you're referring to their Algorithm 10.3, but I didn't quite follow where they are dealing with the composite minimization problem of f_1 being a sum of multiple priors. 10.3 solves problem 10.15 in that paper but that's not the same as problem 1 in your writeup. Would you agree?

I did find another reference from Yao-Liang Yu, 2013 that does solve the problem 1 in your writeup, see their problem 1. Here, author proposes Algorithm 1, proving that the "proximal map of the average of all the priors is best approximated by the average of the proximal maps of the individual priors, given in equation 2 (see this)".

I think this Yao-Liang's paper is directly addressing our problem of the Generalized Accelerated Proximal Gradient and our current implementation pretty much follows their algorithm with the changes I proposed earlier. Kindly share your thoughts?

@mrava87
Copy link
Contributor

mrava87 commented Oct 23, 2023

@shnaqvi, I'll let @olivierleblanc reply with his point of view, however for me the two algorithms are slightly different. You are right that they both try to solve the same kind of objective function but that does not mean that they are the same. This can be simply seen by the fact that:

  • the one Olivier calls Generalized Proximal Gradient, has N z_i vectors; I am also not sure about as in Algorithm 3.2 in [1]., I do not see this algorithm in the book chapter... the one that I think does what is implemented in GeneralizedProximalGradient is Algorithm 10.27.
  • the one you found has only one z vector

So, I think we should stick to finalizing the code following Olivier notes, removing epsg or perhaps using it for what in the algorithm are omega_i (instead of having it fixed to 1/len(proxgs)) and adding eta (which I think we can also easily add to ProximalGradient with default to 1.

We can then add a new algorithm PAProximalGradient that implements the Proximal Average Proximal Gradient algorithm in the paper you refer to :)

@olivierleblanc
Copy link
Contributor

olivierleblanc commented Oct 26, 2023

Hi guys,

Let me answer both of your comments below:

@olivierleblanc , in your note, I went through the reference (pdf here), where you're referring to their Algorithm 10.3, but I didn't quite follow where they are dealing with the composite minimization problem of f_1 being a sum of multiple priors. 10.3 solves problem 10.15 in that paper but that's not the same as problem 1 in your writeup. Would you agree?

Hi @shnaqvi. I agree with you on the fact that in my short note, I don't refer to the origin of the generalized Forward-Backward algorithm. I'm rather providing the algorithm as granted, and I'm making the analogy of each intermediate proximal step with the reference (pdf here) I shared.

I did find another reference from Yao-Liang Yu, 2013 that does solve the problem 1 in your writeup, see their problem 1. Here, author proposes Algorithm 1, proving that the "proximal map of the average of all the priors is best approximated by the average of the proximal maps of the individual priors, given in equation 2 (see this".

I think this Yao-Liang's paper is directly addressing our problem of the Generalized Accelerated Proximal Gradient and our current implementation pretty much follows their algorithm with the changes I proposed earlier. Kindly share your thoughts?

@shnaqvi, I'll let @olivierleblanc reply with his point of view, however for me the two algorithms are slightly different. You are right that they both try to solve the same kind of objective function but that does not mean that they are the same. This can be simply seen by the fact that:

  • the one Olivier calls Generalized Proximal Gradient, has N z_i vectors; I am also not sure about as in Algorithm 3.2 in [1]., I do not see this algorithm in the book chapter... the one that I think does what is implemented in GeneralizedProximalGradient is Algorithm 10.27.
  • the one you found has only one z vector

I haven't read this in details. But, from what is written in Algorithm 1 it seems that it is indeed almost solving our problem of interest. I don't agree with @mrava87's distinction with the multiple $z_i$ vectors, because in my notes they are all summed up at the end, too. To me, the difference between this and my note is that Eq. (2) in my note allows for an $\eta \neq 1$. If we fix $\eta=1$, then I think that the algorithms provide the same computations (up to the acceleration explicited in Algorithm 1).

So, I think we should stick to finalizing the code following Olivier notes, removing epsg or perhaps using it for what in the algorithm are omega_i (instead of having it fixed to 1/len(proxgs)) and adding eta (which I think we can also easily add to ProximalGradient with default to 1.

We can then add a new algorithm PAProximalGradient that implements the Proximal Average Proximal Gradient algorithm in the paper you refer to :)

To conclude, I indeed think it is better to allow for a flexible choice of the different weights and parameters. You can set $\omega_i = \frac{1}{len(proxgs)}$ and $\eta=1$ by default.

Tell me if I missed something in yours comments.

@mrava87
Copy link
Contributor

mrava87 commented Oct 26, 2023

Alright, let me attach a screenshot with a small derivation that I did to see if indeed the two algorithms are the same. As you will see I do not think this is the case; perhaps I am not seeing a choice of omega (in your algorithm) and alpha (in the other one) that would make the two be identical, but from what I write I think are are not.

Note that having the same objective function to minimize does NOT mean that the algorithm itself is the same - aka the trajectory of the solution may differ even though the minimizer of the of is the same

Screen Shot 2023-10-27 at 12 00 27 AM

@olivierleblanc
Copy link
Contributor

olivierleblanc commented Oct 26, 2023

Alright, let me attach a screenshot with a small derivation that I did to see if indeed the two algorithms are the same. As you will see I do not think this is the case; perhaps I am not seeing a choice of omega (in your algorithm) and alpha (in the other one) that would make the two be identical, but from what I write I think are are not.

Note that having the same objective function to minimize does NOT mean that the algorithm itself is the same - aka the trajectory of the solution may differ even though the minimizer of the of is the same

Screen Shot 2023-10-26 at 10 52 09 PM

Alright. Thank you for that note. First, notice that without further assumption on the $g_i$'s, both algorithms can represent the same problem.

Second, notice you forget to write $\prox_{\gamma g_2}$ in your iterations and wrote $\prox_{\gamma g_1}$ everywhere.
Also, as you said textually in the note, it might be appropriate to add a superscript $k$ and $k+1$ to discriminate between two iterates in the computations.

Finally, you will notice that in A1, you get:

$$ x^{k+1} = \frac{1}{2} \prox_{\gamma g_1} (z_2^{k+1} - \gamma \nabla f(x^k)) + \frac{1}{2} \prox_{\gamma g_2} (z_1^{k+1} - \gamma \nabla f(x^k)). $$

To conclude, I agree with your conclusion that if the $z_i$'s paths differ from the one of $x$, then the algorithms differ.

@mrava87
Copy link
Contributor

mrava87 commented Oct 26, 2023

I am not sure I agree with your comment ' First, notice that without further assumption on the gi's, both algorithms can represent the same problem.' as one algorithm requires scaled 'g' functions with scalings that sum to 1 and the other does not make any request of such a kind... so what I simply did is to allow also A1 to have scalings knowing that I can also write prox_wg if I know prox_g.

Sorry the g1 -> g2 was just a typo, I fixed it now :) And yes, I added now subscripts to clarify the iterations.

So, in the end, do you agree that the algorithms are different?

@olivierleblanc
Copy link
Contributor

Actually you wrote in your note what I wanted to say. Nothing prevents you from compensating the $\alpha_i$'s within the definition of the $g_i$'s, so you might solve the same problem with both algorithms.

I agree that they are different.

@mrava87
Copy link
Contributor

mrava87 commented Nov 26, 2023

We haven't heard in a while from @shnaqvi, so I guess he may have been busy with other tasks.

I wanted to get this done before we all forget. I followed the PDF shared by @olivierleblanc and included eta in both solvers. I believe the code does now reflect the notes and if you agree I am going to merge this PR.

I will also make a Github issue to remember about the other solver, Proximal Average Proximal Gradient.

@olivierleblanc
Copy link
Contributor

We haven't heard in a while from @shnaqvi, so I guess he may have been busy with other tasks.

I wanted to get this done before we all forget. I followed the PDF shared by @olivierleblanc and included eta in both solvers. I believe the code does now reflect the notes and if you agree I am going to merge this PR.

I will also make a Github issue to remember about the other solver, Proximal Average Proximal Gradient.

I did not run the code by myself, but I read the last commit and it looks ok!
Thank you for finishing this PR.

@mrava87
Copy link
Contributor

mrava87 commented Nov 27, 2023

Thanks @olivierleblanc for looking over it :) I'll give a couple of days to @shnaqvi to reply, otherwise I'll move on with merging the PR

@mrava87 mrava87 merged commit 6fb31e3 into PyLops:dev Nov 29, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants