Embedded Performance vs Non-Embedded #603

jamespinkerton · 2024-02-11T05:19:24Z

jamespinkerton
Feb 11, 2024

Hi. I’ve found that the embedded program is solving my problem on the order of 2x faster than using OSQP directly. I’m solving repeated portfolio optimization problems with n=250 instruments, but I added a gross position constraint and I punish transaction costs.

I have two questions:

Is it expected that the performance be better from the embedded program?
Will the performance continue to be better if I scale to a larger problem? Say 10x larger.

Answered by goulart-paul

Feb 11, 2024

Regarding "the frequency does not rely on any timing" : the solver works by choosing a single value rho and then computing a one-time matrix factorisation in the first iteration for a matrix whose entries are partly determined by rho. Subsequent iterations then require only a forward-backward solve using those same factors. This means that the first iteration -- i.e. the one that computes the factors -- is much slower than subsequent ones.

This generally works well, but we sometimes find that our initial choice of rho was not ideal. We therefore have the option to make a new choice and refactor. The downside of doing so is that we must make a new factorisation. The upside is that we can g…

View full answer

goulart-paul · 2024-02-11T12:16:52Z

goulart-paul
Feb 11, 2024
Maintainer

The only major difference should be that the "embedded" version that you get by running codegen does some of the computation work of matrix factorisation up front. This is quite a lot of the overall solve cost in many cases.

If you are solving the same sort of problem repeatedly, e.g. by just updating the linear cost and constraint terms ,then you should expect the solve time after the first solve to be about the same since at that point the required matrix factorisations should be available in either the embedded or non-embedded implementations. This assume that you are using the problem data update functions to do so. If you just rebuild the problem using a fresh solver object at every solve then you pay the factorisation costs every time.

It's also possible that you have the embedded version configured to not dynamically update the rho parameter for the solver, while the non-embedded version does so by default. Every such update forces another factorisation, so I suppose that it is possible that in the non-embedded case you end up with an extra factorisation that wasn't really needed. Usually multiple factorisations is the preferred behaviour since it tends to require fewer iterations and be faster overall, but maybe not in your case. TLDR: maybe try disabling the rho updates for the non-embedded version. The iteration count will likely go up, but possibly the total time will go down.

I don't think you can expect a bigger performance gap at larger scales because I think the effect you are seeing isn't explainable by the above.

1 reply

jamespinkerton Feb 11, 2024
Author

I should have been clear: in the non-embedded case I’m using the update function. I’m seeing 3ms for non-embedded and about 1.5ms for embedded. I’m solving the same problem repeatedly, but I’m updating P (sparsity-preserving) in addition to the linear term and the constraints.

I think your point about rho updates is probably right, the more I think about it. I tried using floats instead of double in the non-embedded setting and got “rho update failed” a lot. I’ll try disabling rho updates for the non-embedded version like you suggested. Are you saying that the embedded version cannot update rho, or is there some special setting you have to use to update it?

goulart-paul · 2024-02-11T14:15:41Z

goulart-paul
Feb 11, 2024
Maintainer

Whether or not the rho updates depends on the compiler options; see https://osqp.org/docs/codegen/index.html. If you are updating P then you should have OSQP_EMBEDDED_MODE=2 so that refactorizations are enabled.

You will see in the link above though that for the embedded version "the frequency does not rely on any timing". What this means is that in the non-embedded case we update rho and refactor based on the compute time of the initial factorisation. In the embedded version this doesn't make sense since the codegen has possibly been run on a different machine, so we instead update after a fixed number of iterations.

For comparison purposes, I think you need adaptive_rho_interval set to the some positive integer for both cases so that you will always update based on iteration counts.

2 replies

jamespinkerton Feb 11, 2024
Author

I was using the 0.6.3 API. I passed parameters="matrices" and got the ability to update P out of that.

Setting adaptive_rho_interval=300 gave me the best speedup, and sped up both cases. Non embedded went from 1.9ms to 0.94ms and embedded went from 1ms to 0.82ms. That was excellent advice! Thank you.

Could you elaborate on what "the frequency does not rely on any timing" means? Does that mean embedded already uses adaptive_rho_interval by default? Given that many of the parameters that I set (like eps_abs) get passed to embedded, I was just assuming that every parameter was the same between the two. Are there any other parameters that differ between embedded and non-embedded? I'm looking to hunt down that last performance gap :).

Lastly, do you have any other advice for parameters that you think I should play with if I'm trying to squeeze the last bits of performance?

Thanks so much

jamespinkerton Feb 11, 2024
Author

Ah I've figured out the performance difference. There's no custom warm_start in the embedded mode and my custom warm start was actually slightly worse than using the previous solution. So it looks like your adaptive_rho_interval comment fully equalizes the two ways of doing things!

goulart-paul · 2024-02-11T20:01:43Z

goulart-paul
Feb 11, 2024
Maintainer

Regarding "the frequency does not rely on any timing" : the solver works by choosing a single value rho and then computing a one-time matrix factorisation in the first iteration for a matrix whose entries are partly determined by rho. Subsequent iterations then require only a forward-backward solve using those same factors. This means that the first iteration -- i.e. the one that computes the factors -- is much slower than subsequent ones.

This generally works well, but we sometimes find that our initial choice of rho was not ideal. We therefore have the option to make a new choice and refactor. The downside of doing so is that we must make a new factorisation. The upside is that we can generally make quite a good guess after observing the solver progress for a while, so the convergence rate generally improves quite a lot when making such an adjustment.

We must then have some rule about when to compute such a refactorisation. In the non-embedded case we time the initial factorisation, and if the total iteration time exceeds some fraction of that initial factorisation time then we assume it will be beneficial to refactor. This is the adaptive_rho_fraction setting.

We can also trigger a refactor (or at least check to see if it would be helpful) after a fixed number of iterations instead, and this is what is done in the embedded codegen default options. The reason is that we pre-compute the initial factorisation during codegen in this case, and so can't rely on the factorisation timing since the compiled code might be running on a different machine than the one that generated it. This is the adaptive_rho_interval setting.

I think, but am not entirely certain, that these are the only parameters that differ in the embedded codegen code. If you enable verbose output you should be able to see all of the settings in either case.

As far as other settings go : it depends on the problem structure, overall size and how the solver is behaving on your problems in the first place.

1 reply

jamespinkerton Feb 11, 2024
Author

Great, this is super clear. Thank you for the help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSQP

Embedded Performance vs Non-Embedded #603

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OSQP

Embedded Performance vs Non-Embedded #603

jamespinkerton Feb 11, 2024

Replies: 3 comments · 4 replies

goulart-paul Feb 11, 2024 Maintainer

jamespinkerton Feb 11, 2024 Author

goulart-paul Feb 11, 2024 Maintainer

jamespinkerton Feb 11, 2024 Author

jamespinkerton Feb 11, 2024 Author

goulart-paul Feb 11, 2024 Maintainer

jamespinkerton Feb 11, 2024 Author

jamespinkerton
Feb 11, 2024

Replies: 3 comments 4 replies

goulart-paul
Feb 11, 2024
Maintainer

jamespinkerton Feb 11, 2024
Author

goulart-paul
Feb 11, 2024
Maintainer

jamespinkerton Feb 11, 2024
Author

jamespinkerton Feb 11, 2024
Author

goulart-paul
Feb 11, 2024
Maintainer

jamespinkerton Feb 11, 2024
Author