Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eko from theory #1601

Merged
merged 4 commits into from
Sep 21, 2022
Merged

Eko from theory #1601

merged 4 commits into from
Sep 21, 2022

Conversation

andreab1997
Copy link
Contributor

We want to allow evolven3fit to look inside the relevant theory and, if possible, load the eko needed for the evolution from there. This should actually be the default behaviour.

@andreab1997 andreab1997 changed the base branch from master to evolfit_w_eko September 16, 2022 09:09
andreab1997 and others added 2 commits September 16, 2022 11:12
Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>
@scarlehoff scarlehoff added the run-fit-bot Starts fit bot from a PR. label Sep 16, 2022
@scarlehoff
Copy link
Member

@felixhekhorn any idea to get the memory consumption of the convolution down? I guess in this case is ok even if it is a bit slower.

@Zaharid
Copy link
Contributor

Zaharid commented Sep 16, 2022

I get the impression a lot of the things here should go in loader.py and be generally accessible from vp as providers.

@felixhekhorn
Copy link
Contributor

@felixhekhorn any idea to get the memory consumption of the convolution down? I guess in this case is ok even if it is a bit slower.

This exact problem should be address here: NNPDF/eko#105 which allows to load a single evolution operator (for a fixed final scale) at a time - instead of loading all of them right at the beginning (as is done at the moment)

(Unfortunately @alecandido is quite busy at the moment so there is not much progress - and we want to delay this behind the final acceptance of the eko paper)

@scarlehoff
Copy link
Member

scarlehoff commented Sep 16, 2022

I get the impression a lot of the things here should go in loader.py and be generally accessible from vp as providers.

We can have later on a validphys provider that provides the eko from the theory, but let's first focus on having an usable implementation. This is not less integrated in vp that any of the previous evolution scripts.

This exact problem should be address here: NNPDF/eko#105 which allows to load a single evolution operator (for a fixed final scale) at a time - instead of loading all of them right at the beginning (as is done at the moment)

I'm confused, how would this address the problem? Here we only have one initial scale but have many final scales so it is the opposite situation.

(not any other tricks that we can play? I believe this might only be going beyond the limit of 7GB by a bit so any reduction would help)

@felixhekhorn
Copy link
Contributor

I'm confused, how would this address the problem? Here we only have one initial scale but have many final scales so it is the opposite situation.

I believe it is what you want: say you have two final scales 10 and 100

  • at the moment eko will load both dim-4 objects and keep them in memory and then do the algebra on the first and then on the second and then unload - effectively a dim-5 object is in memory all the time
  • while with the PR you first load one dim-4 object (say 10) do the algebra, unload the operator, load the second 4-dim object, do the algebra, unload the operator - so there is only a single dim-4 operator in memory at a time

@scarlehoff
Copy link
Member

while with the PR you first load one dim-4 object (say 10) do the algebra, unload the operator, load the second 4-dim object, do the algebra, unload the operator - so there is only a single dim-4 operator in memory at a time

Then for the evolution from 1.65 to 1e5 with 50 steps of Q we have in memory all 50 operators at any given time?

@felixhekhorn
Copy link
Contributor

while with the PR you first load one dim-4 object (say 10) do the algebra, unload the operator, load the second 4-dim object, do the algebra, unload the operator - so there is only a single dim-4 operator in memory at a time

Then for the evolution from 1.65 to 1e5 with 50 steps of Q we have in memory all 50 operators at any given time?

in current master, yes - after NNPDF/eko#105 becomes available, no

@Zaharid
Copy link
Contributor

Zaharid commented Sep 16, 2022

We can have later on a validphys provider that provides the eko from the theory, but let's first focus on having an usable implementation. This is not less integrated in vp that any of the previous evolution scripts.

I guess the difference is that here we are already using things like the loader, albeit in a cumbersome way. So since it is being worked on we could as well do it nicer.

@scarlehoff
Copy link
Member

Implementing things inside validphys and using things from validphys are two different things.

In the particular case you mention (the loader, for the theory) is not even clear how it should be used (see egg and chicken problem in the other PR) since it would try to download a theory that it hasn't been built yet!

@scarlehoff
Copy link
Member

@andreab1997 @felixhekhorn @alecandido I've noticed it does goes beyond 7 GB (getting all the way to 13 GB in my computer) but it is only for a few moments and it quickly goes back to 5GB where it stays.

Do you guys have any idea why this peak could be produced? Maybe there is a way to remove that peak easily.
Otherwise I'd say we have to put this on hold until that is fixed so it can be used.

@alecandido
Copy link
Member

With yadism I knew: there was a point in which the grid was dumped to PineAPPL. But since the operation is performed by PineAPPLpy, the cross sections are first copied in memory in the new data structure, then dumped.
So there is a point at which they are doubled in memory, but right after the dump the Grid is dropped from memory, so it is just a spike.

I guess for EKO might happen something similar, most likely with YAML. It will be solved by the usual NNPDF/eko#105

@scarlehoff
Copy link
Member

scarlehoff commented Sep 19, 2022

I guess for EKO might happen something similar, most likely with YAML. It will be solved by the usual NNPDF/eko#105

Ok, as soon as the computation of the operators for all theories finishes I'll try installing that branch to check whether the spike disappears. That branch is only waiting for the paper publication right?

@alecandido
Copy link
Member

alecandido commented Sep 19, 2022

Ok, as soon as the computation of the operators for all theories finishes I'll try installing that branch to check whether the spike disappears. That branch is only waiting for the paper publication right?

Nope, it is definitely more complicated.
That branch is only the first step: it introduces the new data structure, massively changing everything, but it doesn't use it. So, the problem would be the exact same..

(The new data structure is self managing an EKO partially on disk, i.e. it can offload operators as soon as they are computed. But the computation part is still using the huge dictionary with all the operators in memory...)

@alecandido
Copy link
Member

alecandido commented Sep 19, 2022

You can track full progress in NNPDF/eko#138

However, the moment we merge that one, we will write immediately the new "runner". Most likely it will take not too much, since most of the operations will be the exact same, we just need to save all computed pieces right after computation :)

@scarlehoff
Copy link
Member

What do you mean by runner?
As far as I can see the spike is contained here https://github.com/N3PDF/eko/blob/01efee1d27d215e895dd055debff8905af3d548c/src/eko/output.py#L379

I wonder whether there's a way of rewriting this function so that it is a bit less heavy in memory. If not we just wait for NNPDF/eko#138

@felixhekhorn
Copy link
Contributor

As far as I can see the spike is contained here https://github.com/N3PDF/eko/blob/01efee1d27d215e895dd055debff8905af3d548c/src/eko/output.py#L379

exactly - that's the very function we're trying to replace: and to be more precise we want to avoid for fp in innerdir.glob("*.npy.lz4"):

I wonder whether there's a way of rewriting this function so that it is a bit less heavy in memory. If not we just wait for N3PDF/eko#138

I'm afraid we need the full PR (and as you can see that one is highly non-trivial)

What do you mean by runner?

it's the place where the actual computation happens (meaning where everything is glued together). However, I guess in your case your concern is more about what is currently called evolve_pdfs (which calls at the moment load_tar) - is that correct? are you worried about computing the operator or computing the PDF with the operator?

@alecandido
Copy link
Member

You are right, the spike is there because is (most likely) doubling the representation. But the idea is that you'd never worry about doubling, if the object you want to save is small enough in the first place.

The current "runner" (the object is running the full computation and storage) first computes all the operators, and then save them all at once.
The goal would be to replace this two steps process, breaking computation in pieces, i.e. "compute->store->compute->store->compute->...".

@scarlehoff
Copy link
Member

it's the place where the actual computation happens (meaning where everything is glued together). However, I guess in your case your concern is more about what is currently called evolve_pdfs (which calls at the moment load_tar) - is that correct? are you worried about computing the operator or computing the PDF with the operator?

No, the spike seem to happen -at least in montblanc- during the direct call to load_tar

eko_op = output.Output.load_tar(theory_eko_path)
and it is finished after that.

@alecandido
Copy link
Member

alecandido commented Sep 19, 2022

Ok, sorry. Then it's just the complementary problem.

But, actually, if you call load_tar a net increase in memory usage is expected: you are loading an object into memory...
Moreover, streams used for loading the operators should be dropped immediately by the garbage collector.

Can you try to explicitly erase the stream? (just modify manually the package in your environment)
Or also just try to profile the function...

@scarlehoff
Copy link
Member

scarlehoff commented Sep 19, 2022

Moreover, streams used for loading the operators should be dropped immediately by the garbage collector.

They are, once the function is finished. It is really during the function lifetime that the spike occurs.

I'll play around with it once all ekos are done.

Of course, it might well be a giant red herring since I don't know what actually killed the CI, but I think there's a good chance it was this.

@alecandido
Copy link
Member

They are, once the function is finished. It is really during the function lifetime that the spike occurs.

Nope, I mean they should be dropped during function execution: there is a single variable for all the streams, so before loading the new one, the former should be out of scope, and thus dropped.
Then, the actual time to be dropped at should be up to the Python gc, but I'd be really surprised if it is only doing it at the end of the function...

@scarlehoff scarlehoff removed the run-fit-bot Starts fit bot from a PR. label Sep 21, 2022
@scarlehoff
Copy link
Member

While this cannot be used as the default evolution code for the PDF, I think it would still be good to have it

so I would merge this into the other branch (because it works) and then try to 1. keep the other as close to master as possible 2. maybe even change the name to evolven3fit_eko so that if someone wants to use it they can...

@andreab1997
Copy link
Contributor Author

While this cannot be used as the default evolution code for the PDF, I think it would still be good to have it

so I would merge this into the other branch (because it works) and then try to 1. keep the other as close to master as possible 2. maybe even change the name to evolven3fit_eko so that if someone wants to use it they can...

Ok I am merging this. Thank you!

@andreab1997 andreab1997 merged commit 68cbe3f into evolfit_w_eko Sep 21, 2022
@andreab1997 andreab1997 deleted the eko_from_theory branch September 21, 2022 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants