Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output metadata #162

Closed
wants to merge 29 commits into from
Closed

Output metadata #162

wants to merge 29 commits into from

Conversation

andreab1997
Copy link
Contributor

We want to have a file called metadata.yaml in which we can store the relevant metadata for the operators as the targetgrid, the inputgrid, the targetpids and the inputpids.

Moreover we want to move the operator_card and the theory_card inside a folder log in which we will also store all the logs of the functions acting on the operator, in order to ensure reproducibility. @felixhekhorn @alecandido

@felixhekhorn felixhekhorn added enhancement New feature or request output Output format and management labels Nov 10, 2022
@@ -175,5 +186,9 @@ def to_evol(eko: EKO, source: bool = True, target: bool = False):
# assign pids
if source:
eko.rotations._inputpids = br.evol_basis_pids
# update metadata
eko.update_metadata({"rotations": {"inputpids": inputpids}})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is in contradiction with the line above ... actually the problem is more complicated: because for to_evol we were putting the saved inputpids back to a list - instead the inputpids for for flavor_reshape is a matrix as it should. I think we can retain that behaviour but let's think for a moment ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I believe that in any case rotations.inputpids and rotations.targetpids should be matrices, not only for to_evol. The question is, do we want to store the matrix itself inside the file metadata or do we want to store the vector and then internally change the object (for example rotations.targetpids)? In the second case we can add something in __post__init__: Since we have both the vectors rotations.pids and rotations.targetpids as readed from metadata we can just compute the matrix and assign it to rotations.targetgrid in place of the vector. In this way, we will only store vectors, which I believe are more clear to read for an user, but then internally we have the matrices we need. Do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed yesterday, we want the matrix: it is uniform, also for non-standard bases, and always reliable.
The less special cases we have, the easier will be the code, and maintainability will improve.

ops = np.einsum("ajbk,bd->ajdk", ops, inv_inputpids)
errs = np.einsum("ajbk,bd->ajdk", errs, inv_inputpids)
else:
# update metadata
eko.update_metadata({"rotations": {"inputpids": inputpids}})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the call should move down to were the write is happening (else we get out of sync) .. note that we're casting to list (see comment below )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made sense: we want operations to be as atomic as possible (in the ACID sense)

@andreab1997
Copy link
Contributor Author

Some of the tests we are not passing now are failing because the metadata.yaml file is wrongly dumped. However, it is dumped in the exact same way as the operator_card.yaml and indeed even the latter is not working (if you take an eko and do eko.operator_card() it will fail). The problem seems to be related to the fact that yaml.dump() cannot dump numpy arrays. Were you aware of this problem? @felixhekhorn @alecandido

@alecandido
Copy link
Member

Some of the tests we are not passing now are failing because the metadata.yaml file is wrongly dumped. However, it is dumped in the exact same way as the operator_card.yaml and indeed even the latter is not working (if you take an eko and do eko.operator_card() it will fail). The problem seems to be related to the fact that yaml.dump() cannot dump numpy arrays. Were you aware of this problem? @felixhekhorn @alecandido

Problem is not dumping, but loading: if you dump them as Numpy arrays, it will (since YAML can dump custom objects). But then when you load the "custom object" is not matching any longer (because of the scope and stuffs like this), thus it will fail.

In any case, we want basic YAML, not custom objects, so you want to convert any array in a list of lists of lists...

if isinstance(value, np.ndarray):
value = value.tolist()

@andreab1997
Copy link
Contributor Author

Some of the tests we are not passing now are failing because the metadata.yaml file is wrongly dumped. However, it is dumped in the exact same way as the operator_card.yaml and indeed even the latter is not working (if you take an eko and do eko.operator_card() it will fail). The problem seems to be related to the fact that yaml.dump() cannot dump numpy arrays. Were you aware of this problem? @felixhekhorn @alecandido

Problem is not dumping, but loading: if you dump them as Numpy arrays, it will (since YAML can dump custom objects). But then when you load the "custom object" is not matching any longer (because of the scope and stuffs like this), thus it will fail.

In any case, we want basic YAML, not custom objects, so you want to convert any array in a list of lists of lists...

if isinstance(value, np.ndarray):
value = value.tolist()

Yes indeed, this is what I am doing :)

@alecandido
Copy link
Member

alecandido commented Nov 15, 2022

Yes indeed, this is what I am doing :)

If you use .raw it should automatically do it, exactly for that piece of code I quoted.

Make sure you are using .raw, and in case you are already doing, try to debug why it is still not working.

@andreab1997
Copy link
Contributor Author

The solution that I pushed is very bad but I needed something working in order to test the rest. I will push a better solution soon. However, it seems that there is no way using the tarfile python package to add or modify a single file inside a tar file. So, in order to do that for metadata, I believe we need to write a function that creates another tar file with the same name and with the updated metadata file. Do you have better solutions?

@@ -11,7 +11,7 @@ def test_apply(self, fake_legacy, fake_pdf):
pdf_grid = apply.apply_pdf(o, fake_pdf)
assert len(pdf_grid) == len(fake_card["Q2grid"])
pdfs = pdf_grid[q2_out]["pdfs"]
assert list(pdfs.keys()) == o.rotations.targetpids
assert list(pdfs.keys()) == list(o.rotations.targetpids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, you can also assert they are the same despite the iterables even without converting both to lists:

Suggested change
assert list(pdfs.keys()) == list(o.rotations.targetpids)
assert all(x == y for x, y in zip(pdfs.keys(), o.rotations.targetpids))

No special advantage here (you save some memory and time, but it is definitely negligible here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it back: it has the disadvantage of not checking all the elements in the longest iterable, in case the length is different, so you'd need to split into a separate check.

Suggested change
assert list(pdfs.keys()) == list(o.rotations.targetpids)
assert len(pdfs.keys()) == len(o.rotations.targetpids))
assert all(x == y for x, y in zip(pdfs.keys(), o.rotations.targetpids))

Either we write a function for this, to consistently use, or it is only more cumbersome (for a negligible advantage).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once more (I believe I told you somewhere else), .keys() is not required when you iterate on it, and list() is iterating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes i know but for the moment I wanted to keep the changes to the tests as minimal as possible. I will drop .keys now. :)

@andreab1997
Copy link
Contributor Author

andreab1997 commented Nov 16, 2022

So now the tests are working again. This does not mean that we can merge unfortunately. I still need to do a couple of things:

  • update the metadata file after any rotation
  • Read the metadata file in order to assign the correct value to eko.rotations when an eko is loaded.
  • Construct the operator_card class in order to use its method raw before dumping
  • change name for inputpids to something which hints to a matrix

If you believe there is something else, please add here.

@codecov-commenter
Copy link

codecov-commenter commented Nov 29, 2022

Codecov Report

❗ No coverage uploaded for pull request base (Fix_struct_detached@b154418). Click here to learn what that means.
The diff coverage is n/a.

Additional details and impacted files

Impacted file tree graph

@@                  Coverage Diff                   @@
##             Fix_struct_detached     #162   +/-   ##
======================================================
  Coverage                       ?   99.93%           
======================================================
  Files                          ?       96           
  Lines                          ?     4515           
  Branches                       ?        0           
======================================================
  Hits                           ?     4512           
  Misses                         ?        3           
  Partials                       ?        0           
Flag Coverage Δ
unittests 99.93% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

@andreab1997
Copy link
Contributor Author

@alecandido @felixhekhorn This should now be roughly what we want. I did not tick the second point of the todo list ("Read the metadata file in order to assign the correct value to eko.rotations when an eko is loaded.") yet because I am still checking if everything is correct but I believe it is already working. Then I will only need to change the name of "inputpids" to something more telling but that should be trivial. Let me know if you have suggestions.

PS: At the end (thanks to @alecandido's help) I managed to update the metadata file without the other PR (#152)

theory,
op,
nt,
no,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know they mean new_theory_card and new_operator_card, but it is a bit cryptic.
Please, favor more telling names. Trade-offs are best:

  • nt - too short, ambiguous
  • new_theory_card_evolve_single_member - redundant, it is already in the scope of evolve_single_member
  • ✔️ new_theory_card - just perfect :)

The minimal amount of information that saves you an explicit comment is the optimal choice, according to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually copied the name that was already there. Should I change also the others nt? (and similar for no)

Copy link
Member

@alecandido alecandido Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not top priority, but I believe we should constantly improve also these details, they make it simpler for new developers and ourselves to maintain.

If you can improve something is already better than nothing, as I said it is not high-priority, don't waste too much time. But at least change this occurrence, such that the new code has better and better style :)

@@ -61,6 +61,8 @@ def xgrid_reshape(
)
target_rot = b.get_interpolation(targetgrid.raw)
eko.rotations._targetgrid = targetgrid
# update metadata
eko.update_metadata({"rotations": {"_targetgrid": targetgrid}})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you need the underscore?

I know the attribute is that one, the other being a property, but it'd look nicer without, for manual inspection.
At the end of the day, the initial underscore discriminate between the virtual property, and the actual representation, but when we serialize there is only the representation and nothing more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that at the end the metadata file should be read and used to construct a Rotations object. In order to do that it expects _targetgrid and similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I was proposing to change what Rotations expects, but this we can discuss together, it's easier :)

@@ -91,7 +92,7 @@ def raw(self):
elif isinstance(value, float):
value = float(value)
elif isinstance(value, interpolation.XGrid):
value = value.dump()
value = value.dump()["grid"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, this is weird: the serialization of XGrid should be fully managed by the class itself.

So, we should always save what .dump() returns, and XGrid itself should be able to be loaded from the whole return value of a .dump().
If this is not happening, we should fix on XGrid side, not here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, the point is that dump() return a dictionary like {'grid':[...], 'log':[...]} where the actual grid is inside grid. For the tests, sometimes the log part is used and sometimes it is not (and this is causing errors). So I was not sure on how to fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure that the log part is always used, or ignored, on the consumer side

try:
return f"o{operators_card['hash'][:6]}_t{theory_card['hash'][:6]}.tar"
except KeyError:
return "o000000_t000000.tar"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, where does this hard-coded string come from?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe re-raising a more explicit error would be more appropriate...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that in the new format we don't have hash anymore. So, even in this case, we need to choose how to fix. This was a temporary solution just to have tests working but of course we need to make a decision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashes are needed for ekomark, but inside ekobox there should be no need for them, nor we should expose them to the user.

Let's keep them for the benchmarks, where they are actually used, and completely drop this function here.

@@ -130,7 +130,9 @@ class TestEKO:
def _default_cards(self):
t = tc.generate(0, 1.0)
o = oc.generate([10.0])
return compatibility.update(t, o)
nt, no = compatibility.update(t, o)
no["rotations"]["pids"] = no["rotations"]["targetpids"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we put this in compatibility.update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I don't know, I believe it is something very specific to this test. Most of the case we do not need this because [rotations][pids] will be already there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then maybe upgrade default cards, such that we don't need it also here :)

In the end, default cards are for the users (as everything else in ekobox), we don't want to generate incompatible ones :)

Co-authored-by: Alessandro Candido <candido.ale@gmail.com>
@felixhekhorn
Copy link
Contributor

Benchmarks are broken:

$ poe lha
Poe => python benchmarks/lha_paper_bench.py
 ──────────────────────────────────────── 
  Theories: 1 OCards: 1 PDFs: 1 ext: LHA  
 ──────────────────────────────────────── 
Computing for theory=fb3f558, ocard=7f27498 and pdf=ToyLH ...
Traceback (most recent call last):
  File "/home/felix/Physik/N3PDF/EKO/eko/benchmarks/lha_paper_bench.py", line 226, in <module>
    obj.benchmark_plain(0)
  File "/home/felix/Physik/N3PDF/EKO/eko/benchmarks/lha_paper_bench.py", line 123, in benchmark_plain
    self.run_lha(self.plain_theory(pto))
  File "/home/felix/Physik/N3PDF/EKO/eko/benchmarks/lha_paper_bench.py", line 109, in run_lha
    self.run(
  File "/home/felix/.cache/pypoetry/virtualenvs/eko-KkPVjVhh-py3.10/lib/python3.10/site-packages/banana/benchmark/runner.py", line 416, in run
    self.run_config(session, t, o, pdf_name, use_replicas)
  File "/home/felix/.cache/pypoetry/virtualenvs/eko-KkPVjVhh-py3.10/lib/python3.10/site-packages/banana/benchmark/runner.py", line 304, in run_config
    me = self.run_me(t, o, pdf)
  File "/home/felix/Physik/N3PDF/EKO/eko/src/ekomark/benchmark/runner.py", line 119, in run_me
    out = eko.run_dglap(theory, ocard)
  File "/home/felix/Physik/N3PDF/EKO/eko/src/eko/__init__.py", line 29, in run_dglap
    r = runner.Runner(theory_card, operators_card)
  File "/home/felix/Physik/N3PDF/EKO/eko/src/eko/runner.py", line 50, in __init__
    new_theory, new_operators = compatibility.update(theory_card, operators_card)
  File "/home/felix/Physik/N3PDF/EKO/eko/src/eko/compatibility.py", line 57, in update
    new_operators["rotations"]["pids"] = operators["pids"]
KeyError: 'pids'

@felixhekhorn
Copy link
Contributor

you also need to dump version informations (i.e. both program and data) to the metadata (else we get the usual problems) (here the additional scope we added already becomes handy) . This is also relevant for a converter (e.g. the one in #171).

@alecandido
Copy link
Member

you also need to dump version informations (i.e. both program and data) to the metadata

Don't mind: this I'm doing in #172

@andreab1997 andreab1997 mentioned this pull request Dec 7, 2022
@alecandido alecandido mentioned this pull request Dec 7, 2022
20 tasks
@alecandido alecandido closed this Dec 14, 2022
@felixhekhorn felixhekhorn deleted the output_metadata branch January 5, 2023 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request output Output format and management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants