Output metadata #162

andreab1997 · 2022-11-10T16:23:32Z

We want to have a file called metadata.yaml in which we can store the relevant metadata for the operators as the targetgrid, the inputgrid, the targetpids and the inputpids.

Moreover we want to move the operator_card and the theory_card inside a folder log in which we will also store all the logs of the functions acting on the operator, in order to ensure reproducibility. @felixhekhorn @alecandido

felixhekhorn · 2022-11-14T17:31:33Z

src/eko/output/manipulate.py

@@ -175,5 +186,9 @@ def to_evol(eko: EKO, source: bool = True, target: bool = False):
    # assign pids
    if source:
        eko.rotations._inputpids = br.evol_basis_pids
+        # update metadata
+        eko.update_metadata({"rotations": {"inputpids": inputpids}})


this is in contradiction with the line above ... actually the problem is more complicated: because for to_evol we were putting the saved inputpids back to a list - instead the inputpids for for flavor_reshape is a matrix as it should. I think we can retain that behaviour but let's think for a moment ...

At this point I believe that in any case rotations.inputpids and rotations.targetpids should be matrices, not only for to_evol. The question is, do we want to store the matrix itself inside the file metadata or do we want to store the vector and then internally change the object (for example rotations.targetpids)? In the second case we can add something in __post__init__: Since we have both the vectors rotations.pids and rotations.targetpids as readed from metadata we can just compute the matrix and assign it to rotations.targetgrid in place of the vector. In this way, we will only store vectors, which I believe are more clear to read for an user, but then internally we have the matrices we need. Do you agree?

As we discussed yesterday, we want the matrix: it is uniform, also for non-standard bases, and always reliable.
The less special cases we have, the easier will be the code, and maintainability will improve.

felixhekhorn · 2022-11-14T17:33:42Z

src/eko/output/manipulate.py

            ops = np.einsum("ajbk,bd->ajdk", ops, inv_inputpids)
            errs = np.einsum("ajbk,bd->ajdk", errs, inv_inputpids)
        else:
+            # update metadata
+            eko.update_metadata({"rotations": {"inputpids": inputpids}})


the call should move down to were the write is happening (else we get out of sync) .. note that we're casting to list (see comment below )

This made sense: we want operations to be as atomic as possible (in the ACID sense)

andreab1997 · 2022-11-15T10:31:41Z

Some of the tests we are not passing now are failing because the metadata.yaml file is wrongly dumped. However, it is dumped in the exact same way as the operator_card.yaml and indeed even the latter is not working (if you take an eko and do eko.operator_card() it will fail). The problem seems to be related to the fact that yaml.dump() cannot dump numpy arrays. Were you aware of this problem? @felixhekhorn @alecandido

alecandido · 2022-11-15T10:58:57Z

Some of the tests we are not passing now are failing because the metadata.yaml file is wrongly dumped. However, it is dumped in the exact same way as the operator_card.yaml and indeed even the latter is not working (if you take an eko and do eko.operator_card() it will fail). The problem seems to be related to the fact that yaml.dump() cannot dump numpy arrays. Were you aware of this problem? @felixhekhorn @alecandido

Problem is not dumping, but loading: if you dump them as Numpy arrays, it will (since YAML can dump custom objects). But then when you load the "custom object" is not matching any longer (because of the scope and stuffs like this), thus it will fail.

In any case, we want basic YAML, not custom objects, so you want to convert any array in a list of lists of lists...

eko/src/eko/output/struct.py

Lines 89 to 90 in dcb40cb

    
           if isinstance(value, np.ndarray): 
        
               value = value.tolist()

andreab1997 · 2022-11-15T11:07:41Z

Some of the tests we are not passing now are failing because the metadata.yaml file is wrongly dumped. However, it is dumped in the exact same way as the operator_card.yaml and indeed even the latter is not working (if you take an eko and do eko.operator_card() it will fail). The problem seems to be related to the fact that yaml.dump() cannot dump numpy arrays. Were you aware of this problem? @felixhekhorn @alecandido

Problem is not dumping, but loading: if you dump them as Numpy arrays, it will (since YAML can dump custom objects). But then when you load the "custom object" is not matching any longer (because of the scope and stuffs like this), thus it will fail.

In any case, we want basic YAML, not custom objects, so you want to convert any array in a list of lists of lists...

eko/src/eko/output/struct.py

Lines 89 to 90 in dcb40cb

if isinstance(value, np.ndarray):

value = value.tolist()

Yes indeed, this is what I am doing :)

alecandido · 2022-11-15T11:33:53Z

Yes indeed, this is what I am doing :)

If you use .raw it should automatically do it, exactly for that piece of code I quoted.

Make sure you are using .raw, and in case you are already doing, try to debug why it is still not working.

andreab1997 · 2022-11-15T13:05:31Z

The solution that I pushed is very bad but I needed something working in order to test the rest. I will push a better solution soon. However, it seems that there is no way using the tarfile python package to add or modify a single file inside a tar file. So, in order to do that for metadata, I believe we need to write a function that creates another tar file with the same name and with the updated metadata file. Do you have better solutions?

alecandido · 2022-11-16T11:14:21Z

tests/ekomark/test_apply.py

@@ -11,7 +11,7 @@ def test_apply(self, fake_legacy, fake_pdf):
        pdf_grid = apply.apply_pdf(o, fake_pdf)
        assert len(pdf_grid) == len(fake_card["Q2grid"])
        pdfs = pdf_grid[q2_out]["pdfs"]
-        assert list(pdfs.keys()) == o.rotations.targetpids
+        assert list(pdfs.keys()) == list(o.rotations.targetpids)


Just in case, you can also assert they are the same despite the iterables even without converting both to lists:

Suggested change

assert list(pdfs.keys()) == list(o.rotations.targetpids)

assert all(x == y for x, y in zip(pdfs.keys(), o.rotations.targetpids))

No special advantage here (you save some memory and time, but it is definitely negligible here).

I take it back: it has the disadvantage of not checking all the elements in the longest iterable, in case the length is different, so you'd need to split into a separate check.

Suggested change

assert list(pdfs.keys()) == list(o.rotations.targetpids)

assert len(pdfs.keys()) == len(o.rotations.targetpids))

assert all(x == y for x, y in zip(pdfs.keys(), o.rotations.targetpids))

Either we write a function for this, to consistently use, or it is only more cumbersome (for a negligible advantage).

Once more (I believe I told you somewhere else), .keys() is not required when you iterate on it, and list() is iterating.

Yes i know but for the moment I wanted to keep the changes to the tests as minimal as possible. I will drop .keys now. :)

andreab1997 · 2022-11-16T11:23:06Z

So now the tests are working again. This does not mean that we can merge unfortunately. I still need to do a couple of things:

update the metadata file after any rotation
Read the metadata file in order to assign the correct value to eko.rotations when an eko is loaded.
Construct the operator_card class in order to use its method raw before dumping
change name for inputpids to something which hints to a matrix

If you believe there is something else, please add here.

codecov-commenter · 2022-11-29T15:37:18Z

Codecov Report

❗ No coverage uploaded for pull request base (Fix_struct_detached@b154418). Click here to learn what that means.
The diff coverage is n/a.

Additional details and impacted files

@@                  Coverage Diff                   @@
##             Fix_struct_detached     #162   +/-   ##
======================================================
  Coverage                       ?   99.93%           
======================================================
  Files                          ?       96           
  Lines                          ?     4515           
  Branches                       ?        0           
======================================================
  Hits                           ?     4512           
  Misses                         ?        3           
  Partials                       ?        0

Flag	Coverage Δ
unittests	`99.93% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

andreab1997 · 2022-11-29T15:59:06Z

@alecandido @felixhekhorn This should now be roughly what we want. I did not tick the second point of the todo list ("Read the metadata file in order to assign the correct value to eko.rotations when an eko is loaded.") yet because I am still checking if everything is correct but I believe it is already working. Then I will only need to change the name of "inputpids" to something more telling but that should be trivial. Let me know if you have suggestions.

PS: At the end (thanks to @alecandido's help) I managed to update the metadata file without the other PR (#152)

alecandido · 2022-12-01T08:07:19Z

benchmarks/ekobox/benchmark_evol_pdf.py

-            theory,
-            op,
+            nt,
+            no,


I know they mean new_theory_card and new_operator_card, but it is a bit cryptic.
Please, favor more telling names. Trade-offs are best:

❌ nt - too short, ambiguous

❌ new_theory_card_evolve_single_member - redundant, it is already in the scope of evolve_single_member

✔️ new_theory_card - just perfect :)

The minimal amount of information that saves you an explicit comment is the optimal choice, according to me.

I actually copied the name that was already there. Should I change also the others nt? (and similar for no)

Not top priority, but I believe we should constantly improve also these details, they make it simpler for new developers and ourselves to maintain.

If you can improve something is already better than nothing, as I said it is not high-priority, don't waste too much time. But at least change this occurrence, such that the new code has better and better style :)

alecandido · 2022-12-01T08:11:18Z

src/eko/output/manipulate.py

@@ -61,6 +61,8 @@ def xgrid_reshape(
        )
        target_rot = b.get_interpolation(targetgrid.raw)
        eko.rotations._targetgrid = targetgrid
+        # update metadata
+        eko.update_metadata({"rotations": {"_targetgrid": targetgrid}})


Are you sure you need the underscore?

I know the attribute is that one, the other being a property, but it'd look nicer without, for manual inspection.
At the end of the day, the initial underscore discriminate between the virtual property, and the actual representation, but when we serialize there is only the representation and nothing more.

The point is that at the end the metadata file should be read and used to construct a Rotations object. In order to do that it expects _targetgrid and similar.

Indeed, I was proposing to change what Rotations expects, but this we can discuss together, it's easier :)

alecandido · 2022-12-01T08:13:22Z

src/eko/output/struct.py

@@ -91,7 +92,7 @@ def raw(self):
            elif isinstance(value, float):
                value = float(value)
            elif isinstance(value, interpolation.XGrid):
-                value = value.dump()
+                value = value.dump()["grid"]


Uhm, this is weird: the serialization of XGrid should be fully managed by the class itself.

So, we should always save what .dump() returns, and XGrid itself should be able to be loaded from the whole return value of a .dump().
If this is not happening, we should fix on XGrid side, not here.

Yes indeed, the point is that dump() return a dictionary like {'grid':[...], 'log':[...]} where the actual grid is inside grid. For the tests, sometimes the log part is used and sometimes it is not (and this is causing errors). So I was not sure on how to fix

We should make sure that the log part is always used, or ignored, on the consumer side

src/eko/output/struct.py

alecandido · 2022-12-01T08:20:45Z

src/ekobox/evol_pdf.py

+    try:
+        return f"o{operators_card['hash'][:6]}_t{theory_card['hash'][:6]}.tar"
+    except KeyError:
+        return "o000000_t000000.tar"


Wait, where does this hard-coded string come from?

Maybe re-raising a more explicit error would be more appropriate...

The problem is that in the new format we don't have hash anymore. So, even in this case, we need to choose how to fix. This was a temporary solution just to have tests working but of course we need to make a decision.

Hashes are needed for ekomark, but inside ekobox there should be no need for them, nor we should expose them to the user.

Let's keep them for the benchmarks, where they are actually used, and completely drop this function here.

alecandido · 2022-12-01T08:22:55Z

tests/eko/test_output_struct.py

@@ -130,7 +130,9 @@ class TestEKO:
    def _default_cards(self):
        t = tc.generate(0, 1.0)
        o = oc.generate([10.0])
-        return compatibility.update(t, o)
+        nt, no = compatibility.update(t, o)
+        no["rotations"]["pids"] = no["rotations"]["targetpids"]


Shouldn't we put this in compatibility.update?

This I don't know, I believe it is something very specific to this test. Most of the case we do not need this because [rotations][pids] will be already there

Ok, then maybe upgrade default cards, such that we don't need it also here :)

In the end, default cards are for the users (as everything else in ekobox), we don't want to generate incompatible ones :)

Co-authored-by: Alessandro Candido <candido.ale@gmail.com>

felixhekhorn · 2022-12-06T11:55:47Z

Benchmarks are broken:

$ poe lha
Poe => python benchmarks/lha_paper_bench.py
 ──────────────────────────────────────── 
  Theories: 1 OCards: 1 PDFs: 1 ext: LHA  
 ──────────────────────────────────────── 
Computing for theory=fb3f558, ocard=7f27498 and pdf=ToyLH ...
Traceback (most recent call last):
  File "/home/felix/Physik/N3PDF/EKO/eko/benchmarks/lha_paper_bench.py", line 226, in <module>
    obj.benchmark_plain(0)
  File "/home/felix/Physik/N3PDF/EKO/eko/benchmarks/lha_paper_bench.py", line 123, in benchmark_plain
    self.run_lha(self.plain_theory(pto))
  File "/home/felix/Physik/N3PDF/EKO/eko/benchmarks/lha_paper_bench.py", line 109, in run_lha
    self.run(
  File "/home/felix/.cache/pypoetry/virtualenvs/eko-KkPVjVhh-py3.10/lib/python3.10/site-packages/banana/benchmark/runner.py", line 416, in run
    self.run_config(session, t, o, pdf_name, use_replicas)
  File "/home/felix/.cache/pypoetry/virtualenvs/eko-KkPVjVhh-py3.10/lib/python3.10/site-packages/banana/benchmark/runner.py", line 304, in run_config
    me = self.run_me(t, o, pdf)
  File "/home/felix/Physik/N3PDF/EKO/eko/src/ekomark/benchmark/runner.py", line 119, in run_me
    out = eko.run_dglap(theory, ocard)
  File "/home/felix/Physik/N3PDF/EKO/eko/src/eko/__init__.py", line 29, in run_dglap
    r = runner.Runner(theory_card, operators_card)
  File "/home/felix/Physik/N3PDF/EKO/eko/src/eko/runner.py", line 50, in __init__
    new_theory, new_operators = compatibility.update(theory_card, operators_card)
  File "/home/felix/Physik/N3PDF/EKO/eko/src/eko/compatibility.py", line 57, in update
    new_operators["rotations"]["pids"] = operators["pids"]
KeyError: 'pids'

felixhekhorn · 2022-12-06T13:55:48Z

you also need to dump version informations (i.e. both program and data) to the metadata (else we get the usual problems) (here the additional scope we added already becomes handy) . This is also relevant for a converter (e.g. the one in #171).

alecandido · 2022-12-06T15:42:21Z

you also need to dump version informations (i.e. both program and data) to the metadata

Don't mind: this I'm doing in #172

First implementation of metadata file

1e227c4

felixhekhorn added enhancement New feature or request output Output format and management labels Nov 10, 2022

andreab1997 added 10 commits November 14, 2022 14:04

Fix write_text

7172b48

Add method to access metadata

cf7d4c2

Add metadata to detached

fc43644

Add metadata argument

960b681

Fix names of objects inside metadata dict

5925d95

Drop if in detached

c43ff92

Add rotations key to metadata and implement function to update metadata

99ee17f

Update metadata after manipulation

a85fd19

Fix stream.len

230b79d

Fix stream.tell()

27d2f05

felixhekhorn reviewed Nov 14, 2022

View reviewed changes

andreab1997 added 3 commits November 15, 2022 10:41

Add docs

61c0abe

Cast to numpy array

d6a922b

Move casting to np array in detached

dcb40cb

First draft of solution for metadata file

9a43a41

andreab1997 added 6 commits November 15, 2022 14:17

Use __post_init__ even when rotations is called

76b66c9

Fix test_output and remove dump of update_metadata

d1449e9

Update pids after to_evol

fa7eacf

Fix test_output_struct

b47561e

Move update_metadata after

050addd

Store matrices for pids and fix test

c354d56

andreab1997 added 4 commits November 16, 2022 11:11

Fix test_runner

d8386b6

Fix ekobox and tests

b65dc37

Explicitly cast pids to list in op card generation

fbf9290

Fix test_apply

f6dbc2f

alecandido reviewed Nov 16, 2022

View reviewed changes

This was referenced Nov 17, 2022

Wrong predictions obtained for CHORUS datasets #164

Closed

Use temporary working directory for output #152

Closed

andreab1997 added 2 commits November 29, 2022 16:11

Create operator_card class

174b3f7

Fix benchmark of ekobox

34ddcde

Update metadata file after reshape

bfc6c32

andreab1997 requested review from felixhekhorn and alecandido November 29, 2022 15:59

alecandido requested changes Dec 1, 2022

View reviewed changes

Update src/eko/output/struct.py

f01fc19

Co-authored-by: Alessandro Candido <candido.ale@gmail.com>

andreab1997 mentioned this pull request Dec 7, 2022

Fix struct.detached #161

Closed

alecandido mentioned this pull request Dec 7, 2022

Use tempdir during execution #172

Merged

20 tasks

alecandido closed this Dec 14, 2022

alecandido added a commit that referenced this pull request Dec 22, 2022

Check metadata after reshaping operator, see #162

7583496

felixhekhorn deleted the output_metadata branch January 5, 2023 11:19

	assert list(pdfs.keys()) == list(o.rotations.targetpids)
	assert all(x == y for x, y in zip(pdfs.keys(), o.rotations.targetpids))

	assert list(pdfs.keys()) == list(o.rotations.targetpids)
	assert len(pdfs.keys()) == len(o.rotations.targetpids))
	assert all(x == y for x, y in zip(pdfs.keys(), o.rotations.targetpids))

Output metadata #162

Output metadata #162

Conversation

andreab1997 commented Nov 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreab1997 commented Nov 15, 2022

alecandido commented Nov 15, 2022

andreab1997 commented Nov 15, 2022

alecandido commented Nov 15, 2022 • edited Loading

andreab1997 commented Nov 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreab1997 commented Nov 16, 2022 • edited Loading

codecov-commenter commented Nov 29, 2022 • edited Loading

Codecov Report

andreab1997 commented Nov 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alecandido Dec 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixhekhorn commented Dec 6, 2022

felixhekhorn commented Dec 6, 2022

alecandido commented Dec 6, 2022

alecandido commented Nov 15, 2022 •

edited

Loading

andreab1997 commented Nov 16, 2022 •

edited

Loading

codecov-commenter commented Nov 29, 2022 •

edited

Loading

alecandido Dec 1, 2022 •

edited

Loading