Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dump FK Table comparison #8

Closed
felixhekhorn opened this issue Oct 20, 2021 · 19 comments · Fixed by #12
Closed

Dump FK Table comparison #8

felixhekhorn opened this issue Oct 20, 2021 · 19 comments · Fixed by #12

Comments

@felixhekhorn
Copy link
Contributor

  • rename metadata results to results_grid
  • add metadata results_fk with comparison w/ against w/o evolution

will need NNPDF/pineappl#74 to read metadata

@cschwan
Copy link
Contributor

cschwan commented Oct 20, 2021

* rename metadata `results` to  `results_grid`

What's the idea here?

@felixhekhorn
Copy link
Contributor Author

we can also just leave it - but then there will be results and results_fk so it's a bit inconsistent (since now there are two results ...)

@alecandido
Copy link
Member

What's the idea here?

The point is that in rr we store the comparison at the grid level, that is the main and only comparison there, but when you come to pineko the new one is the FkTable.

For sure we don't want to change rr because there will pineappl at some point, since rr is upstream, and independent of what happens afterwards. Then it makes sense to change it afterwards.

@cschwan
Copy link
Contributor

cschwan commented Oct 20, 2021

we can also just leave it - but then there will be results and results_fk so it's a bit inconsistent (since now there are two results ...)

I'd argue for that, because results are the true results (remember there are two entries MC and PineAPPL), and results_fk are the results with errors from the evolution. It also makes sense when you compare a grid at the process scale and an FK table; both results should be the same. And in the FK table there should be both, of course.

Moreover, results_fk should show a different table:

  • scale variations, which should be shown in results don't make any sense in results_fk
  • instead you should compare pre- and post-evolution results and just show the difference in per mille.

So the table should look as follows:

------------------------------------
    Pineko       PineAPPL   central
                             1/1000
------------------------------------
 7.218832e-01  7.218831e-01   0.0000
 7.142003e-01  7.142003e-01   0.0000
 7.928799e-01  7.928799e-01   0.0000
 7.850020e-01  7.850020e-01   0.0000
 7.728221e-01  7.728221e-01   0.0000
 8.654099e-01  8.654099e-01   0.0000

@alecandido
Copy link
Member

Fine for me, I was thinking that in some sense results is the main focus, and it is a bit weird for the FkTable. Nevertheless, we can keep also as it is, it's not going to hurt anything.

@felixhekhorn
Copy link
Contributor Author

So the table should look as follows:

and in fact, it does (quasi):

$ python -mpineko compare data/mydis.pineappl data/myfktable.mydis.pineappl.lz4  CT14llo_NF4
LHAPDF 6.3.0 loading /home/felix/local/share/LHAPDF/CT14llo_NF4/CT14llo_NF4_0000.dat
CT14llo_NF4 PDF set, member #0, version 1; LHAPDF ID = 13207
       O0 left    O0 right       O1 left      O1 right   PineAPPL    FkTable  percent_error
0    50.000000   50.000000  2.000000e-07  2.000000e-07  12.826575  12.826106      -0.003654
1    50.000000   50.000000  6.984209e-07  6.984209e-07   9.550602   9.549435      -0.012216
2    50.000000   50.000000  2.438943e-06  2.438943e-06   7.047279   7.046536      -0.010548
3    50.000000   50.000000  8.516807e-06  8.516807e-06   5.152585   5.152081      -0.009781
4    50.000000   50.000000  2.973850e-05  2.973850e-05   3.728779   3.728443      -0.008991
5    50.000000   50.000000  1.038117e-04  1.038117e-04   2.665475   2.665247      -0.008558
6    50.000000   50.000000  3.620545e-04  3.620545e-04   1.874623   1.874467      -0.008317
7    50.000000   50.000000  1.258680e-03  1.258680e-03   1.290291   1.290190      -0.007889
8    50.000000   50.000000  4.328501e-03  4.328501e-03   0.875369   0.875307      -0.007070
9    50.000000   50.000000  1.437507e-02  1.437507e-02   0.615639   0.615608      -0.005093
10   50.000000   50.000000  4.341492e-02  4.341492e-02   0.463775   0.463782       0.001509
11   50.000000   50.000000  1.091438e-01  1.091438e-01   0.361635   0.361633      -0.000531
12   50.000000   50.000000  2.195041e-01  2.195041e-01   0.288607   0.288606      -0.000214
13   50.000000   50.000000  3.668753e-01  3.668753e-01   0.195669   0.195669       0.000099
14   50.000000   50.000000  5.397572e-01  5.397572e-01   0.089400   0.089399      -0.001851
15   50.000000   50.000000  7.295868e-01  7.295868e-01   0.018904   0.018904       0.003388
16   50.000000   50.000000  9.309441e-01  9.309441e-01   0.000235   0.000236       0.515855
17   46.538363   46.538363  1.091438e-01  1.091438e-01   0.361349   0.361348      -0.000401
18   85.950595   85.950595  1.091438e-01  1.091438e-01   0.363406   0.363405      -0.000486
19  158.740105  158.740105  1.091438e-01  1.091438e-01   0.364716   0.364714      -0.000363

Note that the bin informations are also present here - something I also wish for results because otherwise I can't compare to the data from @scarlehoff who does apply cuts, and thus, e.g., presents only numbers for bin 6 onwards ...

@cschwan
Copy link
Contributor

cschwan commented Oct 25, 2021

This looks good! However, I'd suggest to change a few details, namely

  • to print the differences in per mille instead of in per cent
  • instead of O0 left and O0 right you could print
    • the actual observable names that are in the metadata,
    • and print the observable names for both columns and leave out left and right

@cschwan
Copy link
Contributor

cschwan commented Oct 25, 2021

@alecandido @felixhekhorn another completely unrelated question that also came up during the PC: in DIS the left and right bin boundaries are the same, but I wonder is this really the case also in the data? Don't we have bins in this case?

@alecandido
Copy link
Member

  • the actual observable names that are in the metadata,

unfortunately to do this from python we still need to merge NNPDF/pineappl#74 (we do not have access in the python interface to metadata, but we are going to have)

@alecandido @felixhekhorn another completely unrelated question that also came up during the PC: in DIS the left and right bin boundaries are the same, but I wonder is this really the case also in the data? Don't we have bins in this case?

This I don't know, we should check commondata, with which I'm not very familiar... (yet)

@cschwan
Copy link
Contributor

cschwan commented Oct 25, 2021

  • the actual observable names that are in the metadata,

unfortunately to do this from python we still need to merge N3PDF/pineappl#74 (we do not have access in the python interface to metadata, but we are going to have)

I see, I'll prioritize this Issue now.

@felixhekhorn
Copy link
Contributor Author

in DIS the left and right bin boundaries are the same, but I wonder is this really the case also in the data? Don't we have bins in this case?

I don't think so - at least for the HERA fully inclusive stuff: https://www.hepdata.net/record/ins1377206?version=1&table=Table%202 (I didn't check the paper fully though ... and I know in practice this is rather required)

@cschwan
Copy link
Contributor

cschwan commented Dec 28, 2021

Given what we've discussed last PC, I think we want the following information in the metadata that's not yet in there:

  • MC uncertainty
  • total error (difference between MC result and FK table result using the same PDF). This might require a standardized form for the results metadata and a new results_pdfset entry

@alecandido
Copy link
Member

  • total error (difference between MC result and FK table result using the same PDF). This might require a standardized form for the results metadata and a new results_pdfset entry

If we're going to have a new standardized form results entry, I would make it in whatever standard format (even csv, with whatever sep) in order to make as easy as possible to machine parse it.
Now they are more optimized for human reading, but in case we need to read them frequently I would make a printer for it. In case it is more generally needed, I'd include the printer in pineappl_cli.

@cschwan
Copy link
Contributor

cschwan commented Dec 29, 2021

@alecandido agreed, but I think we can have it both machine and human readable while only deviating a little from what we already do:

  • table headers get prefixed with "# " (comment character)
  • content left as is

The resulting table should be easily readable with numpy.loadtxt.

@alecandido
Copy link
Member

@felixhekhorn if you can, try to address this as well during your pineko transition.

Since you're moving comparison from fkutils (if it were not already here), now the information should be available locally, so it's just a matter of dumping it in the metadata during convolution.

felixhekhorn referenced this issue in NNPDF/pineappl Mar 9, 2022
@felixhekhorn
Copy link
Contributor Author

This looks good! However, I'd suggest to change a few details, namely

* to print the differences in per mille instead of in per cent

* instead of `O0 left` and `O0 right` you could print
  
  * the actual observable names that are in the metadata,
  * and print the observable names for both columns and leave out `left` and `right`

@cschwan here is what I currently have:

$ pineappl info --get results_fk data/fktables/213/LHCB_DY_13TEV_DIELECTRON.pineappl.lz4 
    yll left  yll right    PineAPPL     FkTable  permille_error
0      2.000      2.125   10.337989   10.330937       -0.682085
1      2.125      2.250   30.353310   30.334170       -0.630583
2      2.250      2.375   49.379146   49.355890       -0.470961
3      2.375      2.500   66.899600   66.879018       -0.307656
4      2.500      2.625   82.655479   82.635160       -0.245826
5      2.625      2.750   96.121822   96.097904       -0.248829
6      2.750      2.875  107.081167  107.053141       -0.261728
7      2.875      3.000  115.135111  115.104099       -0.269357
8      3.000      3.125  120.077337  120.044468       -0.273733
9      3.125      3.250  121.560013  121.526957       -0.271931
10     3.250      3.375  113.520790  113.489930       -0.271838
11     3.375      3.500   96.972783   96.946347       -0.272608
12     3.500      3.625   79.585250   79.563615       -0.271837
13     3.625      3.750   62.180007   62.162765       -0.277289
14     3.750      3.875   45.516016   45.502772       -0.290960
15     3.875      4.000   30.629132   30.619719       -0.307349
16     4.000      4.250   13.755959   13.751858       -0.298148
$ pineappl info --get results_fk_pdfset data/fktables/213/LHCB_DY_13TEV_DIELECTRON.pineappl.lz4 
NNPDF40_lo_as_01180

I can not drop "left/right" since the columns in data frames have to have unique names - alternatives are "min/max" (3 letters) or "l/r" (cut to 1 letter)

@felixhekhorn
Copy link
Contributor Author

@alecandido agreed, but I think we can have it both machine and human readable while only deviating a little from what we already do:

* table headers get prefixed with "# " (comment character)

* content left as is

The resulting table should be easily readable with numpy.loadtxt.

adding "#" is not necessary, I think, as np.loadtxt explicitly exposes a skiprows argument

@alecandido
Copy link
Member

Just to tell you: if your table is not so small, always consider using pd.read_csv, since it might be much faster than np.loadtxt (a couple of years ago it was, and I didn't expect anything to have changed in the meanwhile).

@cschwan
Copy link
Contributor

cschwan commented Mar 9, 2022

adding "#" is not necessary, I think, as np.loadtxt explicitly exposes a skiprows argument

It's not necessary, but it would be nice to use the functions out-of-the-box; also note that we might consider changing the header at some point (the number of lines, add some other piece of information), or that we'd like to support other libraries. To have some flexibility I'd therefore strongly suggest to use a comment character. This should be mirrored by the results metadata field, of course.

To implement it you could set the header of the left-most column to # and left-align it.

@felixhekhorn felixhekhorn linked a pull request Mar 24, 2022 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants