Output format revamped #76

alecandido · 2021-10-07T10:52:37Z

In the eko presentation arose again the topic of the output format, already faced in #60.

The request was to have a more standard format, and at the same time to split the metadata from the actual data (@Zaharid).

We would like to accomplish the first one (the choice of yaml was to have a broadly supported format), and we don't dislike the second.
Nevertheless, when you combine this requests with our strict requirement, i.e.:

we want to store a multidimensional array (rank 4 or rank 5)
we want to store it in a as-minimal-as-possible way
it ends up in a particularly restrictive range of options.

The proposal was to use some broadly supported format like Apache Parquet, very common in the big data community.
These and the other database inherited formats are not suitable for our task, since they are optimized for tabular data, and so intrinsically two dimensional (even more, a few of the key points of Parquet are being appendable, readable in chunks, and columnar, and we don't get benefits from any of them).

The formats for multidimensional data available, broadly supported by the community (especially in science) are:

NetCDF, that is a general format but has an especially good library in python for managing the in-memory counterpart (i.e. xarray, closely connected to numpy and inspired by pandas)
HDF5, on which the former one is based, with its own python API

The first one is more specialized and preferable in general, but we don't need it as well, because it support so many features, while our goal is just to store a bare array of floats.

That's why our proposal is just to use the .npy format, coming from numpy library, and to zip it ourselves (using lz4 as it is done for pineapplgrids), who has a very simple API in python (i.e. numpy.save function, and the partner numpy.load).

It exists also an implementation of an API in C++ (or better a couple of), consisting in a very small codebase.
Many languages can interface directly with python (like Julia), and some support explicitly numpy with their own libraries (like ndarray), so we would support the numpy solution, since it is going to be a very flexible one, and at the same time the minimal thing required.

The text was updated successfully, but these errors were encountered:

alecandido · 2021-10-07T10:53:38Z

The full proposal is then to make the output:

a .tar archive of a folder containing
- a .yaml file with metadata
- a .npy.lz4 rank 5 tensor

Zaharid · 2021-10-07T12:17:01Z

I went over these messages and would tend to agree that .npy looks like a good match. Reading its description it looks like our use case and it has the bindings one might need to use it directly. I like the idea of a tar with a yaml header plus the grid as npy.

Zaharid · 2021-10-08T07:17:00Z

Out of curiosity, how big are these files typically?

alecandido · 2021-10-08T08:09:37Z

It really depends on how many Q2 you ask for, and which interpolation grid you use (on input and output, there is a third one used internally that does not affect the output shape, and it's used to increase accuracy even when the input and output are rough; by default they are all the same, but maybe there won't be a default at all).

I tested with current version, and for a single Q2, with the same xgrid in input and output of 49 points, you get:

908k compressed
11M uncompressed

For the uncompressed is essentially (whatever block size) x input_xgrid x output_xgrid x #Q2, for the compressed one might be less trivial (even more considering that we'll compress the rank 5 tensor, so all Q2 together).

alecandido · 2021-10-21T18:27:20Z

Incidentally we speeded up quite a lot the IO operations @Zaharid @felixhekhorn.

Most likely since the yaml parser is quite slow (and maybe same thing for the writer), so we get quite a lot from stripping the bulk data from the .yaml.

felixhekhorn added enhancement New feature or request refactor Refactor code labels Oct 7, 2021

alecandido mentioned this issue Oct 19, 2021

Implement new output format #77

Merged

2 tasks

alecandido closed this as completed Oct 25, 2021

alecandido mentioned this issue Oct 28, 2021

Output format revamped NNPDF/yadism#123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output format revamped #76

Output format revamped #76

alecandido commented Oct 7, 2021 •

edited

Loading

alecandido commented Oct 7, 2021

Zaharid commented Oct 7, 2021

Zaharid commented Oct 8, 2021

alecandido commented Oct 8, 2021

alecandido commented Oct 21, 2021

Output format revamped #76

Output format revamped #76

Comments

alecandido commented Oct 7, 2021 • edited Loading

alecandido commented Oct 7, 2021

Zaharid commented Oct 7, 2021

Zaharid commented Oct 8, 2021

alecandido commented Oct 8, 2021

alecandido commented Oct 21, 2021

alecandido commented Oct 7, 2021 •

edited

Loading