Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output format revamped #76

Closed
alecandido opened this issue Oct 7, 2021 · 5 comments · Fixed by #77
Closed

Output format revamped #76

alecandido opened this issue Oct 7, 2021 · 5 comments · Fixed by #77
Labels
enhancement New feature or request refactor Refactor code

Comments

@alecandido
Copy link
Member

alecandido commented Oct 7, 2021

In the eko presentation arose again the topic of the output format, already faced in #60.

The request was to have a more standard format, and at the same time to split the metadata from the actual data (@Zaharid).

We would like to accomplish the first one (the choice of yaml was to have a broadly supported format), and we don't dislike the second.
Nevertheless, when you combine this requests with our strict requirement, i.e.:

  • we want to store a multidimensional array (rank 4 or rank 5)
  • we want to store it in a as-minimal-as-possible way
    it ends up in a particularly restrictive range of options.

The proposal was to use some broadly supported format like Apache Parquet, very common in the big data community.
These and the other database inherited formats are not suitable for our task, since they are optimized for tabular data, and so intrinsically two dimensional (even more, a few of the key points of Parquet are being appendable, readable in chunks, and columnar, and we don't get benefits from any of them).

The formats for multidimensional data available, broadly supported by the community (especially in science) are:

  • NetCDF, that is a general format but has an especially good library in python for managing the in-memory counterpart (i.e. xarray, closely connected to numpy and inspired by pandas)
  • HDF5, on which the former one is based, with its own python API

The first one is more specialized and preferable in general, but we don't need it as well, because it support so many features, while our goal is just to store a bare array of floats.

That's why our proposal is just to use the .npy format, coming from numpy library, and to zip it ourselves (using lz4 as it is done for pineapplgrids), who has a very simple API in python (i.e. numpy.save function, and the partner numpy.load).

It exists also an implementation of an API in C++ (or better a couple of), consisting in a very small codebase.
Many languages can interface directly with python (like Julia), and some support explicitly numpy with their own libraries (like ndarray), so we would support the numpy solution, since it is going to be a very flexible one, and at the same time the minimal thing required.

@alecandido
Copy link
Member Author

The full proposal is then to make the output:

  • a .tar archive of a folder containing
    • a .yaml file with metadata
    • a .npy.lz4 rank 5 tensor

@Zaharid
Copy link

Zaharid commented Oct 7, 2021

I went over these messages and would tend to agree that .npy looks like a good match. Reading its description it looks like our use case and it has the bindings one might need to use it directly. I like the idea of a tar with a yaml header plus the grid as npy.

@felixhekhorn felixhekhorn added enhancement New feature or request refactor Refactor code labels Oct 7, 2021
@Zaharid
Copy link

Zaharid commented Oct 8, 2021

Out of curiosity, how big are these files typically?

@alecandido
Copy link
Member Author

It really depends on how many Q2 you ask for, and which interpolation grid you use (on input and output, there is a third one used internally that does not affect the output shape, and it's used to increase accuracy even when the input and output are rough; by default they are all the same, but maybe there won't be a default at all).

I tested with current version, and for a single Q2, with the same xgrid in input and output of 49 points, you get:

  • 908k compressed
  • 11M uncompressed

For the uncompressed is essentially (whatever block size) x input_xgrid x output_xgrid x #Q2, for the compressed one might be less trivial (even more considering that we'll compress the rank 5 tensor, so all Q2 together).

@alecandido
Copy link
Member Author

Incidentally we speeded up quite a lot the IO operations @Zaharid @felixhekhorn.

Most likely since the yaml parser is quite slow (and maybe same thing for the writer), so we get quite a lot from stripping the bulk data from the .yaml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactor Refactor code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants