-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output format revamped #76
Comments
The full proposal is then to make the output:
|
I went over these messages and would tend to agree that .npy looks like a good match. Reading its description it looks like our use case and it has the bindings one might need to use it directly. I like the idea of a tar with a yaml header plus the grid as npy. |
Out of curiosity, how big are these files typically? |
It really depends on how many Q2 you ask for, and which interpolation grid you use (on input and output, there is a third one used internally that does not affect the output shape, and it's used to increase accuracy even when the input and output are rough; by default they are all the same, but maybe there won't be a default at all). I tested with current version, and for a single Q2, with the same xgrid in input and output of 49 points, you get:
For the uncompressed is essentially |
Incidentally we speeded up quite a lot the IO operations @Zaharid @felixhekhorn. Most likely since the |
In the eko presentation arose again the topic of the output format, already faced in #60.
The request was to have a more standard format, and at the same time to split the metadata from the actual data (@Zaharid).
We would like to accomplish the first one (the choice of yaml was to have a broadly supported format), and we don't dislike the second.
Nevertheless, when you combine this requests with our strict requirement, i.e.:
it ends up in a particularly restrictive range of options.
The proposal was to use some broadly supported format like Apache Parquet, very common in the big data community.
These and the other database inherited formats are not suitable for our task, since they are optimized for tabular data, and so intrinsically two dimensional (even more, a few of the key points of Parquet are being appendable, readable in chunks, and columnar, and we don't get benefits from any of them).
The formats for multidimensional data available, broadly supported by the community (especially in science) are:
The first one is more specialized and preferable in general, but we don't need it as well, because it support so many features, while our goal is just to store a bare array of floats.
That's why our proposal is just to use the
.npy
format, coming from numpy library, and to zip it ourselves (using lz4 as it is done for pineapplgrids), who has a very simple API in python (i.e.numpy.save
function, and the partnernumpy.load
).It exists also an implementation of an API in C++ (or better a couple of), consisting in a very small codebase.
Many languages can interface directly with python (like Julia), and some support explicitly numpy with their own libraries (like
ndarray
), so we would support the numpy solution, since it is going to be a very flexible one, and at the same time the minimal thing required.The text was updated successfully, but these errors were encountered: