Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for the Feather format #343

Open
evelinag opened this issue Mar 29, 2016 · 5 comments
Open

Support for the Feather format #343

evelinag opened this issue Mar 29, 2016 · 5 comments

Comments

@evelinag
Copy link

Feather is a recently introduced fast binary format for storing data frames. It's language agnostic and it can be currently used to load data frames into R and Python. It would be great to have a support for this format in Deedle as well, to allow exchanging data with R and Python code.

For more information see: blog.rstudio.org/2016/03/29/feather
Feather source code: github.com/wesm/feather

@adamklein
Copy link
Contributor

Great idea!

@buybackoff
Copy link
Contributor

It would be cool to reuse your FlatBuffer project to automatically map .NET's primitive types and structs (and maybe POCOs) to the Arrow format, since it uses FlatBuffers as well. And then to keep .NET<->Arrow as a reusable module and build Feather on the top of it. Feather is too specific for data frames, while the Arrow format could be used for chunks/block storage. I have been investigating for a while how to adapt it for Spreads library, and I am very interested in the .NET port. What do you think would be easier/feasible - C interface with P/Invoke or native rewrite in F#/C#?

@pkese
Copy link
Contributor

pkese commented Feb 11, 2019

@buybackoff
Do you have any idea how https://github.com/kevin-montrose/FeatherDotNet would fit into Deedle's internals?

@buybackoff
Copy link
Contributor

buybackoff commented Feb 11, 2019

@pkese I'm not the one to talk about Deedle internals, but

My current take on it that the physical binary layout doesn't matter much, there is no a silver bullet, but I'm biased. I'm doing well with SQLite and LMDB and store data blocks as just shuffled+compressed blobs. SQLite is damn fast, SSD write speed is the limit when writing moderate size chunks. LMDB is much faster for reads. Zstd compression often makes IO faster - savings on data size and read/write time are bigger that CPU spent on (de)compression.

In the end it is just blobs with headers laid out sequentially with some indexing. Anything will do many orders of magnitude better than csv/json. Arrow is more like well-specified common sense and not something unique. Uniqueness is that the very big Apache ecosystem has agreed upon that standard.

Please sign up for announcements here if you are interested in very fast persistence for real-time data streams, series, matrices and frames. I have it partially working in a private repo and hope to release soon for a general use case. I will implement ML.NET's IDataView rather than Arrow on the top of my very simple physical layout that resembles Arrow a lot conceptually.

@buybackoff
Copy link
Contributor

Relevant issue in ML.NET: dotnet/machinelearning#1860

ML.NET already has Parquet loader: https://github.com/dotnet/machinelearning/tree/master/src/Microsoft.ML.Parquet.

And now we have Feather, Arrow, Parquet, learn how they differ or just names/implementations of the same thing... And then comes IDataView that promises to standardize all the standards. Xkcd link above is so relevant here :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants