-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Welcome! #1
Comments
Hello! I want to get back here to lay down some thoughts, but I thought it would interesting as well to collect the spurious pieces I have seen floating around in my corner of Rust about dataframes:
|
Hi! I have https://github.com/nevi-me/rust-dataframe in addition to what @LukeMathWalker mentioned. |
Another interesting conversion concerning DataFrames: rust-ndarray/ndarray#539 |
Other food for though on columnar storage: https://www.reddit.com/r/rust/comments/afo4ln/exploring_columnoriented_data_in_rust_with_frunk/ |
@LukeMathWalker thanks for continuing to add references. I'd like to start putting together a document with these sources and perhaps some commentary, like an annotated bibliography. |
Might be interesting to see Go's approach to this: https://github.com/go-gota/gota |
Hi everyone! I just wanted to mention my crate that I've been working on lately: https://github.com/jblondin/agnes. I guess I should be on reddit more, since it's pretty similar to this (and I originally based the structure on frunk's HLists):
It's still early code, and I've kinda been working on it in an one-person echo chamber (never a great idea -- my cats are decent debuggers but horrible at calling out bad design decisions), but I think it has some potential. It is typesafe (columns are referred to by unit-like marker structs which are associated with that column's data type), avoids copies as much as possible, and has basic join, print, iteration, and serialization functionality. I wrote a user guide here. I probably need to write up a design document as well. I'm planning on most likely replacing the lowest-level data storage with ndarray to for ease of interoperability (especially if ndarray is going to eventually interop with Apache Arrow as Luca mentions here. Let me know if there's anything I can do to help this initiative -- I'd love to see a stable dataframe library in Rust! |
Just my opinion...
I believe we should build a data frame library as a 'front end' to Apache Arrow. This library would serve the purpose of data access and "data wrangling" and could provide a way to zero copy convert to Arrow is seeing adoption from a range of projects and adopting this underlying infrastructure would allow us to take advantage of the Arrow ecosystem. I'm a committer to the Rust Arrow implementation along with a few others and we would welcome the input regarding requirements of higher level libraries. There are others focusing on lower level details in Arrow, there is already a query execution engine called The key thing to gain consensus on is which project is the dataframe library. The Rust community is smaller and I think we all need to focus on one data frame project and drive it forward. This probably requires someone to step forward and volunteer to drive such a project forward. I don't need such a library bad enough to do this but I would contribute to such a project if it existed. |
I see your point and can agree with this -- using Rust as a data science language will require a lot of interoperability and Apache Arrow is the best way forward for this that I've seen. |
I agree with @paddyhoran, using Arrow also benefits us with not having to worry about a lot of IO. I created https://github.com/nevi-me/rust-dataframe with the intention of bikeshedding a dataframe that relies on Arrow for both in-memory data, as well as some computation. Although I also think that if/when ndarray supports Arrow, it would make for a great UDF interface where one needs multi-dimensional data, and we could use ndarray's stats functionality in dataframes built in Rust. The other effort I've been trying, though time is a huge constraint as I have a hectic work schedule + studying, is creating Arrow interfaces to SQL DBs in Rust. I've got a simple PostgreSQL one working, but haven't had time to put it on GH. |
I think that interoperability should be a core principle of whatever we decide to invest in: it's unreasonable to expect anyone to work in a Rust-only environment for domains such as Machine Learning or Data Engineering. On the other side though, I'd like to build an API that feels native and first-citizen to Rust.
|
Hi @LukeMathWalker I'll answer the question that you've asked @paddyhoran Arrow is very usable, although we might make minor/breaking changes to the parts of the library that we're still working on (we don't support some data types that the CPP and other implementations support, and some might require some refactoring). We have:
The foundational part which one would use to rely on Arrow is sound and relatively stable. |
One possible concern with the Arrow implementation (please correct me if I'm wrong @nevi-me @paddyhoran) is that it seems to currently require the nightly toolchain. Specifically, a dependency on I personally don't see this as a huge problem as eventually these things will be stabilized and we're just starting this project, but I thought I'd point it out. |
Yes, I suppose we could hide One thing I'm personally unsure of is what will happen after 0.15 is released in a few months, because the release after that might be 1.0.0. |
I am not too worried by using the nightly toolchain to leverage What does this versioning strategy imply @nevi-me? Do we risk to have breaking changes without a bump in the major version number? |
It would likely be a private dependency, the IPC part of the format is versioned, so when reading Arrow data from say am external system, that system would declare its version. So that helps with avoiding breakages. If a library that uses Arrow doesn't stay far behind the latest version, small changes would theoretically be easy to handle. One significant consideration though is that if publishing a crate that depends on Arrow, we'd likely have to either move at Arrow's cadence (we're aiming for a release every 2 months going forward), or fork it like what DataFusion did before it was donated to Arrow. |
Hi all. I have created a library similiar to @jblondin , here: https://github.com/jesskfullwood/frames. It does maps, joins, groupby, filter all in typesafe manner, and it allows arbitrary fields types (e.g. you can have enums and structs in your columns). But while it is functional, it is much less polished and I somewhat gave up on it when I decided I couldn't get the ergonomics that I wanted (something as intuitive as R I absolutely am looking for something to use in production, at work we have an unmanageably complex series of |
Hi! I’m excited to begin discussion of strategies for implementing a dataframe.
I imagine this repo as the main archive of discussions, with perhaps a discord channel for real-time chat.
I think discussion can be in the issues for now. We could do something more formal eventually, whether a wiki or md files, if we want to crystallize some directions.
Some topics I’m interested in:
user api (type checking, ergonomics)
backend (performance, integration with other data engines)
use cases (pain points from other systems, examples of current production systems to switch to Rust)
prior art (discussion of design decisions from other data engineering/scientific computing libraries)
WIP (post updates about your current attempt, design decisions, etc
The text was updated successfully, but these errors were encountered: