-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal to implement ADBC driver for Apache Cassandra #2245
Comments
I'm hoping to finish it this week, but there's a work-in-progress tutorial of how to get started building a driver in C++ using nanoarrow/the framework here! #2186 . Arrow C++ presents a packaging problem (e.g., difficult/impossible to make an R driver wrapper, Python wrapper would require pinning a version of pyarrow until we sort out how to put two different Arrow C++ versions in the same process), which is why Matt probably recommended nanoarrow.
It's a bit subjective, but all our existing drivers lean on the most arrowish SDK available for the driver (e.g., Postgres has libpq for C, so we implemented that in C++; Snowflake and BigQuery have Arrow integrations in their Go connectors, so we wrote those in Go). I have no idea what Cassandra provides, but if it had a fairly complete Go or Rust client already and nothing for C++ that might be a good reason to implement it in those languages. The fact that you know C++ and you're motivated counts for a lot, though!
We have some
The Postgres driver has an example of writing tests for this without a live connection to the database (the "copy" tests). No pressure to do it exactly like that but I found it useful to accelerate the process of adding full type support there.
Where to put it is a good thing to think about...ideally we'd (maybe just speaking for me here) like for ADBC connectors to live with the project instead of with us to spread out the maintenance load (e.g., like DuckDB), but there is also not a straightforward way to implement the validation suite outside this repository (or if there is, nobody has tried it yet!). Probably the easiest place to start is as a PR into apache/arrow-adbc and move it when we sort out those details. Feel free to ping me early and often as you get started (probably everybody else is game too, but I'll let the volunteer themselves 🙂 ). All of this is helpful for us too since we've all had build setups for ADBC since the beginning and forget the issues encountered by those new to the project.
🚀 If I had to suggest a place to start it would be to get a "hello world" example running where you can open and close a connection to the database. Then you could perhaps follow that up with implementing the statement's ExecuteQuery for a case of a single type (int32 or string maybe). All just suggestsions (do your thing!) |
Thanks Dewey for the detailed guide! I'd say so long as Cassandra in C++ doesn't require gRPC, and you can use nanoarrow (instead of Arrow C++ for the aforementioned reasons), then C++ would be great.
We briefly did this when I tried to get Flight SQL into pyarrow instead of having a separate wheel. I think it wasn't complicated, we'd have to probably fix up the CMake definitions again though. |
@paleolimbot @lidavidm Thank you so much for the detailed guidance and insights! My apologies for not being able to respond sooner. I was pretty busy for the past couple of weeks but my schedule has freed up a bit now :)
Perfect timing! I'll be sure to check it out and share any feedback on the tutorial.
All of that makes sense, thanks for clarifying. I think we've had trouble using Arrow C++ in Python extensions for similar reasons, so I'm totally onboard with using nanoarrow.
These are good points! From some quick searching around, these are the main drivers I found in each language:
A couple of other random observations/thoughts:
The choice does seem like a bit of a toss-up. Like you said, I could be of more help if we pursued this in C++, so I'm still leaning towards that approach. To address David's point about dependency management, the C/C++ driver lists the following dependencies: Certainly no gRPC, but please let me know if any of these seem problematic.
Sounds good to me! It should be doable, Cassandra does publish official Docker images: https://hub.docker.com/_/cassandra.
I've been wondering how to approach this as well, so I'll definitely take a look at that. Appreciate the pointer.
I agree, it would be nice if the ADBC driver lived under the Cassandra project. This was also discussed in the most recent Arrow community call. The main hurdle seems to be that Cassandra is generally used for more OLTP-style workloads, so it may not be clear to them how Arrow fits into the picture or why an ADBC driver is necessary. I think starting in this repo to prove out the idea and then approaching the Cassandra community would be a viable strategy.
Thank you! I'll definitely reach out as necessary and plan to use draft PRs to get early feedback (if that's ok). |
Thanks for looking into all this! The one thing that might be questionable is a hard dependency on OpenSSL 1.x, but I suppose that migration is still ongoing and I don't think anything should pose dependency conflicts AFAIK. |
This shouldn't be a blocker in any way shape or form, but those aren't dependencies I can wrap into an R package that goes on CRAN (I can probably wrap it as an R package that doesn't go on CRAN, though, and all of our Go based drivers don't go on CRAN either). |
A couple of years ago @0x26res created an Arrow-based Python driver for Cassandra and gave a great talk about it. That might be something you could build on or at least take a look at. |
@ianmcook Thank you for sharing this! I agree that decoding the raw bytes received from the database is likely to provide a noticeable performance improvement. @0x26res It's promising that you have a working example of this, and it's definitely worth building on top of. I dug around the C/C++ Cassandra driver and couldn't find an obvious way of getting those raw bytes, however, so we might need to ask the driver maintainers if they'd be open to having such functionality be exposed. |
What feature or improvement would you like to see?
There isn't an existing ADBC driver for Cassandra as far as I can tell, and it would be great to have one! I'm interested in starting this effort as I have experience getting Arrow data to/from Cassandra, and have a little experience working on an ADBC driver for a different database (comdb2). I met @zeroshade at Community Over Code recently, who inspired me to start the discussion around creating a Cassandra driver :)
Some initial thoughts:
Choice of language
I'm personally most familiar with the Cassandra C/C++ driver as well as Arrow C++. However, if there's good reason to implement the driver in a different language, I'm open to that and happy to get up to speed.
Matt explained that it would be better to use nanoarrow rather than Arrow C++ as the latter is a heavy dependency and can complicate building/deploying drivers. Using nanoarrow sounds like a good idea to me.
Implementation considerations
Cassandra currently does not offer any native mechanism for fetching/ingesting data in Arrow format, so we would likely to have to implement row ↔ column transposition on the client side (in the driver).
The Cassandra Query Language (CQL) can be thought of as an extremely limited subset of SQL. This StackOverflow answer is a good overview of the general limitations. I figure this shouldn't matter as far as implementing an ADBC driver is concerned, but thought it was worth mentioning in case I'm wrong.
Matt also mentioned that there is now an ADBC driver framework. I don't see any reason not to use this. If we find any gaps in the framework while implementing the driver, I'm happy to help fill them in.
First step(s)
Matt mentioned that, before implementing anything, it would be good to stand up a Cassandra node/cluster in CI so that others can also play around with and contribute to the driver.
I suppose the next step would be to configure the build system to pull in the necessary dependencies (like the Cassandra C/C++ driver).
... Start implementing the driver along with integration tests?
I'd love to hear any other considerations for implementing this ADBC driver and/or recommendations on getting started!
The text was updated successfully, but these errors were encountered: