-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for RFC 86: Column-oriented read API for vector layers #280
Labels
Comments
Adding references to other implementations: |
@kylebarron, do you know if this column-oriented API is available in GDAL for all formats, or just for column-oriented formats? |
It's available for all formats. From this part of the RFC
|
2 tasks
bors bot
added a commit
that referenced
this issue
Feb 8, 2023
367: Implement support for RFC 86: Column-oriented read API for vector layers r=lnicola a=kylebarron - [x] I agree to follow the project's [code of conduct](https://github.com/georust/gdal/blob/master/CODE_OF_CONDUCT.md). - [x] I added an entry to `CHANGES.md` if knowledge of this change could be valuable to users. --- ### Description This is a pretty low-level/advanced function, but is very useful for performance when reading (and maybe in the future writing) from OGR into columnar memory. This function operates on an `ArrowArrayStream` struct that needs to be passed in. Most of the time, users will be using a helper library for this, like [`arrow-rs`](https://github.com/apache/arrow-rs) or [`arrow2`](https://github.com/jorgecarleitao/arrow2). The nice part about this API is that this crate does _not_ need to declare those as dependencies. The [OGR guide](https://gdal.org/tutorials/vector_api_tut.html#reading-from-ogr-using-the-arrow-c-stream-data-interface) is very helpful reading. Would love someone to double-check this PR in context of this paragraph: > There are extra precautions to take into account in a OGR context. Unless otherwise specified by a particular driver implementation, the ArrowArrayStream structure, and the ArrowSchema or ArrowArray objects its callbacks have returned, should no longer be used (except for potentially being released) after the OGRLayer from which it was initialized has been destroyed (typically at dataset closing). Furthermore, unless otherwise specified by a particular driver implementation, only one ArrowArrayStream can be active at a time on a given layer (that is the last active one must be explicitly released before a next one is asked). Changing filter state, ignored columns, modifying the schema or using ResetReading()/GetNextFeature() while using a ArrowArrayStream is strongly discouraged and may lead to unexpected results. As a rule of thumb, no OGRLayer methods that affect the state of a layer should be called on a layer, while an ArrowArrayStream on it is active. ### Change list - Copy in `arrow_bridge.h` with the Arrow C Data Interface headers. - Add `arrow_bridge.h` to the bindgen script so that `gdal_3.6.rs` includes a definition for `ArrowArrayStream`. I re-ran this locally; I'm not sure why there's such a big diff. Maybe I need to run this from `3.6.0` instead of `3.6.2`? - Implement `read_arrow_stream` - Add example of reading arrow data to [`arrow2`](https://docs.rs/arrow2) ### Todo - Pass in options to `OGR_L_GetArrowStream`? According to the guide: > The `papszOptions` that may be provided is a NULL terminated list of key=value strings, that may be driver specific. So maybe we should have an `options: Option<Vec<(String, String)>>` argument? Pyogrio [uses this](https://github.com/geopandas/pyogrio/blob/a0b658509f191dece282d6b198099505e9510349/pyogrio/_io.pyx#L1090-L1091) to turn off generating an `fid` for every row. - Have an option to skip reading some columns. Pyogrio does this with [calls to](https://github.com/geopandas/pyogrio/blob/a0b658509f191dece282d6b198099505e9510349/pyogrio/_io.pyx#L1081-L1088) `OGR_L_SetIgnoredFields`. ### References - [OGR Guide for using the C Data Interface](https://gdal.org/tutorials/vector_api_tut.html#reading-from-ogr-using-the-arrow-c-stream-data-interface) Closes #280 Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
GDAL 3.6 added support for a column-oriented API in RFC 86. This is a feature request to add an API for this in the Rust bindings.
For higher-level bindings to GDAL, such as from Python, this API is a big performance improvement as it moves the row-to-columnar conversion loop into C. I don't know how Rust-C bindings work well enough to know if this would also improve performance compared to a Rust loop. But regardless, for a Rust application that would like to use Arrow memory, it would be most ergonomic to reuse the GDAL implementation.
The text was updated successfully, but these errors were encountered: