Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
367: Implement support for RFC 86: Column-oriented read API for vector layers r=lnicola a=kylebarron - [x] I agree to follow the project's [code of conduct](https://github.com/georust/gdal/blob/master/CODE_OF_CONDUCT.md). - [x] I added an entry to `CHANGES.md` if knowledge of this change could be valuable to users. --- ### Description This is a pretty low-level/advanced function, but is very useful for performance when reading (and maybe in the future writing) from OGR into columnar memory. This function operates on an `ArrowArrayStream` struct that needs to be passed in. Most of the time, users will be using a helper library for this, like [`arrow-rs`](https://github.com/apache/arrow-rs) or [`arrow2`](https://github.com/jorgecarleitao/arrow2). The nice part about this API is that this crate does _not_ need to declare those as dependencies. The [OGR guide](https://gdal.org/tutorials/vector_api_tut.html#reading-from-ogr-using-the-arrow-c-stream-data-interface) is very helpful reading. Would love someone to double-check this PR in context of this paragraph: > There are extra precautions to take into account in a OGR context. Unless otherwise specified by a particular driver implementation, the ArrowArrayStream structure, and the ArrowSchema or ArrowArray objects its callbacks have returned, should no longer be used (except for potentially being released) after the OGRLayer from which it was initialized has been destroyed (typically at dataset closing). Furthermore, unless otherwise specified by a particular driver implementation, only one ArrowArrayStream can be active at a time on a given layer (that is the last active one must be explicitly released before a next one is asked). Changing filter state, ignored columns, modifying the schema or using ResetReading()/GetNextFeature() while using a ArrowArrayStream is strongly discouraged and may lead to unexpected results. As a rule of thumb, no OGRLayer methods that affect the state of a layer should be called on a layer, while an ArrowArrayStream on it is active. ### Change list - Copy in `arrow_bridge.h` with the Arrow C Data Interface headers. - Add `arrow_bridge.h` to the bindgen script so that `gdal_3.6.rs` includes a definition for `ArrowArrayStream`. I re-ran this locally; I'm not sure why there's such a big diff. Maybe I need to run this from `3.6.0` instead of `3.6.2`? - Implement `read_arrow_stream` - Add example of reading arrow data to [`arrow2`](https://docs.rs/arrow2) ### Todo - Pass in options to `OGR_L_GetArrowStream`? According to the guide: > The `papszOptions` that may be provided is a NULL terminated list of key=value strings, that may be driver specific. So maybe we should have an `options: Option<Vec<(String, String)>>` argument? Pyogrio [uses this](https://github.com/geopandas/pyogrio/blob/a0b658509f191dece282d6b198099505e9510349/pyogrio/_io.pyx#L1090-L1091) to turn off generating an `fid` for every row. - Have an option to skip reading some columns. Pyogrio does this with [calls to](https://github.com/geopandas/pyogrio/blob/a0b658509f191dece282d6b198099505e9510349/pyogrio/_io.pyx#L1081-L1088) `OGR_L_SetIgnoredFields`. ### References - [OGR Guide for using the C Data Interface](https://gdal.org/tutorials/vector_api_tut.html#reading-from-ogr-using-the-arrow-c-stream-data-interface) Closes #280 Co-authored-by: Kyle Barron <kylebarron2@gmail.com>
- Loading branch information