-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: use new columnar GetArrowStream if GDAL>=3.6 and pyarrow available #155
ENH: use new columnar GetArrowStream if GDAL>=3.6 and pyarrow available #155
Conversation
pyogrio/_io.pyx
Outdated
IF CTE_GDAL_VERSION >= (3, 6, 0): | ||
|
||
if not OGR_L_GetArrowStream(ogr_layer, &stream, NULL): | ||
raise RuntimeError("Failed to open ArrowArrayStream from Layer") | ||
|
||
ELSE: | ||
raise RuntimeError("Need GDAL>=3.6 for Arrow support") | ||
|
||
stream_ptr = <uintptr_t> &stream | ||
|
||
import pyarrow as pa | ||
table = pa.RecordBatchStreamReader._import_from_c(stream_ptr).read_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is actually the main piece that is different from the existing ogr_read
function. So it might be cleaner to integrate it into that one (to avoid too much duplication of code) by passing through the use_arrow
keyword to ogr_read
.
On the other hand, that makes the return value a bit messy (unless we would change this in a nested tuples of length 2 ((meta, data)
where data
can be (fid_data, geometries, field_data)
in the standard case and table
in the arrow case). And the arrow code path is for now not handling all keywords, so that might also be cleaner as a separate function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @jorisvandenbossche ! The potential of a 2x speedup is exciting!
A bit of code duplication is OK for now, esp. if the return signature from Cython function is different and the supported parameters are a bit different.
Question: if we detect GDAL >= 3.6 and pyarrow
, and the parameters are compatible, would we want to automatically opt-in to using arrow for I/O with GDAL to / from GeoDataFrames, instead of only based only on a keyword? Though this would be added complexity around code paths based on parameters...
Perhaps generalizing use_arrow
to something that allows "auto", "numpy", "arrow"
values, where "auto" (default) is based on what is available, falling back to "numpy" if arrow path is not available.
Do we also want to consider adding these as raw I/O functions as well, e.g., read_arrow
/ write_arrow
this always reads the geometry column
We should probably investigate this more. Some folks use pyogrio to get at non-geometry tables in ESRI FileGeodatabases, among others. Would be nice to use arrow to get those, if possible.
You can use
read_dataframe('zip://pyogrio/tests/fixtures/test_fgdb.gdb.zip', layer='basetable')
for testing tabular-only data
Eventually we should indeed do that, I think, but maybe not to start with? Generalizing the keyword seems good for that. But maybe we then need another name?
Good idea, I think adding a
Yes, so that actually works (just tested, and it reads the table nicely without geometry column). The main issue for this case is how we detect reliably that there is no geometry column. But the issue is more that if your layer has a geometry, then it is always included in the resulting arrow table. Which means we might not be able to support the |
Ah, it seems that |
Something hadn't thought about before is the lifetime of the GDAL Dataset/Layer, as long as pyarrow still has data that is backed by buffers owned by GDAL. Quote from the GDAL docs for OGR_L_GetArrowStream about this:
So according to that, we should only destroy the GDAL dataset after we have fully consumed the stream (the data can still be owned by GDAL though, we are still allowed to call the release callback). Simplified, this PR currently implements the following: def read(path, layer):
cdef OGRDataSourceH ogr_dataset = NULL
cdef OGRLayerH ogr_layer = NULL
cdef ArrowArrayStream stream
ogr_dataset = GDALOpenEx(path)
try:
ogr_layer = GDALDatasetGetLayer(ogr_dataset, layer)
OGR_L_GetArrowStream(ogr_layer, &stream, options)
table = pa.RecordBatchReader._import_from_c(<uintptr_t> &stream).read_all()
finally:
GDALClose(ogr_dataset)
return table So because I am currently creating a Returning a materialized Table (from |
This is fresh off the presses so no guarantees that it's a sound design, but when drafting r-spatial/sf#2036 I made a little wrapper stream that keeps the dataset alive until |
Just something I noticed in the documentation that you may or may not be aware of (and that we should probably fix upstream since it is not consistent with general ArrowSchema/Array usage):
|
@paleolimbot see my comment above: #155 (comment) |
Oh sorry I missed that comment with the exact quote 🤦 . I think the stream lifecycle being contained within the layer lifecycle is fine, easy to work around, and consistent with what ADBC does with the lifecycle of its "statement"), but the Array lifecycle is problematic because the assumptions around not copying the array data run deep and it's pretty much impossible to guarantee that the ArrowArray object will not be accessed after the dataset is closed. (Right?) |
But I think that is OK? It was my assumption that you can access the ArrowArray data after the dataset is closed, as long as this ArrowArray already exists (so you need to consume the stream so all ArrowArray structs are created, before the dataset is closed). Although when now rereading this quote, it is indeed saying that "using" also the ArrowArray (not just the stream) shouldn't be done after the dataset is closed. It is not super clear to me what "should no longer be used (except for potentially being released)" exactly means though, as I would assume that as long as you don't call the release callback, the actual buffers pointed to by the ArrowArray should be guaranteed to still be alive. But let's maybe open an issue on the GDAL side to clarify this. |
This is an experiment to use the Arrow Stream API added to GDAL in upcoming 3.6 (RFC 86: Column-oriented read API for vector layers)
For one specific benchmark using the nz-buildings-outline.gpkg file I noticed another 2x speed-up compared to base pyogrio (see slide).
This does not support all features, or behaves a bit differently in some cases. Some observations:
skip_features
ormax_features
is not supported (this would need some change on the GDAL side to for example set a batchsize, and then only get one batch)this always reads all columns at the moment, but that should be solveable by explicitly ignoring fields with-> selecting columns is supportedOGR_L_SetIgnoredFields
as mentioned in ENH: avoid I/O when columns don't need to be returned #153 (something we should do anyway)this always reads the geometry column (not sure if that is something you can turn off with existing GDAL APIs)-> not reading the geometry column (or reading a layer without geometry) is supportedthis always returns an "FID" column, although the name doesn't seem to be fixed (my quick observations is that for file formats that don't have this natively, it comes as a "OGC_FID" column, while for geopackage, this reads as "fid", at least for a gpkg file written by geopandas)-> by default this now also doesn't read an FID column (to be consistent with the non-arrow code path), and it follows thereturn_fids
keyword to include it as a column or notOGR_L_GetGeometryColumn
doesn't seem to always seem to know this (maybe some formats don't require a name (or support a custom name), in which case this method also doesn't return anything?). But for now I making "geometry" in all cases anyway to be consistent with the normal read path.