-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add support for skip_features, max_features for read_arrow #282
Conversation
I think this PR adds support for this in gdal 3.8: OSGeo/gdal#8306 |
Specifically one commit (OSGeo/gdal@248cf60): " OGRLayer::GetArrowStream(): do not issue ResetReading() at beginning of iteration, but at end instead, so SetNextByIndex() can be honoured". |
+1 |
Forget what I said, we already do that through the |
) | ||
|
||
elif skip_features > 0: | ||
table = reader.read_all().slice(skip_features).combine_chunks() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side thought: we should have a better way to release the sliced memory, without having to copy the whole table (and with "we", I mean pyarrow should enable this).
Because right now, assume you have a million rows, and you skip the first 10,000, then only the first batch is sliced, and we unnecessarily copy all the other data.
Latest commit passes |
Progress toward feature parity between Arrow and non-Arrow read interfaces.
This adds support for
skip_features
andmax_features
toread_arrow
, which enables these to be passed through viaread_dataframe(path, use_arrow=True, skip_features=10, max_features=2)
so that the behavior is the same with and withoutuse_arrow
.I added a note to the introduction to describe the overhead involved:
use_arrow
isTrue
,skip_features
andmax_features
will incur additional overhead because all features up to the next batch size abovemax_features
(or size of data layer) will be read prior to slicing out the requested range of features. Ifmax_features
is less than the maximum Arrow batch size (65,536 features) onlymax_features
will be read. All features up toskip_features
are read from the data source and later discarded because the Arrow interface does not support randomly seeking a starting feature.This overhead is relative to reading via Arrow; based on my limited tests so far, it is still generally a lot faster to use Arrow even with these parameters than without Arrow.
This also drops a validation requirement that
skip_features
be less than the number of features available to read (originally we raised aValueError
). Since using a.slice
on a pyarrow Table with a value larger than the size of the original table happily returns an empty table, it made sense to take this approach throughout: if you ask for more features than available, you get back empty arrays / pyarrow Tables / (Geo)DataFrames.