-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for fids filter with use_arrow=True #304
Add support for fids filter with use_arrow=True #304
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @theroggy .
I wonder if we should always set fid_as_index
to True
if fids
is used, and direct users to inspect the index values for the order of records returned from the data source. Then they can more easily sort that as needed, and it is transparent which record corresponds to which fid.
You can also update the docstring for fids
to denote that the order of records returned is not guaranteed to follow the order of the fids
. If the order of the fids or the fid of each record matters, fid_as_index
should solve that; otherwise they may just want a subset of features and the fid of each doesn't really matter. I.e., we're testing for set membership in the list of fids, rather than plucking out records in exactly that order (e.g., pandas take()
).
I added some info about the order of the rows returned + how sorting can be applied to the docstring. |
I think we should do some benchmarking here to see how good this approach actually is (I did some for the non-arrow use case, originally when working on the support for reading by FIDs, will try to dig them up). Because if this is slower, then I am not sure we should provide this feature (if a user wants this, they can always do the |
I did some quick tests... and as often is the case, it depends on the drivers involved. I think there are generally 3 types of data sources available through gdal:
For each individual driver there will still be significant differences depending on the (lack of) optimizations implemented and the inherent efficiency of the format, but I think testing these 3 will give a reasonable idea of what to expect in general. Results:
For the shapefile performance, it seems that the I used the files on this page for my tests: https://landbouwcijfers.vlaanderen.be/open-geodata-landbouwgebruikspercelen Test script used:
Results:
|
It is a bit more fuss than just using that query... and feature parity-wise this is also not ideal, so I personally would provide the feature regardless of possible performance differences and document the circumstances when it is slower if it is the case as we have done in other cases. |
I think providing the functionality - since it allows parity between arrow and non-arrow modes - and documenting (briefly) that performance may vary widely by driver seems like a reasonable path forward. |
…s-filter-with-use_arrow=True
The performance being driver dependent apparently was already mentioned, but I added that the value of |
Sorry for the slow follow-up here on the promised benchmarks: the previous time we briefly discussed this was at #19 (comment) (and the two comments below that). It's quite similar as what Pieter shows above, performance being very driver dependent, the main summary of those older comments (and from testing shp, gpkg, geojson and fgb):
I think a lot also depends on the use case for the user specifying Of course, we can indeed document this clearly, and leave the choice to the users to use this keyword or not (given that we provide a good error message when it fails). |
As a comparison, for Shapefile, it can actually be faster to read the whole file and filter it after reading (and with the arrow reader, this can be done chunk by chunk, so that memory can still be low), compared to using a where filter (but this is still a lot slower than the non-arrow fid-based reader). I re-ran some of the older benchmarks (+ added the arrow read+filter test): https://nbviewer.org/gist/jorisvandenbossche/f47549ec33edc234ac17b05a0bcaea69 |
…s-filter-with-use_arrow=True
For text files it is, at least for my test, ~ the same. But performance and large text GIS files is a contradiction anyway, so not super interesting :-).
I know Oracle has a fixed limit of max. 1000 elements in an IN. I had a look at the limit in sqlite but it seemed like the limit is beyond any sensible number. I didn't think of OGRSQL, but apparently the limit there is rather in the league of the one imposed by oracle.
This is indeed the use case I was thinking about.
Yes, for OGRSQL dependent datasources the practical limit seems to lie between 4000 and 5000 fids. There are several ways this can be stretched (clustering the fids and using "BETWEEN" clauses or "UNION ALL"'s, reading in a batch, inserting them in a temp file and joining on it,...), but I don't think it is already relevant at this point to implement such things. If anyone would ask for it for a sensible use case, further optimizations are possible and could be considered... Do I understand correctly that you use the
I personally think more users will benefit from the feature parity than there will be users that will run into the limits being imposed in the |
Same as above, if there are use cases where such an optimization would useful/needed, this could indeed also be a way to implement an optimization to this in pyogrio. |
If I recall correctly, some of where this came up in dask-geopandas was the idea of using feature bounds (via
According to the notebook, this approach was ~10x faster than using the SQL |
Because in theory the sql fid filtering could be fast, I checked on the feasability of speading it up in gdal a few days ago and apparently it wasn't too hard because it has already been implemented: OSGeo/gdal#8590 So a loop to read the fids per e.g. 1000 using the where filter could be a clean solution... |
Thanks for the GDAL issue! ;)
Indeed, dask-geopandas was the reason that I was looking into this FID based reading. I don't know if anyone is using it in practice, but this allows in theory to process a big shapefile in parallel while ensuring each chunk has a specific spatial extent (now, in practice it's probably better to first convert the file to something like parquet, and only then do the spatial repartitioning). And in practice, at the moment I also only implemented the consecutive chunks with skip_features/max_features in the For this PR, while I am a bit skeptical about needing this (I don't think we necessarily need exact keyword equivalency with or without arrow, we can also better document you can use the |
Suggestion: since we're not yet in agreement on the API, and there are ongoing changes in GDAL that may impact some of our direction here, let's omit defer this from the 0.7 release. |
…s-filter-with-use_arrow=True
New performance test with GDAL 3.8, that includes the optimization for
The maximum number of FID's in the IN is 4.997 for "OGRSQL". For gpkg there is no explicit limit. I did some tests around this, and the limit is in number of elements, not in the number of digits or the length of the where filter. Timings:
Script:
|
…s-filter-with-use_arrow=True
…s-filter-with-use_arrow=True
…s-filter-with-use_arrow=True
…s-filter-with-use_arrow=True
…s-filter-with-use_arrow=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates and your patience here @theroggy .
Given the performance implications and changes on GDAL side, I think we should limit support for this in ogr_open_arrow
to GDAL >= 3.8.0 and raise a runtime exception otherwise.
…s-filter-with-use_arrow=True
@brendan-ward Or a warning, so we avoid breaking without explicit need? |
I guess a warning would be fine, though there is the risk of the user not seeing it and then getting frustrated that things are so slow. But as it is, we are still a ways out from making |
Yes, it depends a bit from situation to situation if an error being thrown is more frustrating or things being slow... but in general I'm personally more stressed by things not working at all than by slowness. I added a warning. I excluded Geopackage and GeoJSON from this, as my tests above showed that |
…s-filter-with-use_arrow=True
…s-filter-with-use_arrow=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @theroggy !
Add support for the
fids
filter keyword.Resolves #301
Note1: there is another test that can be enabled for use_arrow=True, but it depends on the
force_2d
support PR, #300 to be merged.Note2: when reading without
use_arrow=True
, for the testcase we use gdal seems to be return the data in the order the fids were supplied. When usinguse_arrow=True
, the rows are returned in ascending order of the fids, or rather (at least for db oriented data sources) probably in the order they are written on disk.The order rows are returned typically is not guaranteed unless you explicitly ask to order by, at least for DB-like sources, so it is always a risk to depend on that in my opinion. However if the idea is that users rely on the order data is returned when using the
fids
filter, or that in this case having the order guaranteed is useful, sorting could be added at the geopandas level in pyogrio.