-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: read features based on array of FIDs #19
Conversation
9955949
to
ba804ce
Compare
I did some performance checks: https://nbviewer.org/gist/jorisvandenbossche/c57fd2693a4442a539797ebe46df3a61
This is of course with only one specific dataset, I don't know how generalizable the observations are. For GeoPackage, it will in any case depend on the complexity of your dataset whether the overhead of getting by FID is important or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @jorisvandenbossche ! Some of these results are a bit surprising, I wouldn't have expected reading by FID in sequential order to be that much slower.
The split of get_features_by_fid
vs get_features
seems OK - there isn't that much overlap, and they have different parameters, so it seems like cleaner logic the way you have it now. And it's internal API, so we can always change later. 😄
I think with shapefiles, FID corresponds directly to a feature's position in the file. That is, it isn't a separate field, so GDAL just needs to be able to read features based on offsets. In GPKG, fid is a separate field and also a primary key. But I would have thought the primary key would be pretty fast to work with.
pyogrio/_io.pyx
Outdated
@@ -757,7 +859,8 @@ def ogr_read_info(str path, object layer=None, object encoding=None, **kwargs): | |||
'encoding': encoding, | |||
'fields': get_fields(ogr_layer, encoding)[:,2], # return only names | |||
'geometry_type': get_geometry_type(ogr_layer), | |||
'features': OGR_L_GetFeatureCount(ogr_layer, 1) | |||
'features': OGR_L_GetFeatureCount(ogr_layer, 1), | |||
'random_read': OGR_L_TestCapability(ogr_layer, OLCRandomRead), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For testing performance of bbox filters by OGR driver, you might want to add OLCFastSpatialFilter (presumably this is true for GPKG and Shapefile, likely not true for GeoJSON). We may want to consider how we list data source capabilities for the driver in general (though outside this PR); perhaps a nested dictionary under meta
:
meta = {
...,
capabilities: {
"random_read": OGR_L_TestCapability(ogr_layer, OLCRandomRead),
"fast_spatial_filter": OGR_L_TestCapability(ogr_layer, OLCFastSpatialFilter),
...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I mostly added this temporarily to check the value of this. It actually returns True for all three of Shapefile, GPKG and GeoJSON (while they show quite different performance impact), so not sure how useful this is to check (I would be curious which format does not support the random read)
But I agree that if we want to keep this, the nested "capabilities" field as you indicate seems a good idea.
pyogrio/_io.pyx
Outdated
@@ -35,7 +35,7 @@ log = logging.getLogger(__name__) | |||
|
|||
### Constants | |||
cdef const char * STRINGSASUTF8 = "StringsAsUTF8" | |||
|
|||
cdef const char * OLCRandomRead = "RandomRead" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding this here, add this to the bottom of the cdef extern from "ogr_api.h":
block in _ogr.pxd
cdef extern from "ogr_api.h":
...
const char* OLCRandomRead
This imports the constant defined in GDAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The StringsAsUTF8
on the line is also such a constant, so shall I move that one as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please!
I haven't had a chance to dig into the implementation in GDAL, but from the profiling results, it looks like the Another way to approach this would be to use indexes as offsets from the beginning of a given file rather than FID. For serial FID formats (e.g., shapefile), this seems like it should function the same; for formats where FID is not necessarily serial (GPKG?), this may be a function of how well that driver supports fast seeking to the next feature (see OLCFastSetNextByIndex); I think GPK does not support fast seeking. No idea how the performance would compare between using the slower seeking method vs get by FID. |
I see a bit of sqlite interaction in
I am not fully understanding this. Wouldn't this only allow reading a consecutive set of rows? (like give me rows 10 to 20) |
I meant that we let the user specify indices (offsets) as an array just like FIDs, e.g., [1,10,46, 92, ...] Then we use a combination of |
Ah, I see! Yes, that sounds interesting to try out. |
I don't see a custom |
Tried out the indices approach, and the first observations:
This is only a very quick check, and the trade-offs might depend on the actual file format. But at least considering Shapefile and GeoPackage, reading by index doesn't help (for Shapefile it's the same, but this was already fast using FIDs anyway, and for GeoPackage it's very slow). Also, overall, I think the FID approach will probably fine even though it gives a slowdown for some formats like GPKG, assuming that 1) we will typically read a relatively small subset at a time (eg if you have more than 10 partitions, you already are below the 10% for one read, and also for GeoPackage, reading 10% of the data using FIDs is still a bit faster than reading all data sequentially) and 2) with real world data (and not the very simple point dataset), the actual parsing of the data is also more important than the query by ID (and so the potential slowdown from reading by FID is relatively smaller compared to the overall time it takes to read the data) The combined bbox + filter on FIDs might still be worth experimenting with for GeoPackage. I pushed the code I used to test the indices approach (will remove it again in a later commit to clean-up the code in this PR), and I also updated the notebook with the new timings: https://nbviewer.org/gist/jorisvandenbossche/c57fd2693a4442a539797ebe46df3a61 |
Thanks for the additional testing and follow up! It's a bummer that GPKG is going to be slow until you get below a certain threshold, but like you say, other factors would mitigate the impact from that (reading small percent and having more time parsing geometries). Adding a brief warning to the docstring may be a good idea. I haven't had a chance to test, but I wonder how using the |
I did a very quick test:
So given that it's only slightly faster for GPKG (to remember, the above speedup is "best case" as this is a very simple dataset of points, once you need to parse more data, the difference will get smaller), and it has its limitations in the size of the query, it doesn't seem worth it to have special case code depending on the driver to vary the method to query FIDs. |
Hmm, actually, it seems this ValueError with larger So for GeoPackage, I can further increase the size of the subset selection, and then using the where clause starts to get more significantly faster: for selecting 20,000 rows (20% of the size of the points dataset) takes around ~300ms with FIDs and ~100ms with a where clause. So after all, it might be useful to consider special casing GPKG (although it will still depend on how many other columns are selected / how comptes the geometries are). But, we will need the FID selection for other formats anyway. So for this PR, I propose that I clean-up the implementation for FID selection and add tests and docs etc, and then we can still further consider this topic in a follow-up (the special casing for GPKG could also live in a level up, i.e. leave this to the user or downstream package (eg dask-geopandas could also do this)) |
OK, I cleaned up the PR and added some basic tests and docstring. Should be ready for a code review. |
if ogr_feature == NULL: | ||
raise ValueError("Failed to read FID {}".format(fid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GDAL actually puts an informative message to stderr in this case (saying feature id .. is out of bounds). Is there a way to capture this and use that for the error message? (in this case it doesn't change that much though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use exc_wrap_pointer
in a try / except block:
try:
ogr_feature = exc_wrap_pointer(OGR_L_GetFeature(ogr_layer, fid))
except CPLE_BaseError as exc:
raise DriverIOError(str(exc))
when then raises an exception:
pyogrio.errors.DriverIOError: Attempt to read shape with feature id (-1) out of available range.
If you want, you could also create a new error class for this specifically (in errors.py
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to wrap the GDAL error as suggested here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche !
This is pretty close, just needs another test and likely a few minor docstring updates.
if ogr_feature == NULL: | ||
raise ValueError("Failed to read FID {}".format(fid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use exc_wrap_pointer
in a try / except block:
try:
ogr_feature = exc_wrap_pointer(OGR_L_GetFeature(ogr_layer, fid))
except CPLE_BaseError as exc:
raise DriverIOError(str(exc))
when then raises an exception:
pyogrio.errors.DriverIOError: Attempt to read shape with feature id (-1) out of available range.
If you want, you could also create a new error class for this specifically (in errors.py
)
pyogrio/_io.pyx
Outdated
if fids is not None: | ||
if where is not None or bbox is not None or skip_features or max_features: | ||
raise ValueError( | ||
"cannot set both 'fids' and one of 'where', 'bbox', " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"cannot set both 'fids' and one of 'where', 'bbox', " | |
"cannot set both 'fids' and any of 'where', 'bbox', " |
pyogrio/_io.pyx
Outdated
"'skip_features' or 'max_features'" | ||
) | ||
fids = np.asarray(fids, dtype=np.intc) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: move the substantive work here (e.g., applying where filter, etc) down to the else
block on 755, so that this conditional test up top is just for validating parameters.
pyogrio/geopandas.py
Outdated
fids : array-like, optional (default: None) | ||
Array of integer feature id (FID) values to select. Cannot be combined | ||
with other keywords to select a subset (`skip_features`, `max_features`, | ||
`where` or `bbox`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`where` or `bbox`). | |
`where` or `bbox`). Note that the starting index is driver specific. |
(just a nudge to users to remember that GPKG starts at 1, others at 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it is also worth mentioning here that the performance of reading large numbers of features using FIDs is driver specific, and may be much worse than reading the entire dataframe and selecting out features of interest afterward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expanded the note a bit, please take a look (I kept the performance note short for now, because actually starting to explain this might get a bit long ..).
"capabilities": { | ||
"random_read": OGR_L_TestCapability(ogr_layer, OLCRandomRead), | ||
"fast_set_next_by_index": OGR_L_TestCapability(ogr_layer, OLCFastSetNextByIndex), | ||
"fast_spatial_filter": OGR_L_TestCapability(ogr_layer, OLCFastSpatialFilter), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems potentially useful, at least for benchmarking. I'd keep it for now.
Thanks for the review! Pushed some updates |
Also, are we OK with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche for the updates. This looks good. Up to you if you want to wrap the GDAL error for reading a FID out of bounds or leave as is (we can come back to that later).
I think the fids
keyword is just fine, since FID seems to be a common term when referring to specific records within a GIS file.
Can you please add an entry to CHANGES.md
for this PR?
if ogr_feature == NULL: | ||
raise ValueError("Failed to read FID {}".format(fid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to wrap the GDAL error as suggested here?
"capabilities": { | ||
"random_read": OGR_L_TestCapability(ogr_layer, OLCRandomRead), | ||
"fast_set_next_by_index": OGR_L_TestCapability(ogr_layer, OLCFastSetNextByIndex), | ||
"fast_spatial_filter": OGR_L_TestCapability(ogr_layer, OLCFastSpatialFilter), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems potentially useful, at least for benchmarking. I'd keep it for now.
I was indeed thinking to leave that for another PR, as there are some other occurences of that in the same file as well (and didn't want to start mixing things too much) |
This is experimenting with reading a subset based on an array of FIDs. Very much a draft (eg still needs tests, docs, etc), but good enough for running some benchmarks (will post them after a re-run).
The diff is not super clear: I splitted
get_features
to separate the processing of the geometry and fields (so this can be reused inget_features_by_fid
). If you look at only the first commit, this is clearer. In the end, I am not sure if this was the best approach, as in theory it could also be combined inget_features
and have a if/else switch at the moment to get the OGRFeature (switching betweenOGR_L_GetNextFeature
andOGR_L_GetFeature
), since the rest ofget_features
andget_features_by_fid
have a lot of overlap.