ENH: read features based on array of FIDs #19

jorisvandenbossche · 2021-10-16T12:20:26Z

This is experimenting with reading a subset based on an array of FIDs. Very much a draft (eg still needs tests, docs, etc), but good enough for running some benchmarks (will post them after a re-run).

The diff is not super clear: I splitted get_features to separate the processing of the geometry and fields (so this can be reused in get_features_by_fid). If you look at only the first commit, this is clearer. In the end, I am not sure if this was the best approach, as in theory it could also be combined in get_features and have a if/else switch at the moment to get the OGRFeature (switching between OGR_L_GetNextFeature and OGR_L_GetFeature), since the rest of get_features and get_features_by_fid have a lot of overlap.

jorisvandenbossche · 2021-10-16T13:05:41Z

I did some performance checks: https://nbviewer.org/gist/jorisvandenbossche/c57fd2693a4442a539797ebe46df3a61

I expected that the performance of reading by FID would depend on the exact file format being used, and that is also the case (however I hadn't expect GeoPackage to perform so bad). Those first set of benchmarks are with a simple Point dataset (N=100,000) with a single attribute, so the time to actually parse the feature is minimal and the overhead of using FIDs or not is more pronounced:
- For Shapefile, reading all sequentially or all by FID has almost the same speed (reading all with shuffled FIDs is a bit slower)
- For GeoPackage, there is a quite large difference. Sequentially it's on par with Shapefile, but reading by FID is a lot slower (almost 10x). I comfirmed with py-spy that all the additional time is coming from OGR_L_GetFeature(ogr_layer, fid) (there seems to be quite a bit of sqlite3 overhead).
- Reading GeoJSON is slower anyway, and reading by FID further slows it down a bit.
I compared reading with a bbox vs reading with an array of FIDs that result in the same subset of geometries. In this specific case, for Shapefile it's faster to read by FID, but for GeoPackage is actually faster to read by bbox.
- The reason I also did this benchmark was to see if this would be beneficial for GeoPackage. Since that is slower with FIDs, it might be an option to explore if we could read GeoPackage first by bbox but then additional check if the FID is in the list of FIDs. In the context of reading a subset of a file to form a partition of a dask-geopandas GeoDataFrame, it should be possible to also have the bbox in addition to the array of FIDs (eg if the FIDs were calculated based on hilbert distance based on the bounds, it's easy to calculate the total bounds for each partition).
The above is with a very simple file. In practice when reading a file with more complex geometries / more attributes, the time spent in parsing those will become more important than the overhead of accessing the feature by FID. So as a test to read the first features of the tl_2019_us_zcta510 dataset in GeoPackage format, reading by FID is now only a bit slower (16% slower instead of almost 10x slower).

This is of course with only one specific dataset, I don't know how generalizable the observations are. For GeoPackage, it will in any case depend on the complexity of your dataset whether the overhead of getting by FID is important or not.

brendan-ward

Thanks for working on this @jorisvandenbossche ! Some of these results are a bit surprising, I wouldn't have expected reading by FID in sequential order to be that much slower.

The split of get_features_by_fid vs get_features seems OK - there isn't that much overlap, and they have different parameters, so it seems like cleaner logic the way you have it now. And it's internal API, so we can always change later. 😄

I think with shapefiles, FID corresponds directly to a feature's position in the file. That is, it isn't a separate field, so GDAL just needs to be able to read features based on offsets. In GPKG, fid is a separate field and also a primary key. But I would have thought the primary key would be pretty fast to work with.

brendan-ward · 2021-10-19T00:33:59Z

pyogrio/_io.pyx

@@ -757,7 +859,8 @@ def ogr_read_info(str path, object layer=None, object encoding=None, **kwargs):
        'encoding': encoding,
        'fields': get_fields(ogr_layer, encoding)[:,2], # return only names
        'geometry_type': get_geometry_type(ogr_layer),
-        'features': OGR_L_GetFeatureCount(ogr_layer, 1)
+        'features': OGR_L_GetFeatureCount(ogr_layer, 1),
+        'random_read': OGR_L_TestCapability(ogr_layer, OLCRandomRead),


For testing performance of bbox filters by OGR driver, you might want to add OLCFastSpatialFilter (presumably this is true for GPKG and Shapefile, likely not true for GeoJSON). We may want to consider how we list data source capabilities for the driver in general (though outside this PR); perhaps a nested dictionary under meta:

meta = { ..., capabilities: { "random_read": OGR_L_TestCapability(ogr_layer, OLCRandomRead), "fast_spatial_filter": OGR_L_TestCapability(ogr_layer, OLCFastSpatialFilter), ... } }

Yeah, I mostly added this temporarily to check the value of this. It actually returns True for all three of Shapefile, GPKG and GeoJSON (while they show quite different performance impact), so not sure how useful this is to check (I would be curious which format does not support the random read)

But I agree that if we want to keep this, the nested "capabilities" field as you indicate seems a good idea.

brendan-ward · 2021-10-19T02:26:31Z

pyogrio/_io.pyx

@@ -35,7 +35,7 @@ log = logging.getLogger(__name__)

 ### Constants
 cdef const char * STRINGSASUTF8 = "StringsAsUTF8"
-
+cdef const char * OLCRandomRead = "RandomRead"


Instead of adding this here, add this to the bottom of the cdef extern from "ogr_api.h": block in _ogr.pxd

cdef extern from "ogr_api.h": ... const char* OLCRandomRead

This imports the constant defined in GDAL.

The StringsAsUTF8 on the line is also such a constant, so shall I move that one as well?

Yes please!

brendan-ward · 2021-10-19T16:07:50Z

I haven't had a chance to dig into the implementation in GDAL, but from the profiling results, it looks like the OGR_L_GetFeature function must step into a very different code path with GPKG for reading from the SQLite database than OGR_L_GetNextFeature. The latter doesn't seem to hit libsqlite3 much at all (can't even see it in the profile), whereas like you say it is the major time sink for the former.

Another way to approach this would be to use indexes as offsets from the beginning of a given file rather than FID. For serial FID formats (e.g., shapefile), this seems like it should function the same; for formats where FID is not necessarily serial (GPKG?), this may be a function of how well that driver supports fast seeking to the next feature (see OLCFastSetNextByIndex); I think GPK does not support fast seeking. No idea how the performance would compare between using the slower seeking method vs get by FID.

jorisvandenbossche · 2021-10-19T19:40:32Z

it looks like the OGR_L_GetFeature function must step into a very different code path with GPKG for reading from the SQLite database than OGR_L_GetNextFeature. The latter doesn't seem to hit libsqlite3 much at all (can't even see it in the profile), whereas like you say it is the major time sink for the former.

I see a bit of sqlite interaction in GetNextFeature, i.e. using sqlite_step (which I can also see in the source code), but this takes much less time (~5%).
On the other hand GetFeature seems to be using a different interface (OGRGeoPackageTableLayer::GetFeature), and which seems to do an actual SQL query with a WHERE clause for each call. I suppose the sqlite_step in GetNextFeature is also running some query, but much more optimized (maybe a single query that returns the features in order and in each call parses the next row? not fully sure based on a quick read of the source code)

Another way to approach this would be to use indexes as offsets from the beginning of a given file rather than FID.

I am not fully understanding this. Wouldn't this only allow reading a consecutive set of rows? (like give me rows 10 to 20)
(and this is already supported with the skip_features/max_features)

brendan-ward · 2021-10-19T19:59:50Z

Another way to approach this would be to use indexes as offsets from the beginning of a given file rather than FID.

I meant that we let the user specify indices (offsets) as an array just like FIDs, e.g., [1,10,46, 92, ...]

Then we use a combination of OGR_L_GetNextFeature and OGR_L_SetNextByIndex to increment to the next index before reading it. These would all be based on offsets from the beginning of the file, so a bit less general than using FIDs.

jorisvandenbossche · 2021-10-19T20:02:46Z

Ah, I see! Yes, that sounds interesting to try out.

jorisvandenbossche · 2021-10-19T20:28:04Z

I don't see a custom SetNextByIndex implementation in the geopackage driver, so if that means it falls back to the default implementation (https://github.com/OSGeo/gdal/blob/b2904b62620f32ecc5d39cc61d87623b5ccc0849/gdal/ogr/ogrsf_frmts/generic/ogrlayer.cpp#L499-L518, which basically calls NextFeature until the index is reached), then it might not necessarily give much advantage (at least if only reading a small subset). But to test out to be sure.

jorisvandenbossche · 2021-10-19T20:54:34Z

Tried out the indices approach, and the first observations:

For Shapefile, reading by FIDs or by indices gives the same performance (Shapefile also indicates that it has the OLCFastSetNextByIndex capability).
For GeoPackage, using the SetNextByIndex / GetNextFeature combo is very slow. Because GPKG does not have the OLCFastSetNextByIndex capability, it uses the base SetNextByIndex implementation which resets the reading and calls GetNextFeature until the requested index on every call (so which means we are rereading a part of the dataset for every of the indices we want to read ..)
For example, to only read a subset of 100 of the indices (randomly sampled throughout the dataset, so not the first 100), which is something that takes 7ms for Shapefile and 10ms for GeoPackage using FIDs, using indices this takes more than a second for GeoPackage.
For GeoJSON this is even slower, even though it indicates it has the FastSetNextByIndex capabilitity.

This is only a very quick check, and the trade-offs might depend on the actual file format. But at least considering Shapefile and GeoPackage, reading by index doesn't help (for Shapefile it's the same, but this was already fast using FIDs anyway, and for GeoPackage it's very slow).

Also, overall, I think the FID approach will probably fine even though it gives a slowdown for some formats like GPKG, assuming that 1) we will typically read a relatively small subset at a time (eg if you have more than 10 partitions, you already are below the 10% for one read, and also for GeoPackage, reading 10% of the data using FIDs is still a bit faster than reading all data sequentially) and 2) with real world data (and not the very simple point dataset), the actual parsing of the data is also more important than the query by ID (and so the potential slowdown from reading by FID is relatively smaller compared to the overall time it takes to read the data)

The combined bbox + filter on FIDs might still be worth experimenting with for GeoPackage.

I pushed the code I used to test the indices approach (will remove it again in a later commit to clean-up the code in this PR), and I also updated the notebook with the new timings: https://nbviewer.org/gist/jorisvandenbossche/c57fd2693a4442a539797ebe46df3a61

brendan-ward · 2021-10-20T15:15:34Z

Thanks for the additional testing and follow up!

It's a bummer that GPKG is going to be slow until you get below a certain threshold, but like you say, other factors would mitigate the impact from that (reading small percent and having more time parsing geometries). Adding a brief warning to the docstring may be a good idea.

I haven't had a chance to test, but I wonder how using the where parameter to trigger an SQL query would vary for GPKG. E.g., where="FID in (1,10,42, 96)" compared to getting individual features by FID, though I can imagine this gets unwieldy or may overflow limits for 100,000 FIDs.

jorisvandenbossche · 2021-10-29T07:46:04Z

I haven't had a chance to test, but I wonder how using the where parameter to trigger an SQL query would vary for GPKG. E.g., where="FID in (1,10,42, 96)" compared to getting individual features by FID, though I can imagine this gets unwieldy or may overflow limits for 100,000 FIDs.

I did a very quick test:

it's a lot slower for Shapefile, but a bit faster for GeoPackage (which confirms the guess that GPKG can handle this well)
however, the where clause seems to be limited in size. If I further increase the number of indices from 1000 to 5000 in the example below, you get a ValueError: Invalid SQL query for layer 'b'test_points'' (and there is also a log message about "SQL Expression Parsing Error: memory exhausted")

So given that it's only slightly faster for GPKG (to remember, the above speedup is "best case" as this is a very simple dataset of points, once you need to parse more data, the difference will get smaller), and it has its limitations in the size of the query, it doesn't seem worth it to have special case code depending on the driver to vary the method to query FIDs.

jorisvandenbossche · 2021-10-29T07:56:57Z

Hmm, actually, it seems this ValueError with larger where clause only happens for Shapefile or GeoJSON, but not for GPKG.

So for GeoPackage, I can further increase the size of the subset selection, and then using the where clause starts to get more significantly faster: for selecting 20,000 rows (20% of the size of the points dataset) takes around ~300ms with FIDs and ~100ms with a where clause.
(and for reference, for Shapefile with FIDs this is ~35ms)

So after all, it might be useful to consider special casing GPKG (although it will still depend on how many other columns are selected / how comptes the geometries are). But, we will need the FID selection for other formats anyway. So for this PR, I propose that I clean-up the implementation for FID selection and add tests and docs etc, and then we can still further consider this topic in a follow-up (the special casing for GPKG could also live in a level up, i.e. leave this to the user or downstream package (eg dask-geopandas could also do this))

jorisvandenbossche · 2021-10-29T08:46:29Z

OK, I cleaned up the PR and added some basic tests and docstring. Should be ready for a code review.

jorisvandenbossche · 2021-10-29T09:07:36Z

pyogrio/_io.pyx

+        if ogr_feature == NULL:
+            raise ValueError("Failed to read FID {}".format(fid))


GDAL actually puts an informative message to stderr in this case (saying feature id .. is out of bounds). Is there a way to capture this and use that for the error message? (in this case it doesn't change that much though)

You can use exc_wrap_pointer in a try / except block:

try: ogr_feature = exc_wrap_pointer(OGR_L_GetFeature(ogr_layer, fid)) except CPLE_BaseError as exc: raise DriverIOError(str(exc))

when then raises an exception:

pyogrio.errors.DriverIOError: Attempt to read shape with feature id (-1) out of available range.

If you want, you could also create a new error class for this specifically (in errors.py)

Did you want to wrap the GDAL error as suggested here?

brendan-ward

Thanks @jorisvandenbossche !

This is pretty close, just needs another test and likely a few minor docstring updates.

brendan-ward · 2021-10-30T18:04:41Z

pyogrio/_io.pyx

+        if ogr_feature == NULL:
+            raise ValueError("Failed to read FID {}".format(fid))


You can use exc_wrap_pointer in a try / except block:

try: ogr_feature = exc_wrap_pointer(OGR_L_GetFeature(ogr_layer, fid)) except CPLE_BaseError as exc: raise DriverIOError(str(exc))

when then raises an exception:

pyogrio.errors.DriverIOError: Attempt to read shape with feature id (-1) out of available range.

If you want, you could also create a new error class for this specifically (in errors.py)

brendan-ward · 2021-10-30T18:05:58Z

pyogrio/_io.pyx

+    if fids is not None:
+        if where is not None or bbox is not None or skip_features or max_features:
+            raise ValueError(
+                "cannot set both 'fids' and one of 'where', 'bbox', "


Suggested change

"cannot set both 'fids' and one of 'where', 'bbox', "

"cannot set both 'fids' and any of 'where', 'bbox', "

brendan-ward · 2021-10-30T18:09:05Z

pyogrio/_io.pyx

+                "'skip_features' or 'max_features'"
+            )
+        fids = np.asarray(fids, dtype=np.intc)
+    else:


Suggestion: move the substantive work here (e.g., applying where filter, etc) down to the else block on 755, so that this conditional test up top is just for validating parameters.

brendan-ward · 2021-10-30T18:10:43Z

pyogrio/geopandas.py

+    fids : array-like, optional (default: None)
+        Array of integer feature id (FID) values to select. Cannot be combined
+        with other keywords to select a subset (`skip_features`, `max_features`,
+        `where` or `bbox`).


Suggested change

`where` or `bbox`).

`where` or `bbox`). Note that the starting index is driver specific.

(just a nudge to users to remember that GPKG starts at 1, others at 0)

Not sure if it is also worth mentioning here that the performance of reading large numbers of features using FIDs is driver specific, and may be much worse than reading the entire dataframe and selecting out features of interest afterward.

I expanded the note a bit, please take a look (I kept the performance note short for now, because actually starting to explain this might get a bit long ..).

pyogrio/tests/test_raw_io.py

jorisvandenbossche · 2021-11-05T10:50:21Z

pyogrio/_io.pyx

+        "capabilities": {
+            "random_read": OGR_L_TestCapability(ogr_layer, OLCRandomRead),
+            "fast_set_next_by_index": OGR_L_TestCapability(ogr_layer, OLCFastSetNextByIndex),
+            "fast_spatial_filter": OGR_L_TestCapability(ogr_layer, OLCFastSpatialFilter),


Do we want to keep this?

This seems potentially useful, at least for benchmarking. I'd keep it for now.

jorisvandenbossche · 2021-11-05T10:51:13Z

Thanks for the review! Pushed some updates

jorisvandenbossche · 2021-11-05T11:02:55Z

Also, are we OK with the fids keyword naming? (it's not super clear, but I also don't have a better idea at the moment :))

brendan-ward

Thanks @jorisvandenbossche for the updates. This looks good. Up to you if you want to wrap the GDAL error for reading a FID out of bounds or leave as is (we can come back to that later).

I think the fids keyword is just fine, since FID seems to be a common term when referring to specific records within a GIS file.

Can you please add an entry to CHANGES.md for this PR?

brendan-ward · 2021-11-10T12:43:03Z

pyogrio/_io.pyx

+        if ogr_feature == NULL:
+            raise ValueError("Failed to read FID {}".format(fid))


Did you want to wrap the GDAL error as suggested here?

brendan-ward · 2021-11-10T12:43:56Z

pyogrio/_io.pyx

+        "capabilities": {
+            "random_read": OGR_L_TestCapability(ogr_layer, OLCRandomRead),
+            "fast_set_next_by_index": OGR_L_TestCapability(ogr_layer, OLCFastSetNextByIndex),
+            "fast_spatial_filter": OGR_L_TestCapability(ogr_layer, OLCFastSpatialFilter),


This seems potentially useful, at least for benchmarking. I'd keep it for now.

jorisvandenbossche · 2021-11-10T13:03:56Z

Up to you if you want to wrap the GDAL error for reading a FID out of bounds or leave as is (we can come back to that later).

I was indeed thinking to leave that for another PR, as there are some other occurences of that in the same file as well (and didn't want to start mixing things too much)

jorisvandenbossche added 2 commits October 16, 2021 14:23

split get_features

fa5688a

add get_features_by_fid

ba804ce

jorisvandenbossche force-pushed the read-fids branch from 9955949 to ba804ce Compare October 16, 2021 12:26

brendan-ward reviewed Oct 19, 2021

View reviewed changes

test with reading indices (SetNextByIndex + GetNextFeature)

3efb36e

jorisvandenbossche added 3 commits October 29, 2021 10:14

clean-up - remove indices approach

4a4d076

add tests

69f7395

add docstring

5507f64

jorisvandenbossche marked this pull request as ready for review October 29, 2021 08:38

small cleanup

b574b3a

jorisvandenbossche commented Oct 29, 2021

View reviewed changes

brendan-ward requested changes Oct 30, 2021

View reviewed changes

address review

395a9d4

jorisvandenbossche commented Nov 5, 2021

View reviewed changes

brendan-ward approved these changes Nov 10, 2021

View reviewed changes

add changelog

1043529

brendan-ward mentioned this pull request Nov 10, 2021

ENH: wrap GDAL errors #24

Closed

brendan-ward merged commit 804669d into geopandas:master Nov 12, 2021

brendan-ward mentioned this pull request Nov 12, 2021

Add ability to read features based on array of FIDs #13

Closed

jorisvandenbossche deleted the read-fids branch November 12, 2021 21:46

jorisvandenbossche mentioned this pull request Jan 17, 2022

ENH: add read_file function for reading GIS files (using pyogrio) geopandas/dask-geopandas#123

Merged

3 tasks

jorisvandenbossche mentioned this pull request Oct 13, 2023

Add support for fids filter with use_arrow=True #304

Merged

		if ogr_feature == NULL:
		raise ValueError("Failed to read FID {}".format(fid))

	"cannot set both 'fids' and one of 'where', 'bbox', "
	"cannot set both 'fids' and any of 'where', 'bbox', "

	`where` or `bbox`).
	`where` or `bbox`). Note that the starting index is driver specific.

ENH: read features based on array of FIDs #19

ENH: read features based on array of FIDs #19

Conversation

jorisvandenbossche commented Oct 16, 2021 • edited Loading

jorisvandenbossche commented Oct 16, 2021

brendan-ward left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendan-ward commented Oct 19, 2021

jorisvandenbossche commented Oct 19, 2021

brendan-ward commented Oct 19, 2021

jorisvandenbossche commented Oct 19, 2021

jorisvandenbossche commented Oct 19, 2021

jorisvandenbossche commented Oct 19, 2021 • edited Loading

brendan-ward commented Oct 20, 2021

jorisvandenbossche commented Oct 29, 2021 • edited Loading

jorisvandenbossche commented Oct 29, 2021 • edited Loading

jorisvandenbossche commented Oct 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendan-ward left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 5, 2021

jorisvandenbossche commented Nov 5, 2021

brendan-ward left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 10, 2021

jorisvandenbossche commented Oct 16, 2021 •

edited

Loading

jorisvandenbossche commented Oct 19, 2021 •

edited

Loading

jorisvandenbossche commented Oct 29, 2021 •

edited

Loading

jorisvandenbossche commented Oct 29, 2021 •

edited

Loading