Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: read features based on array of FIDs #19

Merged
merged 9 commits into from
Nov 12, 2021

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 16, 2021

This is experimenting with reading a subset based on an array of FIDs. Very much a draft (eg still needs tests, docs, etc), but good enough for running some benchmarks (will post them after a re-run).

The diff is not super clear: I splitted get_features to separate the processing of the geometry and fields (so this can be reused in get_features_by_fid). If you look at only the first commit, this is clearer. In the end, I am not sure if this was the best approach, as in theory it could also be combined in get_features and have a if/else switch at the moment to get the OGRFeature (switching between OGR_L_GetNextFeature and OGR_L_GetFeature), since the rest of get_features and get_features_by_fid have a lot of overlap.

@jorisvandenbossche
Copy link
Member Author

I did some performance checks: https://nbviewer.org/gist/jorisvandenbossche/c57fd2693a4442a539797ebe46df3a61

  • I expected that the performance of reading by FID would depend on the exact file format being used, and that is also the case (however I hadn't expect GeoPackage to perform so bad). Those first set of benchmarks are with a simple Point dataset (N=100,000) with a single attribute, so the time to actually parse the feature is minimal and the overhead of using FIDs or not is more pronounced:
    • For Shapefile, reading all sequentially or all by FID has almost the same speed (reading all with shuffled FIDs is a bit slower)
    • For GeoPackage, there is a quite large difference. Sequentially it's on par with Shapefile, but reading by FID is a lot slower (almost 10x). I comfirmed with py-spy that all the additional time is coming from OGR_L_GetFeature(ogr_layer, fid) (there seems to be quite a bit of sqlite3 overhead).
    • Reading GeoJSON is slower anyway, and reading by FID further slows it down a bit.
  • I compared reading with a bbox vs reading with an array of FIDs that result in the same subset of geometries. In this specific case, for Shapefile it's faster to read by FID, but for GeoPackage is actually faster to read by bbox.
    • The reason I also did this benchmark was to see if this would be beneficial for GeoPackage. Since that is slower with FIDs, it might be an option to explore if we could read GeoPackage first by bbox but then additional check if the FID is in the list of FIDs. In the context of reading a subset of a file to form a partition of a dask-geopandas GeoDataFrame, it should be possible to also have the bbox in addition to the array of FIDs (eg if the FIDs were calculated based on hilbert distance based on the bounds, it's easy to calculate the total bounds for each partition).
  • The above is with a very simple file. In practice when reading a file with more complex geometries / more attributes, the time spent in parsing those will become more important than the overhead of accessing the feature by FID. So as a test to read the first features of the tl_2019_us_zcta510 dataset in GeoPackage format, reading by FID is now only a bit slower (16% slower instead of almost 10x slower).

This is of course with only one specific dataset, I don't know how generalizable the observations are. For GeoPackage, it will in any case depend on the complexity of your dataset whether the overhead of getting by FID is important or not.

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @jorisvandenbossche ! Some of these results are a bit surprising, I wouldn't have expected reading by FID in sequential order to be that much slower.

The split of get_features_by_fid vs get_features seems OK - there isn't that much overlap, and they have different parameters, so it seems like cleaner logic the way you have it now. And it's internal API, so we can always change later. 😄

I think with shapefiles, FID corresponds directly to a feature's position in the file. That is, it isn't a separate field, so GDAL just needs to be able to read features based on offsets. In GPKG, fid is a separate field and also a primary key. But I would have thought the primary key would be pretty fast to work with.

pyogrio/_io.pyx Outdated
@@ -757,7 +859,8 @@ def ogr_read_info(str path, object layer=None, object encoding=None, **kwargs):
'encoding': encoding,
'fields': get_fields(ogr_layer, encoding)[:,2], # return only names
'geometry_type': get_geometry_type(ogr_layer),
'features': OGR_L_GetFeatureCount(ogr_layer, 1)
'features': OGR_L_GetFeatureCount(ogr_layer, 1),
'random_read': OGR_L_TestCapability(ogr_layer, OLCRandomRead),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing performance of bbox filters by OGR driver, you might want to add OLCFastSpatialFilter (presumably this is true for GPKG and Shapefile, likely not true for GeoJSON). We may want to consider how we list data source capabilities for the driver in general (though outside this PR); perhaps a nested dictionary under meta:

meta = {
    ...,
    capabilities: {
        "random_read": OGR_L_TestCapability(ogr_layer, OLCRandomRead),
        "fast_spatial_filter": OGR_L_TestCapability(ogr_layer, OLCFastSpatialFilter),
       ...
    }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mostly added this temporarily to check the value of this. It actually returns True for all three of Shapefile, GPKG and GeoJSON (while they show quite different performance impact), so not sure how useful this is to check (I would be curious which format does not support the random read)

But I agree that if we want to keep this, the nested "capabilities" field as you indicate seems a good idea.

pyogrio/_io.pyx Outdated
@@ -35,7 +35,7 @@ log = logging.getLogger(__name__)

### Constants
cdef const char * STRINGSASUTF8 = "StringsAsUTF8"

cdef const char * OLCRandomRead = "RandomRead"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding this here, add this to the bottom of the cdef extern from "ogr_api.h": block in _ogr.pxd

cdef extern from "ogr_api.h":
    ...
    const char*     OLCRandomRead

This imports the constant defined in GDAL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StringsAsUTF8 on the line is also such a constant, so shall I move that one as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please!

@brendan-ward
Copy link
Member

I haven't had a chance to dig into the implementation in GDAL, but from the profiling results, it looks like the OGR_L_GetFeature function must step into a very different code path with GPKG for reading from the SQLite database than OGR_L_GetNextFeature. The latter doesn't seem to hit libsqlite3 much at all (can't even see it in the profile), whereas like you say it is the major time sink for the former.

Another way to approach this would be to use indexes as offsets from the beginning of a given file rather than FID. For serial FID formats (e.g., shapefile), this seems like it should function the same; for formats where FID is not necessarily serial (GPKG?), this may be a function of how well that driver supports fast seeking to the next feature (see OLCFastSetNextByIndex); I think GPK does not support fast seeking. No idea how the performance would compare between using the slower seeking method vs get by FID.

@jorisvandenbossche
Copy link
Member Author

it looks like the OGR_L_GetFeature function must step into a very different code path with GPKG for reading from the SQLite database than OGR_L_GetNextFeature. The latter doesn't seem to hit libsqlite3 much at all (can't even see it in the profile), whereas like you say it is the major time sink for the former.

I see a bit of sqlite interaction in GetNextFeature, i.e. using sqlite_step (which I can also see in the source code), but this takes much less time (~5%).
On the other hand GetFeature seems to be using a different interface (OGRGeoPackageTableLayer::GetFeature), and which seems to do an actual SQL query with a WHERE clause for each call. I suppose the sqlite_step in GetNextFeature is also running some query, but much more optimized (maybe a single query that returns the features in order and in each call parses the next row? not fully sure based on a quick read of the source code)

Another way to approach this would be to use indexes as offsets from the beginning of a given file rather than FID.

I am not fully understanding this. Wouldn't this only allow reading a consecutive set of rows? (like give me rows 10 to 20)
(and this is already supported with the skip_features/max_features)

@brendan-ward
Copy link
Member

Another way to approach this would be to use indexes as offsets from the beginning of a given file rather than FID.

I meant that we let the user specify indices (offsets) as an array just like FIDs, e.g., [1,10,46, 92, ...]

Then we use a combination of OGR_L_GetNextFeature and OGR_L_SetNextByIndex to increment to the next index before reading it. These would all be based on offsets from the beginning of the file, so a bit less general than using FIDs.

@jorisvandenbossche
Copy link
Member Author

Ah, I see! Yes, that sounds interesting to try out.

@jorisvandenbossche
Copy link
Member Author

I don't see a custom SetNextByIndex implementation in the geopackage driver, so if that means it falls back to the default implementation (https://github.com/OSGeo/gdal/blob/b2904b62620f32ecc5d39cc61d87623b5ccc0849/gdal/ogr/ogrsf_frmts/generic/ogrlayer.cpp#L499-L518, which basically calls NextFeature until the index is reached), then it might not necessarily give much advantage (at least if only reading a small subset). But to test out to be sure.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Oct 19, 2021

Tried out the indices approach, and the first observations:

  • For Shapefile, reading by FIDs or by indices gives the same performance (Shapefile also indicates that it has the OLCFastSetNextByIndex capability).
  • For GeoPackage, using the SetNextByIndex / GetNextFeature combo is very slow. Because GPKG does not have the OLCFastSetNextByIndex capability, it uses the base SetNextByIndex implementation which resets the reading and calls GetNextFeature until the requested index on every call (so which means we are rereading a part of the dataset for every of the indices we want to read ..)
    For example, to only read a subset of 100 of the indices (randomly sampled throughout the dataset, so not the first 100), which is something that takes 7ms for Shapefile and 10ms for GeoPackage using FIDs, using indices this takes more than a second for GeoPackage.
  • For GeoJSON this is even slower, even though it indicates it has the FastSetNextByIndex capabilitity.

This is only a very quick check, and the trade-offs might depend on the actual file format. But at least considering Shapefile and GeoPackage, reading by index doesn't help (for Shapefile it's the same, but this was already fast using FIDs anyway, and for GeoPackage it's very slow).

Also, overall, I think the FID approach will probably fine even though it gives a slowdown for some formats like GPKG, assuming that 1) we will typically read a relatively small subset at a time (eg if you have more than 10 partitions, you already are below the 10% for one read, and also for GeoPackage, reading 10% of the data using FIDs is still a bit faster than reading all data sequentially) and 2) with real world data (and not the very simple point dataset), the actual parsing of the data is also more important than the query by ID (and so the potential slowdown from reading by FID is relatively smaller compared to the overall time it takes to read the data)

The combined bbox + filter on FIDs might still be worth experimenting with for GeoPackage.

I pushed the code I used to test the indices approach (will remove it again in a later commit to clean-up the code in this PR), and I also updated the notebook with the new timings: https://nbviewer.org/gist/jorisvandenbossche/c57fd2693a4442a539797ebe46df3a61

@brendan-ward
Copy link
Member

Thanks for the additional testing and follow up!

It's a bummer that GPKG is going to be slow until you get below a certain threshold, but like you say, other factors would mitigate the impact from that (reading small percent and having more time parsing geometries). Adding a brief warning to the docstring may be a good idea.

I haven't had a chance to test, but I wonder how using the where parameter to trigger an SQL query would vary for GPKG. E.g., where="FID in (1,10,42, 96)" compared to getting individual features by FID, though I can imagine this gets unwieldy or may overflow limits for 100,000 FIDs.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Oct 29, 2021

I haven't had a chance to test, but I wonder how using the where parameter to trigger an SQL query would vary for GPKG. E.g., where="FID in (1,10,42, 96)" compared to getting individual features by FID, though I can imagine this gets unwieldy or may overflow limits for 100,000 FIDs.

I did a very quick test:

  1. it's a lot slower for Shapefile, but a bit faster for GeoPackage (which confirms the guess that GPKG can handle this well)
  2. however, the where clause seems to be limited in size. If I further increase the number of indices from 1000 to 5000 in the example below, you get a ValueError: Invalid SQL query for layer 'b'test_points'' (and there is also a log message about "SQL Expression Parsing Error: memory exhausted")

image

So given that it's only slightly faster for GPKG (to remember, the above speedup is "best case" as this is a very simple dataset of points, once you need to parse more data, the difference will get smaller), and it has its limitations in the size of the query, it doesn't seem worth it to have special case code depending on the driver to vary the method to query FIDs.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Oct 29, 2021

Hmm, actually, it seems this ValueError with larger where clause only happens for Shapefile or GeoJSON, but not for GPKG.

So for GeoPackage, I can further increase the size of the subset selection, and then using the where clause starts to get more significantly faster: for selecting 20,000 rows (20% of the size of the points dataset) takes around ~300ms with FIDs and ~100ms with a where clause.
(and for reference, for Shapefile with FIDs this is ~35ms)

So after all, it might be useful to consider special casing GPKG (although it will still depend on how many other columns are selected / how comptes the geometries are). But, we will need the FID selection for other formats anyway. So for this PR, I propose that I clean-up the implementation for FID selection and add tests and docs etc, and then we can still further consider this topic in a follow-up (the special casing for GPKG could also live in a level up, i.e. leave this to the user or downstream package (eg dask-geopandas could also do this))

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review October 29, 2021 08:38
@jorisvandenbossche
Copy link
Member Author

OK, I cleaned up the PR and added some basic tests and docstring. Should be ready for a code review.

Comment on lines +611 to +612
if ogr_feature == NULL:
raise ValueError("Failed to read FID {}".format(fid))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GDAL actually puts an informative message to stderr in this case (saying feature id .. is out of bounds). Is there a way to capture this and use that for the error message? (in this case it doesn't change that much though)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use exc_wrap_pointer in a try / except block:

        try:
            ogr_feature = exc_wrap_pointer(OGR_L_GetFeature(ogr_layer, fid))

        except CPLE_BaseError as exc:
            raise DriverIOError(str(exc))

when then raises an exception:

pyogrio.errors.DriverIOError: Attempt to read shape with feature id (-1) out of available range.

If you want, you could also create a new error class for this specifically (in errors.py)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to wrap the GDAL error as suggested here?

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche !

This is pretty close, just needs another test and likely a few minor docstring updates.

Comment on lines +611 to +612
if ogr_feature == NULL:
raise ValueError("Failed to read FID {}".format(fid))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use exc_wrap_pointer in a try / except block:

        try:
            ogr_feature = exc_wrap_pointer(OGR_L_GetFeature(ogr_layer, fid))

        except CPLE_BaseError as exc:
            raise DriverIOError(str(exc))

when then raises an exception:

pyogrio.errors.DriverIOError: Attempt to read shape with feature id (-1) out of available range.

If you want, you could also create a new error class for this specifically (in errors.py)

pyogrio/_io.pyx Outdated
if fids is not None:
if where is not None or bbox is not None or skip_features or max_features:
raise ValueError(
"cannot set both 'fids' and one of 'where', 'bbox', "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"cannot set both 'fids' and one of 'where', 'bbox', "
"cannot set both 'fids' and any of 'where', 'bbox', "

pyogrio/_io.pyx Outdated
"'skip_features' or 'max_features'"
)
fids = np.asarray(fids, dtype=np.intc)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: move the substantive work here (e.g., applying where filter, etc) down to the else block on 755, so that this conditional test up top is just for validating parameters.

fids : array-like, optional (default: None)
Array of integer feature id (FID) values to select. Cannot be combined
with other keywords to select a subset (`skip_features`, `max_features`,
`where` or `bbox`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`where` or `bbox`).
`where` or `bbox`). Note that the starting index is driver specific.

(just a nudge to users to remember that GPKG starts at 1, others at 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is also worth mentioning here that the performance of reading large numbers of features using FIDs is driver specific, and may be much worse than reading the entire dataframe and selecting out features of interest afterward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded the note a bit, please take a look (I kept the performance note short for now, because actually starting to explain this might get a bit long ..).

pyogrio/tests/test_raw_io.py Show resolved Hide resolved
"capabilities": {
"random_read": OGR_L_TestCapability(ogr_layer, OLCRandomRead),
"fast_set_next_by_index": OGR_L_TestCapability(ogr_layer, OLCFastSetNextByIndex),
"fast_spatial_filter": OGR_L_TestCapability(ogr_layer, OLCFastSpatialFilter),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems potentially useful, at least for benchmarking. I'd keep it for now.

@jorisvandenbossche
Copy link
Member Author

Thanks for the review! Pushed some updates

@jorisvandenbossche
Copy link
Member Author

Also, are we OK with the fids keyword naming? (it's not super clear, but I also don't have a better idea at the moment :))

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche for the updates. This looks good. Up to you if you want to wrap the GDAL error for reading a FID out of bounds or leave as is (we can come back to that later).

I think the fids keyword is just fine, since FID seems to be a common term when referring to specific records within a GIS file.

Can you please add an entry to CHANGES.md for this PR?

Comment on lines +611 to +612
if ogr_feature == NULL:
raise ValueError("Failed to read FID {}".format(fid))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to wrap the GDAL error as suggested here?

"capabilities": {
"random_read": OGR_L_TestCapability(ogr_layer, OLCRandomRead),
"fast_set_next_by_index": OGR_L_TestCapability(ogr_layer, OLCFastSetNextByIndex),
"fast_spatial_filter": OGR_L_TestCapability(ogr_layer, OLCFastSpatialFilter),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems potentially useful, at least for benchmarking. I'd keep it for now.

@jorisvandenbossche
Copy link
Member Author

Up to you if you want to wrap the GDAL error for reading a FID out of bounds or leave as is (we can come back to that later).

I was indeed thinking to leave that for another PR, as there are some other occurences of that in the same file as well (and didn't want to start mixing things too much)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants