-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
skip_features
increases in time dramatically
#255
Comments
Thanks for reporting @aw-west-defra ; nearly 2 hours is crazy slow. Is this zipped GPKG local on your machine or is it on the network? (if the latter, that brings in more moving parts) It looks like GPKG supports random reads but does not support fast setting of the next index. We're using OGR_L_SetNextByIndex internally when you pass The capabilities listed via from pyogrio import read_info
read_info('/tmp/test.gpkg')['capabilities']
# {'random_read': 1, 'fast_set_next_by_index': 0, 'fast_spatial_filter': 1} I am able to reproduce (on MacOS 12.6.5, M1, Python 3.10, GDAL 3.6.4) some of the slowdown with a test dataset of LineStrings that has 2.1M features, but not to the extreme extent as you. Reading zipped GPKG is much slower than unzipped: from timeit import timeit
timeit("tmp = read_dataframe('/tmp/test.gpkg', skip_features=n-100)", number=1, globals=globals())
# 3.27s
timeit("tmp = read_dataframe('/tmp/test.gpkg.zip', skip_features=n-100)", number=1, globals=globals())
# 101.58s I also found that using timeit(f"tmp = read_dataframe('/tmp/test.gpkg', where='fid BETWEEN {n-100} AND {n}')", number=1, globals=glo
bals())
# 0.029s
timeit(f"tmp = read_dataframe('/tmp/test.gpkg.zip', where='fid BETWEEN {n-100} AND {n}')", number=1, globals
=globals())
# 4.47s If you know that your FIDs are in incremental order, you can also pass an iterable of FIDs to read instead of using a timeit(f"tmp = read_dataframe('/tmp/test.gpkg', fids=range({n-100}, n))", number=1, globals=globals())
# 0.029s
timeit(f"tmp = read_dataframe('/tmp/test.gpkg.zip', fids=range({n-100}, n))", number=1, globals=globals())
# 4.65s In this case, it seems expected that this will be slow to read via If your FIDs are not sequential, you can read them first and then slice into this, but I'm finding this to be very slow too because of zip overhead (seems at least as bad as unzipping the full file first then reading it): from pyogrio.raw import read
fids = read('/tmp/test.gpkg.zip', read_geometry=False, columns=[], return_fids=True)[1] It is pretty quick if the GPKG is unzipped first. I see that we don't have strong warnings in the docs about performance implications of |
My data is on a databricks filesystem, dbfs, this hasn't been a slow down for other tasks. Docs seems like a great solution, and this can be close when you want. |
I think with GPKG, you can set your own FIDs, so it is possible to write them out of order or at least with a non-incremental series (i.e., gaps, not starting at a consistent value, etc). Depends on how those files were originally written. Normally they are incremental and consistent, but would be worth a check. Note that some other drivers don't let you write the FID, so they are always incremental and predictable. I'm surprised at your findings of zipped being faster for some tasks; are those things that involve reading the whole file? I'd expect that the tradeoff would be between the overhead of unzipping the data (and potential penalties of having to unzip more of the data in order to do a random read) versus file transfer speed: i.e., if transfer speed is slow, then you wouldn't notice zip overhead, but if file transfer speed is fast (e.g., local files) then I'd expect the zip overhead to be noticeable. I've never worked with databricks so I can't speak to that specifically. But either way, go with what works best in your case, having done the tests to find the ideal solution vs clearly non-optimal solution (1 hour 40 minutes, yikes!) |
For completeness sake: all of the above is still correct, but skip features will be a bit faster in gdal >= 3.8 because of an optimisation implemented in it. Nonetheless, filtering on fid will still stay way faster and recommended if possible for database oriented drivers: OSGeo/gdal#8306 |
I am attempting to read a few rows of a dataset, however
skip_features
takes a very long time when trying to read later features.I have no recommendation, but would like to bring this to your attention.
Example
It took 1h to read:
pyogrio.read_dataframe(filepath, skip_features=1_000_000, max_features=100)
But I used an equivalent where function that only took 20s:
pyogrio.read_dataframe(filepath, where='fid BETWEEN {} AND {}'.format(1_000_000, 1_000_100))
Table of timeit results
skip_features
,max_feature
where='fid BETWEEN {} AND {}'
Notebook Example
(Apologies I should've used an example dataset. My example is a zipped gpkg with mixed geometry types and 1,539,825 features.)
The text was updated successfully, but these errors were encountered: