Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Calculate feature count ourselves if GDAL returns unknown count (-1), better support OSM data #271

Merged
merged 9 commits into from
Aug 22, 2023
7 changes: 7 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,18 @@

- Support writing dataframes without geometry column (#267)

- Calculate feature count by iterating over features if GDAL returns an
unknown count for a data layer (e.g., OSM driver); this may have signficant
performance impacts for some data sources that would otherwise return an
unknown count (count is used in `read_info`, `read`, `read_dataframe`) (#271).

### Bug fixes

- Fix int32 overflow when reading int64 columns (#260)
- Fix `fid_as_index=True` doesn't set fid as index using `read_dataframe` with
`use_arrow=True` (#265)
- Fix errors reading OSM data due to invalid feature count and incorrect
reading of OSM layers beyond the first layer (#271)

## 0.6.0 (2023-04-27)

Expand Down
42 changes: 42 additions & 0 deletions docs/source/known_issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,45 @@ convert the datetimes to string.

Timezone information is ignored at the moment, both when reading and when writing
datetime columns.

## Support for OpenStreetMap (OSM) data

OpenStreetMap data do not natively support calculating the feature count by data
layer due to the internal data structures. To get around this, Pyogrio iterates
over all features first to calculate the feature count that is used to allocate
arrays that contain the geometries and attributes read from the data layer, and
then iterates over all feature again to populate those arrays. Further, data
within the file are not structured at the top level to support fast reading by
layer, which means that reading data by layer may need to read all records
within the data source, not just those belonging to a particular layer. This is
inefficient and slow, and is exacerbated when attemping to read from
remotely-hosted data sources rather than local files.

You may also be instructed by GDAL to enable interleaved reading mode via an
error message when you try to read a large file without it, which you can do in
one of two ways:

1. Set config option used for all operations

```python
from pyogrio import set_gdal_config_options

set_gdal_config_options({"OGR_INTERLEAVED_READING": True})
```

2. Set dataset open option

```python

from pyogrio import read_dataframe

df = read_dataframe(path, INTERLEAVED_READING=True)
```

We recommend the following to sidestep performance issues:

Comment on lines +103 to +104
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also mention use_arrow as an option. Based on my 1 test this is still not fast, but half as slow as without use_arrow=True. Or is it still too exprimental?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing different results for Alabama (https://download.geofabrik.de/north-america/us/alabama-latest.osm.pbf) reading the lines layer:

local, with arrow: 4.49s
local, without arrow: 3.46s
remote, with arrow: 23.55s
remote, without arrow: 20.26s

times are greater for multipolygons layer but follow similar trend locally but not for remote (likely because of much more network I/O, but it is remote enough from me that there could be huge variability in network I/O):

local, with arrow: 6.82s
local, without arrow: 5.50s
remote, with arrow: 33.85s
remote, without arrow: 39.06s

whereas for points, the first and fastest layer:
local, with arrow 1.12s
local, without arrow: 0.68s
remote, with arrow: 40.36s
remote, without arrow: 20.09s

I had expected the arrow option to be somewhat faster, but that doesn't seem to be the case for me.

Copy link
Member

@theroggy theroggy Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... my timings are indeed different. They are consistently faster for use_arrow, and most of the time twice as fast.

Then again, the timings are also spectacularly longer for me for local files, most likely the difference between Windows and Linux... because CI tests also run a lot slower on Windows vs Linux :-(... To conclude, arrow seems less dramatically slow on Windows ;-).

from datetime import datetime
import itertools
import pyogrio

url_de_bw = "https://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf"
path_de_bw = "C:/temp/baden-wuerttemberg-latest.osm.pbf"

url_us_al = "https://download.geofabrik.de/north-america/us/alabama-latest.osm.pbf"
path_us_al = "C:/temp/alabama-latest.osm.pbf"

urls = [path_de_bw, path_us_al]
layers = ["multipolygons", "points"]

for url, layer, use_arrow in itertools.product(urls, layers, [True, False]):
    start = datetime.now()
    pgdf = pyogrio.read_dataframe(
        url, use_arrow=use_arrow, layer=layer, encoding="utf8"
    )
    print(f"url={url}, layer={layer}, use_arrow={use_arrow}: {datetime.now()-start}")

Results:

url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:01:46.250570
url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:03:40.812138
url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True: 0:00:18.412509
url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False: 0:00:24.491693
url=C:/temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:16.943871
url=C:/temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:00:30.859146
url=C:/temp/alabama-latest.osm.pbf, layer=points, use_arrow=True: 0:00:07.560858
url=C:/temp/alabama-latest.osm.pbf, layer=points, use_arrow=False: 0:00:08.725899

Copy link
Member

@theroggy theroggy Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: I had to specify encoding="utf8" to be able to run it with use_arrow=False, probably because it is German data and they use quite some accents and stuff. Based on the following post and others, it seems osm data is always in "UTF-8":
https://help.openstreetmap.org/questions/5308/character-encoding

So I think it might be better to default the encoding to "UTF-8" for .osm and .pbf?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a fix to always set UTF-8.

It looks like using the interleaved encoding option has an impact on performance:

With default (no interleaved reading):

url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:01:12.460372
url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:03:11.110123
url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True: 0:00:08.757978
url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False: 0:00:15.286818
url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:06.517222
url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:00:15.456438
url=alabama-latest.osm.pbf, layer=points, use_arrow=True: 0:00:01.148906
url=alabama-latest.osm.pbf, layer=points, use_arrow=False: 0:00:02.335114

with pyogrio.set_gdal_config_options({"OGR_INTERLEAVED_READING": True}):

url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:52.878035
url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:01:05.556191
url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True: 0:00:06.217804
url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False: 0:00:04.641645
url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:06.525850
url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:00:05.703674
url=alabama-latest.osm.pbf, layer=points, use_arrow=True: 0:00:00.852143
url=alabama-latest.osm.pbf, layer=points, use_arrow=False: 0:00:00.724238

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... interleaved reading is consitently quite a bit faster... and often use_arrow=False becomes faster than use_arrow=True because of it. Kind of weird.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On windows, a similar pattern:

url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=True: 0:01:49.743053
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=False: 0:01:51.280849
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=True: 0:01:28.155260
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=False: 0:03:56.258161
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True, interleaved=True: 0:00:20.615849
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True, interleaved=False: 0:00:15.530900
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False, interleaved=True: 0:00:10.094106
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False, interleaved=False: 0:00:28.546472
url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=True: 0:00:19.117897
url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=False: 0:00:19.194775
url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=True: 0:00:17.296830
url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=False: 0:00:35.047402
url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=True, interleaved=True: 0:00:07.781301
url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=True, interleaved=False: 0:00:07.374248
url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=False, interleaved=True: 0:00:06.801308
url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=False, interleaved=False: 0:00:09.139016

Copy link
Member

@theroggy theroggy Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a reason why interleaved isn't the default... maybe it is also useful to mention that it can improve performance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am seeing the same behaviour. It's indeed a bit strange that with the interleaved reading turned on, reading with arrow is slower than without. Might be worth reporting to GDAL

- download remote OSM data sources to local files before attempting
to read
- the `use_arrow=True` option may speed up reading from OSM files
- if possible, use a different tool such as `ogr2ogr` to translate the OSM
data source into a more performant format for reading by layer, such as GPKG
91 changes: 78 additions & 13 deletions pyogrio/_io.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,17 @@ cdef OGRLayerH get_ogr_layer(GDALDatasetH ogr_dataset, layer) except NULL:
except CPLE_BaseError as exc:
raise DataLayerError(str(exc))

# if the driver is OSM, we need to execute SQL to set the layer to read in
# order to read it properly
Comment on lines +198 to +199
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something to report to GDAL? Or is it expected we need to set this differently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ogr2ogr does this (which is how I figured out how to do this), so I think it is fair to say that GDAL knows about this and that it is intentional it is left for the user to call instead of being embedded in the driver. It is something we could suggest be added to the documentation for the driver, however.

if get_driver(ogr_dataset) == "OSM":
# Note: this returns NULL and does not need to be freed via
# GDALDatasetReleaseResultSet()
layer_name = get_string(OGR_L_GetName(ogr_layer))
sql_b = f"SET interest_layers = {layer_name}".encode('utf-8')
sql_c = sql_b

GDALDatasetExecuteSQL(ogr_dataset, sql_c, NULL, NULL)

return ogr_layer


Expand Down Expand Up @@ -309,6 +320,62 @@ cdef get_driver(OGRDataSourceH ogr_dataset):
return driver


cdef get_feature_count(OGRLayerH ogr_layer):
"""Get the feature count of a layer.

If GDAL returns an unknown count (-1), this iterates over every feature
to calculate the count.

Parameters
----------
ogr_layer : pointer to open OGR layer

Returns
-------
int
count of features
"""

cdef OGRFeatureH ogr_feature = NULL
cdef int feature_count = OGR_L_GetFeatureCount(ogr_layer, 1)

# if GDAL refuses to give us the feature count, we have to loop over all
# features ourselves and get the count. This can happen for some drivers
# (e.g., OSM) or if a where clause is invalid but not rejected as error
if feature_count == -1:
# make sure layer is read from beginning
OGR_L_ResetReading(ogr_layer)

feature_count = 0
while True:
try:
ogr_feature = exc_wrap_pointer(OGR_L_GetNextFeature(ogr_layer))
feature_count +=1

except NullPointerError:
# No more rows available, so stop reading
break

# driver may raise other errors, e.g., for OSM if node ids are not
# increasing, the default config option OSM_USE_CUSTOM_INDEXING=YES
# causes errors iterating over features
except CPLE_BaseError as exc:
# if an invalid where clause is used for a GPKG file, it is not
# caught as an error until attempting to iterate over features;
# catch it here
if "failed to prepare SQL" in str(exc):
raise ValueError(f"Invalid SQL query: {str(exc)}") from None

raise DataLayerError(f"Could not iterate over features: {str(exc)}") from None

finally:
if ogr_feature != NULL:
OGR_F_Destroy(ogr_feature)
ogr_feature = NULL

return feature_count


cdef set_metadata(GDALMajorObjectH obj, object metadata):
"""Set metadata on a dataset or layer

Expand Down Expand Up @@ -371,13 +438,19 @@ cdef detect_encoding(OGRDataSourceH ogr_dataset, OGRLayerH ogr_layer):
-------
str or None
"""

if OGR_L_TestCapability(ogr_layer, OLCStringsAsUTF8):
return 'UTF-8'

driver = get_driver(ogr_dataset)
if driver == 'ESRI Shapefile':
return 'ISO-8859-1'

if driver == "OSM":
# always set OSM data to UTF-8
# per https://help.openstreetmap.org/questions/2172/what-encoding-does-openstreetmap-use
return "UTF-8"

return None


Expand Down Expand Up @@ -486,11 +559,10 @@ cdef apply_where_filter(OGRLayerH ogr_layer, str where):
if err != OGRERR_NONE:
try:
exc_check()
name = OGR_L_GetName(ogr_layer)
except CPLE_BaseError as exc:
raise ValueError(str(exc))

raise ValueError(f"Invalid SQL query for layer '{name}': '{where}'")
raise ValueError(f"Invalid SQL query for layer '{OGR_L_GetName(ogr_layer)}': '{where}'")


cdef apply_spatial_filter(OGRLayerH ogr_layer, bbox):
Expand Down Expand Up @@ -526,11 +598,10 @@ cdef validate_feature_range(OGRLayerH ogr_layer, int skip_features=0, int max_fe
skip_features : number of features to skip from beginning of available range
max_features : maximum number of features to read from available range
"""
feature_count = OGR_L_GetFeatureCount(ogr_layer, 1)
feature_count = get_feature_count(ogr_layer)
num_features = max_features

if feature_count <= 0:
# the count comes back as -1 if the where clause above is invalid but not rejected as error
if feature_count == 0:
name = OGR_L_GetName(ogr_layer)
warnings.warn(f"Layer '{name}' does not have any features to read")
return 0, 0
Expand Down Expand Up @@ -736,9 +807,6 @@ cdef get_features(
break

except CPLE_BaseError as exc:
if "failed to prepare SQL" in str(exc):
raise ValueError(f"Invalid SQL query") from exc

raise FeatureError(str(exc))

if i >= num_features:
Expand Down Expand Up @@ -886,10 +954,7 @@ cdef get_bounds(
break

except CPLE_BaseError as exc:
if "failed to prepare SQL" in str(exc):
raise ValueError(f"Invalid SQL query") from exc
else:
raise FeatureError(str(exc))
raise FeatureError(str(exc))

if i >= num_features:
raise FeatureError(
Expand Down Expand Up @@ -1327,7 +1392,7 @@ def ogr_read_info(
'fields': fields[:,2], # return only names
'dtypes': fields[:,3],
'geometry_type': get_geometry_type(ogr_layer),
'features': OGR_L_GetFeatureCount(ogr_layer, 1),
'features': get_feature_count(ogr_layer),
'driver': get_driver(ogr_dataset),
"capabilities": {
"random_read": OGR_L_TestCapability(ogr_layer, OLCRandomRead),
Expand Down
4 changes: 4 additions & 0 deletions pyogrio/tests/fixtures/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,3 +83,7 @@ This was extracted from https://prd-tnm.s3.amazonaws.com/StagedProducts/Hydrogra
```bash
ogr2ogr test_mixed_surface.gpkg NHDPLUS_H_0308_HU4_GDB.gdb NHDWaterbody -where '"NHDPlusID" = 15000300070477' -select "NHDPlusID"
```

### OSM PBF test

This was downloaded from https://github.com/openstreetmap/OSM-binary/blob/master/resources/sample.pbf
Binary file added pyogrio/tests/fixtures/sample.osm.pbf
Binary file not shown.
19 changes: 18 additions & 1 deletion pyogrio/tests/test_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
get_gdal_config_option,
get_gdal_data_path,
)
from pyogrio.errors import DataSourceError
from pyogrio.errors import DataSourceError, DataLayerError

from pyogrio._env import GDALEnv

Expand Down Expand Up @@ -266,6 +266,23 @@ def test_read_info_invalid_dataset_kwargs(naturalearth_lowres):
read_info(naturalearth_lowres, INVALID="YES")


def test_read_info_force_feature_count_exception(data_dir):
with pytest.raises(DataLayerError, match="Could not iterate over features"):
read_info(data_dir / "sample.osm.pbf", layer="lines")


def test_read_info_force_feature_count(data_dir):
# the sample OSM file has non-increasing node IDs which causes the default
# custom indexing to raise an exception iterating over features
meta = read_info(data_dir / "sample.osm.pbf", USE_CUSTOM_INDEXING=False)
assert meta["features"] == 8

meta = read_info(
data_dir / "sample.osm.pbf", layer="lines", USE_CUSTOM_INDEXING=False
)
assert meta["features"] == 36


@pytest.mark.parametrize(
"name,value,expected",
[
Expand Down
Loading