Skip to content

Commit

Permalink
FIX: correctly use GDAL auto-decoding of shapefiles when encoding is …
Browse files Browse the repository at this point in the history
…set (#384)
  • Loading branch information
brendan-ward authored Apr 26, 2024
1 parent 154466a commit db58724
Show file tree
Hide file tree
Showing 13 changed files with 585 additions and 41 deletions.
4 changes: 4 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@
FlatGeobuf / GPKG drivers (#335, #387); this has been fixed in GDAL >= 3.8.3.
- Properly ignore fields not listed in `columns` parameter when reading from
the data source not using the Arrow API (#391).
- Properly handle decoding of ESRI Shapefiles with user-provided `encoding`
option for `read`, `read_dataframe`, and `open_arrow`, and correctly encode
Shapefile field names and text values to the user-provided `encoding` for
`write` and `write_dataframe` (#384).

### Packaging

Expand Down
18 changes: 11 additions & 7 deletions docs/source/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,10 @@ first layer.
}
```

NOTE: pyogrio will report `UTF-8` if either the native encoding is likely to be
`UTF-8` or GDAL can automatically convert from the detected native encoding to
`UTF-8`.

To read from a layer using name or index (the following are equivalent):

```python
Expand Down Expand Up @@ -468,21 +472,21 @@ You can also read from a URL with this syntax:

GDAL only supports datetimes at a millisecond resolution. Reading data will thus
give at most millisecond resolution (`datetime64[ms]` data type). With pandas 2.0
`pyogrio.read_dataframe()` will return datetime data as `datetime64[ms]`
correspondingly. For previous versions of pandas, `datetime64[ns]` is used as
ms precision was not supported. When writing, only precision up to
`pyogrio.read_dataframe()` will return datetime data as `datetime64[ms]`
correspondingly. For previous versions of pandas, `datetime64[ns]` is used as
ms precision was not supported. When writing, only precision up to
ms is retained.

Not all file formats have dedicated support to store datetime data, like ESRI
Shapefile. For such formats, or if you require precision > ms, a workaround is to
convert the datetimes to string.

Timezone information is preserved where possible, however GDAL only represents
time zones as UTC offsets, whilst pandas uses IANA time zones (via `pytz` or
`zoneinfo`). This means that dataframes with columns containing multiple offsets
time zones as UTC offsets, whilst pandas uses IANA time zones (via `pytz` or
`zoneinfo`). This means that dataframes with columns containing multiple offsets
(e.g. when switching from standard time to summer time) will be written correctly,
but when read via `pyogrio.read_dataframe()` will be returned as a UTC datetime
column, as there is no way to reconstruct the original timezone from the individual
but when read via `pyogrio.read_dataframe()` will be returned as a UTC datetime
column, as there is no way to reconstruct the original timezone from the individual
offsets present.

## Dataset and layer creation options
Expand Down
16 changes: 12 additions & 4 deletions docs/source/known_issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,18 @@ geometries from the data layer.

## Character encoding

Pyogrio supports reading / writing data layers with a defined encoding. However,
DataFrames do not currently allow arbitrary metadata, which means that we are
currently unable to store encoding information for a data source. Text fields
are read into Python UTF-8 strings.
Pyogrio supports reading / writing data layers with a defined encoding. Where
possible and the `encoding` option is not specified, GDAL will attempt to
automatically decode from the native encoding to `UTF-8`, and pyogrio will report
that the encoding is `UTF-8` in that case instead of the native encoding. For
[ESRI Shapefiles](https://gdal.org/drivers/vector/shapefile.html#encoding),
GDAL will use the associated `.cpg` file or a code page specified in the `.dbf`
file to infer the native encoding, but may incorrectly assume the native encoding
is `ISO-8859-1`, leading to miscoding errors. Most other drivers are assumed to
be in `UTF-8`, but it is possible (in theory) to specify the `encoding` parameter
manually to force conversions to use the specified encoding value.

Field names and values are read into Python `UTF-8` strings.

## No validation of geometry or field types

Expand Down
10 changes: 7 additions & 3 deletions pyogrio/_err.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -153,9 +153,13 @@ cdef inline object exc_check():
else:
# Reformat messages.
msg_b = err_msg
msg = msg_b.decode('utf-8')
msg = msg.replace("`", "'")
msg = msg.replace("\n", " ")

try:
msg = msg_b.decode('utf-8')
msg = msg.replace("`", "'")
msg = msg.replace("\n", " ")
except UnicodeDecodeError as exc:
msg = f"Could not decode error message to UTF-8. Raw error: {msg_b}"

if err_type == 3:
CPLErrorReset()
Expand Down
177 changes: 155 additions & 22 deletions pyogrio/_io.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,40 @@ cdef char** dict_to_options(object values):
return options


cdef const char* override_threadlocal_config_option(str key, str value):
"""Set the CPLSetThreadLocalConfigOption for key=value
Parameters
----------
key : str
value : str
Returns
-------
const char*
value previously set for key, so that it can be later restored. Caller
is responsible for freeing this via CPLFree() if not NULL.
"""

key_b = key.encode("UTF-8")
cdef const char* key_c = key_b

value_b = value.encode("UTF-8")
cdef const char* value_c = value_b


cdef const char *prev_value = CPLGetThreadLocalConfigOption(key_c, NULL)
if prev_value != NULL:
# strings returned from config options may be replaced via
# CPLSetConfigOption() below; GDAL instructs us to save a copy
# in a new string
prev_value = CPLStrdup(prev_value)

CPLSetThreadLocalConfigOption(key_c, value_c)

return prev_value


cdef void* ogr_open(const char* path_c, int mode, char** options) except NULL:
cdef void* ogr_dataset = NULL

Expand Down Expand Up @@ -490,13 +524,27 @@ cdef detect_encoding(OGRDataSourceH ogr_dataset, OGRLayerH ogr_layer):
# read without recoding. Hence, it is up to you to supply the data in the
# appropriate encoding. More info:
# https://gdal.org/development/rfc/rfc23_ogr_unicode.html#oftstring-oftstringlist-fields
# NOTE: for shapefiles, this always returns False for the layer returned
# when executing SQL, even when it supports UTF-8 (patched below);
# this may be fixed by https://github.com/OSGeo/gdal/pull/9649 (GDAL >=3.9.0?)
return "UTF-8"

driver = get_driver(ogr_dataset)
if driver == "ESRI Shapefile":
# Typically, OGR_L_TestCapability returns True for OLCStringsAsUTF8 for ESRI
# Shapefile so this won't be reached. However, for old versions of GDAL and/or
# if libraries are missing, this fallback could be needed.
# OGR_L_TestCapability returns True for OLCStringsAsUTF8 (above) for
# shapefiles when a .cpg file is present with a valid encoding, or GDAL
# auto-detects the encoding from the code page of the .dbf file, or
# SHAPE_ENCODING config option is set, or ENCODING layer creation option
# is specified (shapefiles only). Otherwise, we can only assume that
# shapefiles are in their default encoding of ISO-8859-1 (which may be
# incorrect and must be overridden by user-provided encoding)

# Always use the first layer to test capabilities until detection for
# SQL results from shapefiles are fixed (above)
# This block should only be used for unfixed versions of GDAL (<3.9.0?)
if OGR_L_TestCapability(GDALDatasetGetLayer(ogr_dataset, 0), OLCStringsAsUTF8):
return "UTF-8"

return "ISO-8859-1"

if driver == "OSM":
Expand Down Expand Up @@ -1115,6 +1163,8 @@ def ogr_read(
cdef OGRLayerH ogr_layer = NULL
cdef int feature_count = 0
cdef double xmin, ymin, xmax, ymax
cdef const char *prev_shape_encoding = NULL
cdef bint override_shape_encoding = False

path_b = path.encode('utf-8')
path_c = path_b
Expand Down Expand Up @@ -1146,6 +1196,15 @@ def ogr_read(
raise ValueError("'max_features' must be >= 0")

try:
if encoding:
# for shapefiles, SHAPE_ENCODING must be set before opening the file
# to prevent automatic decoding to UTF-8 by GDAL, so we save previous
# SHAPE_ENCODING so that it can be restored later
# (we do this for all data sources where encoding is set because
# we don't know the driver until after it is opened, which is too late)
override_shape_encoding = True
prev_shape_encoding = override_threadlocal_config_option("SHAPE_ENCODING", encoding)

dataset_options = dict_to_options(dataset_kwargs)
ogr_dataset = ogr_open(path_c, 0, dataset_options)

Expand All @@ -1160,7 +1219,18 @@ def ogr_read(

# Encoding is derived from the user, from the dataset capabilities / type,
# or from the system locale
encoding = encoding or detect_encoding(ogr_dataset, ogr_layer)
if encoding:
if get_driver(ogr_dataset) == "ESRI Shapefile":
# NOTE: SHAPE_ENCODING is a configuration option whereas ENCODING is the dataset open option
if "ENCODING" in dataset_kwargs:
raise ValueError('cannot provide both encoding parameter and "ENCODING" option; use encoding parameter to specify correct encoding for data source')

# Because SHAPE_ENCODING is set above, GDAL will automatically
# decode shapefiles to UTF-8; ignore any encoding set by user
encoding = "UTF-8"

else:
encoding = detect_encoding(ogr_dataset, ogr_layer)

fields = get_fields(ogr_layer, encoding)

Expand Down Expand Up @@ -1254,6 +1324,13 @@ def ogr_read(
GDALClose(ogr_dataset)
ogr_dataset = NULL

# reset SHAPE_ENCODING config parameter if temporarily set above
if override_shape_encoding:
CPLSetThreadLocalConfigOption("SHAPE_ENCODING", prev_shape_encoding)

if prev_shape_encoding != NULL:
CPLFree(<void*>prev_shape_encoding)

return (
meta,
fid_data,
Expand Down Expand Up @@ -1322,6 +1399,8 @@ def ogr_open_arrow(
cdef char **fields_c = NULL
cdef const char *field_c = NULL
cdef char **options = NULL
cdef const char *prev_shape_encoding = NULL
cdef bint override_shape_encoding = False
cdef ArrowArrayStream* stream
cdef ArrowSchema schema

Expand Down Expand Up @@ -1369,6 +1448,10 @@ def ogr_open_arrow(

reader = None
try:
if encoding:
override_shape_encoding = True
prev_shape_encoding = override_threadlocal_config_option("SHAPE_ENCODING", encoding)

dataset_options = dict_to_options(dataset_kwargs)
ogr_dataset = ogr_open(path_c, 0, dataset_options)

Expand All @@ -1383,7 +1466,18 @@ def ogr_open_arrow(

# Encoding is derived from the user, from the dataset capabilities / type,
# or from the system locale
encoding = encoding or detect_encoding(ogr_dataset, ogr_layer)
if encoding:
if get_driver(ogr_dataset) == "ESRI Shapefile":
if "ENCODING" in dataset_kwargs:
raise ValueError('cannot provide both encoding parameter and "ENCODING" option; use encoding parameter to specify correct encoding for data source')

encoding = "UTF-8"

elif encoding.replace('-','').upper() != 'UTF8':
raise ValueError("non-UTF-8 encoding is not supported for Arrow; use the non-Arrow interface instead")

else:
encoding = detect_encoding(ogr_dataset, ogr_layer)

fields = get_fields(ogr_layer, encoding, use_arrow=True)

Expand Down Expand Up @@ -1543,6 +1637,13 @@ def ogr_open_arrow(
GDALClose(ogr_dataset)
ogr_dataset = NULL

# reset SHAPE_ENCODING config parameter if temporarily set above
if override_shape_encoding:
CPLSetThreadLocalConfigOption("SHAPE_ENCODING", prev_shape_encoding)

if prev_shape_encoding != NULL:
CPLFree(<void*>prev_shape_encoding)


def ogr_read_bounds(
str path,
Expand Down Expand Up @@ -1612,22 +1713,29 @@ def ogr_read_info(
cdef char **dataset_options = NULL
cdef OGRDataSourceH ogr_dataset = NULL
cdef OGRLayerH ogr_layer = NULL
cdef const char *prev_shape_encoding = NULL
cdef bint override_shape_encoding = False

path_b = path.encode('utf-8')
path_c = path_b


try:
if encoding:
override_shape_encoding = True
prev_shape_encoding = override_threadlocal_config_option("SHAPE_ENCODING", encoding)

dataset_options = dict_to_options(dataset_kwargs)
ogr_dataset = ogr_open(path_c, 0, dataset_options)

if layer is None:
layer = get_default_layer(ogr_dataset)
ogr_layer = get_ogr_layer(ogr_dataset, layer)

# Encoding is derived from the user, from the dataset capabilities / type,
# or from the system locale
encoding = encoding or detect_encoding(ogr_dataset, ogr_layer)
if encoding and get_driver(ogr_dataset) == "ESRI Shapefile":
encoding = "UTF-8"
else:
encoding = encoding or detect_encoding(ogr_dataset, ogr_layer)

fields = get_fields(ogr_layer, encoding)

Expand Down Expand Up @@ -1663,6 +1771,13 @@ def ogr_read_info(
GDALClose(ogr_dataset)
ogr_dataset = NULL

# reset SHAPE_ENCODING config parameter if temporarily set above
if override_shape_encoding:
CPLSetThreadLocalConfigOption("SHAPE_ENCODING", prev_shape_encoding)

if prev_shape_encoding != NULL:
CPLFree(<void*>prev_shape_encoding)

return meta


Expand Down Expand Up @@ -1956,22 +2071,26 @@ cdef create_ogr_dataset_layer(
dataset_options = NULL
raise exc

# Setup layer creation options
# Setup other layer creation options
for k, v in layer_kwargs.items():
k = k.encode('UTF-8')
v = v.encode('UTF-8')
layer_options = CSLAddNameValue(layer_options, <const char *>k, <const char *>v)

if driver == 'ESRI Shapefile':
# Fiona only sets encoding for shapefiles; other drivers do not support
# encoding as an option.
if encoding is None:
encoding = "UTF-8"
# ENCODING option must be set for shapefiles to properly write *.cpg
# file containing the encoding; this is not a supported option for
# other drivers. This is done after setting general options above
# to override ENCODING if passed by the user as a layer option.
if encoding and "ENCODING" in layer_kwargs:
raise ValueError('cannot provide both encoding parameter and "ENCODING" layer creation option; use the encoding parameter')

# always write to UTF-8 if encoding is not set
encoding = encoding or "UTF-8"
encoding_b = encoding.upper().encode('UTF-8')
encoding_c = encoding_b
layer_options = CSLSetNameValue(layer_options, "ENCODING", encoding_c)

# Setup other layer creation options
for k, v in layer_kwargs.items():
k = k.encode('UTF-8')
v = v.encode('UTF-8')
layer_options = CSLAddNameValue(layer_options, <const char *>k, <const char *>v)

### Get geometry type
# TODO: this is brittle for 3D / ZM / M types
Expand Down Expand Up @@ -2085,10 +2204,17 @@ def ogr_write(
&ogr_dataset, &ogr_layer,
)

# Now the dataset and layer have been created, we can properly determine the
# encoding. It is derived from the user, from the dataset capabilities / type,
# or from the system locale
encoding = encoding or detect_encoding(ogr_dataset, ogr_layer)
if driver == 'ESRI Shapefile':
# force encoding for remaining operations to be in UTF-8 (even if user
# provides an encoding) because GDAL will automatically convert those to
# the target encoding because ENCODING is set as a layer creation option
encoding = "UTF-8"

else:
# Now the dataset and layer have been created, we can properly determine the
# encoding. It is derived from the user, from the dataset capabilities / type,
# or from the system locale
encoding = encoding or detect_encoding(ogr_dataset, ogr_layer)

### Create the fields
field_types = None
Expand Down Expand Up @@ -2330,10 +2456,17 @@ def ogr_write_arrow(
&ogr_dataset, &ogr_layer,
)

# only shapefile supports non-UTF encoding because ENCODING option is set
# during dataset creation and GDAL auto-translates from UTF-8 values to that
# encoding
if encoding and encoding.replace('-','').upper() != 'UTF8' and driver != 'ESRI Shapefile':
raise ValueError("non-UTF-8 encoding is not supported for Arrow; use the non-Arrow interface instead")

if geometry_name:
opts = {"GEOMETRY_NAME": geometry_name}
else:
opts = {}

options = dict_to_options(opts)

try:
Expand Down
Loading

0 comments on commit db58724

Please sign in to comment.