Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: correctly use GDAL auto-decoding of shapefiles when encoding is set #384

Merged
merged 25 commits into from
Apr 26, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
2bfc862
Use SHAPE_ENCODING to disable auto-decoding of shapefiles
brendan-ward Apr 5, 2024
2b38f94
Correctly handle I/O for non-UTF-8 shapefiles
brendan-ward Apr 5, 2024
e738f7b
Use thread local config, add more tests
brendan-ward Apr 6, 2024
3362dae
Fix detection of encoding for shapefile layers via SQL
brendan-ward Apr 6, 2024
b991038
Expand encoding test cases
brendan-ward Apr 6, 2024
ecb56f6
Use ANSI code pages for alternative encodings (not DOS) and slightly …
brendan-ward Apr 6, 2024
994be50
Improve docs a little more
brendan-ward Apr 6, 2024
901320f
consolidate operations per PR feedback
brendan-ward Apr 8, 2024
4313fca
verify SHAPE_ENCODING global option is retained
brendan-ward Apr 8, 2024
5c4657d
Merge branch 'main' into issue380
brendan-ward Apr 8, 2024
ef602d5
Cleanup duplicate override to UTF-8
brendan-ward Apr 8, 2024
0e7e932
Merge branch 'main' into issue380
brendan-ward Apr 17, 2024
146967d
Merge branch 'main' into issue380
brendan-ward Apr 17, 2024
2bdf4dc
prevent combining encoding parameter and ENCODING open / creation option
brendan-ward Apr 17, 2024
b8cc902
Fix missing pyarrow constraint for test
brendan-ward Apr 17, 2024
e7db258
Try to verify that platform default encoding is used for CSV by default
brendan-ward Apr 17, 2024
eaecb18
split CSV platform encoding test out to verify it executes on Windows
brendan-ward Apr 17, 2024
a623501
Fix bug in skip of CSV encoding test
brendan-ward Apr 17, 2024
d615fd7
Merge branch 'main' into issue380
brendan-ward Apr 25, 2024
d2b43f3
cleanup, add arrow I/O tests
brendan-ward Apr 25, 2024
815620a
Fix missing test annotation
brendan-ward Apr 26, 2024
6eca265
Don't fail in error handler if message cannot be decoded to UTF-8
brendan-ward Apr 26, 2024
34bacc3
Fix bug in attempted fix
brendan-ward Apr 26, 2024
e088311
Fix failing test for GDAL >= 3.9
brendan-ward Apr 26, 2024
58d1d3f
Fix other failing FlatGeobuff tests
brendan-ward Apr 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
including the CRS, when using GDAL 3.8 or higher (#366).
- Warn when reading from a multilayer file without specifying a layer (#362).


### Bug fixes

- Fix error in `write_dataframe` if input has a date column and
Expand All @@ -19,6 +18,9 @@
- Raise exception in `read_arrow` or `read_dataframe(..., use_arrow=True)` if
a boolean column is detected due to error in GDAL reading boolean values (#335)
this has been fixed in GDAL >= 3.8.3.
- Properly handle decoding of ESRI Shapefiles with user-provided `encoding`
option for `read`, `read_dataframe`, and `open_arrow`, and correctly encode
Shapefile field names and text values to the user-provided `encoding` (#384).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this last sentence mean exactly? Is that for writing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was for writing. Will make more explicit.


### Packaging

Expand Down
18 changes: 11 additions & 7 deletions docs/source/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,10 @@ first layer.
}
```

NOTE: pyogrio will report `UTF-8` if either the native encoding is likely to be
`UTF-8` or GDAL can automatically convert from the detected native encoding to
`UTF-8`.

To read from a layer using name or index (the following are equivalent):

```python
Expand Down Expand Up @@ -468,21 +472,21 @@ You can also read from a URL with this syntax:

GDAL only supports datetimes at a millisecond resolution. Reading data will thus
give at most millisecond resolution (`datetime64[ms]` data type). With pandas 2.0
`pyogrio.read_dataframe()` will return datetime data as `datetime64[ms]`
correspondingly. For previous versions of pandas, `datetime64[ns]` is used as
ms precision was not supported. When writing, only precision up to
`pyogrio.read_dataframe()` will return datetime data as `datetime64[ms]`
correspondingly. For previous versions of pandas, `datetime64[ns]` is used as
ms precision was not supported. When writing, only precision up to
ms is retained.

Not all file formats have dedicated support to store datetime data, like ESRI
Shapefile. For such formats, or if you require precision > ms, a workaround is to
convert the datetimes to string.

Timezone information is preserved where possible, however GDAL only represents
time zones as UTC offsets, whilst pandas uses IANA time zones (via `pytz` or
`zoneinfo`). This means that dataframes with columns containing multiple offsets
time zones as UTC offsets, whilst pandas uses IANA time zones (via `pytz` or
`zoneinfo`). This means that dataframes with columns containing multiple offsets
(e.g. when switching from standard time to summer time) will be written correctly,
but when read via `pyogrio.read_dataframe()` will be returned as a UTC datetime
column, as there is no way to reconstruct the original timezone from the individual
but when read via `pyogrio.read_dataframe()` will be returned as a UTC datetime
column, as there is no way to reconstruct the original timezone from the individual
offsets present.

## Dataset and layer creation options
Expand Down
16 changes: 12 additions & 4 deletions docs/source/known_issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,18 @@ geometries from the data layer.

## Character encoding

Pyogrio supports reading / writing data layers with a defined encoding. However,
DataFrames do not currently allow arbitrary metadata, which means that we are
currently unable to store encoding information for a data source. Text fields
are read into Python UTF-8 strings.
Pyogrio supports reading / writing data layers with a defined encoding. Where
possible and the `encoding` option is not specified, GDAL will attempt to
automatically decode from the native encoding to `UTF-8`, and pyogrio will report
that the encoding is `UTF-8` in that case instead of the native encoding. For
[ESRI Shapefiles](https://gdal.org/drivers/vector/shapefile.html#encoding),
GDAL will use the associated `.cpg` file or a code page specified in the `.dbf`
file to infer the native encoding, but may incorrectly assume the native encoding
is `ISO-8859-1`, leading to miscoding errors. Most other drivers are assumed to
be in `UTF-8`, but it is possible (in theory) to specify the `encoding` parameter
manually to force conversions to use the specified encoding value.

Field names and values are read into Python `UTF-8` strings.

## No validation of geometry or field types

Expand Down
168 changes: 156 additions & 12 deletions pyogrio/_io.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -486,6 +486,9 @@ cdef detect_encoding(OGRDataSourceH ogr_dataset, OGRLayerH ogr_layer):
# read without recoding. Hence, it is up to you to supply the data in the
# appropriate encoding. More info:
# https://gdal.org/development/rfc/rfc23_ogr_unicode.html#oftstring-oftstringlist-fields
# NOTE: this always returns False for the layer returned by executing SQL,
# even when they indeed support UTF-8; the layer underying the SQL must
brendan-ward marked this conversation as resolved.
Show resolved Hide resolved
# be passed instead
return "UTF-8"

driver = get_driver(ogr_dataset)
Expand Down Expand Up @@ -1109,8 +1112,12 @@ def ogr_read(
cdef char **fields_c = NULL
cdef OGRDataSourceH ogr_dataset = NULL
cdef OGRLayerH ogr_layer = NULL
cdef OGRLayerH base_layer = NULL
cdef int feature_count = 0
cdef double xmin, ymin, xmax, ymax
cdef const char* prev_shape_encoding_c = NULL
cdef char* prev_shape_encoding = NULL
cdef bint override_shape_encoding = False

path_b = path.encode('utf-8')
path_c = path_b
Expand Down Expand Up @@ -1142,9 +1149,35 @@ def ogr_read(
raise ValueError("'max_features' must be >= 0")

try:
if encoding:
override_shape_encoding = True

# for shapefiles, SHAPE_ENCODING must be set before opening the file
# to prevent automatic decoding to UTF-8 by GDAL, so we save previous
# SHAPE_ENCODING so that it can be restored later
# (we do this for all data sources where encoding is set because
# we don't know the driver until after it is opened, which is too late)
prev_shape_encoding_c = CPLGetThreadLocalConfigOption("SHAPE_ENCODING", NULL)
if prev_shape_encoding_c != NULL:
# strings returned from config options may be replaced via
# CPLSetConfigOption() below; GDAL instructs us to save a copy
# in a new string
prev_shape_encoding = CPLStrdup(prev_shape_encoding_c)

# set encoding used automatically by GDAL
encoding_b = encoding.encode('UTF-8')
encoding_c = encoding_b
CPLSetThreadLocalConfigOption("SHAPE_ENCODING", <const char*>encoding_c)
brendan-ward marked this conversation as resolved.
Show resolved Hide resolved


dataset_options = dict_to_options(dataset_kwargs)
ogr_dataset = ogr_open(path_c, 0, dataset_options)

if encoding and get_driver(ogr_dataset) == "ESRI Shapefile":
# Because SHAPE_ENCODING is set above, GDAL will automatically decode
# to UTF-8
encoding = "UTF-8"

brendan-ward marked this conversation as resolved.
Show resolved Hide resolved
if sql is None:
if layer is None:
layer = get_default_layer(ogr_dataset)
Expand All @@ -1156,7 +1189,22 @@ def ogr_read(

# Encoding is derived from the user, from the dataset capabilities / type,
# or from the system locale
encoding = encoding or detect_encoding(ogr_dataset, ogr_layer)
if encoding:
if get_driver(ogr_dataset) == "ESRI Shapefile":
# Because SHAPE_ENCODING is set above, GDAL will automatically decode
# to UTF-8; ignore any encoding set by user
encoding = "UTF-8"

else:
if sql is not None and get_driver(ogr_dataset) == "ESRI Shapefile":
brendan-ward marked this conversation as resolved.
Show resolved Hide resolved
# in order to properly test certain capabilities, we need to have the
# underlying layer referenced by the SQL query
base_layer = GDALDatasetGetLayer(ogr_dataset, 0)

else:
base_layer = ogr_layer

encoding = detect_encoding(ogr_dataset, base_layer)
brendan-ward marked this conversation as resolved.
Show resolved Hide resolved

fields = get_fields(ogr_layer, encoding)

Expand Down Expand Up @@ -1249,6 +1297,13 @@ def ogr_read(
GDALClose(ogr_dataset)
ogr_dataset = NULL

# reset SHAPE_ENCODING config parameter if temporarily set above
if override_shape_encoding:
CPLSetThreadLocalConfigOption("SHAPE_ENCODING", prev_shape_encoding)

if prev_shape_encoding != NULL:
CPLFree(prev_shape_encoding)

return (
meta,
fid_data,
Expand Down Expand Up @@ -1282,9 +1337,13 @@ def ogr_open_arrow(
cdef const char *where_c = NULL
cdef OGRDataSourceH ogr_dataset = NULL
cdef OGRLayerH ogr_layer = NULL
cdef OGRLayerH base_layer = NULL
cdef char **fields_c = NULL
cdef const char *field_c = NULL
cdef char **options = NULL
cdef const char* prev_shape_encoding_c = NULL
cdef char* prev_shape_encoding = NULL
cdef bint override_shape_encoding = False
cdef ArrowArrayStream stream
cdef ArrowSchema schema

Expand Down Expand Up @@ -1328,6 +1387,26 @@ def ogr_open_arrow(

reader = None
try:
if encoding:
override_shape_encoding = True

# for shapefiles, SHAPE_ENCODING must be set before opening the file
# to prevent automatic decoding to UTF-8 by GDAL, so we save previous
# SHAPE_ENCODING so that it can be restored later
# (we do this for all data sources where encoding is set because
# we don't know the driver until after it is opened, which is too late)
prev_shape_encoding_c = CPLGetThreadLocalConfigOption("SHAPE_ENCODING", NULL)
if prev_shape_encoding_c != NULL:
# strings returned from config options may be replaced via
# CPLSetConfigOption() below; GDAL instructs us to save a copy
# in a new string
prev_shape_encoding = CPLStrdup(prev_shape_encoding_c)

# set encoding used automatically by GDAL
encoding_b = encoding.encode('UTF-8')
encoding_c = encoding_b
CPLSetThreadLocalConfigOption("SHAPE_ENCODING", <const char*>encoding_c)

dataset_options = dict_to_options(dataset_kwargs)
ogr_dataset = ogr_open(path_c, 0, dataset_options)

Expand All @@ -1342,7 +1421,22 @@ def ogr_open_arrow(

# Encoding is derived from the user, from the dataset capabilities / type,
# or from the system locale
encoding = encoding or detect_encoding(ogr_dataset, ogr_layer)
if encoding:
if get_driver(ogr_dataset) == "ESRI Shapefile":
# Because SHAPE_ENCODING is set above, GDAL will automatically decode
# to UTF-8; ignore any encoding set by user
encoding = "UTF-8"

else:
if sql is not None and get_driver(ogr_dataset) == "ESRI Shapefile":
# in order to properly test certain capabilities, we need to have the
# underlying layer referenced by the SQL query
base_layer = GDALDatasetGetLayer(ogr_dataset, 0)

else:
base_layer = ogr_layer

encoding = detect_encoding(ogr_dataset, base_layer)

fields = get_fields(ogr_layer, encoding, use_arrow=True)

Expand Down Expand Up @@ -1450,6 +1544,13 @@ def ogr_open_arrow(
GDALClose(ogr_dataset)
ogr_dataset = NULL

# reset SHAPE_ENCODING config parameter if temporarily set above
if override_shape_encoding:
CPLSetThreadLocalConfigOption("SHAPE_ENCODING", prev_shape_encoding)

if prev_shape_encoding != NULL:
CPLFree(<void*>prev_shape_encoding)

def ogr_read_bounds(
str path,
object layer=None,
Expand Down Expand Up @@ -1518,12 +1619,34 @@ def ogr_read_info(
cdef char **dataset_options = NULL
cdef OGRDataSourceH ogr_dataset = NULL
cdef OGRLayerH ogr_layer = NULL
cdef const char* prev_shape_encoding = NULL
cdef bint override_shape_encoding = False

path_b = path.encode('utf-8')
path_c = path_b


try:
if encoding:
override_shape_encoding = True

# for shapefiles, SHAPE_ENCODING must be set before opening the file
# to prevent automatic decoding to UTF-8 by GDAL, so we save previous
# SHAPE_ENCODING so that it can be restored later
# (we do this for all data sources where encoding is set because
# we don't know the driver until after it is opened, which is too late)
prev_shape_encoding_c = CPLGetThreadLocalConfigOption("SHAPE_ENCODING", NULL)
if prev_shape_encoding_c != NULL:
# strings returned from config options may be replaced via
# CPLSetConfigOption() below; GDAL instructs us to save a copy
# in a new string
prev_shape_encoding = CPLStrdup(prev_shape_encoding_c)

# set encoding used automatically by GDAL
encoding_b = encoding.encode('UTF-8')
encoding_c = encoding_b
CPLSetThreadLocalConfigOption("SHAPE_ENCODING", <const char*>encoding_c)


dataset_options = dict_to_options(dataset_kwargs)
ogr_dataset = ogr_open(path_c, 0, dataset_options)

Expand All @@ -1533,7 +1656,13 @@ def ogr_read_info(

# Encoding is derived from the user, from the dataset capabilities / type,
# or from the system locale
encoding = encoding or detect_encoding(ogr_dataset, ogr_layer)
if encoding and get_driver(ogr_dataset) == "ESRI Shapefile":
# Because SHAPE_ENCODING is set above, GDAL will automatically decode
# to UTF-8
encoding = "UTF-8"

else:
encoding = encoding or detect_encoding(ogr_dataset, ogr_layer)

fields = get_fields(ogr_layer, encoding)

Expand Down Expand Up @@ -1566,6 +1695,13 @@ def ogr_read_info(
GDALClose(ogr_dataset)
ogr_dataset = NULL

# reset SHAPE_ENCODING config parameter if temporarily set above
if override_shape_encoding:
CPLSetThreadLocalConfigOption("SHAPE_ENCODING", prev_shape_encoding)

if prev_shape_encoding != NULL:
CPLFree(<void*>prev_shape_encoding)

return meta


Expand Down Expand Up @@ -1600,7 +1736,7 @@ cdef str get_default_layer(OGRDataSourceH ogr_dataset):
-------
str
the name of the default layer to be read.

"""
layers = get_layer_names(ogr_dataset)
first_layer_name = layers[0][0]
Expand Down Expand Up @@ -1632,7 +1768,7 @@ cdef get_layer_names(OGRDataSourceH ogr_dataset):
-------
ndarray(n)
array of layer names

"""
cdef OGRLayerH ogr_layer = NULL

Expand Down Expand Up @@ -1862,8 +1998,9 @@ cdef create_ogr_dataset_layer(
# Setup layer creation options

if driver == 'ESRI Shapefile':
# Fiona only sets encoding for shapefiles; other drivers do not support
# encoding as an option.
# ENCODING option must be set for shapefiles to properly write *.cpg
# file containing the encoding; this is not a supported option for
# other drivers
if encoding is None:
encoding = "UTF-8"
encoding_b = encoding.upper().encode('UTF-8')
Expand Down Expand Up @@ -1988,10 +2125,17 @@ def ogr_write(
&ogr_dataset, &ogr_layer,
)

# Now the dataset and layer have been created, we can properly determine the
# encoding. It is derived from the user, from the dataset capabilities / type,
# or from the system locale
encoding = encoding or detect_encoding(ogr_dataset, ogr_layer)

if driver == 'ESRI Shapefile':
# force encoding for remaining operations to be in UTF-8 because
# GDAL will automatically convert those to the target encoding
encoding = "UTF-8"

else:
# Now the dataset and layer have been created, we can properly determine the
# encoding. It is derived from the user, from the dataset capabilities / type,
# or from the system locale
encoding = encoding or detect_encoding(ogr_dataset, ogr_layer)

### Create the fields
field_types = None
Expand Down
3 changes: 3 additions & 0 deletions pyogrio/_ogr.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ cdef extern from "cpl_conv.h":
const char* CPLFindFile(const char *pszClass, const char *filename)
const char* CPLGetConfigOption(const char* key, const char* value)
void CPLSetConfigOption(const char* key, const char* value)
const char* CPLGetThreadLocalConfigOption(const char* key, const char* value)
void CPLSetThreadLocalConfigOption(const char* key, const char* value)
char* CPLStrdup(const char* string)


cdef extern from "cpl_error.h" nogil:
Expand Down
4 changes: 4 additions & 0 deletions pyogrio/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,10 @@ def read_info(
driver or if the data source is nonspatial. You can force it to be calculated using
the ``force_total_bounds`` parameter.

``encoding`` will be ``UTF-8`` if either the native encoding is likely to be
``UTF-8`` or GDAL can automatically convert from the detected native encoding
to ``UTF-8``.

Parameters
----------
path : str or pathlib.Path
Expand Down
Loading