recognize _Encoding attribute for char and string arrays #665

jswhit · 2017-05-17T13:39:06Z

Add check for _Encoding attribute for NC_STRING variables, otherwise use 'utf-8'. 'utf-8' is used everywhere else, 'default_encoding' global module variable is no longer used. getncattr method now takes optional kwarg 'encoding' (default 'utf-8') so encoding of attributes can be specified if desired. If _Encoding is specified for an NC_CHAR ('S1') variable,the chartostring utility function is used to convert the array of characters to an array of strings with one less dimension (the last dimension is interpreted as the length of each string) when reading the data. When writing the data, stringtochar is used to convert a numpy array of fixed length strings to an array of characters with one more dimension. chartostring and stringtochar now also have an 'encoding' kwarg.

The _Encoding attribute convention is being discussed in Unidata/netcdf-c#402.

Use 'utf-8' and 'replace' for everything except NC_STRING variable data. For NC_STRING variable data, look for _Encoding variable attribute, otherwise use 'utf-8'.

character variables.

force 'U' dtype in chartostring.

to a char variable with _Encoding set.

character array (type='S1') is given

jswhit · 2017-05-18T03:31:26Z

@shoyer, I'm wondering how this change would impact xarray - especially the auto-conversion of char arrays to string arrays with the last dimension collapsed. This would only happen if the _Encoding attribute is set - but even so, would you want a way to disable this extra 'magic' for xarray?

last dim of char variable.

string array)

shoyer · 2017-05-18T15:28:03Z

@jswhit thanks for the heads up. Yes, I think this implementation as-is would break xarray, where we do our own char -> string array conversion.

There are two ways to fix this:

Add some way to turn this off.
Make the netCDF4.Variable objects have a consistent interface with the converted arrays. That means adjusting ndim, shape and dtype along with array values.

I like this second option better.

jswhit · 2017-05-18T16:40:12Z

The second option would be nice, but quite difficult since ndim, shape and dtype are assumed internally to represent the variable as stored in the file.

How about adding a set_auto_chartostring Dataset andVariable method? The default could be False if we want to be conservative.

dopplershift · 2017-05-18T17:19:00Z

That would fit the existing API of the library, where any interpretation of attributes is configurable...

jswhit · 2017-05-18T17:29:35Z

I went ahead and added a set_auto_chartostring method - the default is True for now, just like the auto mask and scale attributes.

shoyer · 2017-05-18T18:23:16Z

I'm okay with methods for this.

But going forward, this is probably a case for separate low level and high level interfaces, even if only the high level interface is exposed publicly. h5py uses this approach and it works quite well.

jswhit · 2017-05-19T15:32:04Z

OK, merging now. @shoyer, good idea about the low level interface. I'll create a separate ticket for that.

jswhit added 13 commits May 16, 2017 10:03

get rid of 'default_encoding' and 'unicode_error' module variables.

bd91f93

Use 'utf-8' and 'replace' for everything except NC_STRING variable data. For NC_STRING variable data, look for _Encoding variable attribute, otherwise use 'utf-8'.

add encoding kwarg to getncattr

7a8779c

make sure all vlen string vars use _Encoding

659addf

use chartostring if _Encoding specified for a NC_CHAR variable

913dd95

use stringtochar to write array of strings with _Encoding is set for

ae1beed

character variables.

fix bug in previous commit

46d9e41

use errors='surrogateescape' in chartostring and stringtochar,

11ee7f9

force 'U' dtype in chartostring.

regenerate C source

52d8f7c

don't use 'surrogateescape' since it's not available in python 2.7

6f4b66c

update

2ffb32f

fix failing tests

371b34b

get nc_open_mem from netcdf_mem.h

047ef12

regenerate C source

0e75433

jswhit mentioned this pull request May 17, 2017

Conventions for string and character array encoding Unidata/netcdf-c#402

Open

jswhit added 12 commits May 17, 2017 10:18

add test for _Encoding with char arrays

74b0353

use ascii not utf-8

fb49b45

add some different slices to test

7f6204a

add new test for unicode attributes

083342e

set NO_NET=1

779ac4a

set NO_NET=1 in run_test.py

7748778

update

ac4157a

make sure python strings are cast to a numpy string array when writing

da9fc72

to a char variable with _Encoding set.

update

048d590

test writing a python string, whether _Encoding is ignored when an

42e2643

character array (type='S1') is given

perform conversion for bytes too

6d76a26

fix failing test in python 3

8078e60

jswhit added 2 commits May 18, 2017 06:30

don't convert to stringarr if last dimension of slice doesn't match

a009936

last dim of char variable.

fix corner case when slice is not along last dimension (don't return a

eb9ff0e

string array)

add set_auto_chartostring Dataset and Variable method.

0d9ade9

update docstrings

6d50a22

jswhit mentioned this pull request May 19, 2017

Dataset class should support encoding parameter to override global attribute #654

Closed

jswhit merged commit f8e55d8 into master May 19, 2017

jswhit deleted the encoding branch May 19, 2017 15:32

This was referenced Oct 21, 2017

fix to_netcdf append bug (GH1215) pydata/xarray#1609

Merged

Unicode strings unexpectedly transformed to byte strings upon open_dataset pydata/xarray#1638

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recognize _Encoding attribute for char and string arrays #665

recognize _Encoding attribute for char and string arrays #665

jswhit commented May 17, 2017 •

edited

Loading

jswhit commented May 18, 2017

shoyer commented May 18, 2017

jswhit commented May 18, 2017

dopplershift commented May 18, 2017

jswhit commented May 18, 2017

shoyer commented May 18, 2017

jswhit commented May 19, 2017

recognize _Encoding attribute for char and string arrays #665

recognize _Encoding attribute for char and string arrays #665

Conversation

jswhit commented May 17, 2017 • edited Loading

jswhit commented May 18, 2017

shoyer commented May 18, 2017

jswhit commented May 18, 2017

dopplershift commented May 18, 2017

jswhit commented May 18, 2017

shoyer commented May 18, 2017

jswhit commented May 19, 2017

jswhit commented May 17, 2017 •

edited

Loading