Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recognize _Encoding attribute for char and string arrays #665

Merged
merged 29 commits into from
May 19, 2017
Merged

Conversation

jswhit
Copy link
Collaborator

@jswhit jswhit commented May 17, 2017

Add check for _Encoding attribute for NC_STRING variables, otherwise use 'utf-8'. 'utf-8' is used everywhere else, 'default_encoding' global module variable is no longer used. getncattr method now takes optional kwarg 'encoding' (default 'utf-8') so encoding of attributes can be specified if desired. If _Encoding is specified for an NC_CHAR ('S1') variable,the chartostring utility function is used to convert the array of characters to an array of strings with one less dimension (the last dimension is interpreted as the length of each string) when reading the data. When writing the data, stringtochar is used to convert a numpy array of fixed length strings to an array of characters with one more dimension. chartostring and stringtochar now also have an 'encoding' kwarg.

The _Encoding attribute convention is being discussed in Unidata/netcdf-c#402.

@jswhit
Copy link
Collaborator Author

jswhit commented May 18, 2017

@shoyer, I'm wondering how this change would impact xarray - especially the auto-conversion of char arrays to string arrays with the last dimension collapsed. This would only happen if the _Encoding attribute is set - but even so, would you want a way to disable this extra 'magic' for xarray?

@shoyer
Copy link
Contributor

shoyer commented May 18, 2017

@jswhit thanks for the heads up. Yes, I think this implementation as-is would break xarray, where we do our own char -> string array conversion.

There are two ways to fix this:

  1. Add some way to turn this off.
  2. Make the netCDF4.Variable objects have a consistent interface with the converted arrays. That means adjusting ndim, shape and dtype along with array values.

I like this second option better.

@jswhit
Copy link
Collaborator Author

jswhit commented May 18, 2017

The second option would be nice, but quite difficult since ndim, shape and dtype are assumed internally to represent the variable as stored in the file.

How about adding a set_auto_chartostring Dataset andVariable method? The default could be False if we want to be conservative.

@dopplershift
Copy link
Member

That would fit the existing API of the library, where any interpretation of attributes is configurable...

@jswhit
Copy link
Collaborator Author

jswhit commented May 18, 2017

I went ahead and added a set_auto_chartostring method - the default is True for now, just like the auto mask and scale attributes.

@shoyer
Copy link
Contributor

shoyer commented May 18, 2017

I'm okay with methods for this.

But going forward, this is probably a case for separate low level and high level interfaces, even if only the high level interface is exposed publicly. h5py uses this approach and it works quite well.

@jswhit
Copy link
Collaborator Author

jswhit commented May 19, 2017

OK, merging now. @shoyer, good idea about the low level interface. I'll create a separate ticket for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants