Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roundtrip unicode strings even when written as character arrays #1648

Merged
merged 10 commits into from
Oct 27, 2017

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Oct 23, 2017

Unicode strings (str on Python 3) are now round-tripped successfully even when written as character arrays (e.g., as netCDF3 files or when using engine='scipy'). This is controlled by the _Encoding attribute convention, which is also understood directly by the netCDF4-Python interface.

This PR also resolves some long-standing technical debt in the test suite related to the hacky use of decode_bytes in assert_allclose (recently encountered by @jhamman in #1609). Once we're sure that we don't need it anymore, I'd like to deprecate and eventually remove the decode_bytes option.

Note that there are still a few unresolved issues with regards to serializing missing values in strings, so I've intentionally held off on documenting the handling of _FillValue for now. I'd like to resolve those separately after discussion in #1647, but ideally this could make it in for the v0.10 release.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - looks pretty good. I had a few small comments. I should say, I tend to gloss over pretty quickly when looking at Python's string encoding stuff so hopefully someone else can review some of the nitty-gritty logic/tests.

doc/io.rst Outdated

xarray can write unicode strings to netCDF files in two ways:

- As variable lengths strings. This is only supported on netCDF4 (HDF5) files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use the nc4/hdf lingo, perhaps "variable length arrays" is clearer. Also, typo in "lengths".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

netCDF4-python refers to "VLEN strings" in a section entitled "Variable length (VLEN) types": http://unidata.github.io/netcdf4-python/#section11

and h5py talks about "variable-length UTF-8":
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8

These both sound like variable length strings to me.

(I fixed the typo.)

'(https://github.com/Unidata/netcdf4-python/issues/730). '
"Either remove '_FillValue' from encoding on variable %r "
"or set {'dtype': 'S1'} in encoding to use the fixed width "
'NC_CHAR type.' % name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is raised in the h5netcdf backend. It may be confusing to get an error that says, "netCDF4 doesn't support what you're trying to do." If you think it is important to call out netCDF4 here, perhaps lump both backends together, i.e. "the h5netcdf/netCDF4 backends do not yet support..."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed to refer to an h5netcdf issue instead.

@@ -1389,6 +1408,10 @@ def roundtrip(self, data, save_kwargs={}, open_kwargs={},
allow_cleanup_failure=False):
yield data.chunk()

def test_roundtrip_string_encoded_characters(self):
# Override method in DatasetIOTestCases - not applicable to dask
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use pytest.skip() here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a skipped test, so much as a test that really shouldn't exist at all :). We slightly abuse the notion of a "backend" for dask here.

with self.roundtrip(original) as actual:
self.assertDatasetIdentical(expected, actual)
except NotImplementedError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use pytest.skip() here and the pass statement above.

Copy link
Member Author

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned up my _FillValue tests to explicit check for errors -- that seems like a better approach for now, rather than marking them as expected failures or skipping them.

doc/io.rst Outdated

xarray can write unicode strings to netCDF files in two ways:

- As variable lengths strings. This is only supported on netCDF4 (HDF5) files.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

netCDF4-python refers to "VLEN strings" in a section entitled "Variable length (VLEN) types": http://unidata.github.io/netcdf4-python/#section11

and h5py talks about "variable-length UTF-8":
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8

These both sound like variable length strings to me.

(I fixed the typo.)

'(https://github.com/Unidata/netcdf4-python/issues/730). '
"Either remove '_FillValue' from encoding on variable %r "
"or set {'dtype': 'S1'} in encoding to use the fixed width "
'NC_CHAR type.' % name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed to refer to an h5netcdf issue instead.

@@ -1389,6 +1408,10 @@ def roundtrip(self, data, save_kwargs={}, open_kwargs={},
allow_cleanup_failure=False):
yield data.chunk()

def test_roundtrip_string_encoded_characters(self):
# Override method in DatasetIOTestCases - not applicable to dask
pass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a skipped test, so much as a test that really shouldn't exist at all :). We slightly abuse the notion of a "backend" for dask here.

@shoyer
Copy link
Member Author

shoyer commented Oct 24, 2017

I should say, I tend to gloss over pretty quickly when looking at Python's string encoding stuff so hopefully someone else can review some of the nitty-gritty logic/tests.

@jhamman This reminds me of myself three years ago when I wrote these original messy tests! The mostly but not entirely compatible semantics of Python 3, NumPy, netCDF and HDF5 makes this very hard to get right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unicode strings unexpectedly transformed to byte strings upon open_dataset
2 participants