Switch NC_CHAR on netCDF4 to use ASCII #316

shoyer · 2016-09-02T15:58:03Z

Fixes #298

The new test, a minor modification of tst_strings2.c to use NC_CHAR and some non-ASCII characters, passes on Travis-CI. Let me know if you need anything else.

You'll notice that I didn't modify the reading code at all -- older versions of netCDF4 already read fixed width ASCII as NC_CHAR. Thus this should present no backwards compatibility concerns.

WardF · 2016-09-02T16:03:24Z

Thanks very much; this issue wasn't being ignored, but we have a bit of a laundry list right now I'm trying to work through and so this patch is very much appreciated! I will take a look at it once the travis-ci tests finish and merge it asap.

shoyer · 2016-09-02T16:10:33Z

Apologies -- I need to double check that the netCDF license is OK (since it's not a standard open source license) before I submit this patch.

WardF · 2016-09-02T16:14:59Z

It's based on a modified BSD 3-clause license; it predates me so I'm not sure where it originated but it is designed to be open/available for use without restriction. Let us know if that presents an issue, I'd be curious to hear about it. Thanks!

DennisHeimbigner · 2016-09-02T18:51:01Z

Just looked at the code and it looks ok to me. Only question is
if it breaks some existing tests.

shoyer · 2016-09-02T19:03:13Z

OK, I got a good ahead on the license (whew!).

It does look like my new test didn't pass on Travis after all. I will have to take a more careful look here.

ocefpaf · 2016-09-02T19:13:35Z

It's based on a modified BSD 3-clause license;

Apologies for hijacking this PR and I do not want to start a discussion here but I am curious if the license is closer to BSD 3-Clause or MIT. It does like MIT to me:

http://www.unidata.ucar.edu/software/netcdf/copyright.html

There is also this text in https://github.com/Unidata/netcdf-c/blob/master/docs/software.md#java-interface-java_interface

It is freely available and the source code is released under the (MIT-style) netCDF C library license.

@WardF any change of adopting a well known license instead? Or at least making it a little bit clearer?

DennisHeimbigner · 2016-09-02T19:16:53Z

We have to be careful with licensing changes.
Is there a specific issue you see with the current one
that is preventing you from being ok with it?
=Dennis Heimbigner
Unidata

ocefpaf · 2016-09-02T19:21:24Z

We have to be careful with licensing changes.

Indeed. Do not change on my account as thew current license is OK for everything I do. However...

Is there a specific issue you see with the current one that is preventing you from being ok with it?

Only the fact that we need to hunt it down, read, and compare with well known licenses and then deal with the fuzziness of that comparison like: I say (and see) MIT while @WardF said BSD 3-Clause.

PS: If you want to discuss this further we can open another issue. I don't want to pollute @shoyer PR here. (Sorry for the comment above.)

DennisHeimbigner · 2016-09-02T19:23:04Z

No, lets close this. We have some people working on
licensing issues and I will pass on your concerns.
=Dennis Heimbigner

On 9/2/2016 1:21 PM, Filipe wrote:

We have to be careful with licensing changes.
Indeed. Do not change on my account as thew current license is OK for
everything I do. However...
Is there a specific issue you see with the current one that is
preventing you from being ok with it?
Only the fact that we need to hunt it down, read, and compare with well
known licenses and then deal with the fuzziness of that comparison like:
I say (and see) MIT while @WardF https://github.com/WardF said BSD
3-Clause.

PS: If you want to discuss this further we can open another issue. I
don't want to pollute @shoyer https://github.com/shoyer PR here.
(Sorry for the comment above.)

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#316 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA3P28KiZYaF53jgAEGDl-V3_fgBDS94ks5qmHc0gaJpZM4Jz2To.

shoyer · 2016-09-02T19:23:34Z

Every time I submit or use code from a project with a non-standard license at work, I need to get a lawyer to read it. I think Wikipedia summarizes the issue well: https://en.wikipedia.org/wiki/License_proliferation

ocefpaf · 2016-09-02T19:26:16Z

No, lets close this.

Thanks!

We have some people working on licensing issues and I will pass on your concerns.

BTW, if there will be a change, BSD 3-clause is the best choice IMO.

shoyer · 2017-03-03T02:50:54Z

nc_test4/tst_strings3.c

+ if (att_type != NC_CHAR || att_len != ATT_LEN) ERR;
+ if (nc_get_att(ncid, NC_GLOBAL, ATT_NAME, data_in)) ERR;
+ for (i = 0; i < att_len; i++)
+ if (strcmp(data_in[i], data[i])) ERR;


The test failed on this line. Is there an easy way to get the error message to print the strings that failed the comparison?

DennisHeimbigner · 2017-03-08T02:56:21Z

There is a potential problem down the road. Currently, the netcdf-c guide (quide.dox) says:

Typically these contain 7-bit ASCII characters, but the
character encoding is application specific...
Applications writing string data using the char data type are
encouraged to add the special variable attribute "_Encoding" with a
value that the netCDF libraries recognize. Currently those valid
values are "UTF-8" or "ASCII", case insensitive.

[Note BTW that specifying UTF-8 for char dat is meaningless, the
documentation probably needs to be changed]

The HDF documentation
(https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_char_encoding.htm) says:

ASCII and UTF-8 Unicode are the only currently supported character encodings.
Extended ASCII encodings (for example, ISO 8859) are not supported.
This encoding policy is not enforced by the HDF5 Library.

The problem I see is this. It is possible for a user to set the _Encoding
to e.g. ISO-8859-1. If, in the future, HDF5 starts to enforce its encoding,
then ISO-8859-1 characters (being 8-bit) could get refused or have
the top bit turned off to enforce 7-bit ascii. Practically, this Pull request
forces only 7-bit ascii characters to be used. It would have been nice if
HDF supported a generic 8-bit character set.

That is not necessarily a bad thing, but we should be aware that we are
potentially disallowing any character encoding except ASCII (for characters)
in the future.

shoyer · 2017-03-10T07:13:17Z

OK, so I feel a little silly here. The test I added wasn't passing before, but I think that was a spurious, because I made my new test file from simply copying tst_strings2.c and adjusting the characters, and my test with non-ASCII characters passes when I simply modify tst_strings2.c directly instead.

So I think this now satisfies @DennisHeimbigner's suggested test of:

make type H5T_CSET_ASCII

store non-ascii characters (i.e. above 127)

read back and see if hdf5 masked off the top bit.

Do you have any other concerns for this patch? I'm happy to squash the git history into a single commit as desired.

That is not necessarily a bad thing, but we should be aware that we are
potentially disallowing any character encoding except ASCII (for characters)
in the future.

I agree, but I believe this is a concern even with the current choice (where NetCDF writes NC_CHAR as UTF8), because not every byte corresponds to a valid UTF-8 character (e.g., consider \xFF):
https://en.wikipedia.org/wiki/UTF-8#Description

There might even be more of a risk of HDF5 turning on validation for UTF-8 than ASCII given that UTF-8 is more complex, though this is all rather speculative without asking the HDF5 developers.

DennisHeimbigner · 2017-03-10T19:04:18Z

I am good with those changes.. If the number of commits is not llarge then
there is probably no need to squash, but I iwll leave that to Ward.

Fixes GH298

WardF · 2017-03-21T18:49:25Z

Going back through and playing catchup while I have the time; reviewing this.

shoyer · 2017-03-21T21:52:32Z

Thanks @DennisHeimbigner and @WardF !

WardF · 2017-03-21T21:53:04Z

Not at all, sorry for the long delay, thanks for your patience!

Upstream changes: ## 4.6.1 - March 15, 2018 * [Bug Fix] Corrected an issue which could result in a dap4 failure. See [Github #888](Unidata/netcdf-c#888) for more information. * [Bug Fix][Enhancement] Allow `nccopy` to control output filter suppresion. See [Github #894](Unidata/netcdf-c#894) for more information. * [Enhancement] Reverted some new behaviors that, while in line with the netCDF specification, broke existing workflows. See [Github #843](Unidata/netcdf-c#843) for more information. * [Bug Fix] Improved support for CRT builds with Visual Studio, improves zlib detection in hdf5 library. See [Github #853](Unidata/netcdf-c#853) for more information. * [Enhancement][Internal] Moved HDF4 into a distinct dispatch layer. See [Github #849](Unidata/netcdf-c#849) for more information. ## 4.6.0 - January 24, 2018 * [Enhancement] Full support for using HDF5 dynamic filters, both for reading and writing. See the file docs/filters.md. * [Enhancement] Added an option to enable strict null-byte padding for headers; this padding was specified in the spec but was not enforced. Enabling this option will allow you to check your files, as it will return an E_NULLPAD error. It is possible for these files to have been written by older versions of libnetcdf. There is no effective problem caused by this lack of null padding, so enabling these options is informational only. The options for `configure` and `cmake` are `--enable-strict-null-byte-header-padding` and `-DENABLE_STRICT_NULL_BYTE_HEADER_PADDING`, respectively. See [Github #657](Unidata/netcdf-c#657) for more information. * [Enhancement] Reverted behavior/handling of out-of-range attribute values to pre-4.5.0 default. See [Github #512](Unidata/netcdf-c#512) for more information. * [Bug] Fixed error in tst_parallel2.c. See [Github #545](Unidata/netcdf-c#545) for more information. * [Bug] Fixed handling of corrupt files + proper offset handling for hdf5 files. See [Github #552](Unidata/netcdf-c#552) for more information. * [Bug] Corrected a memory overflow in `tst_h_dimscales`, see [Github #511](Unidata/netcdf-c#511), [Github #505](Unidata/netcdf-c#505), [Github #363](Unidata/netcdf-c#363) and [Github #244](Unidata/netcdf-c#244) for more information. ## 4.5.0 - October 20, 2017 * Corrected an issue which could potential result in a hang while using parallel file I/O. See [Github #449](Unidata/netcdf-c#449) for more information. * Addressed an issue with `ncdump` not properly handling dates on a 366 day calendar. See [GitHub #359](Unidata/netcdf-c#359) for more information. ### 4.5.0-rc3 - September 29, 2017 * [Update] Due to ongoing issues, native CDF5 support has been disabled by **default**. You can use the options mentioned below (`--enable-cdf5` or `-DENABLE_CDF5=TRUE` for `configure` or `cmake`, respectively). Just be aware that for the time being, Reading/Writing CDF5 files on 32-bit platforms may result in unexpected behavior when using extremely large variables. For 32-bit platforms it is best to continue using `NC_FORMAT_64BIT_OFFSET`. * [Bug] Corrected an issue where older versions of curl might fail. See [GitHub #487](Unidata/netcdf-c#487) for more information. * [Enhancement] Added options to enable/disable `CDF5` support at configure time for autotools and cmake-based builds. The options are `--enable/disable-cdf5` and `ENABLE_CDF5`, respectively. See [Github #484](Unidata/netcdf-c#484) for more information. * [Bug Fix] Corrected an issue when subsetting a netcdf3 file via `nccopy -v/-V`. See [Github #425](Unidata/netcdf-c#425) and [Github #463](Unidata/netcdf-c#463) for more information. * [Bug Fix] Corrected `--has-dap` and `--has-dap4` output for cmake-based builds. See [GitHub #473](Unidata/netcdf-c#473) for more information. * [Bug Fix] Corrected an issue where `NC_64BIT_DATA` files were being read incorrectly by ncdump, despite the data having been written correctly. See [GitHub #457](Unidata/netcdf-c#457) for more information. * [Bug Fix] Corrected a potential stack buffer overflow. See [GitHub #450](Unidata/netcdf-c#450) for more information. ### 4.5.0-rc2 - August 7, 2017 * [Bug Fix] Addressed an issue with how cmake was implementing large file support on 32-bit systems. See [GitHub #385](Unidata/netcdf-c#385) for more information. * [Bug Fix] Addressed an issue where ncgen would not respect keyword case. See [GitHub #310](Unidata/netcdf-c#310) for more information. ### 4.5.0-rc1 - June 5, 2017 * [Enhancement] DAP4 is now included. Since dap2 is the default for urls, dap4 must be specified by (1) using "dap4:" as the url protocol, or (2) appending "#protocol=dap4" to the end of the url, or (3) appending "#dap4" to the end of the url Note that dap4 is enabled by default but remote-testing is disbled until the testserver situation is resolved. * [Enhancement] The remote testing server can now be specified with the `--with-testserver` option to ./configure. * [Enhancement] Modified netCDF4 to use ASCII for NC_CHAR. See [Github Pull request #316](Unidata/netcdf-c#316) for more information. * [Bug Fix] Corrected an error with how dimsizes might be read. See [Github #410](Unidata/netcdf-c#410) for more information. * [Bug Fix] Corrected an issue where 'make check' would fail if 'make' or 'make all' had not run first. See [Github #339](Unidata/netcdf-c#339) for more information. * [Bug Fix] Corrected an issue on Windows with Large file tests. See [Github #385](Unidata/netcdf-c#385]) for more information. * [Bug Fix] Corrected an issue with diskless file access, see [Pull Request #400](Unidata/netcdf-c#400) and [Pull Request #403](Unidata/netcdf-c#403) for more information. * [Upgrade] The bash based test scripts have been upgraded to use a common test_common.sh include file that isolates build specific information. * [Upgrade] The bash based test scripts have been upgraded to use a common test_common.sh include file that isolates build specific information. * [Refactor] the oc2 library is no longer independent of the main netcdf-c library. For example, it now uses ncuri, nclist, and ncbytes instead of its homegrown equivalents. * [Bug Fix] `NC_EGLOBAL` is now properly returned when attempting to set a global `_FillValue` attribute. See [GitHub #388](Unidata/netcdf-c#388) and [GitHub #389](Unidata/netcdf-c#389) for more information. * [Bug Fix] Corrected an issue where data loss would occur when `_FillValue` was mistakenly allowed to be redefined. See [Github #390](Unidata/netcdf-c#390), [GitHub #387](Unidata/netcdf-c#387) for more information. * [Upgrade][Bug] Corrected an issue regarding how "orphaned" DAS attributes were handled. See [GitHub #376](Unidata/netcdf-c#376) for more information. * [Upgrade] Update utf8proc.[ch] to use the version now maintained by the Julia Language project (https://github.com/JuliaLang/utf8proc/blob/master/LICENSE.md). * [Bug] Addressed conversion problem with Windows sscanf. This primarily affected some OPeNDAP URLs on Windows. See [GitHub #365](Unidata/netcdf-c#365) and [GitHub #366](Unidata/netcdf-c#366) for more information. * [Enhancement] Added support for HDF5 collective metadata operations when available. Patch submitted by Greg Sjaardema, see [Pull request #335](Unidata/netcdf-c#335) for more information. * [Bug] Addressed a potential type punning issue. See [GitHub #351](Unidata/netcdf-c#351) for more information. * [Bug] Addressed an issue where netCDF wouldn't build on Windows systems using MSVC 2012. See [GitHub #304](Unidata/netcdf-c#304) for more information. * [Bug] Fixed an issue related to potential type punning, see [GitHub #344](Unidata/netcdf-c#344) for more information. * [Enhancement] Incorporated an enhancement provided by Greg Sjaardema, which may improve read/write times for some complex files. Basically, linked lists were replaced in some locations where it was safe to use an array/table. See [Pull request #328](Unidata/netcdf-c#328) for more information.

shoyer closed this Sep 2, 2016

shoyer reopened this Sep 2, 2016

shoyer commented Mar 3, 2017

View reviewed changes

Switch NC_CHAR on netCDF4 to use ASCII

4dd8e38

Fixes GH298

shoyer force-pushed the NC_CHAR_to_ASCII branch from 2f42d14 to 4dd8e38 Compare March 14, 2017 03:12

WardF self-requested a review March 21, 2017 21:01

WardF added a commit that referenced this pull request Mar 21, 2017

Updated release notes in support of #316

d0d1a69

WardF added the status/under review label Mar 21, 2017

WardF added this to the 4.4.2 milestone Mar 21, 2017

WardF merged commit 4dd8e38 into Unidata:master Mar 21, 2017

cwardgar mentioned this pull request Mar 26, 2017

ArrayChar should be backed by byte[] and honor the "_Encoding" attribute Unidata/thredds#788

Open

This was referenced Apr 7, 2017

netCDF attributes loaded as ndarrays of bytes h5netcdf/h5netcdf#27

Closed

Read unicode strings in attributes h5py/h5py#585

Closed

LocutusOfBorg mentioned this pull request Sep 8, 2017

broken with h5py 2.7.1 h5netcdf/h5netcdf#31

Closed

shoyer deleted the NC_CHAR_to_ASCII branch January 11, 2018 01:45

shoyer mentioned this pull request Jan 11, 2018

Change of file encoding when opening pydata/xarray#1818

Closed

jwpeterson mentioned this pull request Feb 14, 2019

Upgrade bundled netCDF to version 4.6.2 libMesh/libmesh#2035

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch NC_CHAR on netCDF4 to use ASCII #316

Switch NC_CHAR on netCDF4 to use ASCII #316

shoyer commented Sep 2, 2016 •

edited

Loading

WardF commented Sep 2, 2016

shoyer commented Sep 2, 2016

WardF commented Sep 2, 2016

DennisHeimbigner commented Sep 2, 2016

shoyer commented Sep 2, 2016

ocefpaf commented Sep 2, 2016

DennisHeimbigner commented Sep 2, 2016

ocefpaf commented Sep 2, 2016

DennisHeimbigner commented Sep 2, 2016

shoyer commented Sep 2, 2016

ocefpaf commented Sep 2, 2016

shoyer Mar 3, 2017

DennisHeimbigner commented Mar 8, 2017

shoyer commented Mar 10, 2017

DennisHeimbigner commented Mar 10, 2017

WardF commented Mar 21, 2017

shoyer commented Mar 21, 2017

WardF commented Mar 21, 2017

Switch NC_CHAR on netCDF4 to use ASCII #316

Switch NC_CHAR on netCDF4 to use ASCII #316

Conversation

shoyer commented Sep 2, 2016 • edited Loading

WardF commented Sep 2, 2016

shoyer commented Sep 2, 2016

WardF commented Sep 2, 2016

DennisHeimbigner commented Sep 2, 2016

shoyer commented Sep 2, 2016

ocefpaf commented Sep 2, 2016

DennisHeimbigner commented Sep 2, 2016

ocefpaf commented Sep 2, 2016

DennisHeimbigner commented Sep 2, 2016

shoyer commented Sep 2, 2016

ocefpaf commented Sep 2, 2016

shoyer Mar 3, 2017

Choose a reason for hiding this comment

DennisHeimbigner commented Mar 8, 2017

shoyer commented Mar 10, 2017

DennisHeimbigner commented Mar 10, 2017

WardF commented Mar 21, 2017

shoyer commented Mar 21, 2017

WardF commented Mar 21, 2017

shoyer commented Sep 2, 2016 •

edited

Loading