-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conventions for string and character array encoding #402
Comments
Should an _Encoding attribute on a 'char' typed variable be restricted to a 7- or 8-bit encoding? |
As @DennisHeimbigner mentions here Unidata/netcdf4-python#654 (comment), this proposal does deals only with char or String typed variables, not char or String typed attributes. |
Why wouldn't |
Because netcdf does not support attributes for attributes. We would
need to come up with some kind of convention for this: a second
attribute that could be interpreted as applying to the string/char
attribute. Alternate is to define a global encoding for all attributes.
=Dennis Heimbigner
Unidata
…On 5/3/2017 8:01 AM, Jeff Whitaker wrote:
Why wouldn't |_Encoding| apply to attributes as well as variable data?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#402 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3P2yhdO2_-QxIAH57sQUlvJMbiB3I-ks5r2IjKgaJpZM4NOoaO>.
|
@DennisHeimbigner, I guessing you can answer @ethanrd question:
|
If an _Encoding is specified, then that technically determines 7 vs 8 bit. E.g. Ascii is 7 bit, but |
I suggested always indicating the encoding/charset with the same attribute ( So, as an alternate to the above proposal, I'll restate Bob's proposal here with a change or two given the target is the NUG rather than CF:
Reviewing Bob's original proposal brought up a number of questions on how the netCDF-4 and HDF5 libraries handle string encoding (if they enforce the encoding or not, etc.). I'm still digging and will report back when I get somewhere. Also, there was some question in the CF discussion on whether an explicit indicator was needed to differentiate between whether a |
@lesserwhirls, do you have any thoughts here? @ethanrd are you still looking at this, or can we propose the above changes to NUG? |
I do not understand the need for the _CharSet attribute. The type of the variable (char vs String) |
@DennisHeimbigner, the problem is that while netcdf4 has char or string, netcdf3 has only char. So we don't know whether the netcdf3 char holds a string or an array of 8 bit characters. |
@DennisHeimbigner, this is an alternative proposal. _CharSet would be for char variables when they are to be interpreted as individual chars. _Encoding would be for String variables (e.g., in nc4) and char variables in nc3 which should be interpreted as Strings. A further advantage is that only one attribute is needed per variable, not two. Think of it from a software reader's point of view:
|
I think the term "mandatory" is being misused here since a default is defined. |
One other question. If we had an attribute to indicate that a char array |
@BobSimons Given the backward compatibility issues, I'm not sure the NUG should specify how character arrays are interpreted when the proposed attributes are not used. At least not at the level of a MUST. |
The proposed default behavior is to assume that a netcdf3 char array is a string. |
With the original proposal, an nc3 file might have:
With the alternative proposal, that nc3 file would have:
because _Encoding now says two things (this var is a String var and the encoding is ...) |
@BobSimons I don't think the default for NC3 can be UTF-8 because there are existing NC3 files w/o |
I think what @BobSimons means is that the convention will be to assume string and UTF-8 for char arrays without any attributes because, as you say, it's ambiguous, and software will have to do something! |
And I leave it to everyone else to say what the default should be. There
are advantages and disadvantages to every choice.
ISO-8859-1 probably makes sense from a safe, backward-looking sense.
UTF-8 would be nice in a forward-looking sense.
…On Tue, May 9, 2017 at 12:15 PM, Rich Signell ***@***.***> wrote:
I think what @BobSimons <https://github.com/BobSimons> means is that the
convention will be to assume string and UTF-8 for char arrays without any
attributes because, as you say, it's ambiguous, and software will have to
do something!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#402 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABarOJxsouo4DKsLxduKi4Jw0rrQI2Gbks5r4LtjgaJpZM4NOoaO>
.
--
Sincerely,
Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A (New!)
Monterey, CA 93940 (New!)
Phone: (831)333-9878 (New!)
Fax: (831)648-8440
Email: bob.simons@noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
|
If utf-8 is an option, why are we restricting the rest of the list to iso-8859-1/15? |
Under either proposal, for char variables that will be interpreted as
individual characters (which will be stored as individual bytes in .nc
files), UTF-8 isn't and can't be an option because most UTF-8 characters
are represented as more than one byte.
Under either proposal, for char variables that will be interpreted as
Strings, UTF-8 is a valid option.
That said, why just 2 other options?
It can't be open-ended because then all software which tries to read a .nc
file is responsible for being ready to read every possible encoding.
There's a question of what are the "correct" or at least valid names --
different systems seem to use slightly different names. Different computer
languages support different options. So there needs to be a defined list of
acceptable options. Right now, that list is short.
ISO-8859-1 is nice because it is the same as the first 256 characters of
Unicode. So it is the closest to what netcdf library has been doing when
writing just the low byte of a Unicode character. ISO-8859-1 has been
widely used.
ISO-8859-15 is nice because it is the modern version of ISO-8859-1.
ISO-8859-15 has been fairly widely used.
Support for options other than UTF-8 is a way of dealing with legacy files.
There are millions (billions?) of .nc files that aren't going to be
re-written, so it would be nice if there were a way to specify the encoding
if it is known. If it is known, it could be specified by adding an, e.g.,
_Encoding attribute with NCO or on-the-fly with NCML without having to
write a program to read the file and write the file out with the attribute
specifying the encoding.
I personally am open to allowing other options if the need arises, But I
don't know which other options are needed. If others are added, we need to
agree on the specific names.
…On Tue, May 9, 2017 at 1:04 PM, Ryan May ***@***.***> wrote:
If utf-8 is an option, why are we restricting the rest of the list to
iso-8859-1/15?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#402 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABarOPt83LfG-df6K_lr_lMzAHArBl46ks5r4MbCgaJpZM4NOoaO>
.
--
Sincerely,
Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A (New!)
Monterey, CA 93940 (New!)
Phone: (831)333-9878 (New!)
Fax: (831)648-8440
Email: bob.simons@noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
|
Well the problem with ISO-8859-1 (aka latin-1) and ISO-8859-15 (aka latin-9) is that they're distinctly focused on western European languages. We should at a minimum look at something like koi8-r and cp1251 to encompass eastern european/cyrillic characters. You should also be able to declare ascii itself to indicate that you only intend to use the lowest 7-bits. |
Every charset has a different focus.
I suggested ISO-8859-1 and -15 because I know they have been widely used.
If you know of files that use koi8-r and cp1251, then let's add them to the
list of acceptable charsets/encodings.
I don't like the idea of allowing an ASCII (7-bit) option because the data
is 8-bits. A reader has to be ready to deal with 8-bit data. (Or we could
say that ASCII is a valid option but if the file has a character using the
8th bit, the file is invalid. I suspect we would get a lot of invalid files
from non-ASCII apostrophes and hyphens that the file authors aren't even
aware of.) I also don't see the need for an ASCII option because, if the
author really believes the characters are all ASCII, then ISO-8859-1 can be
specified (since the first 128 chars are the same).
…On Tue, May 9, 2017 at 2:46 PM, Ryan May ***@***.***> wrote:
Well the problem with ISO-8859-1 (aka latin-1) and ISO-8859-15 (aka
latin-9) is that they're distinctly focused on western European languages.
We should at a minimum look at something like koi8-r and cp1251 to
encompass eastern european/cyrillic characters. You should also be able to
declare ascii itself to indicate that you only intend to use the lowest
7-bits.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#402 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABarOM4wyqAA0LWh7tNd-oYMRudXDGwfks5r4N61gaJpZM4NOoaO>
.
--
Sincerely,
Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A (New!)
Monterey, CA 93940 (New!)
Phone: (831)333-9878 (New!)
Fax: (831)648-8440
Email: bob.simons@noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
|
for multidimensional character arrays that are to be interpreted as strings, is there a standard way to interpret the dimensions? Should the last dimension be interpreted as the length of the strings? If so, is there a convention for naming that dimension? |
"Yes" for your first two questions: CF 1.6 (and previous) section 2.2
http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_data_types
says that is always the last dimension that holds the number of characters
"NetCDF does not support a character string type, so these must be
represented as character arrays. In this document, a one dimensional array
of character data is simply referred to as a "string". An n-dimensional
array of strings must be implemented as a character array of dimension
(n,max_string_length), with the last (most rapidly varying) dimension
declared large enough to contain the longest string in the array. All the
strings in a given array are therefore defined to be equal in length. For
example, an array of strings containing the names of the months would be
dimensioned (12,9) in order to accommodate "September", the month with the
longest name."
"No" for your third question:
As far as I know, there is no standard for how that dimension should be
named.
CF section 2.3 says
"This convention does not standardize any variable or dimension names. "
…On Tue, May 16, 2017 at 10:02 AM, Jeff Whitaker ***@***.***> wrote:
for multidimensional character arrays that are to be interpreted as
strings, is there a standard way to interpret the dimensions? Should the
last dimension be interpreted as the length of the strings? If so, is there
a convention for naming that dimension?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#402 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABarOBxdV5--h9sFvQr0O2SK2gAUvWuFks5r6da6gaJpZM4NOoaO>
.
--
Sincerely,
Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A (New!)
Monterey, CA 93940 (New!)
Phone: (831)333-9878 (New!)
Fax: (831)648-8440
Email: bob.simons@noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
|
@dopplershift , are you satisfied with the explanation @BobSimons provided? |
I added automatic detection of the |
This seems to be significantly different from the original proposal or the
alternate proposal.
"When writing data to character variables, _Encoding is used to encode the
string arrays into bytes, creating an array of individual characters with
one more dimension. For character variables, if _Encoding is not set, an
array of characters is returned."
I'm confused. Since netcdf4 has separate char and String data types, why
are you adding a dimension when writing chars to a char variable?
Is this your way of allowing chars in a char variable to be encoded with
UTF-8 (and thus perhaps take up multiple bytes / char)? That would expand
the usage of chars significantly.
And when reading a char variable from an nc4 file, won't an array of chars
always be returned? (Or, again, is this your way of expanding the usage of
chars to include UTF-8 encoding?) And can't _Encoding be used to indicate
the charset of the returned characters (e.g., ISO-8859-1)?
…---
This usage seems oriented to just reading and writing netcdf-4 files. It
doesn't solve the problem of how to interpret a char variable in a netcdf-3
file (as strings? as separate chars?). One of the complaints in the CF
discussion was: someone writing code to read a file shouldn't have to know
whether they are reading an nc3 file or an nc4 file in order to know how to
interpret the data. It would be nice to have a system that works with nc3
and nc4 files.
On Wed, May 17, 2017 at 8:49 AM, Jeff Whitaker ***@***.***> wrote:
I added automatic detection of the _Encoding attribute in netcdf4-python (
Unidata/netcdf4-python#665
<Unidata/netcdf4-python#665>). For string
variables, if _Encoding is set it is used to encode the strings into
bytes when writing to the file, and to decode the bytes into strings when
reading from the file. If _Encoding is not specified, utf-8 is used
(which was the previous behavior). When reading data from character
variables _Encoding is used to convert the character array to an array of
fixed length strings, assuming the last dimension is the length of the
strings. When writing data to character variables, _Encoding is used to
encode the string arrays into bytes, creating an array of individual
characters with one more dimension. For character variables, if _Encoding
is not set, an array of characters is returned.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#402 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABarOGWkde0uKwdnXJdxGsaZ0Sez40vlks5r6xcmgaJpZM4NOoaO>
.
--
Sincerely,
Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A (New!)
Monterey, CA 93940 (New!)
Phone: (831)333-9878 (New!)
Fax: (831)648-8440
Email: bob.simons@noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
|
@rsignell-usgs @BobSimons |
What I wish was the case was this:
|
Also, with respect to using the rightmost dim to encode (fixed-length) strings: this is purely |
--- External?
I've always been confused about the relationship of netcdf and CF so I
don't know if you consider CF external, but using the rightmost dim to
encode strings in char variables is part of the CF specification (section
2.2).
…--- Where is this relevant?
Doesn't netcdf-java always use the rightmost dim when you use
NetcdfFileWriter.addStringVariable() and
NetcdfFileWriter.writeStringData() when writing an nc3 file?
And doesn't it use the rightmost dim when you use
NetcdfFile.read(), readData(), and readSection()?
(When reading nc3 files, do those/how do those distinguish
char variables that should be read as individual chars from
char variables that should be read as Strings?)
Doesn't netcdf-c do the same?
Some other software (e.g., some of mine) also uses the rightmost dimension
system explicitly in places that were written before (or before my
awareness of) writeStringData().
On Wed, May 17, 2017 at 11:26 AM, DennisHeimbigner ***@***.*** > wrote:
Also, with respect to using the rightmost dim to encode (fixed-length)
strings: this is purely
an external convention and is certainly not part of the netcdf spec. It
raises a question?
Who actually makes use of this convention? I know only one place: the
conversion of DAP2
string typed vars into netcdf-3 character typed variables. Is it used
anywhere else?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#402 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABarOHFtmcd4UftG-NkFEHtbHcYanZGCks5r6zvKgaJpZM4NOoaO>
.
--
Sincerely,
Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A (New!)
Monterey, CA 93940 (New!)
Phone: (831)333-9878 (New!)
Fax: (831)648-8440
Email: bob.simons@noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
|
WRT to jswhit proposal above.
|
At this point, there seems to be agreement about strings: _Encoding specifies So we can focus on the character type as an eight bit value. I am not concerned here 1 _Encoding applies to individual 8-bit characters but the only legal _Encodings are |
@BobSimons, regarding your comment that the python implemention deviates from your original proposal... In the situation when a user tries to write an array of python fixed length strings to a character variable with I thought this was in the spirit of the CF convention - and this is what a user would have to do manually to write the strings to the character variable. One could certainly argue that this is a too much 'magic' though. The same happens in reverse when data is read from a char variable with |
@DennisHeimbigner, regarding your question "Is there any situation in which a python The answer is yes, if you are writing a single string into a character array like this >>> v
<type 'netCDF4._netCDF4.Variable'>
|S1 strings(n1, n2, nchar)
_Encoding: ascii
unlimited dimensions: n1
current shape = (0, 10, 12)
filling on, default _FillValue of used
>>> v[0,0,:] = 'foobar' The string
|
Your approach is internally consistent -- if someone writes files with your
system and reads them with your system, all is well. But there are other nc
files created by other software, which I think don't mesh with your
approach.
I don't know if your system is for netcdf-4 only, but if netcdf-3 files are
included, the problem is: there are nc3 files with char variables where the
chars are meant to be read as individual chars without collapsing the
rightmost dimension. The Argo program has 100's of 1000's (millions?) of
these files. They have variables like
char POSITION_QC(N_PROF=254);
where there is one QC character per profile. (Yes, there's a more CF-way to
do this now, but they started doing this many years ago.)
I think it is a reasonable reading of the CF convention (section 2.2) to
say that these are legit char variables, not to be interpreted as Strings
(by collapsing the rightmost dimension).
A goal of this proposal is to make it simple for a software reader to read
a file (including an Argo file) and know quickly and easily if a given char
variable in an nc3 file is meant to be interpreted
as individual chars (not collapsing the rightmost dimension)
or as Strings (by collapsing the rightmost dimension).
With nc4 files that is trivial because there are explicit char and String
data types. The problem is with disambiguating char variables in nc3 files.
Stated another way, it is a goal that netcdf-java library's
NetcdfFile.read() should be able to know quickly and easily whether it
should return
an ArrayChar (by not collapsing the rightmost dimension)
or an ArrayString (by collapsing the rightmost dimension)
(and also be able to properly deal with the charset/encoding of the stored
characters).
…On Wed, May 17, 2017 at 3:40 PM, Jeff Whitaker ***@***.***> wrote:
@BobSimons <https://github.com/bobsimons>, regarding your comment that
the python implemention deviates from your original proposal...
In the situation when a user tries to write an array of python fixed
length strings to a character variable with _Encoding set, the python
interface will convert that array of fixed length strings to an array of
single characters (bytes) with one more dimension (equal to the length of
the fixed length strings, and the rightmost dimension of the character
variable) then write that array of characters to the file.
I thought this was in the spirit of the CF convention - and this is what a
user would have to do manually to write the strings to the character
variable. One could certainly argue that this is a too much 'magic' though.
The same happens in reverse when data is read from a char variable with
_Encoding set.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#402 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABarOAqdua1ezYBcjktrLDRQH7DevGtTks5r63dBgaJpZM4NOoaO>
.
--
Sincerely,
Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A (New!)
Monterey, CA 93940 (New!)
Phone: (831)333-9878 (New!)
Fax: (831)648-8440
Email: bob.simons@noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
|
for nc3 or nc4 files, if |
Ah. Thank you. I misunderstood.
…On Wed, May 17, 2017 at 4:46 PM, Jeff Whitaker ***@***.***> wrote:
for nc3 or nc4 files, if _Encoding is not set the individual chars will
be returned by the python interface without collapsing the rightmost
dimension. I thought from your proposal that if _Encoding was set, then
the client should interpret the char array as strings. Did I misread that?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#402 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABarOD3-7mfaiaH_o3mt7JsVy0ZDdhNEks5r64bjgaJpZM4NOoaO>
.
--
Sincerely,
Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A (New!)
Monterey, CA 93940 (New!)
Phone: (831)333-9878 (New!)
Fax: (831)648-8440
Email: bob.simons@noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
|
@BobSimons, would @jswhit's approach with NetCDF-Python work for you in ERDDAP to disambiguate string and char array handling in NetCDF3 and NetCDF4? Seems like it does, right? |
Sorry. I'm on vacation for the next 2 weeks and not available to evaluate
this.
I was confused by his original email. So I don't think I understand his
proposal. I stand by my proposal.
…On Sun, Jun 4, 2017 at 12:25 PM, Rich Signell ***@***.***> wrote:
@BobSimons <https://github.com/bobsimons>, would @jswhit
<https://github.com/jswhit>'s approach with NetCDF-Python work for you in
ERDDAP to disambiguate string and char array handling in NetCDF3 and
NetCDF4?
Seems like it does, right?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#402 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABarOIukT81RvoM10PX0NCVIdr3y1o_Oks5sAtprgaJpZM4NOoaO>
.
--
Sincerely,
Bob Simons
IT Specialist
Environmental Research Division
NOAA Southwest Fisheries Science Center
99 Pacific St., Suite 255A (New!)
Monterey, CA 93940 (New!)
Phone: (831)333-9878 (New!)
Fax: (831)648-8440
Email: bob.simons@noaa.gov
The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><
|
Okay, I'll discuss with you when you get back from vacation. |
As discussed here Unidata/netcdf4-python#654 (comment), there is a need for conventions to specify the encoding of strings and character arrays in netcdf.
There is also a need to specify whether
char
arrays in NetCDF3 contain strings or character arrays.@BobSimons addressed these issues in an enhancement to CF conventions that would specify
charset
for NetCDF3 and_Encoding
for NetCDF4, and the Unidata gang (@DennisHeimbigner, @WardF, @ethanrd and @cwardgar) agreed with the concept, but suggested this be handled in the NUG and we came up with this slightly different proposal that would still accomplish Bob's goals of making it easy for software to figure out what is stuffed in thosechar
orstring
arrays!Proposal:
_CharType
variable attribute with allowed values['STRING', 'CHAR_ARRAY']
to specify if achar
array variable should be interpreted as a string or as an array of individual characters. If_CharType
is missing, default is'STRING'
._Encoding
variable attribute with allowed values['ISO-8859-1', 'ISO-8859-15', 'UTF-8']
to specify the encoding. If_Encoding
is missing for_CharType='STRING'
, default is'UTF-8'
. If_Encoding
is missing for_CharType='CHAR_ARRAY'
, default is'ISO-8859-15'
.The text was updated successfully, but these errors were encountered: