Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for attributes of type string #141

Closed
JimBiardCics opened this issue Jul 23, 2018 · 148 comments · Fixed by #556
Closed

Add support for attributes of type string #141

JimBiardCics opened this issue Jul 23, 2018 · 148 comments · Fixed by #556
Assignees
Labels
change agreed Issue accepted for inclusion in the next version and closed defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors
Milestone

Comments

@JimBiardCics
Copy link
Contributor

JimBiardCics commented Jul 23, 2018

Attributes with a type of string are now possible with netCDF-4, and many examples of attributes with this type are "in the wild". As an example of how this is happening, IDL creates an attribute with this type if you select its version of string type instead of char type. It seems that people often assume that string is the correct type to use because they wish to store strings, not characters.

I propose to add verbiage to the Conventions to allow attributes that have a type of string. There are two ramifications to allowing attributes of this type, the second of which impacts string variables as well.

  1. A string attribute can contain 1D atomic string arrays. We need to decide whether or not we want to allow these or limit them (at least for now) to atomic string scalars. Attributes with arrays of strings could allow for cleaner delimiting of multiple parts than spaces or commas do now (e.g. flag_values and flag_meanings could both be arrays), but this would be a significant stretch for current software packages.
  2. A string attribute (and a string variable) can contain UTF-8 Unicode strings. UTF-8 uses variable-length characters, with the standard ASCII characters as the 1-byte subset. According to the Unicode standard, a UTF-8 string can be signaled by the presence of a special non-printing three byte sequence known as a Byte Order Mark (BOM) at the front of the string, although this is not required. IDL (again, for example) writes this BOM sequence at the beginning of every attribute or variable element of type string.

Allowing attributes containing arrays of strings may open up useful future directions, but it will be more of a break from the past than attributes that have only single strings. Allowing attributes (and variables) to contain UTF-8 will free people to store non-English content, but it might pose headaches for software written in older languages such as C and FORTRAN.

To finalize the change to support string type attributes, we need to decide:

  1. Do we explicitly forbid string array attributes?
  2. Do we place any restrictions on the content of string attributes and (by extension) variables?

Now that I have the background out of the way, here's my proposal.

Allow string attributes. Specify that the attributes defined by the current CF Conventions must be scalar (contain only one string).

Allow UTF-8 in attribute and variable values. Specify that the current CF Conventions use only ASCII characters (which are a subset of UTF-8) for all terms defined within. That is, the controlled vocabulary of CF (standard names and extensions, cell_methods terms other than free-text elements of comments(?), area type names, time units, etc) is composed entirely of ASCII characters. Free-text elements (comments, long names, flag_meanings, etc) may use any UTF-8 character.

Trac ticket: #176 but that's just a placeholder - no discussion took place there (@JonathanGregory 26 July 2024)

@Dave-Allured
Copy link
Contributor

Dave-Allured commented Jul 23, 2018

I am generally in support of this string attributes proposal, including UTF-8 characters. However, for CF controlled attributes, I recommend an explicit preference for type char rather than string. This is for compatibility with large amounts of existing user code that access critical attributes directly, and would need to be reworked for type string.

I suggest not including a constraint for scalar strings, simply because it seems redundant. I think that existing CF language strongly implies single strings in most cases of CF defined attributes.

@ghost
Copy link

ghost commented Jul 24, 2018

How different is reading values from a string attribute compared to a string variable? If some software supports string variables shouldn't it support string attributes as well? If the CF is going to recommend char datatype for string-valued attributes, shouldn't the same be done for string-valued variables?

Prefixing the bytes of an UTF-8 encoded string with the BOM sequence is an odd practice. Although it is permitted, afaik, it is not recommended.

Since what gets stored are always the bytes of one string in some encoding, assuming UTF-8 always should take care of the ASCII character set, too. This could cause issues if someone used other one-byte encodings (e.g. ISO 8859 family) but I don't see how such cases could be easily resolved.

Stroring Unicode strings using the string datatype makes more sense since the number of bytes for such strings in UTF-8 encoding is variable.

@JimBiardCics
Copy link
Contributor Author

This issue and issue #139 are intertwined. There may be overlapping discussion in both.

@JimBiardCics
Copy link
Contributor Author

@ajelenak-thg So I did some digging. I wrote a file with IDL and read it with C. There are no BOM prefixes. I guess some languages (such as Python) make assumptions one way or another about string attributes and variables, but it appears that it's all pretty straightforward.

@JimBiardCics
Copy link
Contributor Author

@ajelenak-thg I agree that we should state that char attributes and variables should contain only ASCII characters.

@JimBiardCics
Copy link
Contributor Author

@Dave-Allured When you say "CF-controlled attributes", are you referring to any values they may have, or to values that are from controlled vocabularies?
It is true that applications written in C or FORTRAN will require code changes to handle string because the API and what is returned for string attributes and variables is different from that for char attributes and variables.
Would a warning about avoiding string for maximum compatibility be sufficient?

@Dave-Allured
Copy link
Contributor

Dave-Allured commented Jul 24, 2018

@JimBiardCics, by "CF-controlled attributes", I mean CF-defined attributes within "the controlled vocabulary of CF" as you described above. By implication I am referring to any values they may have, including but not limited to values from controlled vocabularies.

A warning about avoiding data type string is notification. An explicit preference is advocacy. I believe the compatibility issue is important enough that CF should adopt the explicit preference for type char for key attributes.

@Dave-Allured
Copy link
Contributor

Dave-Allured commented Jul 24, 2018

The restriction that char attributes and variables should contain only ASCII characters is not warranted. The Netcdf-C library is agnostic about the character set of data stored within char attributes and char variables. UTF-8 and other character sets are easily embedded within strings stored as char data.

Therefore I suggest no mention of a character set restriction, outside of the CF controlled vocabulary. Alternatively you could establish the default interpretation of string data (both char and string data types) as the ASCII/UTF-8 conflation.

@DocOtak
Copy link
Member

DocOtak commented Jul 24, 2018

Hi all, I wasn't quite able to form this into a coherent paragraphs so here are some things to keep in mind re: UTF8 vs other encodings:

  • UTF8 is backwards compatible with ASCII if the following are true: no byte order mark, all code points are between U+0000 and U+007F
  • UTF8 is not backwards comparable with Latin1 (ISO 8859-1) because code points above U+007F need two bytes to represent.
  • There are multiple ways of representing the same grapheme, the netCDF classic format required UTF8 to be in Normalization Form Canonical Composition (NFC)

My personal recommendation is that the only encoding for text in CF netCDF be UTF8 in NFC with no byte order mark. For attributes where there is desire to restrict what is allowed (though controlled vocabulary or other limitations), the restriction should be specified using unicode points, e.g. "only printing characters between U+0000 and U+007F are allowed in controlled attributes".

Text which is in controlled vocabulary attributes should continue to be char arrays. Freeform attributes (mostly those in 2.6.2. Description of file contents), could probably be either string or char arrays.

@Dave-Allured
Copy link
Contributor

@DocOtak, you said "the netCDF classic format required UTF8 to be in Normalization Form Canonical Composition (NFC)". This restriction is only for netCDF named objects, i.e. the names of dimensions, variables, and attributes. There is no such restriction for data stored within variables or attributes.

@DocOtak
Copy link
Member

DocOtak commented Jul 24, 2018

@Dave-Allured yes, I reread the section, object names does appear to be what it is restricting. Should there be some consideration of specifying a normalization for the purposes of data in CF netcdf?

Text encoding probably deserves its own section in the CF document, perhaps under data types. The topic of text encoding can be very foreign to someone who thinks that "plain text" is a thing that exists in computing.

@Dave-Allured
Copy link
Contributor

@DocOtak, for general text data, I think UTF-8 normalization is more of a best practice than a necessity for CF purposes. Therefore I suggest that CF remain silent about that, but include it if you feel strongly. Normalization becomes important for efficient string matching, which is why netCDF object names are restricted.

@DocOtak
Copy link
Member

DocOtak commented Jul 24, 2018

@Dave-Allured I don't know enough about the consequences of requiring a specific normalization. There is some interesting information on the unicode website about normalization. Which suggests that over 99% of unicode text on the web is already in NFC. Also interesting is that combining NFC normalized strings may not result in a new string that is normalized. It is also stated in the FAQ that "Programs should always compare canonical-equivalent Unicode strings as equal", so it's probably not an issue as long as the controlled vocabulary attributes have values with code points in the U+0000 and U+007F range (control chars excluded).

@hrajagers
Copy link

@Dave-Allured and @DocOtak,

  1. Most of the character/string attributes in the CF conventions contain a concatenation of sub-strings selected from a standardized vocabulary, variable names, and some numbers and separator symbols. It seems that for those attributes the discuss about the encoding is not so relevant as these sub-strings contain only a very basic set of characters (assuming that variable names are not allowed to contain extended characters). Even for flag_meanings the CF conventions state "Each word or phrase should consist of characters from the alphanumeric set and the following five: '_', '-', '.', '+', '@'." If the alphanumeric set doesn't include extended characters this again doesn't create any problems for encoding. The only attributes that might contain extended characters (and thus be influenced by this encoding choice) are attributes like long_name, institution, title, history, ... However CF inherits most of them from the NetCDF User Guide which explicitly states that they should be stored as character arrays (see NUG Appendix A) So, is it then up to CF to allow strings here? In short, I'm not sure the encoding is important for string/character attributes at this moment.

  2. I initially raised the encoding topic in the related issue Add support for variables of type string #139 because we want our model users to use local names for observation points and they will end up in label variables. In that context I would like to make sure that what I store is properly described.

@JimBiardCics
Copy link
Contributor Author

@hrajagers Thanks for the pointer to NUG Appendix A. It's interesting to see in that text that character array, character string, and string are used somewhat interchangeably. I'm curious to know if the NUG authors looked at this section in light of allowing string type.

@ghost
Copy link

ghost commented Jul 25, 2018

I think we are making good progress on this. I checked the Appendix A table of CF attributes and I think the following attributes can be allowed to hold string values as well as char:

  • comment
  • external_variables
  • _FillValue
  • flag_meanings
  • flag_values
  • history
  • institution
  • long_name
  • references
  • source
  • title

All the other attributes should hold char values to maximize backward compatibility.

@JimBiardCics
Copy link
Contributor Author

@ajelenak-thg Are you suggesting the other attributes must always be of type char, or that they should only contain the ASCII subset of characters?

@ghost
Copy link

ghost commented Jul 25, 2018

Based on the expressed concern so far for backward compatibility I suggested the former: always be of type char. Leave the character set and encoding unspecified since the values of those attributes are controlled by the convention.

@ghost
Copy link

ghost commented Jul 25, 2018

On the string encoding issue, CF data can be currently stored in two file formats: NetCDF Classic, and HDF5. String encoding information cannot be directly stored in the netCDF Classic format and the spec defines a special variable attribute _Encoding for that in future implementations. The values of this attribute are not specified so anything could be used.

In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both char and string datatypes in the context of this discussion are stored as HDF5 strings. This effectively limits what could be allowed values of the (future) _Encoding attribute for maximal data interoperability between the two file formats.

@Dave-Allured
Copy link
Contributor

@hrajagers said: However CF inherits most of them [attributes] from the NetCDF User Guide which explicitly states that they should be stored as character arrays (see NUG Appendix A) So, is it then up to CF to allow strings here?

Yes, NUG Appendix A literally allows only char type attributes. My sense is that proponents believe that string type is compatible with the intent of the NUG, and also strings have enough advantages to warrant departure from the NUG.

Personally I think string type attributes are fine within collaborations where everyone is ready for any needed code upgrades. For exchanged and published data, char type CF attributes should be preferred explicitly by CF.

@Dave-Allured
Copy link
Contributor

@ajelenak-thg said: In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both char and string datatypes in the context of this discussion are stored as HDF5 strings.

Actually the ASCII/UTF-8 restriction is not enforced by the HDF5 library. This is used intentionally by netcdf developers to support arbitrary character sets in netcdf-4 data type char, both attributes and variables. See netcdf issue 298. Therefore, data type char remains fully interoperable between netcdf-3 and netcdf-4 formats.

For example, this netcdf-4 file contains a char attribute and a char variable in an alternate character set. You will need an app or console window enabled for ISO-8859-1 to properly view the ncdump of this file.

@JonathanGregory
Copy link
Contributor

Dear Jim

Thanks for addressing these issues. In fact you've raised two issues: the use of strings, and the encoding. These can be decided separately, can't they?

On strings, I agree with your proposal and subsequent comments by others that we should allow string, but we should recommend the continued use of char, giving as the reason that char will maximise the useability of the data, because of the existence of software that isn't expecting string. Recommend means that the cf-checker will give a warning if string is used. However it's not an error and a given project could decide to use string.

For the attributes whose contents are standardised by CF e.g. coordinates, if string is used we should require a scalar string. This is because software will not expect arrays of strings. These attributes are often critical and so it's essential they can be interpreted. For CF attributes whose contents aren't standardised e.g. comment, is there a strong use-case for allowing arrays of strings?

I recall that at the meeting in Reading the point was made that arrays would be natural for flag_values and flag_meanings. I agree that the argument is stronger in that case because the words in those two attributes correspond one-to-one. Still, it would break existing software to permit it. Is there a strong need for arrays?

Best wishes

Jonathan

@JimBiardCics
Copy link
Contributor Author

@JonathanGregory I agree with you. I think it would be fine to leave string array attributes out of the running for now. I also prefer the recommendation route.

Regarding the encoding, it seems to me that we could avoid a lot of complexity for now by making a simple requirement that all CF-defined terms and whitespace delimiters in string-valued attributes or variables be composed characters from the ASCII character set. It wouldn't matter if people used Latin-1 (ISO-8859-1) or UTF-8 for any free text or free text portions of contents, because they both contain the ASCII character set as a subset. The parts that software would be looking for would be parseable.

@ethanrd
Copy link
Member

ethanrd commented Jul 26, 2018

@JonathanGregory Another use-case (that I think came up during the Reading meeting) had the history attribute as a string array so that each element could contain the description of an individual processing step. I think easier machine readability was mentioned as a motivation.

@JonathanGregory
Copy link
Contributor

Regarding the encoding, I agree that for attribute contents which are standardised by CF it is fine to restrict ourselves to ASCII, in both char and string. For these attributes, we prescribe the possible values (they have controlled vocabulary) and so we don't need to make a rule in the convention about it for the sake of the users of the convention. If we put it in the convention, it would be as guidance for future authors of the convention. I don't have a view about whether we should do this. It would be worth noting to users that whitespace, which often appears in a "black-separated list of words", should be ASCII space. I agree that UTF-8 is fine for contents which aren't standardised.

Regarding arrays of strings, I realise I wasn't thinking clearly yesterday, sorry. As we've agreed, string attributes will not be expected by much existing software. Hence software has to be rewritten to support the use of strings in any case, and support for arrays of strings could be added at the same time, if it's really valuable. I don't see the particular value for the use of string arrays for comment - do other people? For flag_meanings, the argument was that it would allow a meaning to be a string which contained spaces (instead of being joined up with underscores, as is presently necessary); that is, it would be an enhancement to functionality.

Happy weekend - Jonathan

@JonathanGregory
Copy link
Contributor

I meant to write, I don't see the particular value for the use of string arrays for history, which Ethan reminded us of. Why would this be more machine-readable?

@JimBiardCics
Copy link
Contributor Author

@JonathanGregory The use of an array of strings for history would simplify denoting where each entry begins and ends as entries are added, appending or prepending a new array element each time rather than the myriad different ways people do it now. This would actually be a good thing to standardize.

I think we can just not mention string array attributes right now. The multi-valued attributes (other than history, perhaps) pretty much all specify how they are to be formed and delimited.

@cf-metadata-list
Copy link

cf-metadata-list commented Jul 27, 2018 via email

@kenkehoe
Copy link

kenkehoe commented Jul 27, 2018 via email

@JonathanGregory
Copy link
Contributor

Dear @ChrisBarker-NOAA and @DocOtak

Thanks for your further comments. I didn't know that in C you have to do the encoding yourself. I have modified the first two paragraphs accordingly. The third is unchanged. The text now in PR #556 is below. Is it OK?

Cheers

Jonathan

A text string can be stored either in a variable-length string or in a fixed-length char array. In both cases, text strings must be represented in Unicode and encoded according to UTF-8. A text string consisting only of ASCII characters is guaranteed to conform with this requirement, because the ASCII characters are a subset of Unicode, and their UTF-8 encodings are the same as their one-byte ASCII codes (decimal 0-127, hexadecimal 00-7F). Any Unicode composite characters must be NFC-normalized.

Before version 1.12, CF did not require UTF-8 encoding, and did not provide or endorse any convention to record what encoding was used. However, if the text string is stored in a char variable, the encoding might be recorded by the _Encoding attribute, although this is not a CF or NUG convention.

An n-dimensional array of strings may be implemented as a variable or an attribute of type string with n dimensions (only n=1 is allowed for an attribute) or as a variable of type char with n+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable. For example, a char variable containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name. The other strings, such as "May", would be padded with trailing NULL or space characters so that every array element is filled. A string variable to store the same information would be dimensioned (12), with each element of the array containing a string of the appropriate length. The CDL example below shows one variable of each type.

@DocOtak
Copy link
Member

DocOtak commented Oct 23, 2024

Do we know what netCDF4-python does? I think my last remaining concern is that, without good defaults in the software folks are using, no one will actually implement normalization in their "beyond ascii" unicode strings.

@ChrisBarker-NOAA
Copy link
Contributor

netCDF4 Python does not do the NFC normalization -- but I was going to put in a request for that :-)

@ChrisBarker-NOAA
Copy link
Contributor

The PR looks good to me now -- thanks!

However, the conformance section should be updated as well, specifing UTF-8 for strings and char.

@ethanrd
Copy link
Member

ethanrd commented Oct 25, 2024

This all looks really good! Thanks.
My only comment is about the last sentence in the first paragraph of changed text:

Any Unicode composite characters must be NFC-normalized.

I think this should be described in terms of normalized text rather than characters. First, because the Unicode standard discusses and defines normalization [1] in terms of Unicode text (or "Unicode coded character sequences"). Second, it would then only introduce one new Unicode concept, normalization, rather than two, normalization and composite characters.

Perhaps instead some text about normalization could be added to the second sentence in that paragraph. Something like:

In both cases, text strings must be Unicode text that is encoded according to UTF-8 and in NFC normalization form.

[1] See section 3.11 "Normalization Forms" in chapter 3 (PDF) of the Unicode Standard (v15.0).

@ChrisBarker-NOAA
Copy link
Contributor

I think this should be described in terms of normalized text rather than characters.

+1

This is in keeping with the drum I've been beating -- we should talk about Unicode in Unicode terms that scientists[*] will understand -- or at least be able to figure out what to do.

e.g., if the text says "text must be NFC normalized" folks can google:

"how do I NFC normalize a string Python"

(substitute language of choice here) -- you get as the top AI hit, the answer.

Critical is that the user doesn't need to try to figure out if they are using any combining characters, they should simply normalize everything.

[*] - actually any non-specialist in Unicode -- most people that write code, even professional developers, don't know that Unicode has "composite characters", or that there can be more that one way to express what seems like one thing.

In [33]: combined
Out[33]: 'Flöte'

In [34]: separate
Out[34]: 'Flöte'

# Those sure look the same
# but:

In [35]: combined == separate
Out[35]: False

# WTF?

# If you look at the code point values

In [36]: [ord(c) for c in combined]
Out[36]: [70, 108, 246, 116, 101]

In [37]: [ord(c) for c in separate]
Out[37]: [70, 108, 111, 776, 116, 101]

# hmm, there is an extra code point in there.

# to understand this / work with this:

In [39]: import unicodedata

In [40]: normed = unicodedata.normalize('NFC', separate)

In [41]: normed
Out[41]: 'Flöte'

In [42]: normed == combined
Out[42]: True

Ahh -- that makes sense now.

@JonathanGregory
Copy link
Contributor

Dear @ethanrd and @ChrisBarker-NOAA

Thanks for your comments. Below is a new version of the text in PR #556 for section 2.2. Is it OK now?

Best wishes

Jonathan

Conventions document

A text string can be stored either in a variable-length string or in a fixed-length char array. In both cases, text strings must be represented in Unicode Normalization Form C (NFC, section 3.11 and Annex 15 of the Unicode standard) and encoded according to UTF-8. A text string consisting only of ASCII characters is guaranteed to conform with this requirement, because the ASCII characters are a subset of Unicode, and their UTF-8 encodings are the same as their one-byte ASCII codes (decimal 0-127, hexadecimal 00-7F).

Before version 1.12, CF did not require UTF-8 encoding, and did not provide or endorse any convention to record what encoding was used. However, if the text string is stored in a char variable, the encoding might be recorded by the _Encoding attribute, although this is not a CF or NUG convention.

An n-dimensional array of strings may be implemented as a variable or an attribute of type string with n dimensions (only n=1 is allowed for an attribute) or as a variable of type char with n+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable. For example, a char variable containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name. The other strings, such as "May", would be padded with trailing NULL or space characters so that every array element is filled. A string variable to store the same information would be dimensioned (12), with each element of the array containing a string of the appropriate length.

Conformance document

Requirements:

  • Any text stored in a CF attribute or variable must be represented in Unicode Normalization Form C and encoded in UTF-8.

  • If a text-valued attribute is stored in a variable-length string, it must have a scalar value.

@ChrisBarker-NOAA
Copy link
Contributor

Looks great -- I made minor, not critical comments in the PR -- thanks for the massive effort!

@ethanrd
Copy link
Member

ethanrd commented Oct 28, 2024

Hi Jonathan @JonathanGregory - Thanks again for all the work on this! It looks really good. Just one new comment/question. (Sorry!)

I'm wondering about the use of the word "must" rather than "should". I think "must" makes sense for attribute/variable names (for UTF-8 and NFC normalization) but it seems less clear when it comes to the values of variables and even attributes (unless they are CF defined attributes maybe).

@ChrisBarker-NOAA
Copy link
Contributor

ChrisBarker-NOAA commented Oct 28, 2024

I think "must" makes sense for attribute/variable names (for UTF-8 and NFC normalization) -- and that's defined by the NUG, so absolutely.

less clear when it comes to the values of variables and even attributes (unless they are CF defined attributes maybe).

Attributes:

I think it's not always clear exactly what is and isn't a CF defined attribute -- more to the point, it's harder to make it clear to users what the rules are in that case.

And from the perspective of writing code to read/write attributes, it's an unholy mess if there are different encodings for different attributes -- and there is no way to define a different one if you want. I'm not sure it's even possible in, e.g. the netCDF4-python lib (I'd have to check on that).

Variables:

In this case, there is the unsanctioned precedent for an _Encoding attribute, so it's possible for software to handle arbitrary encodings. And the netCDF4-python lib already does. So it's a bit fuzzy there.

However -- it really is so much easier for everyone if we use UTF-8 everywhere -- there are entire web sites and
manifestos about it. And since utf-8 is already baked in to netcdf for variable names, then there is litterally no software that can read/write netcdf without being able to handle utf-8 -- so why should anyone feel the need for another encoding?

If they really, really, want another encoding, then they can still do that, and hopefully specify a _Encoding- and it's OK if the file is then not CF compliant.

And if there is a really, really, really compelling use case, they can store the otherwise-encoded text as a byte array.

I vote for MUST -- if we want to relax that later into a SHOULD, we can, but not the other way around.

@ChrisBarker-NOAA
Copy link
Contributor

Hmm on further thought, -- MUST for utf-8, SHOULD would be OK for NFC normalization of anything but variable names.

@ethanrd
Copy link
Member

ethanrd commented Oct 28, 2024

I agree on MUST utf-8 for attribute values. Too messy otherwise.

I've heard arguments for using other encodings (utf-16 in particular) for some languages/situations. So I'm hesitant around MUST for variable data. But I don't really understand where that argument applies or how widely utf-16 is used, in comparison to utf-8. (I agree utf-8 is probably the most widely supported encoding. Definitely in the netCDF space.)

That's the extent of my hesitancy so I'm good either way, really.

@ChrisBarker-NOAA
Copy link
Contributor

how widely utf-16 is used

UTF-16 is widely used in:

Java
Windows
.NET

Stored as a "wide char" data type -- i.e. 2 bytes per char. This was because MS and Java got ahead of the ball in the early Unicode days -- it was initially thought that all of Unicode could fit in 16 bytes (65536) total characters, so go to a two byte char, and everything else stays the same -- simple! But it turned out that all of Unicode couldn't fit in two bytes, so the whole thing, uncompressed in any way, takes 4 bytes (it's not all used, but a 3 byte data type isn't really a thing).

Anyway -- that was way too much background, but the point is that UTF-16 is widely used internally a lot, but it's not so widely used in data exchanging applications -- I think even MS has pretty much given up on it, for, e.g. MS office XML formats.

For data interchange, UTF-8 is now almost universal.

Also, the char array and string type in netcdf is essentially a char* in C -- you can cram a two-byte encoding into it, but it's likely to break a lot of software (e.g. null- terminated strings in C -- there are a lot of null bytes in utf-16)

The more likely, and reasonable, encodings that folks might want to use are not Unicode, but rather 1-byte encodings, like latin-1 -- or shift-jis or ... those have been used in plain old char* (such as the Python2 string type) for ages -- I'm sure there's a lot of data out there in those encodings. But a good fraction of it is Mojibaked, too :-(

@ethanrd
Copy link
Member

ethanrd commented Oct 28, 2024

I was wondering more about usage based on text language rather than general implementation. The argument I remember had to do with text in some languages being smaller in UTF-16 rather than UTF-8 because most characters in those languages are two-bytes when encode in UTF-16 but 3- or 4-bytes when encoded in UTF-8.

But this is really getting into the details. I'm good with either option. And maybe we can continue this conversation over beers sometime.

@ChrisBarker-NOAA
Copy link
Contributor

Sure -- Unicode requires beer!

@DocOtak
Copy link
Member

DocOtak commented Oct 28, 2024

Required watching pre (or during) beers (youtube link): Characters, Symbols and the Unicode Miracle - Computerphile

@ChrisBarker-NOAA
Copy link
Contributor

I jsu tlearned something new today:

The netcdf-c lib (or maybe the HDF lib underneath, I don't know) NFC normalized variable names for you.

So if you write an non-normalized string in -- it will normalize it for you -- and when read back out, you will get a different string.

What impact does this have on this conversation? maybe not much, although:

  • Maybe we should mention that the lib does it for you (or may, depending on the lib used -- fortran?)
  • I think it is important that attributes be normalized as well -- at least the CF ones that might refer to names. That's actually even more acutely important, as a user could pass in a attribute in normalized form and it then won't match the variable name, even if they also passed the same value to the variable name.

Tested with Python netCDF4 (I checked, the Python wrapper is not doing the normalization)

Here's some sample code, if you're interested

import  netCDF4
import unicodedata

normal_name = "composed\u00E9"
non_normal_name = "separate\u0065\u0301"

# create a netcdef File with two variables.
with netCDF4.Dataset("nfc-norm.nc", 'w') as ds:
    dim = ds.createDimension("a_dim", 10)
    var1 = ds.createVariable(normal_name, float, ("a_dim"))
    var2 = ds.createVariable(non_normal_name, float, ("a_dim"))
    var1[:] = range(10)
    var2[:] = range(10)

# Read it back in, and see if the variable names were normalized
with netCDF4.Dataset("nfc-norm.nc", 'r') as ds:
    # get the vars from their original names
    try:
        norm = ds[normal_name]
        print(f"{normal_name} worked")
    except IndexError:
        print(f"{normal_name} didn't work")

    try:
        non_norm = ds[non_normal_name]
        print(f"{non_normal_name} worked")
    except IndexError:
        print(f"{non_normal_name} didn't work")
        non_norm = ds[unicodedata.normalize('NFC', non_normal_name)]
        print(f"But it  did once normalized!")

    for name in ds.variables.keys():
        assert unicodedata.is_normalized('NFC', name)
    print("All variable names are normalized")

running it, I get:

In [54]: run nfc_norm.py
composedé worked
separateé didn't work
But it  did once normalized!
All variable names are normalized

@JonathanGregory
Copy link
Contributor

Dear Ethan, Chris, Barna

I think it's better to require text stored in attributes and variables to be NFC-normalized and UTF-8, because

  • Some CF attributes contain the names of variables. NetCDF requires UTF-8 for names. To allow an exact match in bytes, any attribute which contains the names of variables must be UTF-8 as well. We don't have to make the rule apply to all attributes, but it's simpler to do so.

  • NFC normalization is likely to avoid confusion about characters looking the same but actually not being the same. I think that's consistent with principle 7, "The conventions should minimise the possibility for mistakes by data-writers and data-readers."

NFC and UTF-8 sounds complicated and offputting, but I hope that most users will be reassured by the statement, "A text string consisting only of ASCII characters is guaranteed to conform with this requirement, because the ASCII characters are a subset of Unicode, and their NFC UTF-8 encodings are the same as their one-byte ASCII codes (decimal 0-127, hexadecimal 00-7F)." I've inserted "NFC" in there in the PR (#556). It's not relevant for these codes, but it must be correct, and it may avoid concern.

I don't think it would be appropriate to have text in the CF convention about which netCDF interfaces automatically produce NFC UTF-8 Unicode. However, we could put that information in a page on the CF website, if someone has time to assemble it, and cite that page in the conventions document. It could go in the page about software that works with CF, for example.

As we've discussed, this change may break our principle 9: "Because many datasets remain in use for a long time after production, it is desirable that metadata written according to previous versions of the convention should also be compliant with and have the same interpretation under later versions". Unfortunately, there's no reliable way to interpret non-ASCII text in existing data, so the best we can do is minimise future problems.

In the PR (#556), I've rephrased the second requirement more simply as, "Any attribute of variable-length string type must be a scalar (not an array)". Elsewhere, we're discussing relaxing this requirement but that's another matter.

Do you, or anyone else, have any more comments?

Cheers

Jonathan

@ChrisBarker-NOAA
Copy link
Contributor

Some CF attributes contain the names of variables. NetCDF requires UTF-8 for names. To allow an exact match in bytes, any attribute which contains the names of variables must be UTF-8 as well.

and NFC normalized

NFC normalization is likely to avoid confusion about characters looking the same but actually not being the same. I think that's consistent with principle 7, "The conventions should minimise the possibility for mistakes by data-writers and data-readers."

yes, but the comparison point is far more critical.

So we're in agreement.

I don't think it would be appropriate to have text in the CF convention about which netCDF interfaces automatically produce NFC UTF-8 Unicode

I agree that we don't usually talk about the tools that way (and don't want to be responsible for keeping up to date on which tools do what) -- but maybe a note along the lines of "libraries that write mnetcdf may automatically do the normalization for you"

@ChrisBarker-NOAA
Copy link
Contributor

Another note, just for interest:

The netcdf-c lib does not normalize attribute values to NFC, only variable and dimension (untested) names.

This could lead to errors -- if someone uses the same value for the var name and in an attribute, and it isn't normalized then the two will end up out of sync :-(

I'm going to suggest to the Python lib that normalization be applied to attributes, and we could suggest the same thing to the C lib, but I suspect that the C folks won't go for it -- it's pretty hands-off when it doesn't need something internally.

@ethanrd
Copy link
Member

ethanrd commented Oct 29, 2024

I like your changes Jonathan @JonathanGregory.

And a question and a comment:

Why is the _Encoding attribute restricted to text stored as a char array? I was just reviewing this netCDF-C discussion (issue #402) from 2017 and it seems to be applied to char or string.

I think it would be good to explain, briefly, that NFC is required to ensure that two versions of the same string match because Unicode can support mutiple ways to represent the same string. Though perhaps that should be a follow on discussion. And probably it should be in an Appendix.

@ChrisBarker-NOAA
Copy link
Contributor

Why is the _Encoding attribute restricted to text stored as a char array? I was just reviewing this netCDF-C discussion (issue #Unidata/netcdf-c#402) from 2017 and it seems to be applied to char or string.

I kinda agree -- at teh binary level the ONLY difference between a char array and a string is that char array as a pre-defined length.

However, from that issue, I see this:

The netCDF char type contains uninterpreted characters, one character per byte. Typically these contain 7-bit ASCII characters, but the character encoding is application specific. For this reason, applications writing data using the enhanced data model are encouraged to use the netCDF-4 string data type in preference to the char data type. Applications writing string data using the char data type are encouraged to add the special variable attribute "_Encoding" with a value that the netCDF libraries recognize. Currently those valid values are "UTF-8" or "ASCII", case insensitive.

So that does refer to the char type, somehow assuming that netCDF-4 string data type didn't have an issue, even though I don't think use of UTF-8 was defined at that point.

however:

Currently those valid values are "UTF-8" or "ASCII", case insensitive.

Which is silly, because ASCII IS UTF-8. though I suppose some folks might want to know, without decoding, that it's only ASCII.

as tne _Encoding attribute is not being introduced to CF -- I don't know that it matters.

@JonathanGregory
Copy link
Contributor

Dear @ChrisBarker-NOAA and @ethanrd

Thanks for your comments and discussion. I haven't added anything more about whether software might do the normalisation for you. I do agree that would be helpful if we have something specific to say. As I mentioned before, I think it would be valuable if we had text in the page about CF-aware software about which languages or libraries automatically produce NFC-normalised UTF-8 text. If that were there, we could link it from the convention. I hope you'll agree with not mentioning it at the moment.

If so, and if no-one else has concerns about the present proposal, we can accept it three weeks after I last invited comments, on 29th. That'll be 19th November. It will be good to conclude this issue, which is the oldest one presently open!

Best wishes

Jonathan

@ChrisBarker-NOAA
Copy link
Contributor

I hope you'll agree with not mentioning it at the moment.

agreed, yes.

if we had text in the page about CF-aware software about which languages or libraries automatically produce NFC-normalised UTF-8 text

well, we haven't bothered to link the netcdf-C lib there yet -- and in a way, the normalization is conforming to netcdf spec, not CF per se so ??

But in any case, a whole other thing, if we decide to do it.

@ethanrd
Copy link
Member

ethanrd commented Nov 6, 2024

I agree.

I expect that general purpose Unicode-aware libraries/tools don't produce normalized text automatically. Libraries will provide methods for normalizing strings. Tools built on these libraries should use it when comparing strings but otherwise they would not do normalization just in normal operations. (That's my understanding anyway.)

I believe the netCDF-C library applies NFC normalization (and checks for the restricted characters) in two places: 1) when a variable (and attribute, group, etc.) is created; and 2) when a variable is searched for by name, e.g., nc_inq_varid(). (@DennisHeimbigner @WardF - Do I have this right?) I expect the netCDF-Java library does this as well. I think the NUG discusses how a string gets written to a netCDF dataset. I don't think it mentions that this must be applied when searching for a variable, but it probably should.

@JonathanGregory
Copy link
Contributor

Three weeks have passed with no further concerns expressed. Therefore we accept the change, with thanks to all who've contributed, especially @DocOtak, @ChrisBarker-NOAA and @ethanrd since we resuscitated the issue this year, and @JimBiardCics who initiated it six years ago. This is the oldest currently open issue, so I'm very happy to close it now by merging #556.

@JonathanGregory JonathanGregory added change agreed Issue accepted for inclusion in the next version and closed and removed CF1.12? We might conclude this issue in time for CF1.12 labels Nov 23, 2024
@JonathanGregory JonathanGregory added this to the 1.12 milestone Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change agreed Issue accepted for inclusion in the next version and closed defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors
Projects
None yet
16 participants