Add support for attributes of type string #141

JimBiardCics · 2018-07-23T15:24:22Z

Attributes with a type of string are now possible with netCDF-4, and many examples of attributes with this type are "in the wild". As an example of how this is happening, IDL creates an attribute with this type if you select its version of string type instead of char type. It seems that people often assume that string is the correct type to use because they wish to store strings, not characters.

I propose to add verbiage to the Conventions to allow attributes that have a type of string. There are two ramifications to allowing attributes of this type, the second of which impacts string variables as well.

A string attribute can contain 1D atomic string arrays. We need to decide whether or not we want to allow these or limit them (at least for now) to atomic string scalars. Attributes with arrays of strings could allow for cleaner delimiting of multiple parts than spaces or commas do now (e.g. flag_values and flag_meanings could both be arrays), but this would be a significant stretch for current software packages.
A string attribute (and a string variable) can contain UTF-8 Unicode strings. UTF-8 uses variable-length characters, with the standard ASCII characters as the 1-byte subset. According to the Unicode standard, a UTF-8 string can be signaled by the presence of a special non-printing three byte sequence known as a Byte Order Mark (BOM) at the front of the string, although this is not required. IDL (again, for example) writes this BOM sequence at the beginning of every attribute or variable element of type string.

Allowing attributes containing arrays of strings may open up useful future directions, but it will be more of a break from the past than attributes that have only single strings. Allowing attributes (and variables) to contain UTF-8 will free people to store non-English content, but it might pose headaches for software written in older languages such as C and FORTRAN.

To finalize the change to support string type attributes, we need to decide:

Do we explicitly forbid string array attributes?
Do we place any restrictions on the content of string attributes and (by extension) variables?

Now that I have the background out of the way, here's my proposal.

Allow string attributes. Specify that the attributes defined by the current CF Conventions must be scalar (contain only one string).

Allow UTF-8 in attribute and variable values. Specify that the current CF Conventions use only ASCII characters (which are a subset of UTF-8) for all terms defined within. That is, the controlled vocabulary of CF (standard names and extensions, cell_methods terms other than free-text elements of comments(?), area type names, time units, etc) is composed entirely of ASCII characters. Free-text elements (comments, long names, flag_meanings, etc) may use any UTF-8 character.

Trac ticket: #176 but that's just a placeholder - no discussion took place there (@JonathanGregory 26 July 2024)

The text was updated successfully, but these errors were encountered:

Dave-Allured · 2018-07-23T20:13:36Z

I am generally in support of this string attributes proposal, including UTF-8 characters. However, for CF controlled attributes, I recommend an explicit preference for type char rather than string. This is for compatibility with large amounts of existing user code that access critical attributes directly, and would need to be reworked for type string.

I suggest not including a constraint for scalar strings, simply because it seems redundant. I think that existing CF language strongly implies single strings in most cases of CF defined attributes.

ghost · 2018-07-24T00:02:26Z

How different is reading values from a string attribute compared to a string variable? If some software supports string variables shouldn't it support string attributes as well? If the CF is going to recommend char datatype for string-valued attributes, shouldn't the same be done for string-valued variables?

Prefixing the bytes of an UTF-8 encoded string with the BOM sequence is an odd practice. Although it is permitted, afaik, it is not recommended.

Since what gets stored are always the bytes of one string in some encoding, assuming UTF-8 always should take care of the ASCII character set, too. This could cause issues if someone used other one-byte encodings (e.g. ISO 8859 family) but I don't see how such cases could be easily resolved.

Stroring Unicode strings using the string datatype makes more sense since the number of bytes for such strings in UTF-8 encoding is variable.

JimBiardCics · 2018-07-24T13:04:23Z

This issue and issue #139 are intertwined. There may be overlapping discussion in both.

JimBiardCics · 2018-07-24T16:30:07Z

@ajelenak-thg So I did some digging. I wrote a file with IDL and read it with C. There are no BOM prefixes. I guess some languages (such as Python) make assumptions one way or another about string attributes and variables, but it appears that it's all pretty straightforward.

JimBiardCics · 2018-07-24T16:31:21Z

@ajelenak-thg I agree that we should state that char attributes and variables should contain only ASCII characters.

JimBiardCics · 2018-07-24T16:49:40Z

@Dave-Allured When you say "CF-controlled attributes", are you referring to any values they may have, or to values that are from controlled vocabularies?
It is true that applications written in C or FORTRAN will require code changes to handle string because the API and what is returned for string attributes and variables is different from that for char attributes and variables.
Would a warning about avoiding string for maximum compatibility be sufficient?

Dave-Allured · 2018-07-24T18:14:07Z

@JimBiardCics, by "CF-controlled attributes", I mean CF-defined attributes within "the controlled vocabulary of CF" as you described above. By implication I am referring to any values they may have, including but not limited to values from controlled vocabularies.

A warning about avoiding data type string is notification. An explicit preference is advocacy. I believe the compatibility issue is important enough that CF should adopt the explicit preference for type char for key attributes.

Dave-Allured · 2018-07-24T18:48:34Z

The restriction that char attributes and variables should contain only ASCII characters is not warranted. The Netcdf-C library is agnostic about the character set of data stored within char attributes and char variables. UTF-8 and other character sets are easily embedded within strings stored as char data.

Therefore I suggest no mention of a character set restriction, outside of the CF controlled vocabulary. Alternatively you could establish the default interpretation of string data (both char and string data types) as the ASCII/UTF-8 conflation.

DocOtak · 2018-07-24T18:56:26Z

Hi all, I wasn't quite able to form this into a coherent paragraphs so here are some things to keep in mind re: UTF8 vs other encodings:

UTF8 is backwards compatible with ASCII if the following are true: no byte order mark, all code points are between U+0000 and U+007F
UTF8 is not backwards comparable with Latin1 (ISO 8859-1) because code points above U+007F need two bytes to represent.
There are multiple ways of representing the same grapheme, the netCDF classic format required UTF8 to be in Normalization Form Canonical Composition (NFC)

My personal recommendation is that the only encoding for text in CF netCDF be UTF8 in NFC with no byte order mark. For attributes where there is desire to restrict what is allowed (though controlled vocabulary or other limitations), the restriction should be specified using unicode points, e.g. "only printing characters between U+0000 and U+007F are allowed in controlled attributes".

Text which is in controlled vocabulary attributes should continue to be char arrays. Freeform attributes (mostly those in 2.6.2. Description of file contents), could probably be either string or char arrays.

Dave-Allured · 2018-07-24T19:04:04Z

@DocOtak, you said "the netCDF classic format required UTF8 to be in Normalization Form Canonical Composition (NFC)". This restriction is only for netCDF named objects, i.e. the names of dimensions, variables, and attributes. There is no such restriction for data stored within variables or attributes.

DocOtak · 2018-07-24T19:26:00Z

@Dave-Allured yes, I reread the section, object names does appear to be what it is restricting. Should there be some consideration of specifying a normalization for the purposes of data in CF netcdf?

Text encoding probably deserves its own section in the CF document, perhaps under data types. The topic of text encoding can be very foreign to someone who thinks that "plain text" is a thing that exists in computing.

Dave-Allured · 2018-07-24T19:35:13Z

@DocOtak, for general text data, I think UTF-8 normalization is more of a best practice than a necessity for CF purposes. Therefore I suggest that CF remain silent about that, but include it if you feel strongly. Normalization becomes important for efficient string matching, which is why netCDF object names are restricted.

DocOtak · 2018-07-24T20:32:30Z

@Dave-Allured I don't know enough about the consequences of requiring a specific normalization. There is some interesting information on the unicode website about normalization. Which suggests that over 99% of unicode text on the web is already in NFC. Also interesting is that combining NFC normalized strings may not result in a new string that is normalized. It is also stated in the FAQ that "Programs should always compare canonical-equivalent Unicode strings as equal", so it's probably not an issue as long as the controlled vocabulary attributes have values with code points in the U+0000 and U+007F range (control chars excluded).

hrajagers · 2018-07-25T08:35:12Z

@Dave-Allured and @DocOtak,

Most of the character/string attributes in the CF conventions contain a concatenation of sub-strings selected from a standardized vocabulary, variable names, and some numbers and separator symbols. It seems that for those attributes the discuss about the encoding is not so relevant as these sub-strings contain only a very basic set of characters (assuming that variable names are not allowed to contain extended characters). Even for flag_meanings the CF conventions state "Each word or phrase should consist of characters from the alphanumeric set and the following five: '_', '-', '.', '+', '@'." If the alphanumeric set doesn't include extended characters this again doesn't create any problems for encoding. The only attributes that might contain extended characters (and thus be influenced by this encoding choice) are attributes like long_name, institution, title, history, ... However CF inherits most of them from the NetCDF User Guide which explicitly states that they should be stored as character arrays (see NUG Appendix A) So, is it then up to CF to allow strings here? In short, I'm not sure the encoding is important for string/character attributes at this moment.
I initially raised the encoding topic in the related issue Add support for variables of type string #139 because we want our model users to use local names for observation points and they will end up in label variables. In that context I would like to make sure that what I store is properly described.

JimBiardCics · 2018-07-25T12:53:46Z

@hrajagers Thanks for the pointer to NUG Appendix A. It's interesting to see in that text that character array, character string, and string are used somewhat interchangeably. I'm curious to know if the NUG authors looked at this section in light of allowing string type.

ghost · 2018-07-25T12:54:04Z

I think we are making good progress on this. I checked the Appendix A table of CF attributes and I think the following attributes can be allowed to hold string values as well as char:

comment
external_variables
_FillValue
flag_meanings
flag_values
history
institution
long_name
references
source
title

All the other attributes should hold char values to maximize backward compatibility.

JimBiardCics · 2018-07-25T13:14:00Z

@ajelenak-thg Are you suggesting the other attributes must always be of type char, or that they should only contain the ASCII subset of characters?

ghost · 2018-07-25T13:45:02Z

Based on the expressed concern so far for backward compatibility I suggested the former: always be of type char. Leave the character set and encoding unspecified since the values of those attributes are controlled by the convention.

ghost · 2018-07-25T13:48:49Z

On the string encoding issue, CF data can be currently stored in two file formats: NetCDF Classic, and HDF5. String encoding information cannot be directly stored in the netCDF Classic format and the spec defines a special variable attribute _Encoding for that in future implementations. The values of this attribute are not specified so anything could be used.

In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both char and string datatypes in the context of this discussion are stored as HDF5 strings. This effectively limits what could be allowed values of the (future) _Encoding attribute for maximal data interoperability between the two file formats.

Dave-Allured · 2018-07-26T00:26:59Z

@hrajagers said: However CF inherits most of them [attributes] from the NetCDF User Guide which explicitly states that they should be stored as character arrays (see NUG Appendix A) So, is it then up to CF to allow strings here?

Yes, NUG Appendix A literally allows only char type attributes. My sense is that proponents believe that string type is compatible with the intent of the NUG, and also strings have enough advantages to warrant departure from the NUG.

Personally I think string type attributes are fine within collaborations where everyone is ready for any needed code upgrades. For exchanged and published data, char type CF attributes should be preferred explicitly by CF.

Dave-Allured · 2018-07-26T02:16:26Z

@ajelenak-thg said: In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both char and string datatypes in the context of this discussion are stored as HDF5 strings.

Actually the ASCII/UTF-8 restriction is not enforced by the HDF5 library. This is used intentionally by netcdf developers to support arbitrary character sets in netcdf-4 data type char, both attributes and variables. See netcdf issue 298. Therefore, data type char remains fully interoperable between netcdf-3 and netcdf-4 formats.

For example, this netcdf-4 file contains a char attribute and a char variable in an alternate character set. You will need an app or console window enabled for ISO-8859-1 to properly view the ncdump of this file.

JonathanGregory · 2018-07-26T18:02:39Z

Dear Jim

Thanks for addressing these issues. In fact you've raised two issues: the use of strings, and the encoding. These can be decided separately, can't they?

On strings, I agree with your proposal and subsequent comments by others that we should allow string, but we should recommend the continued use of char, giving as the reason that char will maximise the useability of the data, because of the existence of software that isn't expecting string. Recommend means that the cf-checker will give a warning if string is used. However it's not an error and a given project could decide to use string.

For the attributes whose contents are standardised by CF e.g. coordinates, if string is used we should require a scalar string. This is because software will not expect arrays of strings. These attributes are often critical and so it's essential they can be interpreted. For CF attributes whose contents aren't standardised e.g. comment, is there a strong use-case for allowing arrays of strings?

I recall that at the meeting in Reading the point was made that arrays would be natural for flag_values and flag_meanings. I agree that the argument is stronger in that case because the words in those two attributes correspond one-to-one. Still, it would break existing software to permit it. Is there a strong need for arrays?

Best wishes

Jonathan

JimBiardCics · 2018-07-26T18:29:15Z

@JonathanGregory I agree with you. I think it would be fine to leave string array attributes out of the running for now. I also prefer the recommendation route.

Regarding the encoding, it seems to me that we could avoid a lot of complexity for now by making a simple requirement that all CF-defined terms and whitespace delimiters in string-valued attributes or variables be composed characters from the ASCII character set. It wouldn't matter if people used Latin-1 (ISO-8859-1) or UTF-8 for any free text or free text portions of contents, because they both contain the ASCII character set as a subset. The parts that software would be looking for would be parseable.

ethanrd · 2018-07-26T21:51:50Z

@JonathanGregory Another use-case (that I think came up during the Reading meeting) had the history attribute as a string array so that each element could contain the description of an individual processing step. I think easier machine readability was mentioned as a motivation.

JonathanGregory · 2018-07-27T14:47:59Z

Regarding the encoding, I agree that for attribute contents which are standardised by CF it is fine to restrict ourselves to ASCII, in both char and string. For these attributes, we prescribe the possible values (they have controlled vocabulary) and so we don't need to make a rule in the convention about it for the sake of the users of the convention. If we put it in the convention, it would be as guidance for future authors of the convention. I don't have a view about whether we should do this. It would be worth noting to users that whitespace, which often appears in a "black-separated list of words", should be ASCII space. I agree that UTF-8 is fine for contents which aren't standardised.

Regarding arrays of strings, I realise I wasn't thinking clearly yesterday, sorry. As we've agreed, string attributes will not be expected by much existing software. Hence software has to be rewritten to support the use of strings in any case, and support for arrays of strings could be added at the same time, if it's really valuable. I don't see the particular value for the use of string arrays for comment - do other people? For flag_meanings, the argument was that it would allow a meaning to be a string which contained spaces (instead of being joined up with underscores, as is presently necessary); that is, it would be an enhancement to functionality.

Happy weekend - Jonathan

JonathanGregory · 2018-07-27T15:05:06Z

I meant to write, I don't see the particular value for the use of string arrays for history, which Ethan reminded us of. Why would this be more machine-readable?

JimBiardCics · 2018-07-27T15:23:26Z

@JonathanGregory The use of an array of strings for history would simplify denoting where each entry begins and ends as entries are added, appending or prepending a new array element each time rather than the myriad different ways people do it now. This would actually be a good thing to standardize.

I think we can just not mention string array attributes right now. The multi-valued attributes (other than history, perhaps) pretty much all specify how they are to be formed and delimited.

cf-metadata-list · 2018-07-27T16:26:32Z

I think we can just not mention string array attributes right now. Do we currently allow array of CHAR (i.e. 2D array) for attributes? According to the netcdf docs; The current version treats all attributes as vectors; scalar values are treated as single-element vectors. Which makes me think no, that’s not possible. I think allowing the string type should not change what’s allowable. BTW, I suspect some client software (e.g. py_netCDF4) treat char and string the same ....

…

-CHB

kenkehoe · 2018-07-27T16:32:15Z

Let me throw a big wrench into this argument about not allowing string arrays. 1. I would prefer a consistent decision and standard about the use of char vs. string so a user does not need to know where to use char array, scalar string, or string arrays. 2. Use of string arrays with flag_meanings (not sure it would be needed with flag_values?) will solve many problems for my program to actually merge our standards with CF. Currently with char arrays we need to connect all words for a single flag by underscores for space delimiting. Many of our variable names and attribute names contain underscores. So when the flag description is parsed and changed to be more human readable all the attribute and variable names are not preserved. Automated tools can no longer replace attribute or variable names with the attribute or variable value. We do this a lot. We also have lengthy descriptions for our flag_meanings. I would prefer to use flag_mask, flag_values and flag_meanings as that general method is better than the one we currently employ. 3. I do see the benefit of storing history as string arrays. Without checking date stamps I can see how many times the file has been modified by checking the list length. It also removes any ambiguity about separators in the history attribute which differs from the CF standard of space separation and is often institution defined. The current definition for history attribute is "List of the applications that have modified the original data." In the python world the use of "list" is different than the intended definition. 4. I'm starting to get a lot of more complicated data that are multidimensional but do not share the same units. We would need to work with udunits, but Cf/Radial is proposing a new standard for complex data which often have different units for different index in a second dimension. If we allowed string arrays in units we could store complex data or other data structures more native to the intended use since uduints interprets space characters as multiplication not a delimiter. 5. missing_value or _FillValue currently allow one value. For string type data allowing sting arrays to have multiple fill values which would allow numeric data also have multiple fill values defined, which I'm sure there are many data sets that have multiple fill values used, but not defined correctly in the data file. 6. valid_range can be used with string data type 7. Conventions attribute could group multiple indicators with the same class of conventions. For example ["CF-1.7", "Cf/Radial instrument_parameters radar_parameters", "ARM-1.3"] 8. and on and on .... I'm not suggesting the use of all these use cases, but this relatively small change can go a long way to improve the standard and future use of the data. OK, I've made my case I'll be quite now. Ken

…

On 2018-7-27 09:23, JimBiardCics wrote: @JonathanGregory <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JonathanGregory&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=d4qYLgaugDM0kdWoZHbgieEpVU-Xg_SJ1d1F_dbBs2M&e=> The use of an array of strings for history would simplify denoting where each entry begins and ends as entries are added, appending or prepending a new array element each time rather than the myriad different ways people do it now. This would actually be a good thing to standardize. I think we can just not mention string array attributes right now. The multi-valued attributes (other than history, perhaps) pretty much all specify how they are to be formed and delimited. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cf-2Dconvention_cf-2Dconventions_issues_141-23issuecomment-2D408452242&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=SzizwDsedBZ_n_qPzSCZ1OVJv5eli4zFSJXKogaOAtE&e=>, or mute the thread <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AH4NvmnCKFC7HSpXQx-5FMi6Yfc-5F1HSfBaks5uKzBvgaJpZM4VbMvb&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=F3efrdVvqb932q5mp7D_eux9BSLztraUFgqR52IYak0&e=>.

-- Kenneth E. Kehoe Research Associate - University of Oklahoma Cooperative Institute for Mesoscale Meteorological Studies ARM Climate Research Facility - Data Quality Office e-mail: kkehoe@ou.edu | Office: 303-497-4754 | Cell: 405-826-0299

JonathanGregory · 2024-10-23T17:26:39Z

Dear @ChrisBarker-NOAA and @DocOtak

Thanks for your further comments. I didn't know that in C you have to do the encoding yourself. I have modified the first two paragraphs accordingly. The third is unchanged. The text now in PR #556 is below. Is it OK?

Cheers

Jonathan

A text string can be stored either in a variable-length string or in a fixed-length char array. In both cases, text strings must be represented in Unicode and encoded according to UTF-8. A text string consisting only of ASCII characters is guaranteed to conform with this requirement, because the ASCII characters are a subset of Unicode, and their UTF-8 encodings are the same as their one-byte ASCII codes (decimal 0-127, hexadecimal 00-7F). Any Unicode composite characters must be NFC-normalized.

Before version 1.12, CF did not require UTF-8 encoding, and did not provide or endorse any convention to record what encoding was used. However, if the text string is stored in a char variable, the encoding might be recorded by the _Encoding attribute, although this is not a CF or NUG convention.

An n-dimensional array of strings may be implemented as a variable or an attribute of type string with n dimensions (only n=1 is allowed for an attribute) or as a variable of type char with n+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable. For example, a char variable containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name. The other strings, such as "May", would be padded with trailing NULL or space characters so that every array element is filled. A string variable to store the same information would be dimensioned (12), with each element of the array containing a string of the appropriate length. The CDL example below shows one variable of each type.

DocOtak · 2024-10-23T18:59:04Z

Do we know what netCDF4-python does? I think my last remaining concern is that, without good defaults in the software folks are using, no one will actually implement normalization in their "beyond ascii" unicode strings.

ChrisBarker-NOAA · 2024-10-23T20:04:12Z

netCDF4 Python does not do the NFC normalization -- but I was going to put in a request for that :-)

ChrisBarker-NOAA · 2024-10-23T20:05:06Z

The PR looks good to me now -- thanks!

However, the conformance section should be updated as well, specifing UTF-8 for strings and char.

ethanrd · 2024-10-25T03:27:10Z

This all looks really good! Thanks.
My only comment is about the last sentence in the first paragraph of changed text:

Any Unicode composite characters must be NFC-normalized.

I think this should be described in terms of normalized text rather than characters. First, because the Unicode standard discusses and defines normalization [1] in terms of Unicode text (or "Unicode coded character sequences"). Second, it would then only introduce one new Unicode concept, normalization, rather than two, normalization and composite characters.

Perhaps instead some text about normalization could be added to the second sentence in that paragraph. Something like:

In both cases, text strings must be Unicode text that is encoded according to UTF-8 and in NFC normalization form.

[1] See section 3.11 "Normalization Forms" in chapter 3 (PDF) of the Unicode Standard (v15.0).

ChrisBarker-NOAA · 2024-10-25T17:52:27Z

I think this should be described in terms of normalized text rather than characters.

+1

This is in keeping with the drum I've been beating -- we should talk about Unicode in Unicode terms that scientists[*] will understand -- or at least be able to figure out what to do.

e.g., if the text says "text must be NFC normalized" folks can google:

"how do I NFC normalize a string Python"

(substitute language of choice here) -- you get as the top AI hit, the answer.

Critical is that the user doesn't need to try to figure out if they are using any combining characters, they should simply normalize everything.

[*] - actually any non-specialist in Unicode -- most people that write code, even professional developers, don't know that Unicode has "composite characters", or that there can be more that one way to express what seems like one thing.

In [33]: combined
Out[33]: 'Flöte'

In [34]: separate
Out[34]: 'Flöte'

# Those sure look the same
# but:

In [35]: combined == separate
Out[35]: False

# WTF?

# If you look at the code point values

In [36]: [ord(c) for c in combined]
Out[36]: [70, 108, 246, 116, 101]

In [37]: [ord(c) for c in separate]
Out[37]: [70, 108, 111, 776, 116, 101]

# hmm, there is an extra code point in there.

# to understand this / work with this:

In [39]: import unicodedata

In [40]: normed = unicodedata.normalize('NFC', separate)

In [41]: normed
Out[41]: 'Flöte'

In [42]: normed == combined
Out[42]: True

Ahh -- that makes sense now.

JonathanGregory · 2024-10-25T21:11:31Z

Dear @ethanrd and @ChrisBarker-NOAA

Thanks for your comments. Below is a new version of the text in PR #556 for section 2.2. Is it OK now?

Best wishes

Jonathan

Conventions document

A text string can be stored either in a variable-length string or in a fixed-length char array. In both cases, text strings must be represented in Unicode Normalization Form C (NFC, section 3.11 and Annex 15 of the Unicode standard) and encoded according to UTF-8. A text string consisting only of ASCII characters is guaranteed to conform with this requirement, because the ASCII characters are a subset of Unicode, and their UTF-8 encodings are the same as their one-byte ASCII codes (decimal 0-127, hexadecimal 00-7F).

Before version 1.12, CF did not require UTF-8 encoding, and did not provide or endorse any convention to record what encoding was used. However, if the text string is stored in a char variable, the encoding might be recorded by the _Encoding attribute, although this is not a CF or NUG convention.

An n-dimensional array of strings may be implemented as a variable or an attribute of type string with n dimensions (only n=1 is allowed for an attribute) or as a variable of type char with n+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable. For example, a char variable containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name. The other strings, such as "May", would be padded with trailing NULL or space characters so that every array element is filled. A string variable to store the same information would be dimensioned (12), with each element of the array containing a string of the appropriate length.

Conformance document

Requirements:

Any text stored in a CF attribute or variable must be represented in Unicode Normalization Form C and encoded in UTF-8.
If a text-valued attribute is stored in a variable-length string, it must have a scalar value.

ChrisBarker-NOAA · 2024-10-25T22:29:07Z

Looks great -- I made minor, not critical comments in the PR -- thanks for the massive effort!

ethanrd · 2024-10-28T17:43:21Z

Hi Jonathan @JonathanGregory - Thanks again for all the work on this! It looks really good. Just one new comment/question. (Sorry!)

I'm wondering about the use of the word "must" rather than "should". I think "must" makes sense for attribute/variable names (for UTF-8 and NFC normalization) but it seems less clear when it comes to the values of variables and even attributes (unless they are CF defined attributes maybe).

ChrisBarker-NOAA · 2024-10-28T18:27:35Z

I think "must" makes sense for attribute/variable names (for UTF-8 and NFC normalization) -- and that's defined by the NUG, so absolutely.

less clear when it comes to the values of variables and even attributes (unless they are CF defined attributes maybe).

Attributes:

I think it's not always clear exactly what is and isn't a CF defined attribute -- more to the point, it's harder to make it clear to users what the rules are in that case.

And from the perspective of writing code to read/write attributes, it's an unholy mess if there are different encodings for different attributes -- and there is no way to define a different one if you want. I'm not sure it's even possible in, e.g. the netCDF4-python lib (I'd have to check on that).

Variables:

In this case, there is the unsanctioned precedent for an _Encoding attribute, so it's possible for software to handle arbitrary encodings. And the netCDF4-python lib already does. So it's a bit fuzzy there.

However -- it really is so much easier for everyone if we use UTF-8 everywhere -- there are entire web sites and
manifestos about it. And since utf-8 is already baked in to netcdf for variable names, then there is litterally no software that can read/write netcdf without being able to handle utf-8 -- so why should anyone feel the need for another encoding?

If they really, really, want another encoding, then they can still do that, and hopefully specify a _Encoding- and it's OK if the file is then not CF compliant.

And if there is a really, really, really compelling use case, they can store the otherwise-encoded text as a byte array.

I vote for MUST -- if we want to relax that later into a SHOULD, we can, but not the other way around.

ChrisBarker-NOAA · 2024-10-28T19:34:23Z

Hmm on further thought, -- MUST for utf-8, SHOULD would be OK for NFC normalization of anything but variable names.

ethanrd · 2024-10-28T22:10:41Z

I agree on MUST utf-8 for attribute values. Too messy otherwise.

I've heard arguments for using other encodings (utf-16 in particular) for some languages/situations. So I'm hesitant around MUST for variable data. But I don't really understand where that argument applies or how widely utf-16 is used, in comparison to utf-8. (I agree utf-8 is probably the most widely supported encoding. Definitely in the netCDF space.)

That's the extent of my hesitancy so I'm good either way, really.

ChrisBarker-NOAA · 2024-10-28T23:01:40Z

how widely utf-16 is used

UTF-16 is widely used in:

Java
Windows
.NET

Stored as a "wide char" data type -- i.e. 2 bytes per char. This was because MS and Java got ahead of the ball in the early Unicode days -- it was initially thought that all of Unicode could fit in 16 bytes (65536) total characters, so go to a two byte char, and everything else stays the same -- simple! But it turned out that all of Unicode couldn't fit in two bytes, so the whole thing, uncompressed in any way, takes 4 bytes (it's not all used, but a 3 byte data type isn't really a thing).

Anyway -- that was way too much background, but the point is that UTF-16 is widely used internally a lot, but it's not so widely used in data exchanging applications -- I think even MS has pretty much given up on it, for, e.g. MS office XML formats.

For data interchange, UTF-8 is now almost universal.

Also, the char array and string type in netcdf is essentially a char* in C -- you can cram a two-byte encoding into it, but it's likely to break a lot of software (e.g. null- terminated strings in C -- there are a lot of null bytes in utf-16)

The more likely, and reasonable, encodings that folks might want to use are not Unicode, but rather 1-byte encodings, like latin-1 -- or shift-jis or ... those have been used in plain old char* (such as the Python2 string type) for ages -- I'm sure there's a lot of data out there in those encodings. But a good fraction of it is Mojibaked, too :-(

ethanrd · 2024-10-28T23:19:10Z

I was wondering more about usage based on text language rather than general implementation. The argument I remember had to do with text in some languages being smaller in UTF-16 rather than UTF-8 because most characters in those languages are two-bytes when encode in UTF-16 but 3- or 4-bytes when encoded in UTF-8.

But this is really getting into the details. I'm good with either option. And maybe we can continue this conversation over beers sometime.

ChrisBarker-NOAA · 2024-10-28T23:19:48Z

Sure -- Unicode requires beer!

DocOtak · 2024-10-28T23:21:59Z

Required watching pre (or during) beers (youtube link): Characters, Symbols and the Unicode Miracle - Computerphile

ChrisBarker-NOAA · 2024-10-29T01:44:01Z

I jsu tlearned something new today:

The netcdf-c lib (or maybe the HDF lib underneath, I don't know) NFC normalized variable names for you.

So if you write an non-normalized string in -- it will normalize it for you -- and when read back out, you will get a different string.

What impact does this have on this conversation? maybe not much, although:

Maybe we should mention that the lib does it for you (or may, depending on the lib used -- fortran?)
I think it is important that attributes be normalized as well -- at least the CF ones that might refer to names. That's actually even more acutely important, as a user could pass in a attribute in normalized form and it then won't match the variable name, even if they also passed the same value to the variable name.

Tested with Python netCDF4 (I checked, the Python wrapper is not doing the normalization)

Here's some sample code, if you're interested

import  netCDF4
import unicodedata

normal_name = "composed\u00E9"
non_normal_name = "separate\u0065\u0301"

# create a netcdef File with two variables.
with netCDF4.Dataset("nfc-norm.nc", 'w') as ds:
    dim = ds.createDimension("a_dim", 10)
    var1 = ds.createVariable(normal_name, float, ("a_dim"))
    var2 = ds.createVariable(non_normal_name, float, ("a_dim"))
    var1[:] = range(10)
    var2[:] = range(10)

# Read it back in, and see if the variable names were normalized
with netCDF4.Dataset("nfc-norm.nc", 'r') as ds:
    # get the vars from their original names
    try:
        norm = ds[normal_name]
        print(f"{normal_name} worked")
    except IndexError:
        print(f"{normal_name} didn't work")

    try:
        non_norm = ds[non_normal_name]
        print(f"{non_normal_name} worked")
    except IndexError:
        print(f"{non_normal_name} didn't work")
        non_norm = ds[unicodedata.normalize('NFC', non_normal_name)]
        print(f"But it  did once normalized!")

    for name in ds.variables.keys():
        assert unicodedata.is_normalized('NFC', name)
    print("All variable names are normalized")

running it, I get:

In [54]: run nfc_norm.py
composedé worked
separateé didn't work
But it  did once normalized!
All variable names are normalized

JonathanGregory · 2024-10-29T14:12:01Z

Dear Ethan, Chris, Barna

I think it's better to require text stored in attributes and variables to be NFC-normalized and UTF-8, because

Some CF attributes contain the names of variables. NetCDF requires UTF-8 for names. To allow an exact match in bytes, any attribute which contains the names of variables must be UTF-8 as well. We don't have to make the rule apply to all attributes, but it's simpler to do so.
NFC normalization is likely to avoid confusion about characters looking the same but actually not being the same. I think that's consistent with principle 7, "The conventions should minimise the possibility for mistakes by data-writers and data-readers."

NFC and UTF-8 sounds complicated and offputting, but I hope that most users will be reassured by the statement, "A text string consisting only of ASCII characters is guaranteed to conform with this requirement, because the ASCII characters are a subset of Unicode, and their NFC UTF-8 encodings are the same as their one-byte ASCII codes (decimal 0-127, hexadecimal 00-7F)." I've inserted "NFC" in there in the PR (#556). It's not relevant for these codes, but it must be correct, and it may avoid concern.

I don't think it would be appropriate to have text in the CF convention about which netCDF interfaces automatically produce NFC UTF-8 Unicode. However, we could put that information in a page on the CF website, if someone has time to assemble it, and cite that page in the conventions document. It could go in the page about software that works with CF, for example.

As we've discussed, this change may break our principle 9: "Because many datasets remain in use for a long time after production, it is desirable that metadata written according to previous versions of the convention should also be compliant with and have the same interpretation under later versions". Unfortunately, there's no reliable way to interpret non-ASCII text in existing data, so the best we can do is minimise future problems.

In the PR (#556), I've rephrased the second requirement more simply as, "Any attribute of variable-length string type must be a scalar (not an array)". Elsewhere, we're discussing relaxing this requirement but that's another matter.

Do you, or anyone else, have any more comments?

Cheers

Jonathan

ChrisBarker-NOAA · 2024-10-29T16:17:32Z

Some CF attributes contain the names of variables. NetCDF requires UTF-8 for names. To allow an exact match in bytes, any attribute which contains the names of variables must be UTF-8 as well.

and NFC normalized

NFC normalization is likely to avoid confusion about characters looking the same but actually not being the same. I think that's consistent with principle 7, "The conventions should minimise the possibility for mistakes by data-writers and data-readers."

yes, but the comparison point is far more critical.

So we're in agreement.

I don't think it would be appropriate to have text in the CF convention about which netCDF interfaces automatically produce NFC UTF-8 Unicode

I agree that we don't usually talk about the tools that way (and don't want to be responsible for keeping up to date on which tools do what) -- but maybe a note along the lines of "libraries that write mnetcdf may automatically do the normalization for you"

ChrisBarker-NOAA · 2024-10-29T18:27:27Z

Another note, just for interest:

The netcdf-c lib does not normalize attribute values to NFC, only variable and dimension (untested) names.

This could lead to errors -- if someone uses the same value for the var name and in an attribute, and it isn't normalized then the two will end up out of sync :-(

I'm going to suggest to the Python lib that normalization be applied to attributes, and we could suggest the same thing to the C lib, but I suspect that the C folks won't go for it -- it's pretty hands-off when it doesn't need something internally.

ethanrd · 2024-10-29T23:20:58Z

I like your changes Jonathan @JonathanGregory.

And a question and a comment:

Why is the _Encoding attribute restricted to text stored as a char array? I was just reviewing this netCDF-C discussion (issue #402) from 2017 and it seems to be applied to char or string.

I think it would be good to explain, briefly, that NFC is required to ensure that two versions of the same string match because Unicode can support mutiple ways to represent the same string. Though perhaps that should be a follow on discussion. And probably it should be in an Appendix.

ChrisBarker-NOAA · 2024-10-29T23:30:00Z

Why is the _Encoding attribute restricted to text stored as a char array? I was just reviewing this netCDF-C discussion (issue #Unidata/netcdf-c#402) from 2017 and it seems to be applied to char or string.

I kinda agree -- at teh binary level the ONLY difference between a char array and a string is that char array as a pre-defined length.

However, from that issue, I see this:

The netCDF char type contains uninterpreted characters, one character per byte. Typically these contain 7-bit ASCII characters, but the character encoding is application specific. For this reason, applications writing data using the enhanced data model are encouraged to use the netCDF-4 string data type in preference to the char data type. Applications writing string data using the char data type are encouraged to add the special variable attribute "_Encoding" with a value that the netCDF libraries recognize. Currently those valid values are "UTF-8" or "ASCII", case insensitive.

So that does refer to the char type, somehow assuming that netCDF-4 string data type didn't have an issue, even though I don't think use of UTF-8 was defined at that point.

however:

Currently those valid values are "UTF-8" or "ASCII", case insensitive.

Which is silly, because ASCII IS UTF-8. though I suppose some folks might want to know, without decoding, that it's only ASCII.

as tne _Encoding attribute is not being introduced to CF -- I don't know that it matters.

JonathanGregory · 2024-11-05T12:58:01Z

Dear @ChrisBarker-NOAA and @ethanrd

Thanks for your comments and discussion. I haven't added anything more about whether software might do the normalisation for you. I do agree that would be helpful if we have something specific to say. As I mentioned before, I think it would be valuable if we had text in the page about CF-aware software about which languages or libraries automatically produce NFC-normalised UTF-8 text. If that were there, we could link it from the convention. I hope you'll agree with not mentioning it at the moment.

If so, and if no-one else has concerns about the present proposal, we can accept it three weeks after I last invited comments, on 29th. That'll be 19th November. It will be good to conclude this issue, which is the oldest one presently open!

Best wishes

Jonathan

ChrisBarker-NOAA · 2024-11-05T17:22:41Z

I hope you'll agree with not mentioning it at the moment.

agreed, yes.

if we had text in the page about CF-aware software about which languages or libraries automatically produce NFC-normalised UTF-8 text

well, we haven't bothered to link the netcdf-C lib there yet -- and in a way, the normalization is conforming to netcdf spec, not CF per se so ??

But in any case, a whole other thing, if we decide to do it.

ethanrd · 2024-11-06T21:36:26Z

I agree.

I expect that general purpose Unicode-aware libraries/tools don't produce normalized text automatically. Libraries will provide methods for normalizing strings. Tools built on these libraries should use it when comparing strings but otherwise they would not do normalization just in normal operations. (That's my understanding anyway.)

I believe the netCDF-C library applies NFC normalization (and checks for the restricted characters) in two places: 1) when a variable (and attribute, group, etc.) is created; and 2) when a variable is searched for by name, e.g., nc_inq_varid(). (@DennisHeimbigner @WardF - Do I have this right?) I expect the netCDF-Java library does this as well. I think the NUG discusses how a string gets written to a netCDF dataset. I don't think it mentions that this must be applied when searching for a variable, but it probably should.

JonathanGregory · 2024-11-23T10:05:32Z

Three weeks have passed with no further concerns expressed. Therefore we accept the change, with thanks to all who've contributed, especially @DocOtak, @ChrisBarker-NOAA and @ethanrd since we resuscitated the issue this year, and @JimBiardCics who initiated it six years ago. This is the oldest currently open issue, so I'm very happy to close it now by merging #556.

JimBiardCics mentioned this issue Jul 23, 2018

Add support for variables of type string #139

Closed

JonathanGregory added change agreed Issue accepted for inclusion in the next version and closed and removed CF1.12? We might conclude this issue in time for CF1.12 labels Nov 23, 2024

JonathanGregory closed this as completed in #556 Nov 23, 2024

JonathanGregory added this to the 1.12 milestone Nov 28, 2024

Add support for attributes of type string #141

Add support for attributes of type string #141

Comments

JimBiardCics commented Jul 23, 2018 • edited by JonathanGregory Loading

Dave-Allured commented Jul 23, 2018 • edited Loading

ghost commented Jul 24, 2018 • edited by ghost Loading

JimBiardCics commented Jul 24, 2018

JimBiardCics commented Jul 24, 2018

JimBiardCics commented Jul 24, 2018

JimBiardCics commented Jul 24, 2018

Dave-Allured commented Jul 24, 2018 • edited Loading

Dave-Allured commented Jul 24, 2018 • edited Loading

DocOtak commented Jul 24, 2018

Dave-Allured commented Jul 24, 2018

DocOtak commented Jul 24, 2018

Dave-Allured commented Jul 24, 2018

DocOtak commented Jul 24, 2018

hrajagers commented Jul 25, 2018

JimBiardCics commented Jul 25, 2018

ghost commented Jul 25, 2018

JimBiardCics commented Jul 25, 2018

ghost commented Jul 25, 2018

ghost commented Jul 25, 2018

Dave-Allured commented Jul 26, 2018

Dave-Allured commented Jul 26, 2018

JonathanGregory commented Jul 26, 2018

JimBiardCics commented Jul 26, 2018

ethanrd commented Jul 26, 2018

JonathanGregory commented Jul 27, 2018

JonathanGregory commented Jul 27, 2018

JimBiardCics commented Jul 27, 2018

cf-metadata-list commented Jul 27, 2018 via email

kenkehoe commented Jul 27, 2018 via email

JonathanGregory commented Oct 23, 2024

DocOtak commented Oct 23, 2024

ChrisBarker-NOAA commented Oct 23, 2024

ChrisBarker-NOAA commented Oct 23, 2024

ethanrd commented Oct 25, 2024

ChrisBarker-NOAA commented Oct 25, 2024

JonathanGregory commented Oct 25, 2024

ChrisBarker-NOAA commented Oct 25, 2024

ethanrd commented Oct 28, 2024

ChrisBarker-NOAA commented Oct 28, 2024 • edited Loading

Attributes:

Variables:

ChrisBarker-NOAA commented Oct 28, 2024

ethanrd commented Oct 28, 2024

ChrisBarker-NOAA commented Oct 28, 2024

ethanrd commented Oct 28, 2024

ChrisBarker-NOAA commented Oct 28, 2024

DocOtak commented Oct 28, 2024

ChrisBarker-NOAA commented Oct 29, 2024

JonathanGregory commented Oct 29, 2024

ChrisBarker-NOAA commented Oct 29, 2024

ChrisBarker-NOAA commented Oct 29, 2024

ethanrd commented Oct 29, 2024

ChrisBarker-NOAA commented Oct 29, 2024

JonathanGregory commented Nov 5, 2024

ChrisBarker-NOAA commented Nov 5, 2024

ethanrd commented Nov 6, 2024

JonathanGregory commented Nov 23, 2024

JimBiardCics commented Jul 23, 2018 •

edited by JonathanGregory

Loading

Dave-Allured commented Jul 23, 2018 •

edited

Loading

ghost commented Jul 24, 2018 •

edited by ghost

Loading

Dave-Allured commented Jul 24, 2018 •

edited

Loading

Dave-Allured commented Jul 24, 2018 •

edited

Loading

ChrisBarker-NOAA commented Oct 28, 2024 •

edited

Loading