-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for attributes of type string #141
Comments
I am generally in support of this I suggest not including a constraint for scalar strings, simply because it seems redundant. I think that existing CF language strongly implies single strings in most cases of CF defined attributes. |
How different is reading values from a Prefixing the bytes of an UTF-8 encoded string with the BOM sequence is an odd practice. Although it is permitted, afaik, it is not recommended. Since what gets stored are always the bytes of one string in some encoding, assuming UTF-8 always should take care of the ASCII character set, too. This could cause issues if someone used other one-byte encodings (e.g. ISO 8859 family) but I don't see how such cases could be easily resolved. Stroring Unicode strings using the |
This issue and issue #139 are intertwined. There may be overlapping discussion in both. |
@ajelenak-thg So I did some digging. I wrote a file with IDL and read it with C. There are no BOM prefixes. I guess some languages (such as Python) make assumptions one way or another about string attributes and variables, but it appears that it's all pretty straightforward. |
@ajelenak-thg I agree that we should state that char attributes and variables should contain only ASCII characters. |
@Dave-Allured When you say "CF-controlled attributes", are you referring to any values they may have, or to values that are from controlled vocabularies? |
@JimBiardCics, by "CF-controlled attributes", I mean CF-defined attributes within "the controlled vocabulary of CF" as you described above. By implication I am referring to any values they may have, including but not limited to values from controlled vocabularies. A warning about avoiding data type |
The restriction that Therefore I suggest no mention of a character set restriction, outside of the CF controlled vocabulary. Alternatively you could establish the default interpretation of string data (both |
Hi all, I wasn't quite able to form this into a coherent paragraphs so here are some things to keep in mind re: UTF8 vs other encodings:
My personal recommendation is that the only encoding for text in CF netCDF be UTF8 in NFC with no byte order mark. For attributes where there is desire to restrict what is allowed (though controlled vocabulary or other limitations), the restriction should be specified using unicode points, e.g. "only printing characters between U+0000 and U+007F are allowed in controlled attributes". Text which is in controlled vocabulary attributes should continue to be char arrays. Freeform attributes (mostly those in 2.6.2. Description of file contents), could probably be either string or char arrays. |
@DocOtak, you said "the netCDF classic format required UTF8 to be in Normalization Form Canonical Composition (NFC)". This restriction is only for netCDF named objects, i.e. the names of dimensions, variables, and attributes. There is no such restriction for data stored within variables or attributes. |
@Dave-Allured yes, I reread the section, object names does appear to be what it is restricting. Should there be some consideration of specifying a normalization for the purposes of data in CF netcdf? Text encoding probably deserves its own section in the CF document, perhaps under data types. The topic of text encoding can be very foreign to someone who thinks that "plain text" is a thing that exists in computing. |
@DocOtak, for general text data, I think UTF-8 normalization is more of a best practice than a necessity for CF purposes. Therefore I suggest that CF remain silent about that, but include it if you feel strongly. Normalization becomes important for efficient string matching, which is why netCDF object names are restricted. |
@Dave-Allured I don't know enough about the consequences of requiring a specific normalization. There is some interesting information on the unicode website about normalization. Which suggests that over 99% of unicode text on the web is already in NFC. Also interesting is that combining NFC normalized strings may not result in a new string that is normalized. It is also stated in the FAQ that "Programs should always compare canonical-equivalent Unicode strings as equal", so it's probably not an issue as long as the controlled vocabulary attributes have values with code points in the U+0000 and U+007F range (control chars excluded). |
@Dave-Allured and @DocOtak,
|
@hrajagers Thanks for the pointer to NUG Appendix A. It's interesting to see in that text that character array, character string, and string are used somewhat interchangeably. I'm curious to know if the NUG authors looked at this section in light of allowing |
I think we are making good progress on this. I checked the Appendix A table of CF attributes and I think the following attributes can be allowed to hold
All the other attributes should hold |
@ajelenak-thg Are you suggesting the other attributes must always be of type |
Based on the expressed concern so far for backward compatibility I suggested the former: always be of type |
On the string encoding issue, CF data can be currently stored in two file formats: NetCDF Classic, and HDF5. String encoding information cannot be directly stored in the netCDF Classic format and the spec defines a special variable attribute In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both |
Yes, NUG Appendix A literally allows only Personally I think |
Actually the ASCII/UTF-8 restriction is not enforced by the HDF5 library. This is used intentionally by netcdf developers to support arbitrary character sets in netcdf-4 data type For example, this netcdf-4 file contains a |
Dear Jim Thanks for addressing these issues. In fact you've raised two issues: the use of strings, and the encoding. These can be decided separately, can't they? On strings, I agree with your proposal and subsequent comments by others that we should allow For the attributes whose contents are standardised by CF e.g. I recall that at the meeting in Reading the point was made that arrays would be natural for Best wishes Jonathan |
@JonathanGregory I agree with you. I think it would be fine to leave Regarding the encoding, it seems to me that we could avoid a lot of complexity for now by making a simple requirement that all CF-defined terms and whitespace delimiters in string-valued attributes or variables be composed characters from the ASCII character set. It wouldn't matter if people used Latin-1 (ISO-8859-1) or UTF-8 for any free text or free text portions of contents, because they both contain the ASCII character set as a subset. The parts that software would be looking for would be parseable. |
@JonathanGregory Another use-case (that I think came up during the Reading meeting) had the |
Regarding the encoding, I agree that for attribute contents which are standardised by CF it is fine to restrict ourselves to ASCII, in both Regarding arrays of strings, I realise I wasn't thinking clearly yesterday, sorry. As we've agreed, Happy weekend - Jonathan |
I meant to write, I don't see the particular value for the use of |
@JonathanGregory The use of an array of strings for history would simplify denoting where each entry begins and ends as entries are added, appending or prepending a new array element each time rather than the myriad different ways people do it now. This would actually be a good thing to standardize. I think we can just not mention string array attributes right now. The multi-valued attributes (other than history, perhaps) pretty much all specify how they are to be formed and delimited. |
I think we can just not mention string array attributes right now.
Do we currently allow array of CHAR (i.e. 2D array) for attributes?
According to the netcdf docs;
The current version treats all attributes as vectors; scalar values are
treated as single-element vectors.
Which makes me think no, that’s not possible.
I think allowing the string type should not change what’s allowable.
BTW, I suspect some client software (e.g. py_netCDF4) treat char and string
the same ....
…-CHB
|
Let me throw a big wrench into this argument about not allowing string
arrays.
1. I would prefer a consistent decision and standard about the use of
char vs. string so a user does not need to know where to use char
array, scalar string, or string arrays.
2. Use of string arrays with flag_meanings (not sure it would be needed
with flag_values?) will solve many problems for my program to
actually merge our standards with CF. Currently with char arrays we
need to connect all words for a single flag by underscores for space
delimiting. Many of our variable names and attribute names contain
underscores. So when the flag description is parsed and changed to
be more human readable all the attribute and variable names are not
preserved. Automated tools can no longer replace attribute or
variable names with the attribute or variable value. We do this a
lot. We also have lengthy descriptions for our flag_meanings. I
would prefer to use flag_mask, flag_values and flag_meanings as that
general method is better than the one we currently employ.
3. I do see the benefit of storing history as string arrays. Without
checking date stamps I can see how many times the file has been
modified by checking the list length. It also removes any ambiguity
about separators in the history attribute which differs from the CF
standard of space separation and is often institution defined. The
current definition for history attribute is "List of the
applications that have modified the original data." In the python
world the use of "list" is different than the intended definition.
4. I'm starting to get a lot of more complicated data that are
multidimensional but do not share the same units. We would need to
work with udunits, but Cf/Radial is proposing a new standard for
complex data which often have different units for different index in
a second dimension. If we allowed string arrays in units we could
store complex data or other data structures more native to the
intended use since uduints interprets space characters as
multiplication not a delimiter.
5. missing_value or _FillValue currently allow one value. For string
type data allowing sting arrays to have multiple fill values which
would allow numeric data also have multiple fill values defined,
which I'm sure there are many data sets that have multiple fill
values used, but not defined correctly in the data file.
6. valid_range can be used with string data type
7. Conventions attribute could group multiple indicators with the same
class of conventions. For example ["CF-1.7", "Cf/Radial
instrument_parameters radar_parameters", "ARM-1.3"]
8. and on and on ....
I'm not suggesting the use of all these use cases, but this relatively
small change can go a long way to improve the standard and future use of
the data.
OK, I've made my case I'll be quite now.
Ken
…On 2018-7-27 09:23, JimBiardCics wrote:
@JonathanGregory
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JonathanGregory&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=d4qYLgaugDM0kdWoZHbgieEpVU-Xg_SJ1d1F_dbBs2M&e=>
The use of an array of strings for history would simplify denoting
where each entry begins and ends as entries are added, appending or
prepending a new array element each time rather than the myriad
different ways people do it now. This would actually be a good thing
to standardize.
I think we can just not mention string array attributes right now. The
multi-valued attributes (other than history, perhaps) pretty much all
specify how they are to be formed and delimited.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cf-2Dconvention_cf-2Dconventions_issues_141-23issuecomment-2D408452242&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=SzizwDsedBZ_n_qPzSCZ1OVJv5eli4zFSJXKogaOAtE&e=>,
or mute the thread
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AH4NvmnCKFC7HSpXQx-5FMi6Yfc-5F1HSfBaks5uKzBvgaJpZM4VbMvb&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=F3efrdVvqb932q5mp7D_eux9BSLztraUFgqR52IYak0&e=>.
--
Kenneth E. Kehoe
Research Associate - University of Oklahoma
Cooperative Institute for Mesoscale Meteorological Studies
ARM Climate Research Facility - Data Quality Office
e-mail: kkehoe@ou.edu | Office: 303-497-4754 | Cell: 405-826-0299
|
Dear @ChrisBarker-NOAA and @DocOtak Thanks for your further comments. I didn't know that in C you have to do the encoding yourself. I have modified the first two paragraphs accordingly. The third is unchanged. The text now in PR #556 is below. Is it OK? Cheers Jonathan A text string can be stored either in a variable-length Before version 1.12, CF did not require UTF-8 encoding, and did not provide or endorse any convention to record what encoding was used. However, if the text string is stored in a An n-dimensional array of strings may be implemented as a variable or an attribute of type |
Do we know what netCDF4-python does? I think my last remaining concern is that, without good defaults in the software folks are using, no one will actually implement normalization in their "beyond ascii" unicode strings. |
netCDF4 Python does not do the NFC normalization -- but I was going to put in a request for that :-) |
The PR looks good to me now -- thanks! However, the conformance section should be updated as well, specifing UTF-8 for strings and char. |
This all looks really good! Thanks.
I think this should be described in terms of normalized text rather than characters. First, because the Unicode standard discusses and defines normalization [1] in terms of Unicode text (or "Unicode coded character sequences"). Second, it would then only introduce one new Unicode concept, normalization, rather than two, normalization and composite characters. Perhaps instead some text about normalization could be added to the second sentence in that paragraph. Something like:
[1] See section 3.11 "Normalization Forms" in chapter 3 (PDF) of the Unicode Standard (v15.0). |
+1 This is in keeping with the drum I've been beating -- we should talk about Unicode in Unicode terms that scientists[*] will understand -- or at least be able to figure out what to do. e.g., if the text says "text must be NFC normalized" folks can google: "how do I NFC normalize a string Python" (substitute language of choice here) -- you get as the top AI hit, the answer. Critical is that the user doesn't need to try to figure out if they are using any combining characters, they should simply normalize everything. [*] - actually any non-specialist in Unicode -- most people that write code, even professional developers, don't know that Unicode has "composite characters", or that there can be more that one way to express what seems like one thing.
|
Dear @ethanrd and @ChrisBarker-NOAA Thanks for your comments. Below is a new version of the text in PR #556 for section 2.2. Is it OK now? Best wishes Jonathan Conventions document A text string can be stored either in a variable-length Before version 1.12, CF did not require UTF-8 encoding, and did not provide or endorse any convention to record what encoding was used. However, if the text string is stored in a An n-dimensional array of strings may be implemented as a variable or an attribute of type Conformance document Requirements:
|
Looks great -- I made minor, not critical comments in the PR -- thanks for the massive effort! |
Hi Jonathan @JonathanGregory - Thanks again for all the work on this! It looks really good. Just one new comment/question. (Sorry!) I'm wondering about the use of the word "must" rather than "should". I think "must" makes sense for attribute/variable names (for UTF-8 and NFC normalization) but it seems less clear when it comes to the values of variables and even attributes (unless they are CF defined attributes maybe). |
Attributes:I think it's not always clear exactly what is and isn't a CF defined attribute -- more to the point, it's harder to make it clear to users what the rules are in that case. And from the perspective of writing code to read/write attributes, it's an unholy mess if there are different encodings for different attributes -- and there is no way to define a different one if you want. I'm not sure it's even possible in, e.g. the netCDF4-python lib (I'd have to check on that). Variables:In this case, there is the unsanctioned precedent for an However -- it really is so much easier for everyone if we use UTF-8 everywhere -- there are entire web sites and If they really, really, want another encoding, then they can still do that, and hopefully specify a And if there is a really, really, really compelling use case, they can store the otherwise-encoded text as a byte array. I vote for MUST -- if we want to relax that later into a SHOULD, we can, but not the other way around. |
Hmm on further thought, -- MUST for utf-8, SHOULD would be OK for NFC normalization of anything but variable names. |
I agree on MUST utf-8 for attribute values. Too messy otherwise. I've heard arguments for using other encodings (utf-16 in particular) for some languages/situations. So I'm hesitant around MUST for variable data. But I don't really understand where that argument applies or how widely utf-16 is used, in comparison to utf-8. (I agree utf-8 is probably the most widely supported encoding. Definitely in the netCDF space.) That's the extent of my hesitancy so I'm good either way, really. |
UTF-16 is widely used in: Java Stored as a "wide char" data type -- i.e. 2 bytes per char. This was because MS and Java got ahead of the ball in the early Unicode days -- it was initially thought that all of Unicode could fit in 16 bytes (65536) total characters, so go to a two byte char, and everything else stays the same -- simple! But it turned out that all of Unicode couldn't fit in two bytes, so the whole thing, uncompressed in any way, takes 4 bytes (it's not all used, but a 3 byte data type isn't really a thing). Anyway -- that was way too much background, but the point is that UTF-16 is widely used internally a lot, but it's not so widely used in data exchanging applications -- I think even MS has pretty much given up on it, for, e.g. MS office XML formats. For data interchange, UTF-8 is now almost universal. Also, the char array and string type in netcdf is essentially a char* in C -- you can cram a two-byte encoding into it, but it's likely to break a lot of software (e.g. null- terminated strings in C -- there are a lot of null bytes in utf-16) The more likely, and reasonable, encodings that folks might want to use are not Unicode, but rather 1-byte encodings, like latin-1 -- or shift-jis or ... those have been used in plain old char* (such as the Python2 string type) for ages -- I'm sure there's a lot of data out there in those encodings. But a good fraction of it is Mojibaked, too :-( |
I was wondering more about usage based on text language rather than general implementation. The argument I remember had to do with text in some languages being smaller in UTF-16 rather than UTF-8 because most characters in those languages are two-bytes when encode in UTF-16 but 3- or 4-bytes when encoded in UTF-8. But this is really getting into the details. I'm good with either option. And maybe we can continue this conversation over beers sometime. |
Sure -- Unicode requires beer! |
Required watching pre (or during) beers (youtube link): Characters, Symbols and the Unicode Miracle - Computerphile |
I jsu tlearned something new today: The netcdf-c lib (or maybe the HDF lib underneath, I don't know) NFC normalized variable names for you. So if you write an non-normalized string in -- it will normalize it for you -- and when read back out, you will get a different string. What impact does this have on this conversation? maybe not much, although:
Tested with Python netCDF4 (I checked, the Python wrapper is not doing the normalization) Here's some sample code, if you're interested
running it, I get:
|
Dear Ethan, Chris, Barna I think it's better to require text stored in attributes and variables to be NFC-normalized and UTF-8, because
NFC and UTF-8 sounds complicated and offputting, but I hope that most users will be reassured by the statement, "A text string consisting only of ASCII characters is guaranteed to conform with this requirement, because the ASCII characters are a subset of Unicode, and their NFC UTF-8 encodings are the same as their one-byte ASCII codes (decimal 0-127, hexadecimal I don't think it would be appropriate to have text in the CF convention about which netCDF interfaces automatically produce NFC UTF-8 Unicode. However, we could put that information in a page on the CF website, if someone has time to assemble it, and cite that page in the conventions document. It could go in the page about software that works with CF, for example. As we've discussed, this change may break our principle 9: "Because many datasets remain in use for a long time after production, it is desirable that metadata written according to previous versions of the convention should also be compliant with and have the same interpretation under later versions". Unfortunately, there's no reliable way to interpret non-ASCII text in existing data, so the best we can do is minimise future problems. In the PR (#556), I've rephrased the second requirement more simply as, "Any attribute of variable-length string type must be a scalar (not an array)". Elsewhere, we're discussing relaxing this requirement but that's another matter. Do you, or anyone else, have any more comments? Cheers Jonathan |
yes, but the comparison point is far more critical. So we're in agreement.
I agree that we don't usually talk about the tools that way (and don't want to be responsible for keeping up to date on which tools do what) -- but maybe a note along the lines of "libraries that write mnetcdf may automatically do the normalization for you" |
Another note, just for interest: The netcdf-c lib does not normalize attribute values to NFC, only variable and dimension (untested) names. This could lead to errors -- if someone uses the same value for the var name and in an attribute, and it isn't normalized then the two will end up out of sync :-( I'm going to suggest to the Python lib that normalization be applied to attributes, and we could suggest the same thing to the C lib, but I suspect that the C folks won't go for it -- it's pretty hands-off when it doesn't need something internally. |
I like your changes Jonathan @JonathanGregory. And a question and a comment: Why is the I think it would be good to explain, briefly, that NFC is required to ensure that two versions of the same string match because Unicode can support mutiple ways to represent the same string. Though perhaps that should be a follow on discussion. And probably it should be in an Appendix. |
I kinda agree -- at teh binary level the ONLY difference between a char array and a string is that char array as a pre-defined length. However, from that issue, I see this:
So that does refer to the char type, somehow assuming that netCDF-4 string data type didn't have an issue, even though I don't think use of UTF-8 was defined at that point. however:
Which is silly, because ASCII IS UTF-8. though I suppose some folks might want to know, without decoding, that it's only ASCII. as tne _Encoding attribute is not being introduced to CF -- I don't know that it matters. |
Dear @ChrisBarker-NOAA and @ethanrd Thanks for your comments and discussion. I haven't added anything more about whether software might do the normalisation for you. I do agree that would be helpful if we have something specific to say. As I mentioned before, I think it would be valuable if we had text in the page about CF-aware software about which languages or libraries automatically produce NFC-normalised UTF-8 text. If that were there, we could link it from the convention. I hope you'll agree with not mentioning it at the moment. If so, and if no-one else has concerns about the present proposal, we can accept it three weeks after I last invited comments, on 29th. That'll be 19th November. It will be good to conclude this issue, which is the oldest one presently open! Best wishes Jonathan |
agreed, yes.
well, we haven't bothered to link the netcdf-C lib there yet -- and in a way, the normalization is conforming to netcdf spec, not CF per se so ?? But in any case, a whole other thing, if we decide to do it. |
I agree. I expect that general purpose Unicode-aware libraries/tools don't produce normalized text automatically. Libraries will provide methods for normalizing strings. Tools built on these libraries should use it when comparing strings but otherwise they would not do normalization just in normal operations. (That's my understanding anyway.) I believe the netCDF-C library applies NFC normalization (and checks for the restricted characters) in two places: 1) when a variable (and attribute, group, etc.) is created; and 2) when a variable is searched for by name, e.g., |
Three weeks have passed with no further concerns expressed. Therefore we accept the change, with thanks to all who've contributed, especially @DocOtak, @ChrisBarker-NOAA and @ethanrd since we resuscitated the issue this year, and @JimBiardCics who initiated it six years ago. This is the oldest currently open issue, so I'm very happy to close it now by merging #556. |
Attributes with a type of string are now possible with netCDF-4, and many examples of attributes with this type are "in the wild". As an example of how this is happening, IDL creates an attribute with this type if you select its version of
string
type instead ofchar
type. It seems that people often assume thatstring
is the correct type to use because they wish to store strings, not characters.I propose to add verbiage to the Conventions to allow attributes that have a type of
string
. There are two ramifications to allowing attributes of this type, the second of which impacts string variables as well.string
attribute can contain 1D atomic string arrays. We need to decide whether or not we want to allow these or limit them (at least for now) to atomic string scalars. Attributes with arrays of strings could allow for cleaner delimiting of multiple parts than spaces or commas do now (e.g. flag_values and flag_meanings could both be arrays), but this would be a significant stretch for current software packages.string
attribute (and astring
variable) can contain UTF-8 Unicode strings. UTF-8 uses variable-length characters, with the standard ASCII characters as the 1-byte subset. According to the Unicode standard, a UTF-8 string can be signaled by the presence of a special non-printing three byte sequence known as a Byte Order Mark (BOM) at the front of the string, although this is not required. IDL (again, for example) writes this BOM sequence at the beginning of every attribute or variable element of typestring
.Allowing attributes containing arrays of strings may open up useful future directions, but it will be more of a break from the past than attributes that have only single strings. Allowing attributes (and variables) to contain UTF-8 will free people to store non-English content, but it might pose headaches for software written in older languages such as C and FORTRAN.
To finalize the change to support
string
type attributes, we need to decide:string
attributes and (by extension) variables?Now that I have the background out of the way, here's my proposal.
Allow
string
attributes. Specify that the attributes defined by the current CF Conventions must be scalar (contain only one string).Allow UTF-8 in attribute and variable values. Specify that the current CF Conventions use only ASCII characters (which are a subset of UTF-8) for all terms defined within. That is, the controlled vocabulary of CF (standard names and extensions, cell_methods terms other than free-text elements of comments(?), area type names, time units, etc) is composed entirely of ASCII characters. Free-text elements (comments, long names, flag_meanings, etc) may use any UTF-8 character.
Trac ticket: #176 but that's just a placeholder - no discussion took place there (@JonathanGregory 26 July 2024)
The text was updated successfully, but these errors were encountered: