Creating a blacklisting certain characters from variable and attribute names #323

larsbarring · 2024-05-31T09:55:25Z

larsbarring
May 31, 2024
Collaborator

Topic for discussion

In #237 it was suggested to substantially relax restrictions on which characters are allowed in variable and attribute names. The conversation is still ongoing and sprinkled in various comments there are examples of characters that should not be allowed, either because they have special meaning in the context of CF or netCDF as such, or otherwise identified as causing problems.

I suggest that we amend the text in section 2.3 to list which character and character ranges CF explicitly disallows, i.e. creating a blacklist. While it may not be possible to identify all characters that should be in such a list (it may even evolve over time) I think that it is helpful to identify those characters that we now know belong to such a list.

So, far I believe the following have been identified from the standard ASCII character set: <space>, control characters (decimal 0 ... 31, 127), /, :, \. This blacklist should probably be expanded to also include Unicode control and whitespace ~~and underscore~~ characters.

I addition, double underscores __ have special meaning in relation to OGC netCDF-LD, specifically for prefixes, and should be mentioned as reserved for that purpose to not create interoperability clashes.

davidhassell · 2024-05-31T11:19:13Z

davidhassell
May 31, 2024
Maintainer

This blacklist should probably be expanded to also include Unicode controls, whitespaces and underscore characters.

Surely we don't want to disallow underscores!

3 replies

larsbarring May 31, 2024
Collaborator Author

Of course not!!!! My mistake (now fixed)

efisher008 Jun 13, 2024
Maintainer

Are (en) dashes/hyphens - supported in variable and attribute names? Should this character be included in the list? Apparently em dashes — are not standard ASCII characters, so probably that does not need to be specified if names are ASCII-only by default.

sethmcg Jun 13, 2024
Collaborator

In Issue #477 we decided to allow ASCII period and ASCII hyphen in attribute names only.

So either there will need to be two lists, or the list will need to be structured to allow for differences in different contexts.

ChrisBarker-NOAA · 2024-06-13T18:39:08Z

ChrisBarker-NOAA
Jun 13, 2024
Collaborator

Hmm -- I like this idea. But first i think we should make clear what the (long term) goal is:

Unicode is very complex, with a lot of subtleties -- There are efforts to manage that with normalization (https://www.unicode.org/reports/tr15/), and categorization of code points. (General Category. Partition of the characters into major classes such as letters, punctuation, and symbols, and further subclasses for each of the major classes.) Etc.

So I think we have essentially three options:

Stick with ASCII -- and maybe add some extras (Latin1?) - this is not great -- really doesn't allow real internationalization -- I think there's general consensus not to do that.
Use the Unicode categorization to restrict allowable characters -- there are a manageable number of such categories (30-ish).
Allow any Unicode code point, except for a defined blacklist (that's what this discussion is about)

I think the whole point of this discussion is that we don't want to do (1) anymore.

for (2) -- it seems appealing, but there's a lot of complexity, e.g. (from the Unicode spec)

Similarly, characters whose General_Category identifies them primarily as a symbol or as a
mathematical symbol may function in other contexts as punctuation or even paired punctuation. The most obvious such case is for U+003C “<” less-than sign and U+003E “>”
greater-than sign. These are given the General_Category gc = Sm because their primary
identity is as mathematical relational signs. However, as is obvious from HTML and XML,
they also serve ubiquitously as paired bracket punctuation characters in many formal syntaxes.

So it can get messing. Nevertheless, there is precedent -- for instance, Python has the following rules:

https://docs.python.org/3/reference/lexical_analysis.html#identifiers

A bit messy, but do-able.

However, there are still a number of complications -- one is NFKC normalization, and another is that Python treats some different Unicode characters as equivalent (e.g. Blackboard Bold "B" U+1D539 is the same as capital B U+0042) -- but only in context where the normalization is done (e.g. processing source code, but not when meta-programming, like setattr()) (sorry can't find a reference at the moment).

Frankly, it's a bit of a mess if people really do use the broad range of allowable characters.

That being said, I think that the CF problem is easier than Python, as CF isn't providing normalization -- only enforcement.

I'm inclined (at the moment -- I haven't thought it through too carefully) to go with (3) -- allow any Unicode code point except a given blacklist. Note that I say Code Point, not character, as some characters can be represented by different code points (e.g. accented characters) If we simply do "Code Point", then there is no issue of normalization, or anything else.

(hmm, option 3(b) -- any code point, but a particular normalization?)

Though maybe that's too much a wild west?

0 replies

sethmcg · 2024-06-14T16:24:27Z

sethmcg
Jun 14, 2024
Collaborator

I'd like to course-correct the discussion a bit, if I may. This is not a proposal to expand the list of allowed characters in a wide-reaching way. That's what #237 is about, and a number of folks (including me and Lars) concluded that it would be unwise; there are a lot of security and interoperability concerns that make it important to consider any expansions of the list carefully and cautiously before adding them.

I believe what Lars is proposing is that we add an explicit, stand-alone listing of the sets of banned and allowed characters, rather than only having them defined implicitly in the text of section 3.2. I can see the value in that, but I think we shouldn't frame it as a list of banned characters, because that implies that anything not on the list is allowed, and as discussed in #237, there are important reasons that the default answer for whether a character is allowed should be "no". I think we should have an explicit list of allowed characters, with an accompanying list (maybe an extra column) of clarifications to cover the known disallowed characters that Lars suggests. So maybe something like:

Allowed	Clarification
`-`	`-` is the ASCII hyphen-minus, ASCII 45 / Unicode U+002D. Other dash characters (unicode en-dash, em-dash, minus sign, soft hyphen, non-breaking hyphen, etc.) are not allowed. This character is only allowed in attribute names, not variable names.

0 replies

ChrisBarker-NOAA · 2024-06-14T18:00:45Z

ChrisBarker-NOAA
Jun 14, 2024
Collaborator

@sethmcg: sorry about that -- I think it was me that expanded the conversation.

However, the reason I did that is that I don't see how we can talk about a blacklist without the context of what's allowed, so I was trying to get at that.

However, I think maybe I get it now -- this proposal for a "blacklist" is more internal to clearly define the rules now, and to guide any potential expansion in the future -- e.g.: whatever we do we won't allow THESE charactors :-)

I see the point of that, so carry on :-)

To that point:

"-" is the ASCII hyphen-minus, ASCII 45 / Unicode U+002D. Other dash characters (unicode en-dash, em-dash, minus sign, soft hyphen, non-breaking hyphen, etc.) are not allowed. This character is only allowed in attribute names, not variable names.

I find this odd to say -- are ANY other non-ascii charactors -- any number of other symbols, punctuation, etc allowed?

I think I get the point here, but it's a odd phrasing.

I think the point is that folks may be tempted to (or accidentally) use another symbol that "looks like" a dash.

In fact, I've had that issue in a totally different context, where something was copy and pasted from an application that had (helpfully) auto-changed an ascii dash to an endash.

So I don't see this as a blacklist so much as a "be cautious of these" list -- at least in that example.

Which I do think is good to document.

The real blacklist are the ones that will break other aspects of CF / netcdf (e.g. have special meaning in CDL)

-CHB

0 replies

sethmcg · 2024-06-14T18:57:45Z

sethmcg
Jun 14, 2024
Collaborator

@ChrisBarker-NOAA

I think the point is that folks may be tempted to (or accidentally) use another symbol that "looks like" a dash.
Yes, precisely. I think that would be a good addition to the conventions, and my impression is that that's what Lars is proposing, though I may be wrong.

I hadn't thought about compiling the list of characters that we definitely don't want to add for various technical reasons, just to have a consolidated reference for what they are and why they're banned. I agree that that would be a very useful thing to have, but I'm not sure about adding it to CF proper. I worry that people would see it and think of it as the complete list of all disallowed characters, and that everything else is allowed. Maybe we want to have that list, but make it an adjunct document of some kind, like the Guidelines for Constructing Standard Names? Or put it in an appendix?

0 replies

larsbarring · 2024-06-17T08:30:44Z

larsbarring
Jun 17, 2024
Collaborator Author

Wow, I was away from this issue for a few days while there have been a lot of activity and good points. When opening this discussion I had in mind was a rather modest extension to section 2.3, where the relevant part reads

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores. By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. By the word digits we mean the standard ASCII digits 0 to 9, and similarly underscores means the standard ASCII underscore _. ... ... ... ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only.

Essentially this allows, as a recommendation, the US-ASCII (or their Unicode counterpart) letters and digits and underscore, as well as period and hyphen for attribute names. All other characters are implicitly not recommended (or "should not"), but not explicitly excluded or forbidden. What I had in mind was to marginally reduce this huge list of not recommended characters by explicitly disallowing the few characters that we already now know will create problems.

So far I am aware of the following, all within the US-ASCII character set, control characters (decimal 0 ... 31, 127), (space), /, \, : (I do not remember in what context the : surfaced, so maybe I am mistaken).

Based on this, my simplistic suggestion is to immediately after the text cited above add a sentence, something like

... ... ... ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only. The following ASCII characters must not be used: control characters (decimal 0 -31, 127), (space), /, \ and :.

In this minimal way we avoid all complications in relation to Unicode, and focus on those few we all agree, I think, cannot be used. All other punctuation (whether ASCII or Unicode), Unicode control and what not, remains as is, which basically means to be sorted out in the future.

0 replies

larsbarring · 2024-10-04T15:23:32Z

larsbarring
Oct 4, 2024
Collaborator Author

I have now explored this in some more detail using a python script to insert various unicode characters into the variable name in a small .cdl file and then use ncgen to generate a .nc file. In the same script I used NCO/ncrename trying to change the same character of a variable name in a working nc-file to all other characters in the list, and then use ncdump to create a cdl file. Thus it is not a full round-trip because the NCO step. I focussed on ASCII (decimal 0 - 127), ISO/IEC 8859-1 (decimal 0 - 255) and control (C1), as well as Unicode whitespace (WS) groups (all according to Wikipedia). Here is the result:

Code point	Decimal	Character "group"	ncgen / nco+ncdump
U+0000	0	ASCII "`nul`" Unicode control C0	NOT/NOT
U+0001 - U+0008	1 - 8	ASCII/Unicode control C0	NOT/OK
U+0009 - U+0010	9 - 10	ASCII/Unicode control C0	NOT/NOT
U+0011 - U+001F	11 - 31	ASCII/Unicode control C0	NOT/OK
U+0020	32	ASCII/ISO/IEC 8859-1 (space)	NOT/NOT
U+0021	33	ASCII/ISO/IEC 8859-1 `!`	NOT/OK
U+0022	34	ASCII/ISO/IEC 8859-1 `"`	NOT/NOT
U+0023 - U+0025	35 -37	ASCII/ISO/IEC 8859-1 `#` `$` `%`	NOT/OK
U+0026 - U+0029	38 - 41	ASCII/ISO/IEC 8859-1 `&` `'` `(` `)`	NOT/NOT
U+002A	42	ASCII/ISO/IEC 8859-1 *``**	NOT/OK
U+002B	43	ASCII/ISO/IEC 8859-1 `+`	OK/OK
U+002C	44	ASCII/ISO/IEC 8859-1 `,`	NOT/OK
U+002D - U+002E	45 - 46	ASCII/ISO/IEC 8859-1 `-` `.`	OK/OK
U+002F	47	ASCII/ISO/IEC 8859-1 `/`	NOT/OK
U+0030 - U+0039	48 - 57	ASCII/ISO/IEC 8859-1 digits	OK/OK
U+003A	58	ASCII/ISO/IEC 8859-1 `:`	NOT/OK
U+003B	59	ASCII/ISO/IEC 8859-1 `;`	NOT/NOT
U+003C - U+003F	60 - 63	ASCII/ISO/IEC 8859-1 `<` `=` `>` `?`	NOT/OK
U+0040	64	ASCII/ISO/IEC 8859-1 `@`	OK/OK
U+0041 - U+005A	65 - 90	ASCII/ISO/IEC 8859-1 `A` - `Z`	OK/OK
U+005B - U+005E	91 - 94	ASCII/ISO/IEC 8859-1 `[` `\` `]` `^`	NOT/OK
U+005F	95	ASCII/ISO/IEC 8859-1 `_`	OK/OK
U+0060	96	ASCII/ISO/IEC 8859-1 `	NOT/NOT
U+0061 - U+007A	97 - 122	ASCII/ISO/IEC 8859-1 `a` - `z`	OK/OK
U+007B	123	ASCII/ISO/IEC 8859-1 `{`	NOT/OK
U+007C	124	ASCII/ISO/IEC 8859-1 `\|`	NOT/NOT
u+007D - U+007E	125 - 126	ASCII/ISO/IEC 8859-1 `}` `~`	NOT/OK
U+007F	127	ASCII "`del`" Unicode control C0	NOT/OK
U+0080 - U+009F	128 - 159	Unicode control C1	OK/OK but screen printouts misbehave, some pretty badly (!)
U+00A0	160	ISO/IEC 8859-1, Unicode WS	OK/OK
U+00A1 - U+00FF	161 - 255	ISO/IEC 8859-1	OK/OK
U+1680 U+2000 - U+202A U+2028, U+2029 U+202F, U+205F U+3000	5760 8192 - 8202 8232, 8233 8239, 8287 12288	Unicode WS whitespace	OK/OK but screen printouts look strange

In doing this I used the most recent released version of the netCDF library tools (netcdf library version 4.9.2 of Jun 6 2024 10:57:38).

With respect to ASCII, I think that this is a pretty strong indication of which characters (groups) should not be not accepted in variable and attribute names.

And, yes, I do think that it better to be explicit about this and expressly rule out those characters we know are likely to cause problems because the CF conventions are all about data exchange and interoperability.

I think that it would be good to get such a statement into CF-1.12, what do you think ?

ping @sethmcg @ChrisBarker-NOAA @JonathanGregory @ethanrd @Dave-Allured @DocOtak @davidhassell

0 replies

JonathanGregory · 2024-10-04T17:43:14Z

JonathanGregory
Oct 4, 2024
Maintainer

Dear @larsbarring et al.

Thanks for your thorough investigation, Lars, and thanks everyone for the discussion. The text which Lars quoted above is not the working version. Following conventions issue #237, section 2.3 now reads

It is recommended that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores. By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. By the word digits we mean the standard ASCII digits 0 to 9, and similarly underscores means the standard ASCII underscore _. Note that this is in conformance with the COARDS conventions, but is more restrictive than the netCDF interface which allows almost all Unicode characters encoded as multibyte UTF-8 characters (NUG Appendix B). The netCDF interface also allows leading underscores in names, but the NUG states that this is reserved for system use. ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only.

which is consistent with the conformance document. That is, as Lars says, we recommend against a lot of characters. All characters except letters, digits, underscores and (for attributes only) ASCII 2D . and 2E - are recommended not to be used.

In the discussion of conventions #237 we agreed that all characters are allowed, despite the recommendation (which is not a requirement) not to use the majority of them. Lars commented that the CF conventions "essentially provide a whitelist of explicitly allowed characters. All other characters are not recommended (or recommended against) but not explicitly disallowed. But throughout this conversation there have been several remarks that some characters should indeed be explicitly disallowed. This could easily be done by ... creating a blacklist." That's what this discussion is about, if I understand correctly.

The last sentence of the working text as above is unsatisfactory, despite #237, because it says . and - are "allowed". Those two characters are certainly "allowed", because all characters are allowed. What it means is that we aren't recommending against them. We should fix that.

The working text is also unsatisfactory because it implies that the NUG prohibits some characters ("it allows almost all Unicode characters ...") but it doesn't say which ones are not allowed. NUG Appendix B says that names should match the regular expression

([a-zA-Z0-9_]|{MUTF8})([^\\x00-\\x1F/\\x7F-\\xFF]|{MUTF8})*
MUTF8        = <multibyte UTF-8 encoded, NFC-normalized Unicode character>

I suppose we should understand the regular expression to begin with ^ and end with $ i.e. it's the complete name. Do you agree?

Since ASCII is a subset of UTF-8, I think that by "multibyte UTF-8 encoded", the NUG must mean a Unicode character which is encoded in more than one byte by UTF-8. That is, MUTF8 doesn't include one-byte characters, among them the ASCII characters 00-7F. Do you agree?

If that's correct, the NUG does not allow / (which is in the middle of the second [...] expression), or the one-byte characters 00-1F (ASCII control characters) and 7F-FF. Lars's experiments agree that ncgen does not work with the control characters, / and 7F, but apparently it does work with 80-FF.

I think we should explicitly state that we prohibit 00-1F, / and 7F-FF, if I'm correct that NUG doesn't allow them anyway. The CF text is currently vague, but it says CF is "more restrictive" than the netCDF interface, for which it cites the NUG (although Lars's experiment shows that the netCDF interface is more forgiving than the NUG). This implies that CF upholds NUG restrictions, as you'd expect.

Also, the the CF working text is inconsistent with the NUG in saying "It is recommended that variable, dimension, attribute and group names begin with a letter". This is not merely a recommendation, because the NUG says that names must begin with a letter, digit, underscore or multi-byte UTF-8 character. We should fix this. Our text currently implies it's OK to start a name with a punctuation mark, for instance, which the NUG prohibits.

Lars's experiment shows that ncgen doesn't allow space (20) or any of the puncuation marks 21-2F except + - and ., nor any of the symbols 3A-3F, 5B-5E, 60, 7B-7F. To put it positively, ncgen allows only letters, digits, _ + - . and @. Thus, ncgen is more restrictive than the NUG.

I think it would be reasonable for CF to prohibit all those characters which ncgen doesn't support, and which therefore could not be used in CDL. That would be a backward-incompatible change, which we don't normally make, if in fact any existing data uses any of those characters in netCDF names. Given Lars's experience, however, it seems unlikely anyone would have used them, despite NUG allowing them.

We've decided to allow . and - in attribute names, but not other names. What about + and @, which are the only two one-byte characters so far not considered. NUG and ncgen allow them, so I think CF should continue to allow them. At the moment, we recommend that they should not be used, since they aren't in our current whitelist.

Best wishes

Jonathan

0 replies

DocOtak · 2024-10-05T00:49:33Z

DocOtak
Oct 5, 2024
Maintainer

Looking over this and the long original question. Is it worth separating variables into two categories: variables meant to be interpreted in a CF way, and variables that are not? I'm of the opinion that variable names basically don't matter and that all of the actual information is going to be inside the attribute values.

I would propose that for variables that are intended to be interpreted as CF variables, we are very restrictive: ASCII letters a-zA-Z, numbers 0-9, and the low line _. But other variables that are not meant to be interpreted by CF readers be unrestricted by CF. Perhaps noting that the use of chars outside the range allowed for CF variables can still cause problems in some software even if you don't intend those variable to be read.

I think that adding . and - was a little quick/premature/misstep as it originated in a proposal that is in progress.

8 replies

sethmcg Oct 8, 2024
Collaborator

I'm a little confused. I'm not saying anything about what you should or shouldn't do, I'm saying that, by definition, if there's file content that doesn't follow the CF standards, that content is not CF-compliant. And if someone is generating non-compliant data, again, by definition, they're not following the standard. Why would we put something in the standard about what people should do when they're not doing what we say they should do?

If you have a reason to put non-compliant data into an otherwise compliant file, that's fine. It's perfectly valid to choose to deviate from the standard because you have some other concern that trumps it. But I figure if somebody has an overriding reason to ignore a section of the standard, they're going to ignore it regardless of what it says. So I don't get why we'd want to say, basically, "even if you're not following the standard, we recommend that you follow the standard."

If you're suggesting that we add some commentary about why this particular part of the standard is the way it is, I'm all for it. But saying "we recommend adhering to these restrictions" just seems tautologically redundant to me. In my mind, that's what the CF standard is: a set of restrictions that we recommend adhering to because it maximizes data reusability and makes it less likely that software will fail to read the data.

ChrisBarker-NOAA Oct 8, 2024
Collaborator

@DocOtak: I'm curious as to why you tin kit important to allow a wider range of characters?

If a user is paying attention to conventions when preparing the file, it doesn't seem like that big a deal to just make sure all your names conform.

I know what you mean by " not meant to be interpreted by generic software" -- but even if a variable is a special use-specific variable, many (most?) folks are still going to want to read it with "generic" software like xarray, or plotting software, or ... and having who knows what in a variable names can get tricky.

It sure seems safer to hvae one set of rules.

DocOtak Oct 8, 2024
Maintainer

I should have included advice for what generic software should do, that is, ignore the variables and attributes it doesn't know how to deal with but interpret the ones it can handle. I remember Matlab being pretty good at reading as much as it can. We should strive a bit to meet data producers where they are. I'd like to be able to welcome them to CF with minimal friction and, once they are in, apply the pressure to make their data more interoperable. In other words, incremental adoption of CF within a single data file should be supported by CF.

Perhaps this is my age, but UTF-8 has been a thing for basically my entire computer life. UTF 8 has been required to be supported by internet protocols since 1998 (26 years). So I often have a feeling of "wow that's antiquated" whenever ASCII gets mentioned. But, people have done foolish things with variable names that we get to deal with: they eval them a symbols in their favorite programing language, we (CF) use spaces to separate variable names (aside, I'd like to see a comeback of the File, Group, Record, and Unit Separators in "CSV" files), popular data servers like ERDDAP have their own restrictions on charsets for format translation reasons. Perhaps the most foolish thing is to assume that plain text exists on a computer and they aren't just byte sequences.

When it comes to input validation, having a blacklist of things tends to be the wrong way to go. Lars even notes that this might be an effort that would be ongoing and we may never identify all the things that need to be on the blacklist. So I'd propose the rules be as follows:

CF Variables MUST be named with a-zA-Z0-9 or "_" (whitelist) and MUST NOT start with "_" and SHOULD NOT start with 0-9.
- Starting with "_" is reserved by NUG
- Starting with 0-9 prevents the variable from being used as a symbol in most programing languages
Variables with any other char are not valid CF variables and SHOULD be ignored by generic software to allow incremental adoption of CF
Only valid CF Variables SHOULD be considered when evaluating the validity of the entire dataset/file

sethmcg Oct 15, 2024
Collaborator

@DocOtak - I appreciate the spirit of these rules, but I don't think the wording works. As it stands, it can be read as doing away with validity, since anything invalid should be ignored when evaluating validity. I think they could work if you talked about the validity of the name rather than the validity of the variable, though.

But I'm still a bit confused about the issue of incremental adoption. I think there's some part of this that I'm just not getting, and that I'm not fully understanding your concern. Do you have an example of one of these files that has parts that can't be made compliant?

DocOtak Oct 16, 2024
Maintainer

@sethmcg I think some context that might be missing is that my office manages a historic observational dataset that has had its own bespoke data formats since the early 90s so we struggled a bit with an all at once adoption. Of the concerns with CF specifically, was if extra attributes were allowed at all. We couldn't, at the time at least, find anything that said extra attributes of any kind were ok.

So I'm coming from a position of wanting an explicit answer to the question "can I have extra variables in my file?" I want the answer to be "yes, and you can do whatever you need to do with those extra variables without any restriction". Technically, that means I rules such as the ones I've proposed above need to exist.

larsbarring · 2024-10-08T15:03:57Z

larsbarring
Oct 8, 2024
Collaborator Author

A couple of further comments to my analysis and to the subsequent comments/responses:

Now looking more carefully at NUG Appendix B that @JonathanGregory refers to, I realize that I did not escape the "special2" characters (in that sense my test was naive)

// special2 chars are recently permitted in
// names (and require escaping in CDL).
// Note: '/' is not permitted.
special2 = ' ' | '!' | '"' | '#' | '$' | '%' | '&' | ''' |
'(' | ')' | '*' | ',' | ':' | ';' | '<' | '=' |
'>' | '?' | '[' | '\' | ']' | '^' | '`' | '{' |
'|' | '}' | '~'

Consequently, they should be regarded as "OK/OK" (though I have not tested).
ASCII Control (x00 - x1F) characters are already excluded by NUG, as is the forward slash /
Of the "special2" characters the space have specific meaning in CF, namely as a delimiter. This has been noted in several previous comments and strongly argued against allowing it (here, here, here). And @Dave-Allured suggested a "limited restriction":

Variable names included in blank-separated lists such as ancillary_variables or coordinates must not include the ASCII space character.

But given the other comments I would argue that it is safer to blacklist the space character altogether. I can imagine that someone think that it is a good idea to disseminate files having a variable (or global attribute) named GCM name and then someone else at a later stage wants to do some ensemble analysis thus trying to put this variable or attribute in a coordinate attribute to create an ensemble dimension.
Regarding the backslash \, NUG allows it (needs escaping), but there were comments against (here and here here, but without specific reason given. But as the backslash is commonly used as escape character (as per NUG) it would be easy to end up in complicated and error-prone combinations.
Moreover, @sethmcg commented

In addition to the aforementioned problem it causes with ancillary variables, it's not uncommon in my experience for a lot of netcdf processing to happen on the command line (rather than neatly encapsulated within the confines of a general-purpose library) by piping the output of ncdump -h through various commands, in particular cut and grep. Whitespace would play havoc with those kinds of workflows, which in my opinion makes it an absolute showstopper.

This use case also makes me very leery about the prospect of allowing any character that is a special character in the shell or a regular expression. Let's not set up a situation that demands lots of quoting.

Further, this suggests to me that if the list of allowed characters is to be expanded, it should be via a whitelist approach rather than a blacklist approach; i.e., the default should be that characters are disallowed unless they have been carefully vetted.

which attracted thumbs-up from @jesusff, myself, @lesserwhirls, @taylor13, @czender, and explicit support from @JonathanGregory. Disallowing (i.e. blacklisting) all characters not explicitly in the whitelist, as Seth suggests, goes against the conclusion when that issue cf-convention#237 was closed. Instead they belong to what might be called a "greylist", i.e. "allowed but not recommended" (or recommended against).

However, allowing but not recommending all characters not explicitly disallowed by NUG is problematic for the following reasons:

space MUST be explicitly disallowed to not break CF as explained above.
In the ASCII range (20 - 7F (decimal 32 - 127)) all characters belonging to the "special2" (listed above) category of NUG should be explicitly disallowed. This follows the suggestion by Seth and supported by many (as cited above). What then remains is (as per NUG):

// special1 chars have traditionally been
// permitted in netCDF names.
special1 = '_'|'.'|'@'|'+'|'-'
Moreover, I strongly argue that ASCII del (7F (decimal 127)) and all Unicode control C1 characters (80 - 9F (decimal 128 - 159)) should be explicitly disallowed. Typically, these are non-printing characters that are not visible in a ncdump printout. Exceptions are 84 (decimal 132) that creates a line-feed, 9A- 9B (decimal 154 -155) that deletes the subsequent character, and particularly problematic, 9D - 9F (decimal 157 - 159) that make the python script silently stop (i.e. it did not even crash) when trying to print, and the same goes for ncdump -h.
Finally, I would equally strongly argue that all Unicode WS whitespace characters should be explicitly disallowed. They are printed as one or more space that are not easy or even possible to distinguish from the ASCII space without special measures. I do not think that we want to end up in a situation where users receiving a file may have to resort to octal/hex dumps of the output from ncdump to actually understand the content (this comment is equally valid for the previous point!).

I suggest that these four points should form the basis for creating a "blacklist" of characters that CF explicitly disallows despite that they are allowed by NUG. In principle this is a breaking change of what we previously agreed on in cf-conventions/#237, which still belong to the current draft version, and in practice the suggested list of characters to blacklist are typically not the ones one would expect to prime targets for users to include in new files.

3 replies

larsbarring Oct 8, 2024
Collaborator Author

PS, I have no intention to check the remaining almost 137 thousand Unicode characters, graphemes, codepoints or whatever they are called :-)

sethmcg Oct 8, 2024
Collaborator

So Lars, to sum up, you are recommending that we explicitly ban the following characters from use in variable and attribute names because they are known to cause technical problems:

all Unicode whitespace characters (including ordinary space)
all of the NUG "special2" characters ( !"#$%&'()*,:;<=>?[]^`{|}~)
ASCII del and all Unicode control C1 characters

Yes?

I am 100% in favor, and support this recommendation wholeheartedly.

larsbarring Oct 8, 2024
Collaborator Author

Thanks Seth for the excellent and succinct summary! Yes, this is exactly what I suggest, and in the longish comment provide concrete reasons for why these character (groups) are singled out.

larsbarring · 2024-10-08T15:38:01Z

larsbarring
Oct 8, 2024
Collaborator Author

On a partly different aspect, @JonathanGregory commented

The last sentence of the working text as above is unsatisfactory, despite #237, because it says . and - are "allowed". Those two characters are certainly "allowed", because all characters are allowed. What it means is that we aren't recommending against them. We should fix that.

I fully agree. The question is how to fix it. @DocOtak noted

I think that adding . and - was a little quick/premature/misstep as it originated in a proposal that is in progress.

which refers to this issue, and in particular this comment. Before we fix this particular sentence I think we should get some input regardingtheir views. I will shortly make comment over there .

0 replies

ChrisBarker-NOAA · 2024-10-08T19:37:40Z

ChrisBarker-NOAA
Oct 8, 2024
Collaborator

all Unicode whitespace characters (including ordinary space)

all of the NUG "special2" characters ( !"#$%&'()*,:;<=>?[]^`{|}~)

ASCII del and all Unicode control C1 characters

I think the NUG special characters are the "blacklist" -- the rest could (should) be defined in terms of the Unicode categories:

https://www.compart.com/en/unicode/category

e.g. No control characters. (ASCII DEL is a control character)

And maybe it's time now to specify which categories are allowed / not allowed?

1 reply

larsbarring Oct 8, 2024
Collaborator Author

Yes, I do agree that we should be more specific compared to the outcome of cf-convention/#237.

JonathanGregory · 2024-10-08T21:59:07Z

JonathanGregory
Oct 8, 2024
Maintainer

Dear all

My understanding of NUG Appendix B (as in my previous post) is that NUG prohibits ASCII control characters (00-1F) and slash / (as @larsbarring says), ASCII DEL (7F), and all one-byte codes 80-FF, not just the Unicode C1 control characters. If NUG prohibits them, so should we. Have I misunderstood it?
On the other hand, we now understand that ncgen as well as NUG permit the "special2" characters !"#$%&'()*,:;<=>?[\]^`{|}~ (including backslash \) and ASCII space (20). Lars and Seth are in favour of prohibiting these in CF (blacklist), whereas currently we allow but recommend against them (greylist). That would be a backward-incompatible change, in our usual generous sense that data written up to CF 1.11 could use them in names, but they would give an error from a checker following CF 1.12. I certainly agree it would be a bad idea to use any of these characters in a name, since it is very likely to cause problems. Is that a sufficient reason for prohibiting them in CF? If the majority think so, I'd agree to it.
Lars proposes that other whitespace Unicode characters should be prohibited, for the same good reasons. Again, I would agree if the majority are happy. Such characters are Unicode character class "Space Separator", key Zs. They include the "Ogham space mark"  , which isn't invisible!
NUG permits the "special1" characters _.@+-. _ is already on our whitelist. We have recently put .- on the whitelist for attribute names. (We could reverse that decision easily, since it's not in a released version of CF yet.) .- for variable and dimension names are on the greylist. @+ are on the greylist for all names.
All other ASCII characters are on the whitelist (letters and digits), and would remain so.
All Unicode code points requiring more than one byte in UTF-8 are on the greylist, and we would leave them there.

Best wishes

Jonathan

2 replies

larsbarring Oct 8, 2024
Collaborator Author

Regarding your first point, I am not expert enough to fully parse the regular expression you refer to. While a regex is appropriate in the that section of the NUG, I think (drawing on myself) that it is relevant for CF to express the character blacklist in plain language.

JonathanGregory Oct 10, 2024
Maintainer

I certainly agree we should describe the characters in plain text! I'm not sure either what the NUG means, especially because regular expressions have several dialects, and different languages protect special characters in various ways. The RE is [^\\x00-\\x1F/\\x7F-\\xFF]. Within [ ] you can specify individual characters or ranges of characters. Inside this [ ] I see \\x00-\\x1F, / and \\x7F-\\xFF. / is a single character, and I suppose the ranges mean 00-1F in hexadecimal, and 7F-FF in hexadecimal. The ^ at the start means the [ ] does not match any of the things inside it.

JonathanGregory · 2024-10-14T17:06:18Z

JonathanGregory
Oct 14, 2024
Maintainer

Please note that @Dave-Allured has opened conventions issue 548 to delete the sentence, "ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only." in Sect 2.3. This sentence was inserted into the working version by conventions issue 477 for various reasons, including to support IETF BCP 47 language tags, discussed in conventions issue 528, which is still ongoing.

If Dave's proposal is accepted, the characters allowed for attribute names will be the same as for variable names in CF 1.12, which is the same as in CF 1.11, the most recently reduced version.

0 replies

ethanrd · 2024-10-15T00:59:04Z

ethanrd
Oct 15, 2024
Maintainer

Hi all - Sorry I'm late to this discussion. A few thoughts as I'm starting to catch up:

Please DO NOT consider the NUG a reliable source for Unicode information. The sections that mention Unicode were written some time ago (2008) and without an in-depth understanding of Unicode. I do feel confident saying the intent at the time was that the names of all netCDF objects (dimension, variable, attribute, group, etc.) should be valid UTF-8 strings that are NFC normalized and do not contain any control characters.

I believe the netCDF-C library validates that names are NFC normalized UTF-8 strings and without control characters (in the ASCII range) when creating a new netCDF dataset but not when reading (and maybe not when renaming). I believe the netCDF-Java library behaves in a simi8lar manner though I haven't tested as much.

I agree with the comment above from Chris @ChrisBarker-NOAA about using Unicode categories (list) to specify allowed and/or not allowed characters. Also an earlier comment about reviewing other documents on Unicode for identifiers/names, e.g., how the Python Language defines the syntax for Identifiers.

0 replies

JonathanGregory · 2024-10-23T17:56:48Z

JonathanGregory
Oct 23, 2024
Maintainer

Dear Lars

To summarise my earlier posting, I think we should replace the first paragraph of 2.3 along these lines:

The NetCDF interface requires the following for the name of any variable, dimension, attribute and group:

It must begin with an ASCII letter (a-z or A-Z), an ASCII digit (0-9), ASCII underscore _ or an NFC-normalised Unicode codepoint encoded in UTF-8 and requiring more than one byte.
The rest of the name can include any ASCII character or multi-byte UTF-8 Unicode codepoint except that it must not include any ASCII control character (decimal 0-31, hexadecimal 00-1F), space (decimal 32, hexadecimal 20), slash / (decimal 47, hexadecimal 2F) or DEL (decimal 127, hexadecimal 7F).

In addition to the NetCDF requirements, in CF

The name is recommended to begin with a letter, not a digit or _.
The name is recommended not to include any multi-byte UTF-8 Unicode codepoint.

and either

The name is required not to include any of the ASCII characters !"#$%&'()*,:;<=>?[\]^`{|}~ or any Unicode codepoint in class Zs "space separator".

or

The name is recommended not to include any of the ASCII characters !"#$%&'()*,:;<=>?[\]^`{|}~.

The or version is the status quo. Whether to adopt the either alternative is the main point at issue, I believe.

Best wishes

Jonathan

0 replies

ChrisBarker-NOAA · 2024-10-23T18:34:39Z

ChrisBarker-NOAA
Oct 23, 2024
Collaborator

It must begin with an ASCII letter (a-z or A-Z), an ASCII digit (0-9), ASCII underscore _ or an NFC-normalised Unicode codepoint encoded in UTF-8 and requiring more than one byte.

wait, really! WTF? that makes absolutely no sense, as I read it "an NFC-normalised Unicode codepoint encoded in UTF-8 and requiring more than one byte" -- that is EVERY code point in Unicode -- including punctuation, control characters, various whitespace, -- this list goes on. Very, very odd that they could disallow all the non letter and digit ASCII codepoints, but allow all the non-ascii ones -- Huh?

I think it was Ethan that said that the netcdf handling of Unicode should not be considered thoughtful.

Anyway -- we probably should bring this up with the netcdf folks, but in the meantime, CF can be more restrictive, and it absolutely should be.

Perhaps we can re-define all this with a more appropriate extension from ASCII to Unicode -- e.g. "control code points are disallowed", "Letters are allowed" -- obviously spelled out in the proper language of Unicode.

NOTE: this is distinct from the blacklist issue -- which I do support.

0 replies

ChrisBarker-NOAA · 2024-10-23T18:36:52Z

ChrisBarker-NOAA
Oct 23, 2024
Collaborator

Side note:

The search on the NUG here:

https://docs.unidata.ucar.edu/nug/current/index.html

is broken (I get a 404) for me.

how do I report that?

4 replies

Dave-Allured Oct 23, 2024

Works for me. Double checked, not cached. Off topic, let's take this off line. Reply to me privately if needed. Go through my icon.

Dave-Allured Oct 23, 2024

Oh sorry, now I see that you mean the "search box". They fixed that in the new NUG version 2, currently in draft.
https://docs.unidata.ucar.edu/nug/2.0-draft/

ChrisBarker-NOAA Oct 23, 2024
Collaborator

OT -- but thanks!

unfortunately, while that search box doesn't 404, it seems to only be searching the headings -- so I"m available to find any discussion of Unicode :-( -- but yes, OT.

Dave-Allured Oct 23, 2024

No longer OT. Try "site:docs.unidata.ucar.edu/nug unicode" in generic search engine such as google.

This did not work as expected for me, but it got me to the file format appendices in both version 1 and version 2 docs. They both address Unicode, and say about the same things.

ChrisBarker-NOAA · 2024-10-23T23:37:46Z

ChrisBarker-NOAA
Oct 23, 2024
Collaborator

Back to On Topic: with Google's help I found the relevant text in the NUG:
(https://docs.unidata.ucar.edu/nug/current/file_format_specifications.html)
I couldn't find it in the draft version 2...

Note on names: Earlier versions of the netCDF C-library reference implementation enforced a more restricted set of characters in creating new names, but permitted reading names containing arbitrary bytes.

This specification extends the permitted characters in names to include multi-byte UTF-8 encoded Unicode and additional printing characters from the US-ASCII alphabet. The first character of a name must be alphanumeric, a multi-byte UTF-8 character, or '_' (reserved for special names with meaning to implementations, such as the “_FillValue” attribute). Subsequent characters may also include printing special characters, except for '/' which is not allowed in names. Names that have trailing space characters are also not permitted.

So that's the NUG text -- and its handling of the Unicode addition is odd (or poorly written, or ...)

Perhaps what they mean by: "a multi-byte UTF-8 character" is actually:

"A Unicode "Letter" character", i.e. (Lu | Ll | Lt)

or maybe all L* code points? or ???

In any case, we certainly don't want control code points in there, and having no ASCII punctuation, but allow other punctuation as the first character makes no sense. and can a name start with a "combining lowline"? (https://unicode-explorer.com/c/0332) [1]

Where would one go to suggest an update to the NUG?

But in the meantime, CF can specify this all more clearly and precisely.

Should we start a new discussion for that, and keep this one (re)focused on the Blacklist?

[1] Just for fun -- here's an experiment:

In [19]: name_with_leading_combining_lowline
Out[19]: '̲aname'

In [20]: print([ord(c) for c in name_with_leading_combining_lowline])
[818, 97, 110, 97, 109, 101]

Notice how when I print the name, it ends up combining leading lowline with the quote character -- fun!

1 reply

Dave-Allured Oct 24, 2024

Despite some recent claims to the contrary, I have found the character set descriptions in the NUG to be perfectly accurate, precise, and up to date. The NUG descriptions take a bit of study and getting used to, but they are right on the mark, including that regex-y thing in the BNF.

Jonathan's recent interpretations are also accurate and precise. I did not check every last character in his lists, but I have no reason for doubt.

The one thing that I have not checked is the NUG versus the actual character guards in the most recent netcdf-C code version. I recall no recent changes, therefore no reason to believe there is any discrepancy. That part of the code is quite readable. So if anyone cares that much, go for it.

Because of the accuracy of current understandings -- ouch -- for those who care to study very carefully -- I suggest do not start a new issue at this time. We already have too much chaos in the character and string arena going on right now. Stay focused.

Dave-Allured · 2024-10-24T00:10:29Z

Dave-Allured
Oct 24, 2024

Here is how to understand "multi-byte UTF-8 character" as used in the NUG. Their abbreviation is MUTF8.

Today's UTF-8 includes byte sequences of 1, 2, 3, and 4 bytes. MUTF8 is ALL legal sequences, except for the 1-byte encodings. If you combine the single-byte sequences with MUTF8, you get the complete UTF-8 set.

0 replies

ChrisBarker-NOAA · 2024-10-24T00:51:11Z

ChrisBarker-NOAA
Oct 24, 2024
Collaborator

"ALL legal sequences" -- fair enough, that's how I interpreted it too -- but allowing ALL of these, including as the leading character of a name, makes no sense at all.

So this:

"The first character of a name must be alphanumeric, a multi-byte UTF-8 character, or '_'"

And as I parse it, you can't use any of the ASCII punctuation marks as a leading charactor, but you can use any non-ascii punctuation charactor, for instance. huh?

In fact, you can use ANY non-ascii "character" -- including combining ones, whitespace, line feeds, other control characters, etc, etc. Really? If you are going to do that, why have any rules at all?

This is reminiscent of the kerfuffle over Unicode as the core string type in Python3 -- the only really challenging problem was file names. (sure there were issues with existing mojibake's data, etc, but those were mostly surmountable).

[I'll bring this around to the topic at hand, I promise]

The big issue was that apparently on nix systems, filenames (paths, etc) are simply stored as a char, and the only special values are null and 47 (/ the ASCII forward slash). This all worked great in the ASCII days, and not too badly in the extended ANSI days (e.g. latin-1, etc, etc...). However, the result was that folks could use pretty much any encoding, all on the same file system, and there was no way to know what the encoding was for any given path.

And all that is totally fine if all you need to do is pass a char* around, split on the slash, and maybe compare to other filenames.

And that all worked fine in Python2, where a string was simply a null-terminated string of bytes (i.e. a char*).

Enter Python3 and Unicode -- now you had to decode what's in the char* in order for Python to. be able to store it in a string. and that's not possible if you don't know the encoding.

This was a very long kerfuffle -- with folks writing, e.g. unix utilities, saying, 'why can't I just pass around the pile of bytes? I don't care what characters they actually mean -- within the code, it's just a pile of bytes.

And within the code, sure -- who cares? But what happens when you want to read that filename from a file? (or a web service) or write it to a text file, or show it to a person on the screen, or ?? The fact is, that outside of a computer program, filenames are text, and it's really helpful to have them be well described, human readable, etc...

Back to the topic at hand:

The NUG has selected utf-8 (and NFC normalization) so at least that's not a problem. And I can easily write code that can work with variable names, attribute. names, etc with any old code points in them -- (I use Python, so if it is valid utf-8, it can be decoded into a Python string, and I can do all. sorts of stuff with it -- no problem) -- other systems could work directly with the utf-8 encoded bytes.

But for CF -- we want files to be both computer and human readable -- an ncdump of the file should be comprehendible (and not trash your terminal settings).

And for THAT, it's a good idea to put some restrictions on allowable code points.

BTW -- my idea to start another discussion was so that we could focus this one on only the Blacklist idea.

0 replies

larsbarring · 2024-10-24T14:27:13Z

larsbarring
Oct 24, 2024
Collaborator Author

I am sorry that the discussion has been complicated and perhaps prolonged by the naive, and not even fully correct tests that I showed in an early post. My apologies.

@JonathanGregory, @Dave-Allured: Thank you for establishing and clarify what the NUG allows for characters in names !

In the light of the rather short time to the deadline for CF version 1.12 may I suggest a two-step approach:

An issue is opened to amend the conventions with the text suggested by @JonathanGregory. Some further wordsmithing might be required, for example I would like to see some brief (one sentence) explanations for why we make recommendations (and not). This will make the conventions more clear.
We continue to discuss a "blacklist" here. I think that @ChrisBarker-NOAA has made some very good points. After all, one of the main objectives of CF is to develop rules and recommendations for sharing data and support interoperability. And if one rule is "You are free to use almost any combination you like of all available characters" it will of course support the freedom of the data writer, but does it help data sharing and interoperability? Is this the freedom data writers really want and need? I do not think that this necessarily will be simple or easy, but I think that is a conversation we should have. Because I think that we should avoid ending up in a situation where data users have to do octal/hex/whatever dumps of the output from ncdump -h to understand what is in a file. That would be like telling users that tools like ncdump is getting out of fashion ....

1 reply

Dave-Allured Oct 24, 2024

Please go ahead and split the issue. I agree that it will help to focus discussions.

ChrisBarker-NOAA · 2024-10-24T18:50:49Z

ChrisBarker-NOAA
Oct 24, 2024
Collaborator

There are two different discussion here:

"What to say."

and

"How to say it."

For the "How to say it" I recommend that we strive to use modern Unicode terminology that should be clear to scientists (with perhaps a few parenthetical notes about ASCII for us old timers).

For example:

"multi-byte UTF-8 Unicode codepoint"

That is clearly defined for UTF-8, which is required by netCDF.

However, for users that are working with a higher level system (such as Python, or Java, or C++ on Windows [1] or ...) "multibyte code point" isn't really a obvious concept. For example:

The degree symbol: °: is code point: U+00B0, decimal 176.

In Python, you can use it like so:

In [43]: name = "temp_in_\N{DEGREE SIGN}C"

In [44]: print(name)
temp_in_°C

So -- to a Python programmer, how do they know if they have used a "multibyte code point"?

They can look at the utf-8 encoding of the string:

In [42]: name.encode('utf-8')
Out[42]: b'temp_in_\xc2\xb0C'

And if you know how to read that, you can see that that degree symbol is taking up 2 bytes.

Or you can check the lengths:

In [48]: len(name)
Out[48]: 10

In [49]: len(name.encode('utf-8'))
Out[49]: 11

So then you know that there's one multi-byte character in there.

You can also see the code point values:

In [45]: [ord(c) for c in name]
Out[45]: [116, 101, 109, 112, 95, 105, 110, 95, 176, 67]

In that case -- any value above 127 (the ASCII range) is a multi-byte character.

I would argue that the last approach -- looking at the code point values, is the most clear and obvious way to understand what's going on.

So -- my proposal is that we use language along the lines of:

"code point above 127" rather

than

"multi-byte UTF-8 Unicode codepoint"

NOTE: I'd love to use "code point" as much as possible, rather than "character" -- though I do note that a lot of the Unicode docs do use "character". Why? because there are control codes, combining glyphs, etc, that aren't really what most people think of as "characters". But oh well.

I'll put my thoughts on the "What to Say" in another comment.

[1] IIUC, Java and Windows (and .NET) use UTF-16 natively. At least in C/C++, that's stored as a wchar_t type. Anyway, the point is that in some (most?) programming, one is not working with UTF-8 directly, but rather encoding to utf-8 on I/O, and internally working with code points ( maybe -- UTF-16 is an unfortunate mess :-( ).

0 replies

ChrisBarker-NOAA · 2024-10-24T19:27:02Z

ChrisBarker-NOAA
Oct 24, 2024
Collaborator

For the "What to say":

Jonathan suggested:

In addition to the NetCDF requirements, in CF

The name is recommended to begin with a letter, not a digit or _.

The name is recommended not to include any multi-byte UTF-8 Unicode codepoint.

So at this point, we are recommending to stick with ASCII, yes? Sounds good -- I do think we should expand that later, but better to expand later than to have to restrict later. I would switch that order though, as then there is less need to define "letter", e.g, to paraphase:

only use ASCII character for names
the first charactor should be a letter

That's simple, and where we were at back in the day, yes?

The name is recommended not to include any of the ASCII characters !"#$%&'()*,:;<=>?[]^`{|}~.

OK -- that's good -- that's essentially Lars' "blacklist", but it's grey for now, yes?

but

or any Unicode codepoint in class Zs "space separator".

That's a bit odd, as we already recommended no non-ascii above, yes? But sure, good to exclude the ASCII space, and while we are doing that, excluding al the Unicode spaces -- for future reference -- makes sense.

I am a bit wary: at this point we are stating that netCDF allows almost anything, but we recommend sticking with ASCII.

But if someone really doesn't want to stick with ASCII -- e.g. they want a variable name in a non-english language - -then they are on their own, and we offer no recommendations at all. That seems less than optimum.

Perhaps there's no time to hash it out in time for CF version 1.12 -- I take it it's already too late to disallow non-ascii in names?

0 replies

JonathanGregory · 2024-10-25T11:37:15Z

JonathanGregory
Oct 25, 2024
Maintainer

Dear all

Responding to a few recent remarks:

@Dave-Allured wrote, "Today's UTF-8 includes byte sequences of 1, 2, 3, and 4 bytes. MUTF8 is ALL legal sequences, except for the 1-byte encodings. If you combine the single-byte sequences with MUTF8, you get the complete UTF-8 set." I think that is correct.
I did check the C source code in order to produce my summary of what is legal in netCDF names. The regular expression given in the NUG Appendix B is correct but, as Lars remarked, not everyone is fluent in regular expressions, so CF can help by expressing it in English.
Chris's remark is at least partly correct that, "You can't use any of the ASCII punctuation marks as a leading character, but you can use any non-ASCII ... character -- including combining ones, whitespace, line feeds, other control characters." I don't know about line feeds and control characters. ASCII control characters (hex 00-1F, including linefeed 0A) are not allowed - if there are other such Unicode codepoints, they would be allowed. The ASCII space is not allowed, but other whitespace is allowed, that's right, surprising though it seems.
I agree with Chris's suggestion that, in our text, it would be useful to explain that "Unicode codepoint encoded in UTF-8 and requiring more than one byte" means any single character in a text string which takes more than one byte in the UTF-8 encoding.
Chris is correct that the existing CF text recommends not to use any non-ASCII character (all non-ASCII characters are multi-byte UTF-8 sequences).
Where I presented two options, the first one is what CF says now, and the second is Lars's alternative, that we should prohibit some characters that we currently recommend not to be used, including class Zs, which is presently among those which are recommended against (greylist) because they're not ASCII.
I agree with Lars's suggestion of opening an issue to clarify the convention. The text I suggested earlier was only "along the lines of", not a detailed proposal, and I agree it needs more clarification. It has two purposes: (a) state the netCDF rules clearly, (b) state the current CF recommendations clearly. The first paragraph of 2.3 has these purposes, but is not as clear as it could be. An issue to clarify the status quo could be a defect issue. The purpose is just to improve the text, not to change the convention.
On the other hand, we haven't yet agreed on whether CF should require certain characters not to be used, which are currently recommended not to be used. The characters concerned are the ASCII characters !"#$%&'()*,:;<=>?[\]^`{|}~ and any Unicode codepoint in class Zs "space separator". I would suggest, Lars, that you could open a separate issue for that, as an enhancement proposal, if you wish. As I said before, I think it could be supported by the CF principle that we try to minimise the potential for mistakes. Does anyone have a real use case of a mistake, confusion or waste of time that has already arisen from using any of those characters in a netCDF name?
If both of those issues are opened, we could probably conclude this discussion.
Also open is @Dave-Allured's issue to reverse the change agreed to CF1.12 which allows - and . in attribute names. If you have a view on this, please say so in that issue. If it isn't sufficiently supported to be accepted by the deadline, the change will be made in CF 1.12, as previously agreed, which inserts a sentence that explicitly says those two characters are fine, not greylisted, for attribute names.

Best wishes

Jonathan

0 replies

Dave-Allured · 2024-10-25T16:47:11Z

Dave-Allured
Oct 25, 2024

I am leaning toward the following. Note my previous remarks.

On characters and character sets, keep the CF text simple. Excruciatingly simple.
There is no need to repeat in detail the exact character rules from the NUG.
Blacklisting or greylisting should have an explicit justification (here in discussion, not in the text).
Not valid justifications include variable name in a programming language, and use without valid quotations in context.
From these last two, I get no need whatsoever for any blacklist or greylist. Therefore, no PR needed.

0 replies

ethanrd · 2024-10-25T17:36:59Z

ethanrd
Oct 25, 2024
Maintainer

As I mentioned in my comment above and I think in agreement with some comments by @ChrisBarker-NOAA , there are lots of characters in the multi-byte UTF-8 character set (non-ASCII UTF-8) that are allowed by the NUG that should not be used in netCDF object names (control characters, emoticons, and dingbats were the examples I gave above). To me, it is a mistake in the NUG to allow all non-ASCII UTF-8 characters.

However, that is another discussion. With respect to CF, I think there should be a strong recommendation to limit the characters used in netCDF object names and some details on a few of the reasons data producers might want to limit the set of characters allowed in netCDF object names.

Because of the complexity of Unicode, I suspect an allow list would work better than a disallow list and much of it should be based on Unicode character categories.

Anyway, here's my attempt to capture some reasons for limiting characters and levels of Unicode capabilities:

Though netCDF allows most Unicode characters in the names of netCDF objects (variable, attributes, groups, etc.), there are a number of reasons to limit the characters to a subset of those characters. Those reasons include interoperability with existing (often not Unicode aware) software and various security issues (see Unicode Technical Report #36: Security Considerations and Unicode Standard Annex (UAX) #31: Unicode Identifiers and Syntax).

For maximum interoperability with non-Unicode aware software, it is recommended that names should only contain [A-Za-z0-9_].

For minimal Unicode support, it is recommended that names should only contain characters in the following Unicode character categories of Lowercase Letter (Ll), Uppercase Letter (Lu), Titlecase Letter (Lt), Decimal Number (Nd), and Connector Punctuation (Pc).

There are other Unicode character categories that should probably be added to the above list, or to another list aimed at more maximal Unicode support.

0 replies

Dave-Allured · 2024-10-25T17:53:36Z

Dave-Allured
Oct 25, 2024

@ethanrd Okay. Given that the topic of character sets is hideously complicated, a general statement that object names should be sensible and legible could be okay. Surely we can rely on data producers to create reasonable object names without expanding CF with intricate rules. If the occasional dingbat or joiner slips in, I really don't care. I can digest any piece of Unicode whatsoever, that is thrown my way.

Meanwhile, character rules have unintended consequences, as has been noted several times in the history of netcdf.

0 replies

ChrisBarker-NOAA · 2024-10-25T18:16:34Z

ChrisBarker-NOAA
Oct 25, 2024
Collaborator

Surely we can rely on data producers to create reasonable object names without expanding CF with intricate rules.

The rules do not have to be intricate. In a way, not more intricate than they were with ASCII -- I think we should essentially translate the ASCII rules to Unicode terms -- e.g. no control characters, and things like that.

If the occasional dingbat or joiner slips in, I really don't care. I can digest any piece of Unicode whatsoever, that is thrown my way.

And that is the difference in perspective -- in many cases, you can treat a name as simply a bunch of bytes, with only a few values that have special meaning. Great. And that's why the netcdf lack-of-limitations is probably fine -- it does require NFC normalization, which is the one thing that "a string of bytes" would not work correctly without.

However, in CF, we have larger concerns -- human readability, and compatibility with other tools, etc, are important.

So text is not a "string of bytes" -- it's human readable text, and should be kept that way.

For CF, If the occasional dingbat or joiner slips in, then it could affect readability and downstream tools -- so we DO care. And frankly, with all the non-CF compliant files I see out there -- folks ARE going to stick oddball characters into strings, and it's OK that that wouldn't be CF compliant -- as you say, "I can digest any piece of Unicode whatsoever, that is thrown my way." so your tools won't break, but it could be ugly for users.

As for the "blacklist": I thought the idea was to capture the particular characters that could, e.g. break / confuse CDL (and other similar tools?) -- I think that's important, but maybe those are already not allowed in the existing rules (all being ASCII)

0 replies

ChrisBarker-NOAA · 2024-10-25T18:17:56Z

ChrisBarker-NOAA
Oct 25, 2024
Collaborator

Meanwhile, character rules have unintended consequences

Hmm -- yes. But I think being overly permissive upfront is more likely to create unintended consequences.

0 replies

Creating a blacklisting certain characters from variable and attribute names #323

larsbarring May 31, 2024 Collaborator

Topic for discussion

Replies: 43 comments · 39 replies

davidhassell May 31, 2024 Maintainer

larsbarring May 31, 2024 Collaborator Author

efisher008 Jun 13, 2024 Maintainer

sethmcg Jun 13, 2024 Collaborator

ChrisBarker-NOAA Jun 13, 2024 Collaborator

sethmcg Jun 14, 2024 Collaborator

ChrisBarker-NOAA Jun 14, 2024 Collaborator

sethmcg Jun 14, 2024 Collaborator

larsbarring Jun 17, 2024 Collaborator Author

larsbarring Oct 4, 2024 Collaborator Author

JonathanGregory Oct 4, 2024 Maintainer

DocOtak Oct 5, 2024 Maintainer

sethmcg Oct 8, 2024 Collaborator

ChrisBarker-NOAA Oct 8, 2024 Collaborator

DocOtak Oct 8, 2024 Maintainer

sethmcg Oct 15, 2024 Collaborator

DocOtak Oct 16, 2024 Maintainer

larsbarring Oct 8, 2024 Collaborator Author

larsbarring Oct 8, 2024 Collaborator Author

sethmcg Oct 8, 2024 Collaborator

larsbarring Oct 8, 2024 Collaborator Author

larsbarring Oct 8, 2024 Collaborator Author

ChrisBarker-NOAA Oct 8, 2024 Collaborator

larsbarring Oct 8, 2024 Collaborator Author

JonathanGregory Oct 8, 2024 Maintainer

larsbarring Oct 8, 2024 Collaborator Author

JonathanGregory Oct 10, 2024 Maintainer

JonathanGregory Oct 14, 2024 Maintainer

ethanrd Oct 15, 2024 Maintainer

JonathanGregory Oct 23, 2024 Maintainer

ChrisBarker-NOAA Oct 23, 2024 Collaborator

ChrisBarker-NOAA Oct 23, 2024 Collaborator

ChrisBarker-NOAA Oct 23, 2024 Collaborator

ChrisBarker-NOAA Oct 23, 2024 Collaborator

ChrisBarker-NOAA Oct 24, 2024 Collaborator

larsbarring Oct 24, 2024 Collaborator Author

ChrisBarker-NOAA Oct 24, 2024 Collaborator

ChrisBarker-NOAA Oct 24, 2024 Collaborator

JonathanGregory Oct 25, 2024 Maintainer

ethanrd Oct 25, 2024 Maintainer

ChrisBarker-NOAA Oct 25, 2024 Collaborator

ChrisBarker-NOAA Oct 25, 2024 Collaborator

larsbarring
May 31, 2024
Collaborator

Replies: 43 comments 39 replies

davidhassell
May 31, 2024
Maintainer

larsbarring May 31, 2024
Collaborator Author

efisher008 Jun 13, 2024
Maintainer

sethmcg Jun 13, 2024
Collaborator

ChrisBarker-NOAA
Jun 13, 2024
Collaborator

sethmcg
Jun 14, 2024
Collaborator

ChrisBarker-NOAA
Jun 14, 2024
Collaborator

sethmcg
Jun 14, 2024
Collaborator

larsbarring
Jun 17, 2024
Collaborator Author

larsbarring
Oct 4, 2024
Collaborator Author

JonathanGregory
Oct 4, 2024
Maintainer

DocOtak
Oct 5, 2024
Maintainer

sethmcg Oct 8, 2024
Collaborator

ChrisBarker-NOAA Oct 8, 2024
Collaborator

DocOtak Oct 8, 2024
Maintainer

sethmcg Oct 15, 2024
Collaborator

DocOtak Oct 16, 2024
Maintainer

larsbarring
Oct 8, 2024
Collaborator Author

larsbarring Oct 8, 2024
Collaborator Author

sethmcg Oct 8, 2024
Collaborator

larsbarring Oct 8, 2024
Collaborator Author

larsbarring
Oct 8, 2024
Collaborator Author

ChrisBarker-NOAA
Oct 8, 2024
Collaborator

larsbarring Oct 8, 2024
Collaborator Author

JonathanGregory
Oct 8, 2024
Maintainer

larsbarring Oct 8, 2024
Collaborator Author

JonathanGregory Oct 10, 2024
Maintainer

JonathanGregory
Oct 14, 2024
Maintainer

ethanrd
Oct 15, 2024
Maintainer

JonathanGregory
Oct 23, 2024
Maintainer

ChrisBarker-NOAA
Oct 23, 2024
Collaborator

ChrisBarker-NOAA
Oct 23, 2024
Collaborator

ChrisBarker-NOAA Oct 23, 2024
Collaborator

ChrisBarker-NOAA
Oct 23, 2024
Collaborator

ChrisBarker-NOAA
Oct 24, 2024
Collaborator

larsbarring
Oct 24, 2024
Collaborator Author

ChrisBarker-NOAA
Oct 24, 2024
Collaborator

ChrisBarker-NOAA
Oct 24, 2024
Collaborator

JonathanGregory
Oct 25, 2024
Maintainer

ethanrd
Oct 25, 2024
Maintainer

ChrisBarker-NOAA
Oct 25, 2024
Collaborator

ChrisBarker-NOAA
Oct 25, 2024
Collaborator