Creating a blacklisting certain characters from variable and attribute names #323
Replies: 43 comments 39 replies
-
Surely we don't want to disallow underscores! |
Beta Was this translation helpful? Give feedback.
-
Hmm -- I like this idea. But first i think we should make clear what the (long term) goal is: Unicode is very complex, with a lot of subtleties -- There are efforts to manage that with normalization (https://www.unicode.org/reports/tr15/), and categorization of code points. (General Category. Partition of the characters into major classes such as letters, punctuation, and symbols, and further subclasses for each of the major classes.) Etc. So I think we have essentially three options:
I think the whole point of this discussion is that we don't want to do (1) anymore. for (2) -- it seems appealing, but there's a lot of complexity, e.g. (from the Unicode spec)
So it can get messing. Nevertheless, there is precedent -- for instance, Python has the following rules: https://docs.python.org/3/reference/lexical_analysis.html#identifiers A bit messy, but do-able. However, there are still a number of complications -- one is NFKC normalization, and another is that Python treats some different Unicode characters as equivalent (e.g. Blackboard Bold "B" U+1D539 is the same as capital B U+0042) -- but only in context where the normalization is done (e.g. processing source code, but not when meta-programming, like Frankly, it's a bit of a mess if people really do use the broad range of allowable characters. That being said, I think that the CF problem is easier than Python, as CF isn't providing normalization -- only enforcement. I'm inclined (at the moment -- I haven't thought it through too carefully) to go with (3) -- allow any Unicode code point except a given blacklist. Note that I say Code Point, not character, as some characters can be represented by different code points (e.g. accented characters) If we simply do "Code Point", then there is no issue of normalization, or anything else. (hmm, option 3(b) -- any code point, but a particular normalization?) Though maybe that's too much a wild west? |
Beta Was this translation helpful? Give feedback.
-
I'd like to course-correct the discussion a bit, if I may. This is not a proposal to expand the list of allowed characters in a wide-reaching way. That's what #237 is about, and a number of folks (including me and Lars) concluded that it would be unwise; there are a lot of security and interoperability concerns that make it important to consider any expansions of the list carefully and cautiously before adding them. I believe what Lars is proposing is that we add an explicit, stand-alone listing of the sets of banned and allowed characters, rather than only having them defined implicitly in the text of section 3.2. I can see the value in that, but I think we shouldn't frame it as a list of banned characters, because that implies that anything not on the list is allowed, and as discussed in #237, there are important reasons that the default answer for whether a character is allowed should be "no". I think we should have an explicit list of allowed characters, with an accompanying list (maybe an extra column) of clarifications to cover the known disallowed characters that Lars suggests. So maybe something like:
|
Beta Was this translation helpful? Give feedback.
-
@sethmcg: sorry about that -- I think it was me that expanded the conversation. However, the reason I did that is that I don't see how we can talk about a blacklist without the context of what's allowed, so I was trying to get at that. However, I think maybe I get it now -- this proposal for a "blacklist" is more internal to clearly define the rules now, and to guide any potential expansion in the future -- e.g.: whatever we do we won't allow THESE charactors :-) I see the point of that, so carry on :-) To that point:
I find this odd to say -- are ANY other non-ascii charactors -- any number of other symbols, punctuation, etc allowed? I think I get the point here, but it's a odd phrasing. I think the point is that folks may be tempted to (or accidentally) use another symbol that "looks like" a dash. In fact, I've had that issue in a totally different context, where something was copy and pasted from an application that had (helpfully) auto-changed an ascii dash to an endash. So I don't see this as a blacklist so much as a "be cautious of these" list -- at least in that example. Which I do think is good to document. The real blacklist are the ones that will break other aspects of CF / netcdf (e.g. have special meaning in CDL) -CHB |
Beta Was this translation helpful? Give feedback.
-
I hadn't thought about compiling the list of characters that we definitely don't want to add for various technical reasons, just to have a consolidated reference for what they are and why they're banned. I agree that that would be a very useful thing to have, but I'm not sure about adding it to CF proper. I worry that people would see it and think of it as the complete list of all disallowed characters, and that everything else is allowed. Maybe we want to have that list, but make it an adjunct document of some kind, like the Guidelines for Constructing Standard Names? Or put it in an appendix? |
Beta Was this translation helpful? Give feedback.
-
Wow, I was away from this issue for a few days while there have been a lot of activity and good points. When opening this discussion I had in mind was a rather modest extension to section 2.3, where the relevant part reads
Essentially this allows, as a recommendation, the US-ASCII (or their Unicode counterpart) letters and digits and underscore, as well as period and hyphen for attribute names. All other characters are implicitly not recommended (or "should not"), but not explicitly excluded or forbidden. What I had in mind was to marginally reduce this huge list of not recommended characters by explicitly disallowing the few characters that we already now know will create problems. So far I am aware of the following, all within the US-ASCII character set, control characters (decimal 0 ... 31, 127), Based on this, my simplistic suggestion is to immediately after the text cited above add a sentence, something like
In this minimal way we avoid all complications in relation to Unicode, and focus on those few we all agree, I think, cannot be used. All other punctuation (whether ASCII or Unicode), Unicode control and what not, remains as is, which basically means to be sorted out in the future. |
Beta Was this translation helpful? Give feedback.
-
I have now explored this in some more detail using a python script to insert various unicode characters into the variable name in a small .cdl file and then use ncgen to generate a .nc file. In the same script I used NCO/ncrename trying to change the same character of a variable name in a working nc-file to all other characters in the list, and then use ncdump to create a cdl file. Thus it is not a full round-trip because the NCO step. I focussed on ASCII (decimal 0 - 127), ISO/IEC 8859-1 (decimal 0 - 255) and control (C1), as well as Unicode whitespace (WS) groups (all according to Wikipedia). Here is the result:
In doing this I used the most recent released version of the netCDF library tools (netcdf library version 4.9.2 of Jun 6 2024 10:57:38). With respect to ASCII, I think that this is a pretty strong indication of which characters (groups) should not be not accepted in variable and attribute names. And, yes, I do think that it better to be explicit about this and expressly rule out those characters we know are likely to cause problems because the CF conventions are all about data exchange and interoperability. I think that it would be good to get such a statement into CF-1.12, what do you think ? ping @sethmcg @ChrisBarker-NOAA @JonathanGregory @ethanrd @Dave-Allured @DocOtak @davidhassell |
Beta Was this translation helpful? Give feedback.
-
Dear @larsbarring et al. Thanks for your thorough investigation, Lars, and thanks everyone for the discussion. The text which Lars quoted above is not the working version. Following conventions issue #237, section 2.3 now reads
which is consistent with the conformance document. That is, as Lars says, we recommend against a lot of characters. All characters except letters, digits, underscores and (for attributes only) ASCII 2D In the discussion of conventions #237 we agreed that all characters are allowed, despite the recommendation (which is not a requirement) not to use the majority of them. Lars commented that the CF conventions "essentially provide a whitelist of explicitly allowed characters. All other characters are not recommended (or recommended against) but not explicitly disallowed. But throughout this conversation there have been several remarks that some characters should indeed be explicitly disallowed. This could easily be done by ... creating a blacklist." That's what this discussion is about, if I understand correctly. The last sentence of the working text as above is unsatisfactory, despite #237, because it says The working text is also unsatisfactory because it implies that the NUG prohibits some characters ("it allows almost all Unicode characters ...") but it doesn't say which ones are not allowed. NUG Appendix B says that names should match the regular expression
I suppose we should understand the regular expression to begin with Since ASCII is a subset of UTF-8, I think that by "multibyte UTF-8 encoded", the NUG must mean a Unicode character which is encoded in more than one byte by UTF-8. That is, MUTF8 doesn't include one-byte characters, among them the ASCII characters 00-7F. Do you agree? If that's correct, the NUG does not allow I think we should explicitly state that we prohibit 00-1F, Also, the the CF working text is inconsistent with the NUG in saying "It is recommended that variable, dimension, attribute and group names begin with a letter". This is not merely a recommendation, because the NUG says that names must begin with a letter, digit, underscore or multi-byte UTF-8 character. We should fix this. Our text currently implies it's OK to start a name with a punctuation mark, for instance, which the NUG prohibits. Lars's experiment shows that I think it would be reasonable for CF to prohibit all those characters which We've decided to allow Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
Looking over this and the long original question. Is it worth separating variables into two categories: variables meant to be interpreted in a CF way, and variables that are not? I'm of the opinion that variable names basically don't matter and that all of the actual information is going to be inside the attribute values. I would propose that for variables that are intended to be interpreted as CF variables, we are very restrictive: ASCII letters I think that adding |
Beta Was this translation helpful? Give feedback.
-
A couple of further comments to my analysis and to the subsequent comments/responses:
However, allowing but not recommending all characters not explicitly disallowed by NUG is problematic for the following reasons:
I suggest that these four points should form the basis for creating a "blacklist" of characters that CF explicitly disallows despite that they are allowed by NUG. In principle this is a breaking change of what we previously agreed on in cf-conventions/#237, which still belong to the current draft version, and in practice the suggested list of characters to blacklist are typically not the ones one would expect to prime targets for users to include in new files. |
Beta Was this translation helpful? Give feedback.
-
On a partly different aspect, @JonathanGregory commented
I fully agree. The question is how to fix it. @DocOtak noted
which refers to this issue, and in particular this comment. Before we fix this particular sentence I think we should get some input regardingtheir views. I will shortly make comment over there . |
Beta Was this translation helpful? Give feedback.
-
I think the NUG special characters are the "blacklist" -- the rest could (should) be defined in terms of the Unicode categories: https://www.compart.com/en/unicode/category e.g. No control characters. (ASCII DEL is a control character) And maybe it's time now to specify which categories are allowed / not allowed? |
Beta Was this translation helpful? Give feedback.
-
Dear all
Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
Please note that @Dave-Allured has opened conventions issue 548 to delete the sentence, "ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only." in Sect 2.3. This sentence was inserted into the working version by conventions issue 477 for various reasons, including to support IETF BCP 47 language tags, discussed in conventions issue 528, which is still ongoing. If Dave's proposal is accepted, the characters allowed for attribute names will be the same as for variable names in CF 1.12, which is the same as in CF 1.11, the most recently reduced version. |
Beta Was this translation helpful? Give feedback.
-
Hi all - Sorry I'm late to this discussion. A few thoughts as I'm starting to catch up: Please DO NOT consider the NUG a reliable source for Unicode information. The sections that mention Unicode were written some time ago (2008) and without an in-depth understanding of Unicode. I do feel confident saying the intent at the time was that the names of all netCDF objects (dimension, variable, attribute, group, etc.) should be valid UTF-8 strings that are NFC normalized and do not contain any control characters. I believe the netCDF-C library validates that names are NFC normalized UTF-8 strings and without control characters (in the ASCII range) when creating a new netCDF dataset but not when reading (and maybe not when renaming). I believe the netCDF-Java library behaves in a simi8lar manner though I haven't tested as much. I agree with the comment above from Chris @ChrisBarker-NOAA about using Unicode categories (list) to specify allowed and/or not allowed characters. Also an earlier comment about reviewing other documents on Unicode for identifiers/names, e.g., how the Python Language defines the syntax for Identifiers. |
Beta Was this translation helpful? Give feedback.
-
Dear Lars To summarise my earlier posting, I think we should replace the first paragraph of 2.3 along these lines: The NetCDF interface requires the following for the name of any variable, dimension, attribute and group:
In addition to the NetCDF requirements, in CF
and either
or
The or version is the status quo. Whether to adopt the either alternative is the main point at issue, I believe. Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
wait, really! WTF? that makes absolutely no sense, as I read it "an NFC-normalised Unicode codepoint encoded in UTF-8 and requiring more than one byte" -- that is EVERY code point in Unicode -- including punctuation, control characters, various whitespace, -- this list goes on. Very, very odd that they could disallow all the non letter and digit ASCII codepoints, but allow all the non-ascii ones -- Huh? I think it was Ethan that said that the netcdf handling of Unicode should not be considered thoughtful. Anyway -- we probably should bring this up with the netcdf folks, but in the meantime, CF can be more restrictive, and it absolutely should be. Perhaps we can re-define all this with a more appropriate extension from ASCII to Unicode -- e.g. "control code points are disallowed", "Letters are allowed" -- obviously spelled out in the proper language of Unicode. NOTE: this is distinct from the blacklist issue -- which I do support. |
Beta Was this translation helpful? Give feedback.
-
Side note: The search on the NUG here: https://docs.unidata.ucar.edu/nug/current/index.html is broken (I get a 404) for me. how do I report that? |
Beta Was this translation helpful? Give feedback.
-
Back to On Topic: with Google's help I found the relevant text in the NUG:
So that's the NUG text -- and its handling of the Unicode addition is odd (or poorly written, or ...) Perhaps what they mean by: "a multi-byte UTF-8 character" is actually: "A Unicode "Letter" character", i.e. (Lu | Ll | Lt) or maybe all L* code points? or ??? In any case, we certainly don't want control code points in there, and having no ASCII punctuation, but allow other punctuation as the first character makes no sense. and can a name start with a "combining lowline"? (https://unicode-explorer.com/c/0332) [1] Where would one go to suggest an update to the NUG? But in the meantime, CF can specify this all more clearly and precisely. Should we start a new discussion for that, and keep this one (re)focused on the Blacklist? [1] Just for fun -- here's an experiment:
Notice how when I print the name, it ends up combining leading lowline with the quote character -- fun! |
Beta Was this translation helpful? Give feedback.
-
Here is how to understand "multi-byte UTF-8 character" as used in the NUG. Their abbreviation is MUTF8. Today's UTF-8 includes byte sequences of 1, 2, 3, and 4 bytes. MUTF8 is ALL legal sequences, except for the 1-byte encodings. If you combine the single-byte sequences with MUTF8, you get the complete UTF-8 set. |
Beta Was this translation helpful? Give feedback.
-
"ALL legal sequences" -- fair enough, that's how I interpreted it too -- but allowing ALL of these, including as the leading character of a name, makes no sense at all. So this: "The first character of a name must be alphanumeric, a multi-byte UTF-8 character, or '_'" And as I parse it, you can't use any of the ASCII punctuation marks as a leading charactor, but you can use any non-ascii punctuation charactor, for instance. huh? In fact, you can use ANY non-ascii "character" -- including combining ones, whitespace, line feeds, other control characters, etc, etc. Really? If you are going to do that, why have any rules at all? This is reminiscent of the kerfuffle over Unicode as the core string type in Python3 -- the only really challenging problem was file names. (sure there were issues with existing mojibake's data, etc, but those were mostly surmountable). [I'll bring this around to the topic at hand, I promise] The big issue was that apparently on nix systems, filenames (paths, etc) are simply stored as a char, and the only special values are null and 47 (/ the ASCII forward slash). This all worked great in the ASCII days, and not too badly in the extended ANSI days (e.g. latin-1, etc, etc...). However, the result was that folks could use pretty much any encoding, all on the same file system, and there was no way to know what the encoding was for any given path. And all that is totally fine if all you need to do is pass a char* around, split on the slash, and maybe compare to other filenames. And that all worked fine in Python2, where a string was simply a null-terminated string of bytes (i.e. a char*). Enter Python3 and Unicode -- now you had to decode what's in the char* in order for Python to. be able to store it in a string. and that's not possible if you don't know the encoding. This was a very long kerfuffle -- with folks writing, e.g. unix utilities, saying, 'why can't I just pass around the pile of bytes? I don't care what characters they actually mean -- within the code, it's just a pile of bytes. And within the code, sure -- who cares? But what happens when you want to read that filename from a file? (or a web service) or write it to a text file, or show it to a person on the screen, or ?? The fact is, that outside of a computer program, filenames are text, and it's really helpful to have them be well described, human readable, etc... Back to the topic at hand: The NUG has selected utf-8 (and NFC normalization) so at least that's not a problem. And I can easily write code that can work with variable names, attribute. names, etc with any old code points in them -- (I use Python, so if it is valid utf-8, it can be decoded into a Python string, and I can do all. sorts of stuff with it -- no problem) -- other systems could work directly with the utf-8 encoded bytes. But for CF -- we want files to be both computer and human readable -- an ncdump of the file should be comprehendible (and not trash your terminal settings). And for THAT, it's a good idea to put some restrictions on allowable code points. BTW -- my idea to start another discussion was so that we could focus this one on only the Blacklist idea. |
Beta Was this translation helpful? Give feedback.
-
I am sorry that the discussion has been complicated and perhaps prolonged by the naive, and not even fully correct tests that I showed in an early post. My apologies. @JonathanGregory, @Dave-Allured: Thank you for establishing and clarify what the NUG allows for characters in names ! In the light of the rather short time to the deadline for CF version 1.12 may I suggest a two-step approach:
|
Beta Was this translation helpful? Give feedback.
-
There are two different discussion here: "What to say." and "How to say it." For the "How to say it" I recommend that we strive to use modern Unicode terminology that should be clear to scientists (with perhaps a few parenthetical notes about ASCII for us old timers). For example: "multi-byte UTF-8 Unicode codepoint" That is clearly defined for UTF-8, which is required by netCDF. However, for users that are working with a higher level system (such as Python, or Java, or C++ on Windows [1] or ...) "multibyte code point" isn't really a obvious concept. For example: The degree symbol: °: is code point: U+00B0, decimal 176. In Python, you can use it like so:
So -- to a Python programmer, how do they know if they have used a "multibyte code point"? They can look at the utf-8 encoding of the string:
And if you know how to read that, you can see that that degree symbol is taking up 2 bytes. Or you can check the lengths: In [48]: len(name) In [49]: len(name.encode('utf-8')) So then you know that there's one multi-byte character in there. You can also see the code point values: In [45]: [ord(c) for c in name] In that case -- any value above 127 (the ASCII range) is a multi-byte character. I would argue that the last approach -- looking at the code point values, is the most clear and obvious way to understand what's going on. So -- my proposal is that we use language along the lines of: "code point above 127" rather than "multi-byte UTF-8 Unicode codepoint" NOTE: I'd love to use "code point" as much as possible, rather than "character" -- though I do note that a lot of the Unicode docs do use "character". Why? because there are control codes, combining glyphs, etc, that aren't really what most people think of as "characters". But oh well. I'll put my thoughts on the "What to Say" in another comment. [1] IIUC, Java and Windows (and .NET) use UTF-16 natively. At least in C/C++, that's stored as a wchar_t type. Anyway, the point is that in some (most?) programming, one is not working with UTF-8 directly, but rather encoding to utf-8 on I/O, and internally working with code points ( maybe -- UTF-16 is an unfortunate mess :-( ). |
Beta Was this translation helpful? Give feedback.
-
For the "What to say": Jonathan suggested:
So at this point, we are recommending to stick with ASCII, yes? Sounds good -- I do think we should expand that later, but better to expand later than to have to restrict later. I would switch that order though, as then there is less need to define "letter", e.g, to paraphase:
That's simple, and where we were at back in the day, yes?
OK -- that's good -- that's essentially Lars' "blacklist", but it's grey for now, yes? but
That's a bit odd, as we already recommended no non-ascii above, yes? But sure, good to exclude the ASCII space, and while we are doing that, excluding al the Unicode spaces -- for future reference -- makes sense. I am a bit wary: at this point we are stating that netCDF allows almost anything, but we recommend sticking with ASCII. But if someone really doesn't want to stick with ASCII -- e.g. they want a variable name in a non-english language - -then they are on their own, and we offer no recommendations at all. That seems less than optimum. Perhaps there's no time to hash it out in time for CF version 1.12 -- I take it it's already too late to disallow non-ascii in names? |
Beta Was this translation helpful? Give feedback.
-
Dear all Responding to a few recent remarks:
Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
I am leaning toward the following. Note my previous remarks.
|
Beta Was this translation helpful? Give feedback.
-
As I mentioned in my comment above and I think in agreement with some comments by @ChrisBarker-NOAA , there are lots of characters in the multi-byte UTF-8 character set (non-ASCII UTF-8) that are allowed by the NUG that should not be used in netCDF object names (control characters, emoticons, and dingbats were the examples I gave above). To me, it is a mistake in the NUG to allow all non-ASCII UTF-8 characters. However, that is another discussion. With respect to CF, I think there should be a strong recommendation to limit the characters used in netCDF object names and some details on a few of the reasons data producers might want to limit the set of characters allowed in netCDF object names. Because of the complexity of Unicode, I suspect an allow list would work better than a disallow list and much of it should be based on Unicode character categories. Anyway, here's my attempt to capture some reasons for limiting characters and levels of Unicode capabilities:
There are other Unicode character categories that should probably be added to the above list, or to another list aimed at more maximal Unicode support. |
Beta Was this translation helpful? Give feedback.
-
@ethanrd Okay. Given that the topic of character sets is hideously complicated, a general statement that object names should be sensible and legible could be okay. Surely we can rely on data producers to create reasonable object names without expanding CF with intricate rules. If the occasional dingbat or joiner slips in, I really don't care. I can digest any piece of Unicode whatsoever, that is thrown my way. Meanwhile, character rules have unintended consequences, as has been noted several times in the history of netcdf. |
Beta Was this translation helpful? Give feedback.
-
The rules do not have to be intricate. In a way, not more intricate than they were with ASCII -- I think we should essentially translate the ASCII rules to Unicode terms -- e.g. no control characters, and things like that.
And that is the difference in perspective -- in many cases, you can treat a name as simply a bunch of bytes, with only a few values that have special meaning. Great. And that's why the netcdf lack-of-limitations is probably fine -- it does require NFC normalization, which is the one thing that "a string of bytes" would not work correctly without. However, in CF, we have larger concerns -- human readability, and compatibility with other tools, etc, are important. So text is not a "string of bytes" -- it's human readable text, and should be kept that way. For CF, If the occasional dingbat or joiner slips in, then it could affect readability and downstream tools -- so we DO care. And frankly, with all the non-CF compliant files I see out there -- folks ARE going to stick oddball characters into strings, and it's OK that that wouldn't be CF compliant -- as you say, "I can digest any piece of Unicode whatsoever, that is thrown my way." so your tools won't break, but it could be ugly for users. As for the "blacklist": I thought the idea was to capture the particular characters that could, e.g. break / confuse CDL (and other similar tools?) -- I think that's important, but maybe those are already not allowed in the existing rules (all being ASCII) |
Beta Was this translation helpful? Give feedback.
-
Hmm -- yes. But I think being overly permissive upfront is more likely to create unintended consequences. |
Beta Was this translation helpful? Give feedback.
-
Topic for discussion
In #237 it was suggested to substantially relax restrictions on which characters are allowed in variable and attribute names. The conversation is still ongoing and sprinkled in various comments there are examples of characters that should not be allowed, either because they have special meaning in the context of CF or netCDF as such, or otherwise identified as causing problems.
I suggest that we amend the text in section 2.3 to list which character and character ranges CF explicitly disallows, i.e. creating a blacklist. While it may not be possible to identify all characters that should be in such a list (it may even evolve over time) I think that it is helpful to identify those characters that we now know belong to such a list.
So, far I believe the following have been identified from the standard ASCII character set: <
space
>, control characters (decimal 0 ... 31, 127),/
,:
,\
. This blacklist should probably be expanded to also include Unicode control and whitespaceand underscorecharacters.I addition, double underscores
__
have special meaning in relation to OGC netCDF-LD, specifically for prefixes, and should be mentioned as reserved for that purpose to not create interoperability clashes.Beta Was this translation helpful? Give feedback.
All reactions