Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-English scripts for unquoted keys #891

Merged
merged 24 commits into from
Sep 11, 2022

Conversation

abelbraaksma
Copy link
Contributor

@abelbraaksma abelbraaksma commented Mar 15, 2022

Fixes: #687

Support all international scripts, except for certain non-symbols, arrows and joins. This PR largely follows the recommendation in my comment (#687 (comment)), except where that conflicted with backward compat issues (i.e., we allow keys to start with a digit or dash, and that should stay that way).

The excluded ranges are mostly those that are really not "letterlike" (arrows, line drawing chars, spaces, punctuation). If someone feels that some of the excluded ranges should be included, by all means, let me know.

ABNF verified with https://tools.ietf.org/tools/bap/.

Here are some examples that will work once this is allowed. Tested with http://instaparse.mojombo.com/.

# Examples that are now valid unquoted keys
zeeën = "'seas' - Dutch"
Cuántos-años = "'How old' - Spanish"
الساعة = "arabic"
汉语大字典 = "cjk ideographs"
辭源 = "Ciyuan"
பெண்டிரேம் = "'we are women' - in Tamil"
गंगा = "'Ganges' - in Devanagari, Hindi"
העברית = "'Academy' - Hebrew"
Тіні-забутих-предків = "'Shadows of forgotten ancestors' - movie title in Ukrainian"
VäinöLinna = "finnish author"
బడికి = "'School' - in Telugu"
Người = "'person' - Vietnamese"
😂 = "x1F602 - smiley"
ᚠ = "'Cattle' - Rune"
𓆣𓆤𓆥𓆦 = "Egyptian hieroglyphs"
Fuß = "'Foot' - German"
français  = "'French' - French"

# These are still valid
- = "single dash"
3 = "digit"
_ = "underscore"

Support all international scripts, except for certain non-symbols, arrows and joins
toml.md Outdated Show resolved Hide resolved
Copy link
Contributor

@ChristianSi ChristianSi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work!

Regarding the text in the written spec, I think the wording with "any letter-like Unicode character from any Unicode script" is good, but I also think the spec should explain what exactly that means. As I see it, the written spec should be self-contained and one should not have to read the ABNF to find out what it actually means. So, we could leave the paragraph as suggested and then add a sentence such as: "More specifically, the allowed characters are ASCII letters and digits (A-Za-z0-9), the underscore _ and the dash -, a well as the Unicode characters U+00C0 to U+00D6, U+00D8 to U+00F6, U+00F8 to U+00FF, U+0010 to U+002FF, U+0300 to U+037D" (etc.), thus translating the "unquoted-key-char" rule from the ABNF into readable text. It's just 8 lines in the ABNF, so it should be OK as a sentence (maybe set as its own paragraph).

toml.abnf Outdated Show resolved Hide resolved
toml.md Outdated Show resolved Hide resolved
toml.md Outdated Show resolved Hide resolved
toml.md Outdated Show resolved Hide resolved
toml.md Outdated Show resolved Hide resolved
@abelbraaksma
Copy link
Contributor Author

It's just 8 lines in the ABNF, so it should be OK as a sentence (maybe set as its own paragraph).

@ChristianSi, I updated with a paragraph as bulleted list, with I think is now quite readable (other comments/suggestions have been addressed).

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Mar 16, 2022

The error is in the build, which says "All checks have failed" is a bit odd, not sure what to do about that:
image

@hukkin
Copy link
Contributor

hukkin commented Mar 16, 2022

The error is in the build, which says "All checks have failed" is a bit odd, not sure what to do about that:

The problem is that pre-commit.ci has been activated for this repository, but there is no configuration file for it. Not related to your changes at all.

toml.abnf Outdated Show resolved Hide resolved
@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Mar 16, 2022

@arp242, (moving to the main thread so that our discussion remains visible in the future) as I read your comments, you basically have two concerns. One is about whether using Unicode Categories would be better suited and the second about why I included unassigned code-points.

Including unassigned code-points

This is very much "by design". We're trying to be as inclusive as possible. Since nobody can predict what code-points get assigned in the future, nor in what categories they get, the best we can do is include unassigned code-points, unless the block is excluded as "not a character" in Unicode (a few such ranges are in there).

This argument is precisely what large standards bodies like W3C use when they need to design specs. You want your spec to be future-proof and you cannot do that when you exclude ranges that may be assigned in the future.

Using categories vs fixed ranges

Categories are in flux and code-points get reassigned. This is not only a problem between Unicode versions, but also in libraries. For instance, certain characters in .NET Framework are in a different category than they are for .NET 5. Other libraries rely on whatever is the system default, which means running code on one system may fail on another.

From a parsing perspective, one could hard-code the code-points of the categories instead of naming them, which would probably be the way to go. But that is an arduous task and leads to unwieldy ABNF that is hard to analyze. Plus many code-points that are inside Lo category shouldn't be included and conversely, many code-points that aren't in one of the L categories should be added.

Finally, if you don't hard-code the categories, you'll need to fix the Unicode version. But suppose you'd pin it to version 5 (most widely implemented), then you miss out on all the good stuff since version 5. Suppose you'd pin it to version 10 (much less widely implemented). Now you're asking implementers that want to conform to TOML to implement all the ranges for categories of version 10. Plus that users may find on their system that character "X" is in Ll, but TOML doesn't allow it because it used a version in the past. This breaks the "be liberal in what you accept" paradigm.

Btw, anybody writing an implementation for TOML that doesn't have a full Unicode library available can still easily write an implementation when only certain code-points need to be accepted. When using categories, this is much harder (as was already acknowledged in the original thread).

Conclusion

In the end, there are pros and cons for either approach and I hear what you are saying. I've been in exactly that spot when we discussed certain specs in the W3C technical committees. In the end we decided to be inclusive rather than exclusive and it proved to be the easier path. I can kindly repeat my quote from the original thread here:

Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters. The intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names.

and:

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name.

@marzer
Copy link
Contributor

marzer commented Mar 16, 2022

When using categories, this is much harder (as was already acknowledged in the original thread).

Can confirm. Massive PITA.

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Mar 16, 2022

Plus, many Unicode libraries assign different codepoints to categories.

Character classes are defined in Unicode; perhaps some libraries get them wrong, but I never encountered problems with this. AFAIK this is not a wide-spread problem(?)

Actually, it is. Even the latest version (v14.0) added a bunch of new code-points to the the L category. So categories are in constant flux and each version expands on that. See https://www.unicode.org/reports/tr44/tr44-28.html#Unicode_14.0.0, specifically:

"Newly encoded modifier letters in the range U+10780..U+107BA were assigned the value Other_Lowercase, for consistency with other, similar modifier letters."

and:

"The General_Category for U+1734 HANUNOO PAMUDPOD was changed from Mn to Mc, for consistency in treatment with the newly encoded U+1715 TAGALOG PAMUDPOD."

Bottom line is, if we want to be stable, we'd best use a predefined, as wide as possible set that explicitly includes unassigned code-points.

@arp242
Copy link
Contributor

arp242 commented Mar 16, 2022

You want your spec to be future-proof and you cannot do that when you exclude ranges that may be assigned in the future.

I don't really follow this to be honest. The goal of this PR is to allow only letter and digit-like symbols, but with such an inclusive range in the future it might include punctuation, new arrows, or other things now explicitly excluded. So you risk ending up with something very inconsistent. To me this seems much less

Personally I'd rather erroneously exclude some symbols from the allowed list than accidentally include some we didn't intend, since this can be updated with a simple backwards-compatible spec change, whereas the reverse can't be fixed and we'll be stuck with it. That seems more future-proof to me. In the issue about this you mentioned "It follows the well-known adage: be liberal in what you accept", and I actually don't agree with that principle at all. There's been tons of discussions about this in the last few decades and no need to repeat it here, but this seems to be at the core of the issue.

We can also just say "include everything except =, [, ., ], some (or all?) spaces", or something to that effect. Do we really need to exclude things like arrows or block drawing characters? It's perhaps kind of silly to use it, but then again so is using Linear-A or various emojis, and that's allowed now too. That would make this entire discussion moot. I can't really think of any serious downsides right now.

Copy link
Contributor

@ChristianSi ChristianSi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good to me now, except for a tiny formatting change I would suggest.

toml.md Outdated Show resolved Hide resolved
@ChristianSi
Copy link
Contributor

ChristianSi commented Mar 17, 2022

@arp242 I don't quite understand the point you're trying to make. In one paragraph you suggest to be more restrictive, in the next one to be less so.

My own feeling, and your wavering maybe confirms it, is that this PR has already managed to strike a good balance. Also, as I understand it, this is very close to what XML/HTML do, and I'd say that if it works for those giants, it can work for TOML too!

@abelbraaksma
Copy link
Contributor Author

We can also just say "include everything except =, [, ., ], some (or all?) spaces", or something to that effect. Do we really need to exclude things like arrows or block drawing characters?

Technically, there's nothing that stops us from doing that, I agree. There's however the long-standing consensus that "idents" need to follow certain rules and generally those rules have been to exclude characters or code-points that aren't letterlike. Fragment identifiers, perhaps the most widely used targets for URIs, follow the exact same logic as I've laid out here and it has been in use for over 25 years without issues.

It's perhaps kind of silly to use it, but then again so is using Linear-A or various emojis, and that's allowed now too.

Yes, emojis are silly, as is Linear-A. But even in the most recent Unicode version, characters have been added for oft-used languages like Telugu, Latin, CJK Ideographs, Mongolian. obviously, they won't be immediately used by everyone, but that's not the point here. Since version 5.0 (Windows 7) or 5.1 (Windows 10) there have been many thousands of characters added.

I wouldn't oppose specifically excluding emojis. The xml:id spec is from 2005, the XML and related HTML specs that define Name and NCName are from 1998 and about. It was with great foresight that they decided not to lock in the character ranges, yet a small byproduct of that foresight lead to weird things like emojis now being allowed in international domain names, identifiers, tagnames and html fragments. Nobody's ever really gotten worried about that.

If such massive standards (HTML, XML, URI) managed to deal with this for a quarter of a century, without change or backward/future compat issues, I think TOML can do to and should follow good example.

I don't really follow this to be honest.

I'm sorry about that. Apparently I lack the right words to explain myself clearly. If you look closely to the ranges, you'll notice that in the ASCII range a lot of characters are excluded, which can be used for future use if TOML needs it (i.e., allow foo + bar of foo : bar to mean something).

With "future proof" I mean that it still allows TOML to change over time. It also means that there won't be new versions of TOML just to cater for changes in categories (which happen just about every year in Unicode).

Beside that, implementers that have to deal with libraries that do not have strong Unicode support (this includes .NET w.r.t. up-to-date categories), need to implement all this by hand. Keeping things simple helps in adoption.

Anyway, I value your input and point of view, and this is just my take on this feature change. While I do feel strongly about doing it the "standard, well-trodden way", I'm not married to any particular method. If the TOML powers that be disagree with what I'd call my "simple, yet inclusive" approach, I've no problem to switch gears and go with another flow ;).

@ChristianSi
Copy link
Contributor

As for the "character" vs. "code point" issue, I don't have a strong preference, but I think we can continue to use the term "character", since it's convenient and widely understood. Serious confusion is unlikely, especially as we use the U+xxxx syntax to precisely define the exact ranges of code points (~ characters) allowed.

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Mar 17, 2022

@ChristianSi, looks like we submitted our comments at the same time. You've a much better way of summarizing than I do 🤣. I concur re code-points. I also think it makes the text a bit more palatable to read.

@arp242
Copy link
Contributor

arp242 commented Mar 17, 2022

There's however the long-standing consensus that "idents" need to follow certain rules and generally those rules have been to exclude characters or code-points that aren't letterlike.

What is the motivation for this? And does this also apply to TOML?

I looked a bit at the xml:id specification, but I couldn't really find anything. I think it's important to look at the why of it rather than just copy things "because its how it's done". TOML is not a programming language or XML, so what does and doesn't work in one area can actually work just fine here.

For example, almost all programming languages require identifiers to start with a number; otherwise 1 + 1 could refer to both digits or variables and you need type information during parsing to figure that out. - is typically excluded as well, to prevent confusion with the minus operator, etc. But that doesn't apply to TOML, and we already allow keys to start with numbers and contain -, as well as some other more relaxed rules.

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Mar 18, 2022

What is the motivation for this? And does this also apply to TOML?

If you don't, it becomes a mess. The 'M' in TOML stands for 'Minimal'. Using this fairly simple approach allows it to stay minimal. Using categories or hand waiving w.r.t. Unicode versions doesn't.

I looked a bit at the xml:id specification, but I couldn't really find anything.

The xml:id spec is a very simple spec that defines it as an NCName, which is a Name without the :. Conversely, such ids are used as URI fragment identifiers, which is most typically used to reference an ID or Name in an HTML, XML or like document (for which purpose they are restricted to the NCName production).

just copy things "because its how it's done".

We don't just copy things, we pick what's useful if it serves our purpose. We've had months of discussion before we came to this PR. Have a look at the original issue please.

TOML is not a programming language or XML,

Nobody says TOML is a programming language. That doesn't mean we cannot borrow useful ideas elsewhere, whether it's from a supermarket, the Hitchhiker's Guide or ALGOL really doesn't matter. In this case we looked at past experiences, discussions and best practices in like fields.

so what does and doesn't work in one area can actually work just fine here.

Exactly! That's why we came up with this simple, unobtrusive approach.

For example, almost all programming languages require identifiers to start with a number

They don't, unless you mean labels.

- is typically excluded as well,

It isn't, unless you specifically mean a restriction in programming languages, which is irrelevant here. Closer to home you see it isn't excluded in DNS, HTML tag names, XPath and XQuery identifiers, UNC file names, paths in URLs etc, etc.

But that doesn't apply to TOML, and we already allow keys to start with numbers and contain -, as well as some other more relaxed rules.

Exactly, which is precisely why we made the rules the way they are. Nobody is suggesting to change that.

To give a counter example that follows your idea of using Unicode categories: MS SQL Server. You cannot create a database name with characters defined since 2002, as once they chose this approach, it became impossible to ever let go. Since then, MS SQL Server has been limited to Unicode 3.2. If they want to change this, they'll end up in version hell (old scripts won't be compatible with new scripts anymore and v.v.).

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Mar 18, 2022

But tbh, I feel like we’re going a bit in circles. I’m at fault too, for repeating my arguments.

Basically, there are two choices. We go for a simple approach that’s trivial to implement, is widely adopted and is easy to explain and understand but has a remote chance of some day including character assignments that weren’t anticipated; or we go for an approach that at face value seems sensible, but requires significant expansion of the ABNF, is hard to read, hard to maintain, asks a lot of implementers (one already tried and called it a major pita) and locks us in with a Unicode version.

To me, that’s not a hard choice to make.

@arp242
Copy link
Contributor

arp242 commented Mar 18, 2022

I think there's some confusion here now. In my previous comment I was writing about the idea to allow all characters except a short list of symbols that would make parsing harder (=, [, ], etc.) or be greatly confusing (e.g. various space symbols). That is, I was continuing from my previous paragraph a few comments up:

We can also just say "include everything except =, [, ., ], some (or all?) spaces", or something to that effect. Do we really need to exclude things like arrows or block drawing characters? It's perhaps kind of silly to use it, but then again so is using Linear-A or various emojis, and that's allowed now too. That would make this entire discussion moot. I can't really think of any serious downsides right now.

My previous comment wasn't talking about using Unicode categories – that is a done discussion at this point as far as I'm concerned, because I don't actually see a reason to restrict bare keys to just digits and letters. I probably didn't make that too clear, sorry.

In TOML we already allow almost everything in quoted keys, so the only reason I can see to not allow something in bare keys if it makes parsing harder or if it has the potential to be greatly confusing (such as various forms of spaces, and maybe some hompglyps; not sure if that's worth the effort though).

The ABNF notation for this shouldn't be a huge deal, and implementing it shouldn't be either. Basically, it's just a modified table from what you wrote in this patch. Actually both the ABNF and implementation might be shorter and simpler.

What is the motivation for this? And does this also apply to TOML?
If you don't, it becomes a mess. The 'M' in TOML stands for 'Minimal'. Using this fairly simple approach allows it to stay minimal. Using categories or hand waiving w.r.t. Unicode versions doesn't.

I looked a bit at the xml:id specification, but I couldn't really find anything.
The xml:id spec is a very simple spec that defines it as an NCName, which is a Name without the :. Conversely, such ids are used as URI fragment identifiers, which is most typically used to reference an ID or Name in an HTML, XML or like document (for which purpose they are restricted to the NCName production).

Yeah, I looked at that, but unless I missed it (didn't read from cover to cover), it doesn't explain why these specific character ranges were chosen. Why are some characters excluded? For example it goes out of its way to forbid × and ÷, but it's not clear to me why? I guess the intent was to exclude math symbols (for whatever reason), but does include various math symbols in those ranges, so I'm left wondering: what's the problem with people using hello÷world = "value" in TOML? Why would we want to forbid that? And if there is a good reason, then why is hello₋world = "value" allowed? It seems rather inconsistent to me, and should either allow or forbid both (but I see no reason to forbid either).

For example, almost all programming languages require identifiers to start with a number

They don't, unless you mean labels.

I meant "forbid" instead of "require" 😅 Sorry, that was a confusing mixup.

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Mar 18, 2022

My previous comment wasn't talking about using Unicode categories – that is a done discussion at this point

Aha, I didn't get that, sorry.

it doesn't explain why these specific character ranges were chosen. Why are some characters excluded?

That's explained in the XML specs, I quoted that earlier, see #687 (comment).

For example it goes out of its way to forbid × and ÷, but it's not clear to me why? I guess the intent was to exclude math symbols (for whatever reason), but does include various math symbols in those ranges

That's historical. All Math Symbols from Unicode 2.0 were excluded. This part of the spec was written in 2001 and then went into CR/PR status, which froze this definition (XML itself used to use a more complicated range in the 1998 original spec but dropped that for the same reasons as mentioned before: prevent complex versioning and solid future-proofness). The symbols you probably talk about were added at a later date. We could certainly exclude those other ranges, but before you know it, you'll end up with a much more complicated production. Most well-known and oft-used symbols/separators etc are now excluded for the (imo) right reasons and the rest we should just ignore.

The discussion whether or not to include "everything" except a very small set should probably be done in a separate discussion/issue, as that wasn't the goal of the accepted change discussed in #687. Personally, I don't think there's much merit in including many well-known symbols, and it may be a pain for editor-writers and syntax highlighting, but ultimately that's for the larger community to decide.

@arp242
Copy link
Contributor

arp242 commented Mar 18, 2022

That's explained in the XML specs, I quoted that earlier, see #687 (comment).

I guess I didn't read it as such 😅

Anyway, I think they key thing here is that you're not really "restricting identifiers" as it's already unrestricted with quoted keys, so it seems to me it that this makes it a parsing/lexing issue only: you already need to deal with all of Unicode (including for syntax highlighters etc.) and implementation-wise it shouldn't be an issue anywhere I think (or at least: not more than it already is).

@abelbraaksma
Copy link
Contributor Author

abelbraaksma commented Aug 16, 2022

@marzer, of course! 🤦‍♂️. How did I not get that 😆! Fixed.

@pradyunsg: I think this is ready for re-review.

@abelbraaksma abelbraaksma requested a review from pradyunsg August 16, 2022 11:04
@ChristianSi
Copy link
Contributor

More than 3 weeks and still no news? Is the maintainership bottleneck raising its ugly head again? 😭

tuckstarrydell referenced this pull request Sep 11, 2022
This enforces that all Markdown files are wrapped at 80 characters,
in the style the prettier follows.
@marzer
Copy link
Contributor

marzer commented Sep 12, 2022

Hype hype hype, thanks everyone for driving this (particularly @abelbraaksma who actually Did The Thing)

@abelbraaksma
Copy link
Contributor Author

Thanks @pradyunsg for merging, thanks @marzer and @ChristianSi (and others!) for all the helpful comments. Glad we’re now going international!

@epage
Copy link

epage commented Dec 19, 2022

Just wanting to catch up on this for when I need to implement this for my toml parser. How does the selected ranges compare to unicode's identifier definition (Id_Start, Id_Continue, Xid_Start, Xid_Continue)?

If keeping the idea of "identifier" I believe this is what programming languages usually use and ideally we wouldn't be re-inventing the wheel.

@abelbraaksma
Copy link
Contributor Author

@epage, Unicode identifiers assume that identifiers don’t start with a digit (of any kind), which is usually correct, but not in TOML, which allows numerical identifiers.

While I did look at those definitions, they didn’t suit our needs. We settled on the simplest possible ranges that are easy to implement. The ones from Unicode are typically way more complex and depend on character properties.

@ChristianSi
Copy link
Contributor

TOML's rules for bare keys are pretty close to those used in XML for identifiers, IIRC.

@arp242
Copy link
Contributor

arp242 commented Jan 16, 2023

I was writing test cases for this, and using a pirate flag (🏴‍☠️) doesn't work; this is:

     CPoint  Dec    UTF8        HTML       Name (Cat)
'🏴' U+1F3F4 127988 f0 9f 8f b4 🏴  WAVING BLACK FLAG (Other_Symbol)
'�'  U+200D  8205   e2 80 8d    ‍      ZERO WIDTH JOINER (Format)
'☠'  U+2620  9760   e2 98 a0    ☠   SKULL AND CROSSBONES (Other_Symbol)

The flag and ZWJ is fine, but the skull and crossbones isn't allowed in the current range.

Seems confusing since most emojis work. Took me quite a bit of time to figure when modifying my parser to support this because I just assumed I missed something, but turns out it's just not in the allowed range:

unquoted-key-char =/ %x2070-218F / %x2460-24FF          ; include super-/subscripts, letterlike/numberlike forms, enclosed alphanumerics
unquoted-key-char =/ %x2C00-2FEF / %x3001-D7FF          ; skip arrows, math, box drawing etc, skip 2FF0-3000 ideographic up/down markers and spaces

Looking at the U+2500..U+2bff range, I don't really see why we need to skip a lot of these things.


I know we discussed this before, but I still think we should either allow only letters+numbers or just allow almost everything (with a few exceptions); the current behaviour is just confusing. The examples uses an emoji as an example and ZWJ is explicitly allowed, so you'd expect all emojis to work, but turns out only some emojis work. It just so happened by chance that "pirate flag" was the first emoji I tried, but there are probably others as well and with ZWJ combinations it'll be a whack-a-mole.

Either way, IMHO we should support all emojis or none. Many other ZWJ combinations do work fine; 🏳️‍🌈 (U+1F3F3 ZWJ U+1F308) or 🏴󠁧󠁢󠁷󠁬󠁳󠁿 is okay, but 🏳️‍⚧️ isn't (as U+26A7 isn't in the allowed range). In a quick test it seems all flags work, except two.

@pradyunsg
Copy link
Member

@arp242 Could you file an issue for this?

@abelbraaksma abelbraaksma deleted the update-changelog branch January 17, 2023 11:53
arp242 added a commit to arp242/toml that referenced this pull request Jun 2, 2023
This backs out the unicode bare keys from toml-lang#891.

This does *not* mean we can't include it in a future 1.2 (or 1.3, or
whatever); just that right now there doesn't seem to be a clear
consensus regarding to normalisation and which characters to include.
It's already the most discussed single issue in the history of TOML.

I kind of hate doing this as it seems a step backwards; in principle I
think we *should* have this so I'm not against the idea of the feature
as such, but things seem to be at a bit of a stalemate right now, and
this will allow TOML to move forward on other issues.

It hasn't come up *that* often; the issue (toml-lang#687) wasn't filed until
2019, and has only 11 upvotes. Other than that, the issue was raised
only once before in 2015 as far as I can find (toml-lang#337). I also can't
really find anyone asking for it in any of the HN threads on TOML.

All of this means we can push forward releasing TOML 1.1, giving people
access to the much more frequently requested relaxing of inline tables
(toml-lang#516, with 122 upvotes, and has come up on HN as well) and some other
more minor things (e.g. `\e` has 12 upvotes in toml-lang#715).

Basically, a lot more people are waiting for this, and all things
considered this seems a better path forward for now, unless someone
comes up with a proposal which addresses all issues (I tried and thus
far failed).

I proposed this over here a few months ago, and the response didn't seem
too hostile to the idea:
toml-lang#966 (comment)
arp242 added a commit to arp242/toml that referenced this pull request Jun 2, 2023
This backs out the unicode bare keys from toml-lang#891.

This does *not* mean we can't include it in a future 1.2 (or 1.3, or
whatever); just that right now there doesn't seem to be a clear
consensus regarding to normalisation and which characters to include.
It's already the most discussed single issue in the history of TOML.

I kind of hate doing this as it seems a step backwards; in principle I
think we *should* have this so I'm not against the idea of the feature
as such, but things seem to be at a bit of a stalemate right now, and
this will allow TOML to move forward on other fronts.

It hasn't come up *that* often; the issue (toml-lang#687) wasn't filed until
2019, and has only 11 upvotes. Other than that, the issue was raised
only once before in 2015 as far as I can find (toml-lang#337). I also can't
really find anyone asking for it in any of the HN threads on TOML.

Reverting this means we can go forward releasing TOML 1.1, giving people
access to the much more frequently requested relaxing of inline tables
(toml-lang#516, with 122 upvotes, and has come up on HN as well) and some other
more minor things (e.g. `\e` has 12 upvotes in toml-lang#715).

Basically, a lot more people are waiting for this, and all things
considered this seems a better path forward for now, unless someone
comes up with a proposal which addresses all issues (I tried and thus
far failed).

I proposed this over here a few months ago, and the responses didn't
seem too hostile to the idea:
toml-lang#966 (comment)
arp242 added a commit to arp242/toml that referenced this pull request Sep 22, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and ANY solution is a
trade-off. That said, I do believe some trade-offs are better than
others, and after looking at a bunch of different options I believe this
is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is the strongest argument in favour of this and the biggest
  improvement: we can't really do anything wrong here in a way that we
  can't correct later. Being conservative is probably the right way
  forward.

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work", but "this
  character works fine, but this very similar doesn't". This shows up in
  a number of things:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  From the user's perspective this seems like a bug in the TOML parser.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the code adding multibyte support in the first case will
  probably be harder, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something "Extra Augmented BNF?"

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

[1]: Aside: I encountered this just the other day as I created a TOML
     file with all UK election results since 1945, which looks like:

         [1950]
         Labour       = [13_266_176, 315, 617]
         Conservative = [12_492_404, 298, 619]
         Liberal      = [ 2_621_487,   9, 475]
         Sinn_Fein    = [    23_362,   0,   2]

     That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just
     wrote it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this pull request Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and after looking at a bunch of different options I believe
this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later. Being conservative for these type of things is is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  It should either allow everything or nothing. This in-between is just
  horrible. From the user's perspective this seems like a bug in the
  TOML parser, but it's not: it's a bug in the specification.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the code adding multibyte support in the first case will
  probably be harder, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something "Extra Augmented BNF?"

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this pull request Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and after looking at a bunch of different options I believe
this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later. Being conservative for these type of things is is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  It should either allow everything or nothing. This in-between is just
  horrible. From the user's perspective this seems like a bug in the
  TOML parser, but it's not: it's a bug in the specification.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this pull request Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and I've made it no secret that I feel the current
trade-off is a bad one. After looking at a bunch of different options I
believe this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later, unlike what we have now, which is "well I think it probably
  won't cause any problems, based on what these 5 European/American guys
  think, but if it does: we won't be able to correct it".

  Being conservative for these type of things is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all (especially outside of the
  Latin character range by the way, which shows the Euro/US bias in how
  it's written).

  People don't read specifications in great detail, nor should they.
  People try something and sees if it works. Now it seems to work on
  first approximation, and then (possibly months or years later) it
  seems to "suddenly break". From the user's perspective this seems like
  a bug in the TOML parser, but it's not: it's a bug in the
  specification. It should either allow everything or nothing. This
  in-between is confusing and horrible.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

  Confusables is also an issue with different scripts (Latin and
  Cyrillic is well-known), but this is less of an issue since it's not
  syntax, and also something that's fundamentally unavoidable in any
  multi-script environment.

- Maps closer to identifiers in more (though not all) languages. We
  discussed whether TOML keys are "strings" or "identifiers" last week
  in toml-lang#966 and while views differ (mostly because they're both) it seems
  to me that making it map *closer* is better. This is a minor issue,
  but it's nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this pull request Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and I've made it no secret that I feel the current
trade-off is a bad one. After looking at a bunch of different options I
believe this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later, unlike what we have now, which is "well I think it probably
  won't cause any problems, based on what these 5 European/American guys
  think, but if it does: we won't be able to correct it".

  Being conservative for these type of things is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all (especially outside of the
  Latin character range by the way, which shows the Euro/US bias in how
  it's written).

  People don't read specifications in great detail, nor should they.
  People try something and sees if it works. Now it seems to work on
  first approximation, and then (possibly months or years later) it
  seems to "suddenly break". From the user's perspective this seems like
  a bug in the TOML parser, but it's not: it's a bug in the
  specification. It should either allow everything or nothing. This
  in-between is confusing and horrible.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

  Confusables is also an issue with different scripts (Latin and
  Cyrillic is well-known), but this is less of an issue since it's not
  syntax, and also something that's fundamentally unavoidable in any
  multi-script environment.

- Maps closer to identifiers in more (though not all) languages. We
  discussed whether TOML keys are "strings" or "identifiers" last week
  in toml-lang#966 and while views differ (mostly because they're both) it seems
  to me that making it map *closer* is better. This is a minor issue,
  but it's nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Relax bare key restrictions to allow additional unicode letters and numbers
8 participants