Allow non-English scripts for unquoted keys #891

abelbraaksma · 2022-03-15T12:16:53Z

Fixes: #687

Support all international scripts, except for certain non-symbols, arrows and joins. This PR largely follows the recommendation in my comment (#687 (comment)), except where that conflicted with backward compat issues (i.e., we allow keys to start with a digit or dash, and that should stay that way).

The excluded ranges are mostly those that are really not "letterlike" (arrows, line drawing chars, spaces, punctuation). If someone feels that some of the excluded ranges should be included, by all means, let me know.

ABNF verified with https://tools.ietf.org/tools/bap/.

Here are some examples that will work once this is allowed. Tested with http://instaparse.mojombo.com/.

# Examples that are now valid unquoted keys
zeeën = "'seas' - Dutch"
Cuántos-años = "'How old' - Spanish"
الساعة = "arabic"
汉语大字典 = "cjk ideographs"
辭源 = "Ciyuan"
பெண்டிரேம் = "'we are women' - in Tamil"
गंगा = "'Ganges' - in Devanagari, Hindi"
העברית = "'Academy' - Hebrew"
Тіні-забутих-предків = "'Shadows of forgotten ancestors' - movie title in Ukrainian"
VäinöLinna = "finnish author"
బడికి = "'School' - in Telugu"
Người = "'person' - Vietnamese"
😂 = "x1F602 - smiley"
ᚠ = "'Cattle' - Rune"
𓆣𓆤𓆥𓆦 = "Egyptian hieroglyphs"
Fuß = "'Foot' - German"
français  = "'French' - French"

# These are still valid
- = "single dash"
3 = "digit"
_ = "underscore"

Support all international scripts, except for certain non-symbols, arrows and joins

toml.md

ChristianSi

Good work!

Regarding the text in the written spec, I think the wording with "any letter-like Unicode character from any Unicode script" is good, but I also think the spec should explain what exactly that means. As I see it, the written spec should be self-contained and one should not have to read the ABNF to find out what it actually means. So, we could leave the paragraph as suggested and then add a sentence such as: "More specifically, the allowed characters are ASCII letters and digits (A-Za-z0-9), the underscore _ and the dash -, a well as the Unicode characters U+00C0 to U+00D6, U+00D8 to U+00F6, U+00F8 to U+00FF, U+0010 to U+002FF, U+0300 to U+037D" (etc.), thus translating the "unquoted-key-char" rule from the ABNF into readable text. It's just 8 lines in the ABNF, so it should be OK as a sentence (maybe set as its own paragraph).

toml.abnf

toml.md

abelbraaksma · 2022-03-16T10:59:54Z

It's just 8 lines in the ABNF, so it should be OK as a sentence (maybe set as its own paragraph).

@ChristianSi, I updated with a paragraph as bulleted list, with I think is now quite readable (other comments/suggestions have been addressed).

abelbraaksma · 2022-03-16T11:01:06Z

The error is in the build, which says "All checks have failed" is a bit odd, not sure what to do about that:

hukkin · 2022-03-16T11:09:30Z

The error is in the build, which says "All checks have failed" is a bit odd, not sure what to do about that:

The problem is that pre-commit.ci has been activated for this repository, but there is no configuration file for it. Not related to your changes at all.

toml.abnf

abelbraaksma · 2022-03-16T14:50:44Z

@arp242, (moving to the main thread so that our discussion remains visible in the future) as I read your comments, you basically have two concerns. One is about whether using Unicode Categories would be better suited and the second about why I included unassigned code-points.

Including unassigned code-points

This is very much "by design". We're trying to be as inclusive as possible. Since nobody can predict what code-points get assigned in the future, nor in what categories they get, the best we can do is include unassigned code-points, unless the block is excluded as "not a character" in Unicode (a few such ranges are in there).

This argument is precisely what large standards bodies like W3C use when they need to design specs. You want your spec to be future-proof and you cannot do that when you exclude ranges that may be assigned in the future.

Using categories vs fixed ranges

Categories are in flux and code-points get reassigned. This is not only a problem between Unicode versions, but also in libraries. For instance, certain characters in .NET Framework are in a different category than they are for .NET 5. Other libraries rely on whatever is the system default, which means running code on one system may fail on another.

From a parsing perspective, one could hard-code the code-points of the categories instead of naming them, which would probably be the way to go. But that is an arduous task and leads to unwieldy ABNF that is hard to analyze. Plus many code-points that are inside Lo category shouldn't be included and conversely, many code-points that aren't in one of the L categories should be added.

Finally, if you don't hard-code the categories, you'll need to fix the Unicode version. But suppose you'd pin it to version 5 (most widely implemented), then you miss out on all the good stuff since version 5. Suppose you'd pin it to version 10 (much less widely implemented). Now you're asking implementers that want to conform to TOML to implement all the ranges for categories of version 10. Plus that users may find on their system that character "X" is in Ll, but TOML doesn't allow it because it used a version in the past. This breaks the "be liberal in what you accept" paradigm.

Btw, anybody writing an implementation for TOML that doesn't have a full Unicode library available can still easily write an implementation when only certain code-points need to be accepted. When using categories, this is much harder (as was already acknowledged in the original thread).

Conclusion

In the end, there are pros and cons for either approach and I hear what you are saying. I've been in exactly that spot when we discussed certain specs in the W3C technical committees. In the end we decided to be inclusive rather than exclusive and it proved to be the easier path. I can kindly repeat my quote from the original thread here:

Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters. The intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names.

and:

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name.

marzer · 2022-03-16T14:59:17Z

When using categories, this is much harder (as was already acknowledged in the original thread).

Can confirm. Massive PITA.

abelbraaksma · 2022-03-16T15:16:26Z

Plus, many Unicode libraries assign different codepoints to categories.

Character classes are defined in Unicode; perhaps some libraries get them wrong, but I never encountered problems with this. AFAIK this is not a wide-spread problem(?)

Actually, it is. Even the latest version (v14.0) added a bunch of new code-points to the the L category. So categories are in constant flux and each version expands on that. See https://www.unicode.org/reports/tr44/tr44-28.html#Unicode_14.0.0, specifically:

"Newly encoded modifier letters in the range U+10780..U+107BA were assigned the value Other_Lowercase, for consistency with other, similar modifier letters."

and:

"The General_Category for U+1734 HANUNOO PAMUDPOD was changed from Mn to Mc, for consistency in treatment with the newly encoded U+1715 TAGALOG PAMUDPOD."

Bottom line is, if we want to be stable, we'd best use a predefined, as wide as possible set that explicitly includes unassigned code-points.

arp242 · 2022-03-16T15:31:31Z

You want your spec to be future-proof and you cannot do that when you exclude ranges that may be assigned in the future.

I don't really follow this to be honest. The goal of this PR is to allow only letter and digit-like symbols, but with such an inclusive range in the future it might include punctuation, new arrows, or other things now explicitly excluded. So you risk ending up with something very inconsistent. To me this seems much less

Personally I'd rather erroneously exclude some symbols from the allowed list than accidentally include some we didn't intend, since this can be updated with a simple backwards-compatible spec change, whereas the reverse can't be fixed and we'll be stuck with it. That seems more future-proof to me. In the issue about this you mentioned "It follows the well-known adage: be liberal in what you accept", and I actually don't agree with that principle at all. There's been tons of discussions about this in the last few decades and no need to repeat it here, but this seems to be at the core of the issue.

We can also just say "include everything except =, [, ., ], some (or all?) spaces", or something to that effect. Do we really need to exclude things like arrows or block drawing characters? It's perhaps kind of silly to use it, but then again so is using Linear-A or various emojis, and that's allowed now too. That would make this entire discussion moot. I can't really think of any serious downsides right now.

ChristianSi

This looks very good to me now, except for a tiny formatting change I would suggest.

toml.md

ChristianSi · 2022-03-17T17:16:43Z

@arp242 I don't quite understand the point you're trying to make. In one paragraph you suggest to be more restrictive, in the next one to be less so.

My own feeling, and your wavering maybe confirms it, is that this PR has already managed to strike a good balance. Also, as I understand it, this is very close to what XML/HTML do, and I'd say that if it works for those giants, it can work for TOML too!

abelbraaksma · 2022-03-17T17:19:42Z

We can also just say "include everything except =, [, ., ], some (or all?) spaces", or something to that effect. Do we really need to exclude things like arrows or block drawing characters?

Technically, there's nothing that stops us from doing that, I agree. There's however the long-standing consensus that "idents" need to follow certain rules and generally those rules have been to exclude characters or code-points that aren't letterlike. Fragment identifiers, perhaps the most widely used targets for URIs, follow the exact same logic as I've laid out here and it has been in use for over 25 years without issues.

It's perhaps kind of silly to use it, but then again so is using Linear-A or various emojis, and that's allowed now too.

Yes, emojis are silly, as is Linear-A. But even in the most recent Unicode version, characters have been added for oft-used languages like Telugu, Latin, CJK Ideographs, Mongolian. obviously, they won't be immediately used by everyone, but that's not the point here. Since version 5.0 (Windows 7) or 5.1 (Windows 10) there have been many thousands of characters added.

I wouldn't oppose specifically excluding emojis. The xml:id spec is from 2005, the XML and related HTML specs that define Name and NCName are from 1998 and about. It was with great foresight that they decided not to lock in the character ranges, yet a small byproduct of that foresight lead to weird things like emojis now being allowed in international domain names, identifiers, tagnames and html fragments. Nobody's ever really gotten worried about that.

If such massive standards (HTML, XML, URI) managed to deal with this for a quarter of a century, without change or backward/future compat issues, I think TOML can do to and should follow good example.

I don't really follow this to be honest.

I'm sorry about that. Apparently I lack the right words to explain myself clearly. If you look closely to the ranges, you'll notice that in the ASCII range a lot of characters are excluded, which can be used for future use if TOML needs it (i.e., allow foo + bar of foo : bar to mean something).

With "future proof" I mean that it still allows TOML to change over time. It also means that there won't be new versions of TOML just to cater for changes in categories (which happen just about every year in Unicode).

Beside that, implementers that have to deal with libraries that do not have strong Unicode support (this includes .NET w.r.t. up-to-date categories), need to implement all this by hand. Keeping things simple helps in adoption.

Anyway, I value your input and point of view, and this is just my take on this feature change. While I do feel strongly about doing it the "standard, well-trodden way", I'm not married to any particular method. If the TOML powers that be disagree with what I'd call my "simple, yet inclusive" approach, I've no problem to switch gears and go with another flow ;).

ChristianSi · 2022-03-17T17:23:39Z

As for the "character" vs. "code point" issue, I don't have a strong preference, but I think we can continue to use the term "character", since it's convenient and widely understood. Serious confusion is unlikely, especially as we use the U+xxxx syntax to precisely define the exact ranges of code points (~ characters) allowed.

abelbraaksma · 2022-03-17T19:12:13Z

@ChristianSi, looks like we submitted our comments at the same time. You've a much better way of summarizing than I do 🤣. I concur re code-points. I also think it makes the text a bit more palatable to read.

arp242 · 2022-03-17T20:13:13Z

There's however the long-standing consensus that "idents" need to follow certain rules and generally those rules have been to exclude characters or code-points that aren't letterlike.

What is the motivation for this? And does this also apply to TOML?

I looked a bit at the xml:id specification, but I couldn't really find anything. I think it's important to look at the why of it rather than just copy things "because its how it's done". TOML is not a programming language or XML, so what does and doesn't work in one area can actually work just fine here.

For example, almost all programming languages require identifiers to start with a number; otherwise 1 + 1 could refer to both digits or variables and you need type information during parsing to figure that out. - is typically excluded as well, to prevent confusion with the minus operator, etc. But that doesn't apply to TOML, and we already allow keys to start with numbers and contain -, as well as some other more relaxed rules.

abelbraaksma · 2022-03-18T00:29:39Z

What is the motivation for this? And does this also apply to TOML?

If you don't, it becomes a mess. The 'M' in TOML stands for 'Minimal'. Using this fairly simple approach allows it to stay minimal. Using categories or hand waiving w.r.t. Unicode versions doesn't.

I looked a bit at the xml:id specification, but I couldn't really find anything.

The xml:id spec is a very simple spec that defines it as an NCName, which is a Name without the :. Conversely, such ids are used as URI fragment identifiers, which is most typically used to reference an ID or Name in an HTML, XML or like document (for which purpose they are restricted to the NCName production).

just copy things "because its how it's done".

We don't just copy things, we pick what's useful if it serves our purpose. We've had months of discussion before we came to this PR. Have a look at the original issue please.

TOML is not a programming language or XML,

Nobody says TOML is a programming language. That doesn't mean we cannot borrow useful ideas elsewhere, whether it's from a supermarket, the Hitchhiker's Guide or ALGOL really doesn't matter. In this case we looked at past experiences, discussions and best practices in like fields.

so what does and doesn't work in one area can actually work just fine here.

Exactly! That's why we came up with this simple, unobtrusive approach.

For example, almost all programming languages require identifiers to start with a number

They don't, unless you mean labels.

- is typically excluded as well,

It isn't, unless you specifically mean a restriction in programming languages, which is irrelevant here. Closer to home you see it isn't excluded in DNS, HTML tag names, XPath and XQuery identifiers, UNC file names, paths in URLs etc, etc.

But that doesn't apply to TOML, and we already allow keys to start with numbers and contain -, as well as some other more relaxed rules.

Exactly, which is precisely why we made the rules the way they are. Nobody is suggesting to change that.

To give a counter example that follows your idea of using Unicode categories: MS SQL Server. You cannot create a database name with characters defined since 2002, as once they chose this approach, it became impossible to ever let go. Since then, MS SQL Server has been limited to Unicode 3.2. If they want to change this, they'll end up in version hell (old scripts won't be compatible with new scripts anymore and v.v.).

abelbraaksma · 2022-03-18T00:45:58Z

But tbh, I feel like we’re going a bit in circles. I’m at fault too, for repeating my arguments.

Basically, there are two choices. We go for a simple approach that’s trivial to implement, is widely adopted and is easy to explain and understand but has a remote chance of some day including character assignments that weren’t anticipated; or we go for an approach that at face value seems sensible, but requires significant expansion of the ABNF, is hard to read, hard to maintain, asks a lot of implementers (one already tried and called it a major pita) and locks us in with a Unicode version.

To me, that’s not a hard choice to make.

arp242 · 2022-03-18T02:07:32Z

I think there's some confusion here now. In my previous comment I was writing about the idea to allow all characters except a short list of symbols that would make parsing harder (=, [, ], etc.) or be greatly confusing (e.g. various space symbols). That is, I was continuing from my previous paragraph a few comments up:

We can also just say "include everything except =, [, ., ], some (or all?) spaces", or something to that effect. Do we really need to exclude things like arrows or block drawing characters? It's perhaps kind of silly to use it, but then again so is using Linear-A or various emojis, and that's allowed now too. That would make this entire discussion moot. I can't really think of any serious downsides right now.

My previous comment wasn't talking about using Unicode categories – that is a done discussion at this point as far as I'm concerned, because I don't actually see a reason to restrict bare keys to just digits and letters. I probably didn't make that too clear, sorry.

In TOML we already allow almost everything in quoted keys, so the only reason I can see to not allow something in bare keys if it makes parsing harder or if it has the potential to be greatly confusing (such as various forms of spaces, and maybe some hompglyps; not sure if that's worth the effort though).

The ABNF notation for this shouldn't be a huge deal, and implementing it shouldn't be either. Basically, it's just a modified table from what you wrote in this patch. Actually both the ABNF and implementation might be shorter and simpler.

What is the motivation for this? And does this also apply to TOML?
If you don't, it becomes a mess. The 'M' in TOML stands for 'Minimal'. Using this fairly simple approach allows it to stay minimal. Using categories or hand waiving w.r.t. Unicode versions doesn't.

I looked a bit at the xml:id specification, but I couldn't really find anything.
The xml:id spec is a very simple spec that defines it as an NCName, which is a Name without the :. Conversely, such ids are used as URI fragment identifiers, which is most typically used to reference an ID or Name in an HTML, XML or like document (for which purpose they are restricted to the NCName production).

Yeah, I looked at that, but unless I missed it (didn't read from cover to cover), it doesn't explain why these specific character ranges were chosen. Why are some characters excluded? For example it goes out of its way to forbid × and ÷, but it's not clear to me why? I guess the intent was to exclude math symbols (for whatever reason), but does include various math symbols in those ranges, so I'm left wondering: what's the problem with people using hello÷world = "value" in TOML? Why would we want to forbid that? And if there is a good reason, then why is hello₋world = "value" allowed? It seems rather inconsistent to me, and should either allow or forbid both (but I see no reason to forbid either).

For example, almost all programming languages require identifiers to start with a number

They don't, unless you mean labels.

I meant "forbid" instead of "require" 😅 Sorry, that was a confusing mixup.

abelbraaksma · 2022-03-18T10:56:40Z

My previous comment wasn't talking about using Unicode categories – that is a done discussion at this point

Aha, I didn't get that, sorry.

it doesn't explain why these specific character ranges were chosen. Why are some characters excluded?

That's explained in the XML specs, I quoted that earlier, see #687 (comment).

For example it goes out of its way to forbid × and ÷, but it's not clear to me why? I guess the intent was to exclude math symbols (for whatever reason), but does include various math symbols in those ranges

That's historical. All Math Symbols from Unicode 2.0 were excluded. This part of the spec was written in 2001 and then went into CR/PR status, which froze this definition (XML itself used to use a more complicated range in the 1998 original spec but dropped that for the same reasons as mentioned before: prevent complex versioning and solid future-proofness). The symbols you probably talk about were added at a later date. We could certainly exclude those other ranges, but before you know it, you'll end up with a much more complicated production. Most well-known and oft-used symbols/separators etc are now excluded for the (imo) right reasons and the rest we should just ignore.

The discussion whether or not to include "everything" except a very small set should probably be done in a separate discussion/issue, as that wasn't the goal of the accepted change discussed in #687. Personally, I don't think there's much merit in including many well-known symbols, and it may be a pain for editor-writers and syntax highlighting, but ultimately that's for the larger community to decide.

arp242 · 2022-03-18T13:43:53Z

That's explained in the XML specs, I quoted that earlier, see #687 (comment).

I guess I didn't read it as such 😅

Anyway, I think they key thing here is that you're not really "restricting identifiers" as it's already unrestricted with quoted keys, so it seems to me it that this makes it a parsing/lexing issue only: you already need to deal with all of Unicode (including for syntax highlighters etc.) and implementation-wise it shouldn't be an issue anywhere I think (or at least: not more than it already is).

abelbraaksma · 2022-08-16T11:03:53Z

@marzer, of course! 🤦‍♂️. How did I not get that 😆! Fixed.

@pradyunsg: I think this is ready for re-review.

ChristianSi · 2022-09-08T14:58:41Z

More than 3 weeks and still no news? Is the maintainership bottleneck raising its ugly head again? 😭

This enforces that all Markdown files are wrapped at 80 characters, in the style the prettier follows.

marzer · 2022-09-12T01:48:09Z

Hype hype hype, thanks everyone for driving this (particularly @abelbraaksma who actually Did The Thing)

abelbraaksma · 2022-09-12T14:18:00Z

Thanks @pradyunsg for merging, thanks @marzer and @ChristianSi (and others!) for all the helpful comments. Glad we’re now going international!

epage · 2022-12-19T13:37:13Z

Just wanting to catch up on this for when I need to implement this for my toml parser. How does the selected ranges compare to unicode's identifier definition (Id_Start, Id_Continue, Xid_Start, Xid_Continue)?

If keeping the idea of "identifier" I believe this is what programming languages usually use and ideally we wouldn't be re-inventing the wheel.

abelbraaksma · 2022-12-19T19:39:45Z

@epage, Unicode identifiers assume that identifiers don’t start with a digit (of any kind), which is usually correct, but not in TOML, which allows numerical identifiers.

While I did look at those definitions, they didn’t suit our needs. We settled on the simplest possible ranges that are easy to implement. The ones from Unicode are typically way more complex and depend on character properties.

ChristianSi · 2022-12-23T13:22:55Z

TOML's rules for bare keys are pretty close to those used in XML for identifiers, IIRC.

arp242 · 2023-01-16T12:25:27Z

I was writing test cases for this, and using a pirate flag (🏴‍☠️) doesn't work; this is:

     CPoint  Dec    UTF8        HTML       Name (Cat)
'🏴' U+1F3F4 127988 f0 9f 8f b4 &#x1f3f4;  WAVING BLACK FLAG (Other_Symbol)
'�'  U+200D  8205   e2 80 8d    &zwj;      ZERO WIDTH JOINER (Format)
'☠'  U+2620  9760   e2 98 a0    &#x2620;   SKULL AND CROSSBONES (Other_Symbol)

The flag and ZWJ is fine, but the skull and crossbones isn't allowed in the current range.

Seems confusing since most emojis work. Took me quite a bit of time to figure when modifying my parser to support this because I just assumed I missed something, but turns out it's just not in the allowed range:

unquoted-key-char =/ %x2070-218F / %x2460-24FF          ; include super-/subscripts, letterlike/numberlike forms, enclosed alphanumerics
unquoted-key-char =/ %x2C00-2FEF / %x3001-D7FF          ; skip arrows, math, box drawing etc, skip 2FF0-3000 ideographic up/down markers and spaces

Looking at the U+2500..U+2bff range, I don't really see why we need to skip a lot of these things.

I know we discussed this before, but I still think we should either allow only letters+numbers or just allow almost everything (with a few exceptions); the current behaviour is just confusing. The examples uses an emoji as an example and ZWJ is explicitly allowed, so you'd expect all emojis to work, but turns out only some emojis work. It just so happened by chance that "pirate flag" was the first emoji I tried, but there are probably others as well and with ZWJ combinations it'll be a whack-a-mole.

Either way, IMHO we should support all emojis or none. Many other ZWJ combinations do work fine; 🏳️‍🌈 (U+1F3F3 ZWJ U+1F308) or 🏴󠁧󠁢󠁷󠁬󠁳󠁿 is okay, but 🏳️‍⚧️ isn't (as U+26A7 isn't in the allowed range). In a quick test it seems all flags work, except two.

pradyunsg · 2023-01-16T12:29:15Z

@arp242 Could you file an issue for this?

This backs out the unicode bare keys from toml-lang#891. This does *not* mean we can't include it in a future 1.2 (or 1.3, or whatever); just that right now there doesn't seem to be a clear consensus regarding to normalisation and which characters to include. It's already the most discussed single issue in the history of TOML. I kind of hate doing this as it seems a step backwards; in principle I think we *should* have this so I'm not against the idea of the feature as such, but things seem to be at a bit of a stalemate right now, and this will allow TOML to move forward on other issues. It hasn't come up *that* often; the issue (toml-lang#687) wasn't filed until 2019, and has only 11 upvotes. Other than that, the issue was raised only once before in 2015 as far as I can find (toml-lang#337). I also can't really find anyone asking for it in any of the HN threads on TOML. All of this means we can push forward releasing TOML 1.1, giving people access to the much more frequently requested relaxing of inline tables (toml-lang#516, with 122 upvotes, and has come up on HN as well) and some other more minor things (e.g. `\e` has 12 upvotes in toml-lang#715). Basically, a lot more people are waiting for this, and all things considered this seems a better path forward for now, unless someone comes up with a proposal which addresses all issues (I tried and thus far failed). I proposed this over here a few months ago, and the response didn't seem too hostile to the idea: toml-lang#966 (comment)

This backs out the unicode bare keys from toml-lang#891. This does *not* mean we can't include it in a future 1.2 (or 1.3, or whatever); just that right now there doesn't seem to be a clear consensus regarding to normalisation and which characters to include. It's already the most discussed single issue in the history of TOML. I kind of hate doing this as it seems a step backwards; in principle I think we *should* have this so I'm not against the idea of the feature as such, but things seem to be at a bit of a stalemate right now, and this will allow TOML to move forward on other fronts. It hasn't come up *that* often; the issue (toml-lang#687) wasn't filed until 2019, and has only 11 upvotes. Other than that, the issue was raised only once before in 2015 as far as I can find (toml-lang#337). I also can't really find anyone asking for it in any of the HN threads on TOML. Reverting this means we can go forward releasing TOML 1.1, giving people access to the much more frequently requested relaxing of inline tables (toml-lang#516, with 122 upvotes, and has come up on HN as well) and some other more minor things (e.g. `\e` has 12 upvotes in toml-lang#715). Basically, a lot more people are waiting for this, and all things considered this seems a better path forward for now, unless someone comes up with a proposal which addresses all issues (I tried and thus far failed). I proposed this over here a few months ago, and the responses didn't seem too hostile to the idea: toml-lang#966 (comment)

I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and ANY solution is a trade-off. That said, I do believe some trade-offs are better than others, and after looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is the strongest argument in favour of this and the biggest improvement: we can't really do anything wrong here in a way that we can't correct later. Being conservative is probably the right way forward. - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them and the specification even strongly discourages people from using them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work", but "this character works fine, but this very similar doesn't". This shows up in a number of things: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: ＃x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all. People don't read specifications, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months later) it seems to "break". From the user's perspective this seems like a bug in the TOML parser. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '＃' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '＂' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. - Maps to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the code adding multibyte support in the first case will probably be harder, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something "Extra Augmented BNF?" Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.

I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and after looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later. Being conservative for these type of things is is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them and the specification even strongly discourages people from using them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: ＃x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all. People don't read specifications, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months later) it seems to "break". It should either allow everything or nothing. This in-between is just horrible. From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '＃' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '＂' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. - Maps to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the code adding multibyte support in the first case will probably be harder, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something "Extra Augmented BNF?" Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.

I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and after looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later. Being conservative for these type of things is is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them and the specification even strongly discourages people from using them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: ＃x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all. People don't read specifications, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months later) it seems to "break". It should either allow everything or nothing. This in-between is just horrible. From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '＃' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '＂' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. - Maps to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?) Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.

I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and I've made it no secret that I feel the current trade-off is a bad one. After looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later, unlike what we have now, which is "well I think it probably won't cause any problems, based on what these 5 European/American guys think, but if it does: we won't be able to correct it". Being conservative for these type of things is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: ＃x = "commented ... or is it?" # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all (especially outside of the Latin character range by the way, which shows the Euro/US bias in how it's written). People don't read specifications in great detail, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months or years later) it seems to "suddenly break". From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. It should either allow everything or nothing. This in-between is confusing and horrible. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '＃' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '＂' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. Confusables is also an issue with different scripts (Latin and Cyrillic is well-known), but this is less of an issue since it's not syntax, and also something that's fundamentally unavoidable in any multi-script environment. - Maps closer to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?) Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.

Allow non-English scripts for unquoted keys

cc0a59e

Support all international scripts, except for certain non-symbols, arrows and joins

abelbraaksma mentioned this pull request Mar 15, 2022

Relax bare key restrictions to allow additional unicode letters and numbers #687

Closed

Update changelog and toml.md with non-English bare key examples

aacda84

hukkin reviewed Mar 15, 2022

View reviewed changes

toml.md Outdated Show resolved Hide resolved

ChristianSi suggested changes Mar 16, 2022

View reviewed changes

toml.abnf Outdated Show resolved Hide resolved

toml.md Outdated Show resolved Hide resolved

abelbraaksma added 2 commits March 16, 2022 11:21

Simplify abnf slightly

6c7e62d

Add full Unicode ranges to toml.md

ddf3fa5

ChristianSi suggested changes Mar 16, 2022

View reviewed changes

toml.md Outdated Show resolved Hide resolved

toml.md Outdated Show resolved Hide resolved

marzer reviewed Mar 16, 2022

View reviewed changes

toml.md Outdated Show resolved Hide resolved

abelbraaksma added 2 commits March 16, 2022 11:39

remove trailing spaces

40e6c13

improve wording

8e78ee9

arp242 reviewed Mar 16, 2022

View reviewed changes

toml.abnf Outdated Show resolved Hide resolved

ChristianSi suggested changes Mar 17, 2022

View reviewed changes

toml.md Outdated Show resolved Hide resolved

abelbraaksma requested a review from pradyunsg August 16, 2022 11:04

tuckstarrydell referenced this pull request Sep 11, 2022

Enforce wrapping lines at 80 characters with prettier

be51db4

This enforces that all Markdown files are wrapped at 80 characters, in the style the prettier follows.

pradyunsg approved these changes Sep 11, 2022

View reviewed changes

pradyunsg merged commit 13a3e63 into toml-lang:main Sep 11, 2022

awvwgk mentioned this pull request Sep 11, 2022

Support non-English scripts for bare keys toml-f/toml-f#111

Open

arp242 mentioned this pull request Oct 27, 2022

Add tests for unicode in bare keys toml-lang/toml-test#125

Closed

epage mentioned this pull request Dec 19, 2022

Support TOML 1.1 release toml-rs/toml#397

Open

arp242 mentioned this pull request Jan 16, 2023

Not all emojis work as bare keys #954

Open

abelbraaksma deleted the update-changelog branch January 17, 2023 11:53

SnoopJ mentioned this pull request Mar 9, 2023

Clarify that key uniqueness depends only on binary representation, recommend normalization #966

Open

arp242 mentioned this pull request Jun 2, 2023

Backout Unicode bare keys #979

Open

uyha mentioned this pull request Sep 21, 2023

An open letter for the release of v1.1.0 #989

Closed

arp242 mentioned this pull request Sep 23, 2023

Change bare key characters to Letter and Digit #990

Open

JamesParrott mentioned this pull request Oct 8, 2023

List index out of range + unparseable UTF8 chars uiri/toml#430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow non-English scripts for unquoted keys #891

Allow non-English scripts for unquoted keys #891

abelbraaksma commented Mar 15, 2022 •

edited

Loading

ChristianSi left a comment •

edited

Loading

abelbraaksma commented Mar 16, 2022

abelbraaksma commented Mar 16, 2022 •

edited

Loading

hukkin commented Mar 16, 2022

abelbraaksma commented Mar 16, 2022 •

edited

Loading

marzer commented Mar 16, 2022

abelbraaksma commented Mar 16, 2022 •

edited

Loading

arp242 commented Mar 16, 2022

ChristianSi left a comment •

edited

Loading

ChristianSi commented Mar 17, 2022 •

edited

Loading

abelbraaksma commented Mar 17, 2022

ChristianSi commented Mar 17, 2022

abelbraaksma commented Mar 17, 2022 •

edited

Loading

arp242 commented Mar 17, 2022

abelbraaksma commented Mar 18, 2022 •

edited

Loading

abelbraaksma commented Mar 18, 2022 •

edited

Loading

arp242 commented Mar 18, 2022

abelbraaksma commented Mar 18, 2022 •

edited

Loading

arp242 commented Mar 18, 2022

abelbraaksma commented Aug 16, 2022 •

edited

Loading

ChristianSi commented Sep 8, 2022

marzer commented Sep 12, 2022

abelbraaksma commented Sep 12, 2022

epage commented Dec 19, 2022

abelbraaksma commented Dec 19, 2022

ChristianSi commented Dec 23, 2022

arp242 commented Jan 16, 2023

pradyunsg commented Jan 16, 2023

Allow non-English scripts for unquoted keys #891

Allow non-English scripts for unquoted keys #891

Conversation

abelbraaksma commented Mar 15, 2022 • edited Loading

ChristianSi left a comment • edited Loading

Choose a reason for hiding this comment

abelbraaksma commented Mar 16, 2022

abelbraaksma commented Mar 16, 2022 • edited Loading

hukkin commented Mar 16, 2022

abelbraaksma commented Mar 16, 2022 • edited Loading

Including unassigned code-points

Using categories vs fixed ranges

Conclusion

marzer commented Mar 16, 2022

abelbraaksma commented Mar 16, 2022 • edited Loading

arp242 commented Mar 16, 2022

ChristianSi left a comment • edited Loading

Choose a reason for hiding this comment

ChristianSi commented Mar 17, 2022 • edited Loading

abelbraaksma commented Mar 17, 2022

ChristianSi commented Mar 17, 2022

abelbraaksma commented Mar 17, 2022 • edited Loading

arp242 commented Mar 17, 2022

abelbraaksma commented Mar 18, 2022 • edited Loading

abelbraaksma commented Mar 18, 2022 • edited Loading

arp242 commented Mar 18, 2022

abelbraaksma commented Mar 18, 2022 • edited Loading

arp242 commented Mar 18, 2022

abelbraaksma commented Aug 16, 2022 • edited Loading

ChristianSi commented Sep 8, 2022

marzer commented Sep 12, 2022

abelbraaksma commented Sep 12, 2022

epage commented Dec 19, 2022

abelbraaksma commented Dec 19, 2022

ChristianSi commented Dec 23, 2022

arp242 commented Jan 16, 2023

pradyunsg commented Jan 16, 2023

abelbraaksma commented Mar 15, 2022 •

edited

Loading

ChristianSi left a comment •

edited

Loading

abelbraaksma commented Mar 16, 2022 •

edited

Loading

abelbraaksma commented Mar 16, 2022 •

edited

Loading

abelbraaksma commented Mar 16, 2022 •

edited

Loading

ChristianSi left a comment •

edited

Loading

ChristianSi commented Mar 17, 2022 •

edited

Loading

abelbraaksma commented Mar 17, 2022 •

edited

Loading

abelbraaksma commented Mar 18, 2022 •

edited

Loading

abelbraaksma commented Mar 18, 2022 •

edited

Loading

abelbraaksma commented Mar 18, 2022 •

edited

Loading

abelbraaksma commented Aug 16, 2022 •

edited

Loading