Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider hex and/or octal integer formats #409

Closed
rmunn opened this issue Apr 28, 2016 · 37 comments · Fixed by #507
Closed

Reconsider hex and/or octal integer formats #409

rmunn opened this issue Apr 28, 2016 · 37 comments · Fixed by #507

Comments

@rmunn
Copy link

rmunn commented Apr 28, 2016

Issue #53 was closed in June 2014, because the decision at the time was to prefer simplicity of implementation. So because 0xff00ff or 0o755 were slightly harder to write parsers for than 16711935 or 493, the choice at the time was not to allow hex or octal numbers in TOML.

However, since that time, issue #263 has been decided the other way. Datetime values are non-trivial to parse, but are highly useful in some scenarios. So the decision was made to keep them in, because they are useful to some real users.

These two decisions are inconsistent. If datetimes are going to be in TOML, the same arguments can be (and have been) made for hex and octal representations of numbers, which are a lot easier to write a parser for than datetimes. Most languages already have a hex parser implementation that TOML parsers could take advantage of. And in any language that doesn't, parsing hex values is not complex. It's a problem with "Coding 101 homework" levels of difficulty, not "doctoral thesis" levels of difficulty.

And hex and octal values are useful in many scenarios that TOML is intended for, such as config files. Unix permissions use octal values: 0o755 is much easier to mentally translate to u+rwx, g+rx, o+rx than 491. Or was it 493 or 495? Quick, can you tell which of those three decimal values is the correct conversion of 0o755? I can't without a calculator, and I'd much rather see 0o755 in config files. Hex, of course, is highly useful when dealing with colors or bit flags. Neither are as common in config files as octal, but if we allow octal there's no good reason not to allow hex.

Therefore, I would ask that #53 be revisited, either be reopening that issue and having the discussion there, or by starting a new discussion here. The reason for closing #53, to keep things simple for TOML implementations, has been abandoned by now, and there's no longer any reason not to allow hex and octal values.

@rmunn
Copy link
Author

rmunn commented Apr 28, 2016

Also, I'll repeat the comment I made on issue #53 last month: if octal values are included, PLEASE don't repeat C's mistake, as so many programming languages have. A leading 0 in an integer should not change its meaning. The format for octal should parallel the format for hex: 0x123 for hex, and 0o123 for octal. (And 0b101 for binary, if binary is allowed.)

Additionally, I would recommend that the only integer format markers allowed be lowercase: 0x, 0o, and possibly 0b. In particular, I suggest that 0O (digit zero followed by capital letter O) be forbidden by the spec. It's too easy to mistake those two characters for each other in many fonts.

As for hex digits, either they should be lowercase-only (to match how 0x is the only format marker allowed) or they should allow mixed-case; I haven't made up my mind which would be better. I prefer to use lowercase hex digits myself, but some people prefer to see 0xDEADBEEF rather than 0xdeadbeef, so if I were making the decision, I would choose to allow mixed-case in hex digits. It doesn't complicate implementations much, and it lets people follow their preference.

Finally, the question of negative numbers comes up. What should -0x123 mean? What about -0xffff? What about -0xffffffffffffffff? I would recommend that negative numbers should only be allowed in decimal representation, and should be forbidden in hex, octal, and binary representations. So -0x123 would be an error, rather than converting to -443 in decimal.

@acasajus
Copy link

acasajus commented Jun 3, 2016

Also #54 got accepted

@FranklinYu
Copy link

FranklinYu commented Jun 14, 2016

I think this is exactly what @BurntSushi was worried about when he hesitated to agree #54: allowing any of the features is unfair to all the others, but allowing all of them would make this language no longer "minimal". I would suggest that this feature be included in the standard after most of the available parsers have implemented it, not the other way around; at that time it would be easier for @BurntSushi to decide to merge this feature into standard.

@rmunn
Copy link
Author

rmunn commented Jun 15, 2016

Letting the parser implementations drive the spec might be a good idea, but OTOH, that's how we got the mess that is Javascript. So while I agree that it would be good for parsers to implement this proposal (and it should be easy to implement since most languages have a built-in ability to parse ints in bases other than 10), I think we should also have a discussion about how the spec should specify it. In particular, I think it's VERY important to hash out how octal should be represented -- should a leading 0 signify octal, as it does in C? Or should the 0o755 syntax be the only way to specify octal numbers? That is a discussion that needs to happen in the spec, so that we don't have parsers implementing this in two mutually-incompatible ways. (I have a STRONG preference for the 0o syntax, as I've long felt that it was a major mistake in the C language to have leading zeroes in code change the meaning of an int literal. But if lots of people feel otherwise, then we should go with the Principle of Least Surprise.)

Also, as of this writing, a total of 9 unique people have reacted with thumbs-up emoji on either this proposal, or on my March 30th comment on #53. So far I have not seen a thumbs-down emoji or a "We shouldn't do this" response to me. Both @BurntSushi and @mojombo said "Not sure we'll need this, and it would complicate parsers" to the original proposal, but haven't yet responded to this one.

And since their original "Not sure we'll need this" response was a good one, and there does need to be a use-case to justify the extra work for parser implementors, here's a summary of the use cases:

Hex - colors (#ff00ff) and bit flags (0x01, 0x02, 0x04, 0x08, 0x10, 0x20...) seem to be the most likely use cases in configuration files. I could maybe also see specifying "magic numbers" via an array of byte values, e.g. utf8bom = [ 0xef, 0xbb, 0xbf ]. That one's less likely -- but colors and bit flags are probably going to be used a lot, and they really NEED hex to be comprehensible.

Octal - Unix file permissions (0o755, 0o640) really need octal to be comprehensible. Those tend to show up pretty often in configuration files, which seem to be one of the uses TOML is intended for.

Binary - No obvious use case. MAYBE some utility for bitmasks: 0b11111100 is slightly more obvious about which bits it masks out than 0xfc. But if a parser has implemented hex and octal (both of which do have genuine use cases that need them), then the extra cost to implement binary is trivial (in nearly every language, binary-parsing will be basically a copy and paste of octal-parsing, with 8 changed to 2).

@lmna
Copy link

lmna commented Jun 15, 2016

Hex could be a poor man's surrogate for MAC addresses, IPv6 addresses, public key / certificate fingerprints, RFC 4122 UUIDs . Underscores could be put in place of colons and dashes. Note that ipv6 addr and uuid are 128 bits long.

@FranklinYu
Copy link

FranklinYu commented Jun 16, 2016

How about having a "lowest standard" without any advanced feature, while keeping a "suggested standard" for all the advanced features? Something like "we do not impose this requirement on your implementation, but if you do want this feature, then implement it as below..."? Most of the advanced features, including Hex/Oct literals, Date/Datetime literals, all serves as extension to the standard: anything satisfying the "lowest standard" will still be parsed as expected even by a parser supporting such advanced features.

It might be a bad example, but it reminds me of C vs C++ (bad example because actually not all C codes can compile as C++ code).

I am wondering how @BurntSushi and @mojombo like this idea.

update

A better example is the Scheme specification branching to two specification: R7RS (small) to keep minimalism and R7RS (large) for more functionality.

@tshepang
Copy link

tshepang commented Dec 15, 2016

So, to summarise, add these 3 ways to represent numbers:

  • 0b1010
  • 0o12
  • 0xa

Have these rules:

  • all letters be lower-case, with the possible exeption of allowing upper-case for hex
  • negative signs not allowed

Am I missing anything?

@rmunn
Copy link
Author

rmunn commented Dec 16, 2016

That's all I wrote. I just noticed that I didn't mention underscores between digits, the way the spec allows for decimal integers. For consistency's sake, I think underscores between digits should also be allowed in hex, octal and binary as well, especially since that is what is allowed in languages like Java and F#. So if underscores are allowed in decimal numbers and not in hex/octal/binary, then that will violate the principle of least surprise.

@timbunce
Copy link

timbunce commented Feb 7, 2017

Having just come across TOML I was delighted by everything until I noticed the very odd omission of hex literals (and octal and binary by extension). In cases where such values are natural, trying to use anything else goes directly against TOML's "easy to read" objective. Well, to be sure, 16711935 is easy to read but the meaning is much more clearly expressed as 0xff00ff.

@rmunn
Copy link
Author

rmunn commented Feb 18, 2017

One further refinement of my design suggestion: underscores should be allowed between digits, but NOT inside the 0x / 0o / 0b prefix of a hex, octal or binary number. I.e., 0_xdeadbeef is not acceptable, but 0xdead_beef is allowable. Allowing 0_x, etc., would just make parsing harder for no benefit.

I have not yet decided whether underscores should be allowed between the prefix and the first digit of the number; technically, the x or o or b of the prefix is not a digit, so if we want consistency with the decimal-numbers rule "each underscore must be surrounded by at least one digit", then we wouldn't allow that. But once the 0x has been parsed, the rest is unambiguous, and I could see someone wanting to write 0x_ff_00_ff_80 to represent the RGBA color "magenta, 50% transparent". So I lean towards allowing underscores between the 0x / 0o / 0b prefix and the first digit of the number.

@rmunn
Copy link
Author

rmunn commented Mar 4, 2017

I've looked further at two existing languages that allow underscores in number literals (Java and F#). In both of these languages, as in TOML, the underscore may appear ONLY between digits, and they do not count a base prefix (0x, etc.) as being a digit for this purpose. In other words, 0x_12_34 is a syntax error in both F# and Java since an underscore may not appear immediately after the base prefix.

To follow the principle of least surprise, I have therefore decided that my TOML spec proposal will use the same rule as Java and F#. So underscores MUST NOT appear immediately after the base prefix. The 0x_ff_00_ff_80 example I gave in my previous comment will be invalid, and would have to be written as 0xff_00_ff_80 to be valid.

@guai
Copy link

guai commented Mar 30, 2017

Octals are useless. I see no usage of them except of one single case - unix file rights. But even unix have more userfriendly option with u o g mnemonics
Modern languages tend to not have octals cause we dont have 16bit platforms anymore

@rmunn
Copy link
Author

rmunn commented Mar 31, 2017

Octals are almost useless for everything except Unix file permissions, yes — but that's a major use case, and sufficient justification all by itself for including them. The letter-based permissions can be easier to read in some cases (especially for people who don't use Unix very much), but experienced Unix admins find 0o644 to be perfectly easy to read. In fact, I personally would rather express permissions as 0o644 than the equivalent u=rw,g=r,o=r, and plenty of experienced Unix users feel the same way. Note, for example, how the examples for setting file permissions in Ansible did not feel the need to explain that mode 0644 means u=rw,g=r,o=r — but they did feel the need to explain that the textual mode u=rw,g=r,o=r was equivalent to 0644, because the octal notation is more familiar than the textual representation to anyone who uses Unix a lot.

(There's actually another decent reason to use octal, and that's to more easily spot UTF-8 multi-byte sequences in Unicode data, but that's not a use case for TOML. I'm just mentioning it for curiosity's sake.)

@guai
Copy link

guai commented Mar 31, 2017

@rmunn, how about express this single case with just strings like "0644"?
Octals in form of 0NNN tend to cause problems when people know nothing about octals, but know math :) where leading zeros can be omitted.
0oNNN are poorely recognizable when people have problems with their sight.

@wbober
Copy link

wbober commented Nov 27, 2017

Any progress on this?

@rmunn I'd like to comment on the use cases from a hardware developer perspective. I'd like to use toml as a configuration language for a test rig. Hex and binary are very useful when you deal with hardware, for example, hex is used to refer to memory or register addresses. Binary is very useful when you deal with register values.

@guai
Copy link

guai commented Nov 27, 2017

On octals: here are some relatively new langs that decided not to have them.

  • Ceylon

Integer literals with a leading zero, 0, are allowed, but unlike other C-like programming languages, such literals are not interpreted using octal notation.

  • Kotlin

NOTE: Octal literals are not supported.

  • D deprecates octals

Rationale
The use of a leading zero is confusing, as 0123 != 123.

@tshepang
Copy link

@guai Octals should be not ambiguous, if you prefix them with 0o, not just 0.

@BurntSushi
Copy link
Member

As requested by #330 (comment), I'll weigh in.

First and foremost, this is a backwards compatible addition, since all conforming parsers today will return an error if a user types a hex/octal literal as proposed here. Therefore, there is no particular reason to render a verdict now.

Secondly, I'd personally be in favor of adding at least hex. Octal seems useful for file permissions. @mojombo what do you think?

@rmunn
Copy link
Author

rmunn commented Nov 27, 2017

Many other new languages have decided to allow octal, but to settle on the 0o prefix instead of the confusing leading zero. https://en.wikipedia.org/wiki/Octal lists Haskell, OCaml, Perl 6, Python 3, Ruby, Tcl 9, and ECMAScript 6 as all supporting octal written in the 0o syntax. So 0o is basically the standard way of doing octal these days.

@BurntSushi
Copy link
Member

I agree. If we do octal, we should use a 0o prefix.

@guai
Copy link

guai commented Nov 27, 2017

I definitely agree that 0o is way better than just 0 prefix in most sane fonts at least. But I still see not much of a usage at all because there are no 16 bit platforms out there. The only usable example mentioned itt is unix fs access rights. And if there are exactly one use case isn't it better to support it explicitly in a form of mnemonics?

@tshepang
Copy link

@guai what is mnemonics?

@pradyunsg
Copy link
Member

I'd like to see hex and octal literals. The former is common in for representing multibit values and the latter is used for single bit values (like Unix permissions).

0xDead_Beef and 0o644_000 look good to me too. :)

@guai
Copy link

guai commented Nov 27, 2017

@tshepang, it would be something like flags = rwxr-xr-x or flags = u=rwx,go=rx instead of hm... flags= 0o755 I guess

@tshepang
Copy link

I find mnemonics more clumsy, and they feel not justified to have support for them (use strings). OTOH octals are more general, and there probably is some other use for them beyond unix file permissions.

@rmunn
Copy link
Author

rmunn commented Nov 27, 2017

@guai - For the Unix access permissions use case, octal numbers are more widely used than the mnemonics, and especially in config files. I can't point you to any evidence for this assertion, since AFAIK nobody has done a statistical analysis. But in my experience, you'll see a lot more chmod 0755 commands than chmod u=rwx,g=rx,o=rx. It's more compact, just as easy to read for any experienced Unix user (and if they're setting permissions in config files, they're probably experienced) and it's what Unix users have come to expect.

And I'd be against a special-case flags = rwxr-xr-x or flags = u=rwx,go=rx example that would translate into numbers. Far too specialized; if a config file wants to allow that, strings are a much better use case there.

@guai
Copy link

guai commented Nov 27, 2017

@rmunn, its just that sort of crazyness everyone got used to.

experienced Unix user

And who would be an average toml user? In neighbor thread I was told, that concept of empty path is not obvious enough for toml, but that is the thing well known to every user familiar with any filesystem too

@rmunn
Copy link
Author

rmunn commented Nov 27, 2017

@guai - The fact that you said on March 30 that "octals are useless" when they're used in just one use case (Unix file permissions) makes me think that you do most of your development on Windows. Is that correct? If so, you have relatively little experience with Unix, so you wouldn't know just how much more often the octal-number format is used in Unix permissions than the text format. But here's one data point to help convince you: I've been using Linux for about 20 years now, and I can look at permissions like 775 or 644 and tell you exactly what they mean. But every time I try to write the mnemonic permissions, I have to stop and say "Does the o in o=rx mean 'owner', or 'other'?" And then I have to look it up.

Anyway, I've made my point so it's time to move on to a different topic: should binary numbers (0b1101) be included as well?

Pro: Consistency, a.k.a. the "why not?" argument. If you've already written code to handle hex and octal numbers in your parser, handling binary numbers is trivial to add.
Con: Not often needed, a.k.a. the "why?" argument.

I was thinking that binary should be dropped from the proposal, but then @wbober mentioned an actual use case: config files for driving a hardware test rig. When you're writing a file to send a specific set of binary digits to a connector, and the connector's pins are numbered from (say) 0 to 15, it's easier to use pinout = 0b1101_0011_0111_0010 than pinout = 0xd374. The binary version of that number will let you see at a glance whether pin 12 has a high signal (1) or a low signal (0), whereas the hex version requires you to do a conversion in your head.

So since there's a real user with a real use case (and because the cost of implementing binary format is trivial once you've added hex and octal formats), I'm now in favor of saying "Yes, let's add binary as well".

@BurntSushi
Copy link
Member

BurntSushi commented Nov 27, 2017 via email

@guai
Copy link

guai commented Nov 27, 2017

@rmunn, I have quite a lot of unix experience, but still hate to convert those meaningless digits in my head all the time. Its just bad design, and is still there for legacy reasons.

I think binary is more useful than octal

But there still a question left, will out user experienced enough. If he is an experienced unix user at least, than the point of this topic is ok, but many other decisions were made with less experienced users in mind, I think.

@pradyunsg
Copy link
Member

I agree with @BurntSushi. Let's just wait. :)

@mojombo
Copy link
Member

mojombo commented Nov 27, 2017

Thank you all for your patience and the excellent arguments presented here! It's been a year and a half since this was opened, and as I hoped, time would bear out which features would turn out to be important to real TOML users. I think I've seen enough evidence now that hex, octal, and binary all have reasonable use cases and should be included in TOML as first class citizens. I'll draw up a PR for their inclusion with 0x, 0o, and 0b prefixes respectively.

@pradyunsg
Copy link
Member

This issue can be closed. :)

@tshepang
Copy link

@pradyunsg why?

@pradyunsg
Copy link
Member

Ah. My bad. I thought this was some other issue. :/

@mojombo
Copy link
Member

mojombo commented Nov 29, 2017

See #507 for the proposal.

@rmunn
Copy link
Author

rmunn commented Nov 29, 2017

One comment about underscores in numeric literals: my proposal so far has been that an underscore is not allowed between a hex/octal/binary prefix and the first digit of the number. That is, 0x_dead_beef is not allowed, and must be written as 0xdead_beef. I looked at Java and F#, and both of them followed that rule.

I have just learned that C# 7.2 will allow underscores between a prefix and the first digit, so that 0x_dead_beef would be a legal numeric literal in C# 7.2. At the moment, I'm inclined to not change my proposal, and have TOML forbid that syntax, because that's slightly easier for parsers to handle. If Java and F# follow C#'s lead and start allowing 0x_dead_beef literals, then I'd revise my proposal and suggest that the next version of TOML start allowing that as well, for the sake of least surprise.

But it's better to start out strict and then loosen restrictions later, because that keeps backward compatibility. I.e., if the original rule is that 0x_12_34 is not allowed, then everyone will write 0x12_34 in their TOML files, which will still be legal if the restriction is relaxed to allow 0x_12_34. However, if we start by allowing literals like 0x_12_34 and then wanted to shift to disallowing that syntax, then we'd end up invalidating existing config files.

So I recommend keeping the proposal as-is with regard to the underscore rules, but if C# 7.2's slightly looser underscore rules make their way into Java and F# (and other languages that I haven't looked at yet), then we can loosen TOML's underscore restrictions as well, in whatever future version of the TOML spec would be appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.