Add low-level APIs for converting integers to strings #46281

SimonSapin · 2017-11-26T20:07:59Z

This adds the following new APIs on primitive integer types, unstable under the int_to_str feature gate:

impl $Int {
    pub fn to_uppercase_str_radix(self, buffer: &mut [u8], radix: u32) -> &mut str { /* … */ }
    pub fn to_lowercase_str_radix(self, buffer: &mut [u8], radix: u32) -> &mut str { /* … */ }
    pub fn to_str(self, buffer: &mut [u8]) -> &mut str { /* … */ }
    pub const MAX_STR_LEN: usize = /* … */;
}

They are the “mirror” APIs of from_str and from_str_radix.

The &mut [u8] -> &mut str pattern is the same as char::encode_utf8: take a (possibly stack-allocated) buffer, write into it, and return a slice of it. It is maximally flexible in terms of allocation/copying strategy.

to_str is equivalent to the existing impls of the fmt::Display trait (it uses the same optimized algorithm) but does not have the overhead of creating a fmt::Formatter and otherwise invoking the string formatting machinery. This overhead was deemed high enough to make a crate dedicated to avoiding it: https://github.com/dtolnay/itoa/

It always requires a buffer of at least MAX_STR_LEN bytes, even for values that would fit in a smaller buffer. This is to avoid adding bound checks to the algorithm. Alternatives are to add bound checks, or add them only for to_str and not Display with more (internal) generics magic.

to_lowercase_str_radix and to_lowercase_str_radix are similar to the existing impls of the fmt::LowerHex, UpperHex, Octal, and Binary traits; but they support more bases and behave differently for negative integers: they use a minus sign followed by the representation of the mathematically opposite integer, whereas the formatting traits use the two’s complement. For example:

assert_eq!(format!("{:X}", -26), "FFFFFFE6");
assert_eq!((-26).to_uppercase_str_radix(&mut [0; 8], 16), "-1A");

There’s two separate methods for upper v.s. lower case to avoid a bool parameter that wouldn’t be self-explanatory at call sites, or a dedicated enum that seems like more new API surface than is warranted. An alternative might be to pick one of upper or lower case and only provide that, but that seems arbitrary since neither is obviously superior IMO.

…r bound. For example, 99 would be assume to have decimal length 3 since it has the same bit length as 127.

…tegers

SimonSapin · 2017-11-26T20:08:19Z

CC @rust-lang/libs

sfackler · 2017-11-26T20:37:07Z

src/libcore/num/mod.rs

@@ -1315,6 +1420,98 @@ macro_rules! uint_impl {
            from_str_radix(src, radix)
        }

+        /// Write the representation in a given base in a pre-allocated buffer.
+        ///
+        /// Digits are a subset (depending on `radix`) of `0-9A-F`.


0-9A-Z, right?

Oops. Fixed.

hanna-kruppe · 2017-11-27T10:20:49Z

In general I'm in favor of exposing low-level APIs like that to be able to reuse the optimized libstd implementation as building block for tasks that can't or don't want to use std::fmt. However, this part:

to_lowercase_str_radix and to_lowercase_str_radix are similar to the existing impls of the fmt::LowerHex, UpperHex, Octal, and Binary traits; but [...] behave differently for negative integers

seems worrying. I can see the desire for the different behavior, but if these APIs are supposed to be the "high level of control" equivalent of std::fmt APIs, any difference in behavior is an unexpected footgun, especially when the difference is not due to accepting arbitrary bases rather than knowing the base ahead of time.

I'm also wondering whether it could be worthwhile to wait for const generics for the *_radix APIs. In the typical use cases I've seen, the radix is virtually always constant, and for the common case of a power-of-two radix, it being constant could enable nice optimizations.

SimonSapin · 2017-11-27T12:31:40Z

For what it’s worth, I believe that the footgun is that the formatting traits use the two’s complement representation in the first place. I trace the git history of this code and found no evidence that this isn’t accidental (such as discussion of this behavior or why it is desirable). I’d propose changing that behavior, but it’s been #[stable] long enough that we probably can’t. So all that’s left to do is documenting it #46285 and moving on.

I think of of these new APIs less as the low-level equivalent of formatting traits, and more as the inverse of from_str and from_str_radix. I’ve only mentioned the formatting trait to explain why the existing functionality is not sufficient.

As to making the radix a type parameter instead of a runtime parameter, we already have a from_str_radix that uses the latter and is stable. We can always add something else later, maybe to optimize hexadecimal specifically.

hanna-kruppe · 2017-11-27T13:12:34Z

For what it’s worth, I believe that the footgun is that the formatting traits use the two’s complement representation in the first place.

I fully agree, but if it's here to stay, internal consistency also matters.

I think of of these new APIs less as the low-level equivalent of formatting traits, and more as the inverse of from_str and from_str_radix

I didn't consider this. A valid perspective, but these new APIs are also solving the same problem as std::fmt solves, so they're also connected to std::fmt. So we're between a rock and a hard place -- either be consistent with the other integer formatting APIs, or be consistent with the inverse function. I'm almost tempted to suggest changing the behavior of std::fmt.

As to making the radix a type parameter instead of a runtime parameter, we already have a from_str_radix that uses the latter and is stable

Good point.

shepmaster · 2017-12-01T18:46:52Z

Slipped through assignment...

r? @dtolnay

dtolnay · 2017-12-07T18:35:50Z

src/libcore/num/mod.rs

+        /// assert_eq!(i16::min_value().to_str(&mut [0; i16::MAX_STR_LEN]), "-32768")
+        /// ```
+        #[unstable(feature = "int_to_str", issue = /* FIXME */ "0")]
+        pub fn to_str(self, buffer: &mut [u8]) -> &mut str {


I am not thrilled about how an inherent method to_str (which is almost never what you want) competes in rustdoc searches and autocomplete with a trait method to_string (which is almost always what you want). Would it make sense to name this in a scarier way?

dtolnay · 2017-12-07T18:42:01Z

It would be neat to explore how typenum could work for specifying the radix, similar to what #46281 (comment) suggested.

I am sympathetic to these being the inverse of from_str_radix and not the inverse of LowerHex etc. I think that is the right choice for what to expose.

println!("{}", <i16>::from_str_radix("-ff", 16).unwrap());
println!("{}", <i16>::from_str_radix("ffff", 16).unwrap_err());

Overall I would prefer to have these bake in a high-quality integer formatting library on crates.io first. Let's run it by the rest of the team in case someone wants to justify why these need to be exposed by std.

@rfcbot fcp close

rfcbot · 2017-12-07T18:42:03Z

Team member @dtolnay has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

SimonSapin · 2017-12-07T19:29:05Z

I don’t mind renaming to_str but I don’t have a good name idea right now. Do you have a suggestion?

It would be neat to explore how typenum could work for specifying the radix

Do we really want something like typenum in the standard library?

I would prefer to have these bake in a high-quality integer formatting library on crates.io first

Hasn’t that already happened with https://crates.io/crates/itoa ?

We could also keep only the to_str (possibly renamed) part of this PR, and leave the _radix case for later.

carols10cents · 2017-12-12T18:43:36Z

ping @BurntSushi @Kimundi @aturon @sfackler waiting for your ticky boxes here!

SimonSapin · 2017-12-13T22:34:24Z

@dtolnay, FCP to close immediately after stating some concerns feels premature. What do you think of not keeping the _radix method, and renaming to_str to write_to for example?

dtolnay

Thanks for the feedback Simon! I see what you mean. I apologize for not expressing my perspective clearly on the first try. It isn't so much that I had a few concerns -- it is that I have a general sense that it would be more productive to iterate on this feature in a library than adding it here now.

Even restricted to just the base-10 method, it feels like there is enough design space that we shouldn't hope to get it right on the first try. Having it bake in a library is better for everyone involved.

A library would be immediately available on the stable channel, so we can migrate code that currently uses itoa or similar to collect feedback. I would be happy to migrate serde_json as long as the new one is at least as fast.
A library can be versioned by Cargo so we avoid breaking nightly users as we settle on the best approach.

This is similar to the path that failure is following. For something like this, I believe bringing an established high-quality library to the table can be a faster path to stability in the standard library than iterating on #[unstable] stuff.

As an example of there being more design space than I think we have explored, here is a totally different approach that brings many advantages. Compare:

let mut buf = unsafe { mem::uninitialized::<[u8; i16::MAX_STR_LEN]>() };
let s = i.write_to(&mut buf);

// wrapper around uninitialized buffer of the right size
let mut buf = NumberBuffer::new();
let s = buf.stringify(i);

Advantages

Easier to use correctly.
Impossible to use incorrectly.
No more awkward MAX_STR_LEN constant.
No bounds check that your &[u8] is big enough.
No ambiguity about whether the &[u8] needs to be big enough to hold any integer, or just your integer.
No ambiguity about alignment of the result within the buffer.
Cannot panic.
Safe code gets to use an uninitialized buffer.
Googleable type name.
Forward compatible with controlling the base in a way that allocates a buffer of exactly the right size for your base using a const fn, struct NumberBuffer<const RADIX: usize = 10>.
No conflict with to_string autocomplete on the primitives.
The &mut is provided by autoref.
No ambiguity about whether the method stomps on the front of your buffer as scratch space.

Disadvantages

No longer exposes the ability to write right-aligned into an existing buffer. Unclear if this is ever valuable in practice.
Other things, I'm sure.

SimonSapin · 2017-12-14T15:42:32Z

As maintainer of the itoa crate, how do you feel about adding NumberBuffer (IntegerBuffer?) there?

dtolnay · 2017-12-14T18:19:36Z

I don't think I am going to have bandwidth for this in the near term. It would be better if someone else can drive this and changes are not blocked on me, so I think releasing a different library would make more sense.

rfcbot · 2017-12-19T21:05:53Z

🔔 This is now entering its final comment period, as per the review above. 🔔

bors · 2017-12-22T01:51:58Z

☔ The latest upstream changes (presumably #46922) made this pull request unmergeable. Please resolve the merge conflicts.

dtolnay · 2017-12-23T13:08:58Z

I will go ahead and close this. Thanks for the PR and discussion! Again I am not opposed to adding a low-level number to string API in the standard library but I believe it will be more productive to iterate on this in a crate. The nb crate name is available in case anyone wants to run with the NumberBuffer design. 😉

SimonSapin added 8 commits November 26, 2017 13:57

Extract decimal integer formatting logic into separate functions

41c8e45

Add a to_str method to unsigned integers, taking a pre-allocated buffer

33693eb

Unindent a no-op block

7cd95e2

Add a to_str method to signed integers, taking a pre-allocated buffer

334280c

UnsignedToStr::str_len: replace lookup table by some integer arithmetic

a3a0714

<$Int>::str_len based on bit length was wrong, always require an uppe…

3f9bacb

…r bound. For example, 99 would be assume to have decimal length 3 since it has the same bit length as 127.

Add to_uppercase_str_radix and to_lowercase_str_radix to primitive in…

3de7bfe

…tegers

Replace a magic number with a constant

53bd486

sfackler reviewed Nov 26, 2017

View reviewed changes

Typo fix

ad4edf7

kennytm added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Nov 27, 2017

rust-highfive assigned dtolnay Dec 1, 2017

dtolnay reviewed Dec 7, 2017

View reviewed changes

rfcbot added the proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. label Dec 7, 2017

carols10cents added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 12, 2017

dtolnay reviewed Dec 14, 2017

View reviewed changes

rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Dec 19, 2017

dtolnay closed this Dec 23, 2017

dtolnay mentioned this pull request Mar 17, 2018

Support other radix dtolnay/itoa#15

Closed

Add low-level APIs for converting integers to strings #46281

Add low-level APIs for converting integers to strings #46281

Uh oh!

Conversation

SimonSapin commented Nov 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SimonSapin commented Nov 26, 2017

Uh oh!

sfackler Nov 26, 2017

Choose a reason for hiding this comment

Uh oh!

SimonSapin Nov 26, 2017

Choose a reason for hiding this comment

Uh oh!

hanna-kruppe commented Nov 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SimonSapin commented Nov 27, 2017

Uh oh!

hanna-kruppe commented Nov 27, 2017

Uh oh!

shepmaster commented Dec 1, 2017

Uh oh!

dtolnay Dec 7, 2017

Choose a reason for hiding this comment

Uh oh!

dtolnay commented Dec 7, 2017

Uh oh!

rfcbot commented Dec 7, 2017 • edited by aturon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SimonSapin commented Dec 7, 2017

Uh oh!

carols10cents commented Dec 12, 2017

Uh oh!

SimonSapin commented Dec 13, 2017

Uh oh!

dtolnay left a comment

Choose a reason for hiding this comment

Advantages

Disadvantages

Uh oh!

SimonSapin commented Dec 14, 2017

Uh oh!

dtolnay commented Dec 14, 2017

Uh oh!

rfcbot commented Dec 19, 2017

Uh oh!

bors commented Dec 22, 2017

Uh oh!

dtolnay commented Dec 23, 2017

Uh oh!

Uh oh!

SimonSapin commented Nov 26, 2017 •

edited

Loading

hanna-kruppe commented Nov 27, 2017 •

edited

Loading

rfcbot commented Dec 7, 2017 •

edited by aturon

Loading