Display impl for u128 and i128 is slow #44583

dtolnay · 2017-09-15T00:52:38Z

@henninglive has a significantly faster one in dtolnay/itoa#10 (comment). Let's provide the faster one in std::fmt!

Unoptimized:
	test bench_itoa::bench_u128_max ... bench:       3,009 ns/iter (+/- 99)

Manual linking:
	test bench_itoa::bench_u128_max ... bench:       1,537 ns/iter (+/- 59)

Inlined:
	test bench_itoa::bench_u128_max ... bench:       1,289 ns/iter (+/- 54)

The text was updated successfully, but these errors were encountered:

henninglive · 2017-09-15T01:03:04Z

The implementation is copied from std::fmt, the only change is that I have inlined the __udivmodti4() implementation to work around #44545 until LLVM is smart enough combine the calls. This is probably not relevant for std.

dtolnay · 2017-09-15T01:09:20Z

I guess I haven't looked at the two implementations carefully. Can we not inline __udivmodti4() in std?

henninglive · 2017-09-15T01:15:12Z

It’s possible, but we probably don’t want to do that. __udivmodti4() is an intrinsic passed to LLVM to be used to implement division for i128/u128 on architectures without native 128-bit support. It is an implementation detail of the architecture and is currently defined separately from the compiler itself. Std is going to nearly as fast as the inlined version when #44545 is eventually fixed. We should proberly wait for that or we could let std link __udivmodti4() as it did in the past, rust-lang/rfcs#914.

cuviper · 2017-09-15T17:23:59Z

For platforms that don't have native 128-bit division, it may be faster to break it into parts that can do native division from there. Bases 2, 8, and 16 can split directly to 64-bit parts, and base-10 can split into three base-10¹³ parts. If some 32-bit targets don't have native 64-bit division, they may do even better with 5 base-10⁸ parts stored in 32-bit, etc.

cuviper · 2017-09-15T17:30:33Z

Hmm, I just read the implementation, and it already chunks it by 10000, so that much is probably pretty good already.

henninglive · 2017-09-15T18:20:31Z

I haven’t tried it, but I believe the current implementation is going to be pretty terrible on platforms without u64. If I understand this code correctly, the __udivmodti4() implementation as going use __udivmoddi4 and its brothers to do u64 operations, which a lot of non inlined functions calls to divide two u128s.

dtolnay · 2017-09-16T21:07:34Z

I have yet another implementation in dtolnay/itoa#12. This one is 13x faster than std::fmt on my machine.

steveklabnik · 2020-01-21T10:54:23Z

Triage: this appears to have gotten even more extreme over time:

running 2 tests
test itoa   ... bench:         216 ns/iter (+/- 28)
test stdlib ... bench:       1,377 ns/iter (+/- 123)

The reproduction involves cargo features and so cannot be done on the playground, details below.

Cargo.toml:

[package]
name = "itoatest"
version = "0.1.0"
authors = ["Steve Klabnik <steve@steveklabnik.com>"]
edition = "2018"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies.itoa]
version = "*"
features = ["std", "i128"]

in benches/bench.rs:

#![feature(test)]
extern crate test;
extern crate itoa;

use test::Bencher;

#[bench]
fn stdlib(b: &mut Bencher) {
    b.iter(|| {
        let s = format!("{}", u128::max_value());
        std::hint::black_box(s);
    });
}

#[bench]
fn itoa(b: &mut Bencher) {

    b.iter(|| {
        let mut s = String::new();
        itoa::fmt(&mut s, u128::max_value()).unwrap();
        std::hint::black_box(s);
    });
}

Use less divisions in display u128/i128 This PR is an absolute mess, and I need to test if it improves the speed of fmt::Display for u128/i128, but I think it's correct. It hopefully is more efficient by cutting u128 into at most 2 u64s, and also chunks by 1e16 instead of just 1e4. Also I specialized the implementations for uints to always be non-false because it bothered me that it was checked at all Do not merge until I benchmark it and also clean up the god awful mess of spaghetti. Based on prior work in rust-lang#44583 cc: `@Dylan-DPC` Due to work on `itoa` and suggestion in original issue: r? `@dtolnay`

mdibaiee · 2022-01-09T13:51:47Z

Is this still an open issue? I see a PR for this has been merged.

mati865 · 2022-01-15T14:41:33Z

Using @steveklabnik example (with added #![feature(bench_black_box)]) I've got this:

running 2 tests
test itoa   ... bench:          34 ns/iter (+/- 0)
test stdlib ... bench:          54 ns/iter (+/- 0)

I think there is still some room for improvements but it's much better now.

dtolnay · 2022-01-15T17:18:29Z

That's likely the best std::fmt can do. The remaining 20ns discrepancy is not algorithmic, it is the overhead of the Formatter machinery.

dtolnay mentioned this issue Sep 15, 2017

Add support for 128-bit integers dtolnay/itoa#10

Merged

Mark-Simulacrum added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Sep 17, 2017

dtolnay mentioned this issue Feb 11, 2018

Write u128 using only two divisions dtolnay/itoa#12

Merged

Mark-Simulacrum added the E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. label May 27, 2018

JulianKnodt mentioned this issue Aug 28, 2020

Use less divisions in display u128/i128 #76017

Merged

dtolnay closed this as completed Jan 15, 2022

dtolnay added T-libs Relevant to the library team, which will review and decide on the PR/issue. and removed T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jan 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Display impl for u128 and i128 is slow #44583

Display impl for u128 and i128 is slow #44583

dtolnay commented Sep 15, 2017

henninglive commented Sep 15, 2017 •

edited

Loading

dtolnay commented Sep 15, 2017

henninglive commented Sep 15, 2017 •

edited

Loading

cuviper commented Sep 15, 2017

cuviper commented Sep 15, 2017

henninglive commented Sep 15, 2017

dtolnay commented Sep 16, 2017

steveklabnik commented Jan 21, 2020

mdibaiee commented Jan 9, 2022

mati865 commented Jan 15, 2022

dtolnay commented Jan 15, 2022

Display impl for u128 and i128 is slow #44583

Display impl for u128 and i128 is slow #44583

Comments

dtolnay commented Sep 15, 2017

henninglive commented Sep 15, 2017 • edited Loading

dtolnay commented Sep 15, 2017

henninglive commented Sep 15, 2017 • edited Loading

cuviper commented Sep 15, 2017

cuviper commented Sep 15, 2017

henninglive commented Sep 15, 2017

dtolnay commented Sep 16, 2017

steveklabnik commented Jan 21, 2020

mdibaiee commented Jan 9, 2022

mati865 commented Jan 15, 2022

dtolnay commented Jan 15, 2022

henninglive commented Sep 15, 2017 •

edited

Loading

henninglive commented Sep 15, 2017 •

edited

Loading