Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Some performance improvements #133

Merged
merged 6 commits into from
Dec 11, 2015
Merged

Conversation

koverstreet
Copy link
Contributor

I added add and subtract methods that take an operand by value, instead of by reference, so they can work in place instead of always allocating new BigInts for the result.

I'd like to hear if anyone knows a better way of doing this, though - with the various wrappers/forwardings doing this for all the operations is going to be a mess and add a ton of boilerplate. I haven't even tried to do a complete job of it yet.

I'm starting to think the sanest thing to do would be to only have in place implements of most of the operations - and have the wrappers clone() if necessary. That way, the only extra work we're doing is a memcpy(), which is a hell of a lot faster than a heap allocation.

@cuviper
Copy link
Member

cuviper commented Nov 16, 2015

CI failed on rust 1.0 .. We'll want to keep that working if possible.

I actually have been working on something similar. I'll try to clean that up and push to a branch soon so we can compare/merge approaches.

Do you have comparative benchmark results to share?

@koverstreet
Copy link
Contributor Author

On Mon, Nov 16, 2015 at 01:50:04AM -0800, Josh Stone wrote:

CI failed on rust 1.0 .. We'll want to keep that working if possible.

Yeah, I just saw that...

I actually have been working on something similar. I'll try to clean that up and push to a branch soon so we can compare/merge approaches.

Figures :) I'm definitely interested in seeing what you did. This is my "screw
around and learn Rust" project.

I'm also toying with inline assembly for the inner loops - mainly for the
purpose of getting away from DoubleBigDigit so all the arithmatic can be done in
the native word size. So far just toying, though.

Do you have comparative benchmark results to share?

Yeah, in the patch messages.

@koverstreet
Copy link
Contributor Author

Just got it working with inline asm and BigDigit size == usize, if anyone wants to take a look.

https://github.com/koverstreet/num/commits/asm

Very hacky but it does give a nice performance boost on the multiply benchmark.

@koverstreet
Copy link
Contributor Author

Found some new performance improvements and rolled them into the patch. The current numbers are:

Before:
multiply_0: 7,137 ns/iter (+/- 620)
multiply_1: 2,104,565 ns/iter (+/- 255,208)
multiply_2: 51,620,572 ns/iter (+/- 2,707,818)

After:
multiply_0: 634 ns/iter (+/- 61)
multiply_1: 60,992 ns/iter (+/- 6,843)
multiply_2: 6,849,681 ns/iter (+/- 388,000)

@cuviper
Copy link
Member

cuviper commented Nov 16, 2015

OK, I pushed my stuff to cuviper:wip. I think I've got the sort of val/ref forwarding changes you were thinking of. For the most part, my changes are more incremental tweaks than what you have. My performance improvement to Add/Sub is similar to yours, I think, but I definitely didn't improve Mul as much as you did. Take a look -- if it's not too hard for you to refactor your improvements on top of mine, that would be awesome... I think my forwarding simplifies it a bit, but let me know what you think.

As for asm stuff, that's a tricky door to open. It's a feature only for rust nightly, of course, and only for archs you implement, so there still has to be a pure-Rust fallback. I think if you want to explore this route, you'd be better off working on ramp.

@koverstreet
Copy link
Contributor Author

On Mon, Nov 16, 2015 at 11:23:50AM -0800, Josh Stone wrote:

OK, I pushed my stuff to cuviper:wip. I think I've got the sort of val/ref forwarding changes you were thinking of. For the most part, my changes are more incremental tweaks than what you have. My performance improvement to Add/Sub is similar to yours, I think, but I definitely didn't improve Mul as much as you did. Take a look -- if it's not too hard for you to refactor your improvements on top of mine, that would be awesome... I think my forwarding simplifies it a bit, but let me know what you think.

Yeah, I can do that. Any idea when your changes will go in?

As for asm stuff, that's a tricky door to open. It's a feature only for rust nightly, of course, and only for archs you implement, so there still has to be a pure-Rust fallback. I think if you want to explore this route, you'd be better off working on ramp.

Oh, I'm not in a rush to get the asm stuff in. But I do like my work to be used,
so I'd prefer to just do whatever is necessary to get it into num when it's
ready, no offense to the ramp project :)

Is it possible to use #[cfg] to switch off the rust version? Or would we be
stuck waiting until the oldest version of rust we have to support supports
whatever features we want to use?

@homu
Copy link
Contributor

homu commented Nov 19, 2015

☔ The latest upstream changes (presumably 658637d) made this pull request unmergeable. Please resolve the merge conflicts.

@cuviper
Copy link
Member

cuviper commented Nov 19, 2015

There you go, I pushed my changes. Thanks homu. :)

I don't know any way to #[cfg] the rust version -- I wish there were. We could add a manual asm or nightly feature in cargo, but nothing automatic that I'm aware of.

@hauleth
Copy link
Member

hauleth commented Nov 19, 2015

@cuviper there is none. Only way to support asm in num would be Carfo feature flag.

@koverstreet
Copy link
Contributor Author

On Wed, Nov 18, 2015 at 10:06:03PM -0800, Josh Stone wrote:

There you go, I pushed my changes. Thanks homu. :)

I don't know any way to #[cfg] the rust version -- I wish there were. We could add a manual asm or nightly feature in cargo, but nothing automatic that I'm aware of.

Damn, that sounds like a major oversight. If I'm feeling particularly ambitious
maybe I'll work on adding that to rustc.

That said - I also wrote a patch that implements all the low level arithmetic
(the same ones the asm patch implements) without using DoubleBigDigit, so
BigDigit can be changed to usize - that actually gets us most of the performance
of the asm patch (which is still rather simple and dumb).

I pushed that patch out, and also rebased everything on top of your stuff.

@cuviper
Copy link
Member

cuviper commented Nov 23, 2015

Let's step back a little and keep focus in this PR, and especially do what we can without breaking API. We can do more in followup PRs. So let's keep the basic add/sub/mul optimizations (first four commits). The ops against primitives can be their own PR, a lot like what #125 did for complex numbers. The BigDigit changes constitute an API break -- which I am open to, but we'd need to bump semver, and if we do this I'd like to hide BigDigit entirely as an implementation detail. (But perhaps some of your optimizations avoiding DoubleBigDigit are still usable right now, if they still help with u32 BigDigit.)

I'll add a few more inline comments too...

#[bench]
fn divide_2(b: &mut Bencher) {
divide_bench(b, 1 << 16, 1 << 12);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last one takes painfully long to benchmark, even after all of your optimizations. Let's leave it out or #[ignore] it for now. (Then use --ignored to run it anyway.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new performance is impressive enough that we can keep this after all.

@koverstreet
Copy link
Contributor Author

Thanks for the review feedback.

On Sun, Nov 22, 2015 at 05:31:23PM -0800, Josh Stone wrote:

Let's step back a little and keep focus in this PR, and especially do what we can without breaking API. We can do more in followup PRs. So let's keep the basic add/sub/mul optimizations (first four commits). The ops against primitives can be their own PR, a lot like what #125 did for complex numbers.

Alright, sounds good.

I do have a patch to improve divide performance - do you mind including that, to
fix the issue with the benchmark taking too long to run?

The only thing I haven't changed per your comments is add/add2 and the way carry
is handled - it's a relatively minor issue, but I feel better about handling the
carry within add2 instead of returning it.

Reason being in the other places where add2 is used (multiply/divide), if they
don't size a correctly to hold the result then that's a serious bug (we'd return
an incorrect result) - and the way I've got it now we always catch that with an
assert, but if add2 is returning a carry bit then it's up to the caller to
check.

We could handle that with more wrapper functions, but... it just seems not worth
it as long as the extra push/pop isn't an actual performance issue (and it
really shouldn't be, having to realloc the vec won't be the common case).

So I wouldn't mind if it's changed later but I lean towards leaving it with the
extra push/pop for now.

Aside from that - I incorporated all your review feedback, and pushed with just
the performance optimizations. Let me know if that works, or if you want me to
split the division optimizations off (but I hate having stuff multiply into a
crapton of different pull requests).

The BigDigit changes constitute an API break -- which I am open to, but we'd need to bump semver, and if we do this I'd like to hide BigDigit entirely as an implementation detail. (But perhaps some of your optimizations avoiding DoubleBigDigit are still usable right now, if they still help with u32 BigDigit.)

nod Not used to having to consider API breakages.

BTW, what I'd really like is to come up with a good way of separating out the
implementation of the low level arithmetic (into another mod?) with a nice
abstract interface, in a way that's easy to select which one to build with.

And then ideally keep the DoubleBigDigit implementation, along with my BigDigit
= usize implementation and hopefully eventually fancy asm ones. Partly because I
can completely believe the DoubleBigDigit implementation could be faster on
different cpus, but also because it's definitely more readable.

So maybe something to consider if/when we do an API break, if you know a nice
clean way of doing all that.

@cuviper
Copy link
Member

cuviper commented Nov 30, 2015

"The benchmark is too slow? Let me improve it by 500x..." That's just scary. I feel I now need to pick over that with a fine-toothed comb, but who am I to question Knuth's methods. 😄

Regarding the carry push/pop, it's a small point, but I'm afraid I still disagree. I think we'd be better served by having two variants of add2 -- one that returns the extra carry, and one that asserts there is no carry (probably built on the former). If you don't want to do this, I probably will, unless you can make a better case why this is harmful. I expect it should perform no worse -- occasionally even better for avoiding capacity growth -- and it's conceptually cleaner to me.

Regarding API and implementations -- technically we could do whatever we want in implementation even now, as long as the public interfaces remain. BigDigit and DoubleBigDigit could become vestigial types with no bearing on the true innards anymore. However, new(digits: Vec<BigDigit>) does have a strong implication about what it's doing, that this should be a fast constructor by reusing that vector, so we probably don't want to stray too far. When we finally do an API break, we should hide these innards completely so we're free to implement however we see best, rather than trying to juggle multiple implementations as you seem to suggest.

Patch comments inbound. (Some of these may not even be for new changes, just things I didn't notice before.)

let y = rng.gen_bigint(ybits);

b.iter(|| &x / &y);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about some pre-iter assertions that these huge divisions are actually correct? e.g. check the round trip to multiply and add the remainder back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Mon, Nov 30, 2015 at 03:38:38PM -0800, Josh Stone wrote:

  • let x = rng.gen_bigint(xbits);
  • let y = rng.gen_bigint(ybits);
  • b.iter(|| &x * &y);
    +}

+fn divide_bench(b: &mut Bencher, xbits: usize, ybits: usize) {

  • let seed: &[_] = &[1, 2, 3, 4];
  • let mut rng: StdRng = SeedableRng::from_seed(seed);
  • let x = rng.gen_bigint(xbits);
  • let y = rng.gen_bigint(ybits);
  • b.iter(|| &x / &y);
    +}

How about some pre-iter assertions that these huge divisions are actually correct? e.g. check the round trip to multiply and add the remainder back.

Well, then it wouln't be just benchmarking division - I do have a test that does
that though, I didn't push it because it takes way too long to run if you don't
do a release build.

I don't know a good way of putting in these kinds of torture tests where you
want it to run "as long as you have time for", not any fixed amount of time. I
guess I could just make it a #[ignore] test, like you suggested for the other
division test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "pre-iter" I meant before the b.iter() call which loops itself. Just like the gen_bigint calls are not included in the benchmark results -- this is just setup, and I'm only suggesting a quick sanity check before hammering at it. :)

}
}

fn mul3(x: &[BigDigit], y: &[BigDigit]) -> BigUint {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what is the 3 in mul3? But anyway, this function is only used by Mul::mul, which itself isn't doing anything else, which raises the question why this should be separate at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Mon, Nov 30, 2015 at 03:42:11PM -0800, Josh Stone wrote:

  •            p.data.extend(repeat(0).take(len));
    
  •            mac3(&mut p.data[..], &j0.data.data[..], &j1.data.data[..]);
    
  •            p = p.normalize();
    
  •            sub2(&mut acc[b..], &p.data[..]);
    
  •        },
    
  •        Minus   => {
    
  •            mac3(&mut acc[b..], &j0.data.data[..], &j1.data.data[..]);
    
  •        },
    
  •        NoSign  => (),
    
  •    }
    
  • }
    +}

+fn mul3(x: &[BigDigit], y: &[BigDigit]) -> BigUint {

Just curious, what is the 3 in mul3? But anyway, this function is only used by Mul::mul, which itself isn't doing anything else, which raises the question why this should be separate at all.

Back when I still had an add3 and needed to distinguish, I was just using the
same convention you use in assembly for talking about ops that overwrite one of
the arguments for the result, or have another argument for the destination (e.g.
two argument add/mul that overwrites an argument for the dest, or 3 argument
that doesn't - or mac3 vs. mac4 (multiply accumulate)). It probably doesn't make
much sense anymore but it still aesthetically pleases me somehow :)

Though if we do ever (hopefully) get alloca() or some other way of moving heap
allocations to the stack, it could make sense to add a mul2() - avoid doing any
heap allocations by putting all the temporaries (including the result) on the
stack, and then return the result by copying over one of the (heap allocated)
arguments.

It does also get used in other patches I have (e.g. divide and conquer division
is able to make use of multiplying slices).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you're counting number of arguments and the result (when present). OK then.

I see your divide-and-conquer branch, thanks for not stacking that on just yet. :)
We'll get this PR wrapped up first and I'll happily push a new num release for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Tue, Dec 01, 2015 at 04:50:28PM -0800, Josh Stone wrote:

I see your divide-and-conquer branch, thanks for not stacking that on just yet. :)
We'll get this PR wrapped up first and I'll happily push a new num release for it.

BTW - I haven't been able to get divide and conquor division to even be as fast
as Knuth, so I doubt I'll be pushing that out any time soon.

It turns out that division of really large numbers is in general (with any
algorithm I've looked at) pretty dependent on multiplication performance - it
looks like well before you get to the point where it's worth doing something
fancier than knuth's algorithm, our current bottleneck is that karatsuba
multiplication doesn't scale that high.

So probably the next thing to do, algorithm wise, is FFT multiplication (or
maybe toom-3, but from what I was reading if you have a good, well optimized
implementation of FFT multiplication it's debatable whether toom-3 is worth it).

But I didn't study any of the math for that, so I'm not sure I'm feeling that
ambitious :)

@koverstreet
Copy link
Contributor Author

On Mon, Nov 30, 2015 at 03:38:14PM -0800, Josh Stone wrote:

"The benchmark is too slow? Let me improve it by 500x..." That's just scary.
I feel I now need to pick over that with a fine-toothed comb, but who am I to
question Knuth's methods. 😄

:D

The code isn't fresh in my head anymore but I had a suspicion as I was working
on it that the old code had the algorithm slightly wrong - at least, the logic
didn't quite match Knuth's algorithm but I didn't bother to try and figure out
if it affected the behaviour. I dug out my copy of TAOCP and just stuck to
Knuth's algorithm when I was working on it.

Regarding the carry push/pop, it's a small point, but I'm afraid I still disagree. I think we'd be better served by having two variants of add2 -- one that returns the extra carry, and one that asserts there is no carry (probably built on the former). If you don't want to do this, I probably will, unless you can make a better case why this is harmful. I expect it should perform no worse -- occasionally even better for avoiding capacity growth -- and it's conceptually cleaner to me.

Alright, I can do the wrapper thing.

Regarding API and implementations -- technically we could do whatever we want in implementation even now, as long as the public interfaces remain. BigDigit and DoubleBigDigit could become vestigial types with no bearing on the true innards anymore. However, new(digits: Vec<BigDigit>) does have a strong implication about what it's doing, that this should be a fast constructor by reusing that vector, so we probably don't want to stray too far. When we finally do an API break, we should hide these innards completely so we're free to implement however we see best, rather than trying to juggle multiple implementations as you seem to suggest.

Good point. I went and split out the implementations of the low level arithmetic
into different modules like I was talking about:

koverstreet@ff06399

btw - the operations I have the low level arithmatic module exposing right now
(like div_wide, which I saw you mentioned in another email) are ones that match
what you can do with (x86 at least, I don't know arm) assembly - that is,
there's no good reasons these operations available as intrinsics and my hope is
that's what happens eventually.

Although for add with carry/subtract with borrow, my language nerd friend (who's
working on Swift, and thus gets to interact with the llvm guys) tells me that
llvm can't really model instructions with two different outputs - the carry bit

  • so that may never happen.

Re: the vector constructor, if there's actual outside demand for that kind of
functionality (and I suppose there probably is) then it probably makes sense to
support it for multiple sizes of digit (say someone's bridging between gmp or
some other library and num; they're going to be stuck with whatever size digit
that other library uses).

We could do fast in place converion of vectors of arbitrary size digits to
biguints with a tiny bit of unsafe code - just cast (is it transmute in rust?)
the vector to the new type and then just fix the endianness thing in place.

So if you want to keep that functionality and hide BigDigit I could do that.

let (s_len, o_len) = (a.len(), b.len());
if s_len < o_len { return Less; }
if s_len > o_len { return Greater; }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make these a/b_len too please.

@cuviper
Copy link
Member

cuviper commented Dec 2, 2015

The new CI failure is in num-macros vs nightly rust -- I'll take a look at that.

I wouldn't even change the size of the public BigDigit without calling that a break. But if we used TheRealBigDigit = usize internally (or even ditched giving it a name), that would be fine.

Adding vector conversions for any size u8..u64 sounds intriguing. The example in Vec::from_raw_parts (1.0, nightly) is a good template. You'd just have to pad zeros for alignment, tear it apart, restore with new lengths, and fix the endianness as you said.

Ideas for new PRs, not here. ;)

@cuviper
Copy link
Member

cuviper commented Dec 10, 2015

A few review items haven't been addressed yet:

  • Sanity checks in the benchmarks during setup, before iteration.
  • Add a debug assert for div_wide overflow conditions
  • Replace public Add/Sub on slices with private functions.
  • Update cmp_slice names to a/b_len.

They're all simple things, so I don't want to hold back this PR too long. If you're already on holiday, maybe I'll go ahead and tweak these myself so we can merge. :)

@koverstreet
Copy link
Contributor Author

On Thu, Dec 10, 2015 at 10:11:57AM -0800, Josh Stone wrote:

A few review items haven't been addressed yet:

  • Sanity checks in the benchmarks during setup, before iteration.
  • Add a debug assert for div_wide overflow conditions
  • Replace public Add/Sub on slices with private functions.
  • Update cmp_slice names to a/b_len.

They're all simple things, so I don't want to hold back this PR too long. If you're already on holiday, maybe I'll go ahead and tweak these myself so we can merge. :)

Whoops, sorry, I got distracted with other code. I'll fix those up today :)

Add benchmarks that test multiplies/divides of different sizes
@koverstreet
Copy link
Contributor Author

On Thu, Dec 10, 2015 at 10:11:57AM -0800, Josh Stone wrote:

A few review items haven't been addressed yet:

  • Sanity checks in the benchmarks during setup, before iteration.

Did you want something besides just testing multiply/divide against each other?
Because I added a test that does that.

This is needed for the multiply optimizations in the next patch.
@cuviper
Copy link
Member

cuviper commented Dec 10, 2015

Did you want something besides just testing multiply/divide against each
other?

That's fine. I just want a little confidence in what we're benchmarking. :)

Because I added a test that does that.

Right, but that torture test is slow and ignored, and it's not even testing
the same inputs. That's fine though, since we can easily let the benchmark
sanity-check itself.

@koverstreet
Copy link
Contributor Author

On Thu, Dec 10, 2015 at 01:45:41PM -0800, Josh Stone wrote:

Did you want something besides just testing multiply/divide against each
other?

That's fine. I just want a little confidence in what we're benchmarking. :)

Because I added a test that does that.

Right, but that torture test is slow and ignored, and it's not even testing
the same inputs. That's fine though, since we can easily let the benchmark
sanity-check itself.

The torture test is doing exactly what I thought you meant for what to test in
the benchmark... or did you mean something else?

@cuviper
Copy link
Member

cuviper commented Dec 10, 2015 via email

@koverstreet
Copy link
Contributor Author

On Thu, Dec 10, 2015 at 02:07:47PM -0800, Josh Stone wrote:

But it's ignored, for good reason, and not using the same input values as
the benchmark.

Let's not belabor this small point. If it's still unclear, leave it to me
to add later.

Ok. Change it however you feel like :)

let mut ret = BigUint::from_slice(b);
sub2(&mut ret.data[..], a);
BigInt::from_biguint(Minus, ret.normalize())
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely one of these branches should be Plus? If so, we have insufficient test coverage to notice this, so please add a new test for it too. That might be best as a direct sub_sign unit test, so it's not relying on particular multiplication details to reach this point.

But you had this right before, and multiplication tests still passed, so I wonder how much that calling code is being exercised. In fact, I just put a panic! in sub_sign, and it wasn't reached at all until I enabled the torture test. With that test we can also see the plain failure here.

If we reduce the iterations of that torture test, we can leave it enabled even in debug testing. Just 1000 iterations still triggers this particular issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Thu, Dec 10, 2015 at 04:06:49PM -0800, Josh Stone wrote:

  •    return ll + mm.shl_unit(half_len) + hh.shl_unit(half_len \* 2);
    
  • match cmp_slice(a, b) {
  •    Greater => {
    
  •        let mut ret = BigUint::from_slice(a);
    
  •        sub2(&mut ret.data[..], b);
    
  •        BigInt::from_biguint(Minus, ret.normalize())
    
  •    },
    
  •    Less    => {
    
  •        let mut ret = BigUint::from_slice(b);
    
  •        sub2(&mut ret.data[..], a);
    
  •        BigInt::from_biguint(Minus, ret.normalize())
    
  •    },
    

Surely one of these branches should be Plus? If so, we have insufficient test coverage to notice this, so please add a new test for it too. That might be best as a direct sub_sign unit test, so it's not relying on particular multiplication details to reach this point.

But you had this right before, and multiplication tests still passed, so I wonder how much that calling code is being exercised. In fact, I just put a panic! in sub_sign, and it wasn't reached at all until I enabled the torture test. With that test we can also see the plain failure here.

swear A bug exactly there is why I overloaded add/subtract for slices...
pretty much reintroduced the same bug when I took it back out. Augh.

Yeah, part of the problem is that karatsuba multiplication doesn't become
worthwhile until the numbers are decently sized.

Does rust support gprof yet? code coverage would be really good to have for code
like this.

If we reduce the iterations of that torture test, we can leave it enabled even in debug testing. Just 1000 iterations still triggers this particular issue.

Ok yeah, good idea. Maybe I'll keep a second version that's #ignored that runs
lots more iterations.

The main idea here is to do as much as possible with slices, instead of
allocating new BigUints (= heap allocations).

Current performance:

multiply_0:	10,507 ns/iter (+/- 987)
multiply_1:	2,788,734 ns/iter (+/- 100,079)
multiply_2:	69,923,515 ns/iter (+/- 4,550,902)

After this patch, we get:

multiply_0:	364 ns/iter (+/- 62)
multiply_1:	34,085 ns/iter (+/- 1,179)
multiply_2:	3,753,883 ns/iter (+/- 46,876)
Before:
test divide_0      ... bench:       4,058 ns/iter (+/- 255)
test divide_1      ... bench:     304,507 ns/iter (+/- 28,063)
test divide_2      ... bench: 668,293,969 ns/iter (+/- 25,383,239)

After:
test divide_0      ... bench:         874 ns/iter (+/- 71)
test divide_1      ... bench:      16,641 ns/iter (+/- 1,205)
test divide_2      ... bench:   1,336,888 ns/iter (+/- 77,450)
@cuviper
Copy link
Member

cuviper commented Dec 11, 2015

@homu r+ fa372e2

@homu
Copy link
Contributor

homu commented Dec 11, 2015

⚡ Test exempted - status

@homu homu merged commit fa372e2 into rust-num:master Dec 11, 2015
homu added a commit that referenced this pull request Dec 11, 2015
[RFC] Some performance improvements

I added add and subtract methods that take an operand by value, instead of by reference, so they can work in place instead of always allocating new BigInts for the result.

I'd like to hear if anyone knows a better way of doing this, though - with the various wrappers/forwardings doing this for all the operations is going to be a mess and add a ton of boilerplate. I haven't even tried to do a complete job of it yet.

I'm starting to think the sanest thing to do would be to only have in place implements of most of the operations - and have the wrappers clone() if necessary. That way, the only extra work we're doing is a memcpy(), which is a hell of a lot faster than a heap allocation.
@cuviper
Copy link
Member

cuviper commented Dec 11, 2015

Thanks for your hard work!

Does rust support gprof yet?

Not directly, see rust-lang/rfcs#646. Apparently kcov works though, like in this tutorial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants