Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ArrayVec and AccumulateVec to reduce heap allocations during interning of slices #37270

Merged
merged 3 commits into from
Oct 26, 2016

Conversation

Mark-Simulacrum
Copy link
Member

@Mark-Simulacrum Mark-Simulacrum commented Oct 19, 2016

Updates mk_tup, mk_type_list, and mk_substs to allow interning directly from iterators. The previous PR, #37220, changed some of the calls to pass a borrowed slice from Vec instead of directly passing the iterator, and these changes further optimize that to avoid the allocation entirely.

This change yields 50% less malloc calls in some cases. It also yields decent, though not amazing, performance improvements:

futures-rs-test  4.091s vs  4.021s --> 1.017x faster (variance: 1.004x, 1.004x)
helloworld       0.219s vs  0.220s --> 0.993x faster (variance: 1.010x, 1.018x)
html5ever-2016-  3.805s vs  3.736s --> 1.018x faster (variance: 1.003x, 1.009x)
hyper.0.5.0      4.609s vs  4.571s --> 1.008x faster (variance: 1.015x, 1.017x)
inflate-0.1.0    3.864s vs  3.883s --> 0.995x faster (variance: 1.232x, 1.005x)
issue-32062-equ  0.309s vs  0.299s --> 1.033x faster (variance: 1.014x, 1.003x)
issue-32278-big  1.614s vs  1.594s --> 1.013x faster (variance: 1.007x, 1.004x)
jld-day15-parse  1.390s vs  1.326s --> 1.049x faster (variance: 1.006x, 1.009x)
piston-image-0. 10.930s vs 10.675s --> 1.024x faster (variance: 1.006x, 1.010x)
reddit-stress    2.302s vs  2.261s --> 1.019x faster (variance: 1.010x, 1.026x)
regex.0.1.30     2.250s vs  2.240s --> 1.005x faster (variance: 1.087x, 1.011x)
rust-encoding-0  1.895s vs  1.887s --> 1.005x faster (variance: 1.005x, 1.018x)
syntex-0.42.2   29.045s vs 28.663s --> 1.013x faster (variance: 1.004x, 1.006x)
syntex-0.42.2-i 13.925s vs 13.868s --> 1.004x faster (variance: 1.022x, 1.007x)

We implement a small-size optimized vector, intended to be used primarily for collection of presumed to be short iterators. This vector cannot be "upsized/reallocated" into a heap-allocated vector, since that would require (slow) branching logic, but during the initial collection from an iterator heap-allocation is possible.

We make the new AccumulateVec and ArrayVec generic over implementors of the Array trait, of which there is currently one, [T; 8]. In the future, this is likely to expand to other values of N.

Huge thanks to @nnethercote for collecting the performance and other statistics mentioned above.

@eddyb
Copy link
Member

eddyb commented Oct 19, 2016

cc @rust-lang/lang We'd like to have impls of SliceInternable for &[T]/&Vec<T> and for iterators.
This is a powerful pattern as it, unlike IntoIterator, forces callers to choose between providing an existent container in memory or an iterator when the former isn't available, without duplicating APIs.
However this results in a conflicting impls error right now, because Iterator is from another crate and not #[fundamental] - do you think we can make it fundamental? There is precedent - Fn traits are fundamental, according to a comment, for regex to be able to not have them conflict with &str.

@bors
Copy link
Contributor

bors commented Oct 19, 2016

☔ The latest upstream changes (presumably #37269) made this pull request unmergeable. Please resolve the merge conflicts.

type Target = [T];
fn deref(&self) -> &Self::Target {
let values = &self.values[..self.count];
unsafe { &*(values as *const [_] as *const [T]) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a good place to use from_raw_parts and skip the bounds check for the slice? Or does it not matter?

Copy link
Member

@eddyb eddyb Oct 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, didn't think of it like that. Yeah it would be slightly less cruft to optimize away.

}

#[allow(unions_with_drop_fields, dead_code)]
union ManuallyDrop<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since rustc devs are the UB police, is this guaranteed to be convertible to T, even without #[repr(C)]? (I'm reading this PR with interest for the future of the external arrayvec and smallvec crates of course.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's technically in newtype form, so we can guarantee it at no cost. However, adding #[repr(C)] would be better here, for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding #[repr(C)] would require that T is repr(C), I thought? Or is that just a lint, and we can allow it away?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should work fine, try it out.


impl<T> Drop for SmallVec8<T> {
fn drop(&mut self) {
for i in 0..self.count {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have DerefMut, this is just drop_in_place(&mut self[..])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding DerefMut feels unsafe since we then have potential for safe mutation of the stack allocated buffer, which we probably don't want to allow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what's unsound about modifying the buffer, since it's only making a slice of the valid/initialized element range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's perfectly safe, it can't do anything harmful, not even semantically.

@Mark-Simulacrum Mark-Simulacrum force-pushed the smallvec-optimized-arenas branch from 80c2bcf to bad8be6 Compare October 19, 2016 12:54
ManuallyDrop { empty: () },
ManuallyDrop { empty: () },
ManuallyDrop { empty: () },
];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two variables need to be in a SmallVec8 variable for unwinding to drop the right values.

ManuallyDrop { empty: () },
];
for el in iter {
values[count] = ManuallyDrop { value: el };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use field assignment, i.e .value = el.

Copy link
Member Author

@Mark-Simulacrum Mark-Simulacrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving comments to myself, mainly, but a few questions for @eddyb as well.


pub fn mk_substs_trait(self,
s: Ty<'tcx>,
t: &[Ty<'tcx>])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably take I: SliceInternable, but then the current code doesn't support passing it to iter::chain, since SliceInternable is not an Iterator. @eddyb Thoughts on what we should/could do here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most users are genuinely slice though. Not high-priority.

@@ -298,20 +275,20 @@ impl<'a, 'gcx, 'tcx> Substs<'tcx> {
target_substs: &Substs<'tcx>)
-> &'tcx Substs<'tcx> {
let defs = tcx.lookup_generics(source_ancestor);
Substs::new(tcx, target_substs.iter().chain(&self[defs.own_count()..]).cloned())
tcx.mk_substs(target_substs.iter().chain(&self[defs.own_count()..]).cloned())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to avoid Iterator::cloned(), but the types don't match up here to allow that, since we're iterating over &Substs, not Substs. I'm not sure if there is/what the good solution here is; I believe this is semi-common though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't avoid an iterator here. .cloned() is suspicious when used on a slice that can be passed directly - there's no cloning cost, these are just pointer copies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify as to if it's possible to avoid .cloned() here? The target_substs and (I believe) the slice into self are both &Kind instead of Kind. Or is it Ty?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not but that's irrelevant here. cloned is not bad, it's just a sign of a slice iterator, which in some cases (not this) can be replaced with the slice itself.

}

#[allow(unions_with_drop_fields, dead_code)]
union ManuallyDrop<T> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding #[repr(C)] would require that T is repr(C), I thought? Or is that just a lint, and we can allow it away?


impl<T> Drop for SmallVec8<T> {
fn drop(&mut self) {
for i in 0..self.count {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding DerefMut feels unsafe since we then have potential for safe mutation of the stack allocated buffer, which we probably don't want to allow?

@@ -800,7 +800,7 @@ fn build_free<'a, 'gcx, 'tcx>(tcx: TyCtxt<'a, 'gcx, 'tcx>,
-> TerminatorKind<'tcx> {
let free_func = tcx.lang_items.require(lang_items::BoxFreeFnLangItem)
.unwrap_or_else(|e| tcx.sess.fatal(&e));
let substs = Substs::new(tcx, iter::once(Kind::from(data.item_ty)));
let substs = tcx.mk_substs(iter::once(Kind::from(data.item_ty)));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this might be easier to optimize if we skip the iter::once, since we're not joining it with anything anyway. Leaving this is a note to self, mostly, since I'm not sure we can omit the iterator with the current implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intern_substs with a slice.

@@ -146,7 +146,7 @@ impl<'a, 'gcx, 'tcx> Cx<'a, 'gcx, 'tcx> {
params: &[Ty<'tcx>])
-> (Ty<'tcx>, Literal<'tcx>) {
let method_name = token::intern(method_name);
let substs = Substs::new_trait(self.tcx, self_ty, params);
let substs = self.tcx.mk_substs_trait(self_ty, params);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be taking a params: SliceInternable<...> instead of &[...] here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish. Most of these take a slice though.

@@ -1663,8 +1663,8 @@ impl<'o, 'gcx: 'tcx, 'tcx> AstConv<'gcx, 'tcx>+'o {
hir::TyTup(ref fields) => {
let flds = fields.iter()
.map(|t| self.ast_ty_to_ty(rscope, &t))
.collect();
tcx.mk_tup(flds)
.collect::<Vec<_>>();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should collect into a MaybeSmallVec8.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this use an itera- ah shucks, the slice impl problem.

@@ -88,7 +88,7 @@ impl<'a, 'gcx, 'tcx> FnCtxt<'a, 'gcx, 'tcx> {

// Tuple up the arguments and insert the resulting function type into
// the `closures` table.
fn_ty.sig.0.inputs = vec![self.tcx.mk_tup(fn_ty.sig.0.inputs)];
fn_ty.sig.0.inputs = vec![self.tcx.mk_tup(&fn_ty.sig.0.inputs)];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be allocating a vec![...] here? I feel like we might be able to change the struct's type to a SmallVec8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct thing to do is to intern a slice, but that's a bit more complicated of a refactoring.

@@ -3733,11 +3733,11 @@ impl<'a, 'gcx, 'tcx> FnCtxt<'a, 'gcx, 'tcx> {
};
err_field = err_field || t.references_error();
t
}).collect();
}).collect::<Vec<_>>();
Copy link
Member Author

@Mark-Simulacrum Mark-Simulacrum Oct 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't collect into a Vec. (this requires slice impl)

unsafe {
v.values[v.count].value = el;
}
v.count += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make this loop an Extend impl, you can then have fill_item take T: Extend<Kind> + Deref<Target=[Kind]>.

@eddyb
Copy link
Member

eddyb commented Oct 19, 2016

EDIT: Moved this to a PR comment to be able to link it.

Oh, turns out #[repr(C)] is wrong, for subtle reasons! The size of a C struct/union is an aligned size of its contents, but its alignment can be higher than the alignments of any fields.

That is, there is a "minimum struct/union" alignment (not sure where in the C standard, but LLVM supports it and so do we! I just checked and no built-in target specs have minimum ABI alignment for struct/union that's larger than one byte, just "preferred alignment").
And if this alignment larger than the size of the contents, the size of the struct/union will be the alignment.

The correct thing to do is #[repr(transparent)] from rust-lang/rfcs#1758.

Thanks to @ubsan and @retep998 on IRC for figuring this out!

}

fn fill_item<FR, FT>(substs: &mut Vec<Kind<'tcx>>,
fn fill_item<FR, FT, T: Extend<Kind<'tcx>> + Deref<Target=[Kind<'tcx>]>>(substs: &mut T,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the bounds in the where clause.

@@ -679,7 +679,7 @@ fn subst_ty_renumber_bound() {
env.t_fn(&[t_param], env.t_nil())
};

let substs = Substs::new(env.infcx.tcx, iter::once(Kind::from(t_rptr_bound1)));
let substs = env.infcx.tcx.mk_substs(iter::once(Kind::from(t_rptr_bound1)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can pass a slice to intern_substs.

@@ -714,7 +714,7 @@ fn subst_ty_renumber_some_bounds() {
env.t_pair(t_param, env.t_fn(&[t_param], env.t_nil()))
};

let substs = Substs::new(env.infcx.tcx, iter::once(Kind::from(t_rptr_bound1)));
let substs = env.infcx.tcx.mk_substs(iter::once(Kind::from(t_rptr_bound1)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -776,7 +776,7 @@ fn subst_region_renumber_region() {
env.t_fn(&[env.t_rptr(re_early)], env.t_nil())
};

let substs = Substs::new(env.infcx.tcx, iter::once(Kind::from(re_bound1)));
let substs = env.infcx.tcx.mk_substs(iter::once(Kind::from(re_bound1)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@Mark-Simulacrum Mark-Simulacrum force-pushed the smallvec-optimized-arenas branch from f21961f to a36e00c Compare October 19, 2016 18:39
@bors
Copy link
Contributor

bors commented Oct 19, 2016

☔ The latest upstream changes (presumably #37220) made this pull request unmergeable. Please resolve the merge conflicts.

@Mark-Simulacrum Mark-Simulacrum force-pushed the smallvec-optimized-arenas branch from a36e00c to 1a3cb0f Compare October 19, 2016 20:43
@@ -191,12 +191,13 @@ impl<'a, 'gcx, 'tcx> Substs<'tcx> {
}
}

fn fill_item<FR, FT, T: Extend<Kind<'tcx>> + Deref<Target=[Kind<'tcx>]>>(substs: &mut T,
fn fill_item<T, FR, FT>(substs: &mut T,
tcx: TyCtxt<'a, 'gcx, 'tcx>,
defs: &ty::Generics<'tcx>,
mk_region: &mut FR,
mk_type: &mut FT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these arguments need to be reindented.

@Mark-Simulacrum Mark-Simulacrum force-pushed the smallvec-optimized-arenas branch from d39d8f4 to 25d3b4a Compare October 19, 2016 21:05
@Mark-Simulacrum
Copy link
Member Author

Mark-Simulacrum commented Oct 19, 2016

Performance, compared with current master, stage2 release builds (configured with no arguments):

futures-rs-test  4.726s vs  4.729s --> 0.999x faster (variance: 1.027x, 1.018x)
helloworld       0.317s vs  0.305s --> 1.039x faster (variance: 1.082x, 1.167x)
html5ever-2016-  6.668s vs  6.647s --> 1.003x faster (variance: 1.013x, 1.014x)
hyper.0.5.0      5.478s vs  5.368s --> 1.020x faster (variance: 1.045x, 1.023x)
inflate-0.1.0    4.397s vs  4.360s --> 1.008x faster (variance: 1.007x, 1.014x)
issue-32062-equ  0.408s vs  0.420s --> 0.969x faster (variance: 1.229x, 1.110x)
issue-32278-big  2.117s vs  2.028s --> 1.044x faster (variance: 1.021x, 1.069x)
jld-day15-parse  1.660s vs  1.684s --> 0.986x faster (variance: 1.050x, 1.026x)
piston-image-0. 12.664s vs 12.481s --> 1.015x faster (variance: 1.003x, 1.009x)
regex.0.1.30     2.780s vs  2.768s --> 1.005x faster (variance: 1.139x, 1.033x)
rust-encoding-0  2.342s vs  2.373s --> 0.987x faster (variance: 1.087x, 1.093x)
syntex-0.42.2   31.017s vs 30.904s --> 1.004x faster (variance: 1.035x, 1.001x)
syntex-0.42.2-i 16.111s vs 16.202s --> 0.994x faster (variance: 1.008x, 1.005x)

@Mark-Simulacrum Mark-Simulacrum force-pushed the smallvec-optimized-arenas branch from 25d3b4a to 7517e34 Compare October 19, 2016 21:41
@eddyb
Copy link
Member

eddyb commented Oct 19, 2016

@Mark-Simulacrum I'm surprised a bit, I would've expected more wins given how high allocation was on the profile. Maybe the SmallVec8 case isn't hit that often? @nnethercote might want to take a look.

@nikomatsakis
Copy link
Contributor

OK, so, @rust-lang/lang: @eddyb and I had a short chat about fundamental traits on IRC. I am not comfortable with declaring Iterator to be fundamental because I hold out hope we can remove the notion of fundamental traits and replace it with negative impls (though I'm not 100% sure this is a good idea). But I'd hate to do any hasty steps that make it harder to change fundamental in the future.

However, I would be comfortable with doing a targeted extension to fundamental. Basically a mildly hacky negative impl. For example, we discussed allowing #[fundamental(slice)] which would say "this trait is fundamental for slice types". This could be applied to Iterator.

We would thus be committed to adding negative impls of Iterator to [T] in the future if we changed from fundamental to explicit negative impls. But we basically are anyway, given how IntoIterator is setup and so forth.

How do people feel about that? (I'm happy to guide @Mark-Simulacrum through the steps to do that change.)

@withoutboats
Copy link
Contributor

@nikomatsakis first paragraph is how I felt when I saw @eddyb's question also. One of the biggest problems with #[fundamental] is that it is a potentially breaking change to implement a fundamental trait for an existing type, and this is totally undocumented.

#[fundamental(slice)] avoids that problem, at least. It is certainly a hack though, and I'm a little uncomfortable about the possibility that negative impls don't work out and now we have this hack in the language forever. People are also pretty likely to rely on it without realizing it - an impl for [T] and an impl for I: Iterator<Item = T> are probably one of the most common pairs of conflicting impls that people are running into right now.

@eddyb
Copy link
Member

eddyb commented Oct 19, 2016

Do note that we don't care about [T] (only sized types are involved), but &[T] which is "slice reference".
That's why I suggested #[fundamental(primitive)] on IRC.

@nikomatsakis
Copy link
Contributor

nikomatsakis commented Oct 19, 2016

@withoutboats

#[fundamental(slice)] avoids that problem, at least. It is certainly a hack though, and I'm a little uncomfortable about the possibility that negative impls don't work out and now we have this hack in the language forever.

I guess part of this is that, if I were happy with #[fundamental], I'd probably be ok w/ Iterator being fundamental. IOW, if negative impls didn't work out, we could remove #[fundamental(slice)] and just mark Iterator as #[fundamental] -- after all we need some answer to the existing closure traits, which are fundamental. Put another way the whole goal with fundamental was to do something we wouldn't probably want to stabilize, but which let us do the things we knew we would need at minimum. So whatever solution we adopt should be as powerful.

But yeah this does mean we are committing to either Iterator being fundamental or being able to declare a trait not implemented for slice types (in some way).


@eddyb tbh I meant &[T] when I wrote "slice", though I guess that's sloppy terminology on my part.

@arielb1
Copy link
Contributor

arielb1 commented Oct 22, 2016

@eddyb

IIRC, LLVM can optimize even the malloc case better if you have a fixed capacity (i.e. use RawVec instead of Vec) - re the lack of inductive resizing elimination. Just call the resulting thing AccumulateVec and be done with it.

pub fn intern_substs(self, ts: &[Kind<'tcx>]) -> &'tcx Slice<Kind<'tcx>> {
if ts.len() == 0 {
return unsafe {
mem::transmute(slice::from_raw_parts(0x1 as *mut Kind<'tcx>, 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be just Slice::empty() like the other one.

@Mark-Simulacrum Mark-Simulacrum force-pushed the smallvec-optimized-arenas branch 3 times, most recently from b49bfe9 to d56f099 Compare October 25, 2016 00:33
@Mark-Simulacrum Mark-Simulacrum changed the title Utilize TypedArena::alloc_slice and new MaybeSmallVec8 in mk_* interners Add ArrayVec and AccumulateVec to reduce heap allocations during interning of slices Oct 25, 2016
@Mark-Simulacrum Mark-Simulacrum force-pushed the smallvec-optimized-arenas branch from d56f099 to c751299 Compare October 25, 2016 01:45
@Mark-Simulacrum Mark-Simulacrum force-pushed the smallvec-optimized-arenas branch from c751299 to a2efca9 Compare October 26, 2016 00:09
// option. This file may not be copied, modified, or distributed
// except according to those terms.

//! A vector type intended to be used for collecting from iterators onto the stack.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a slightly more expansive comment that mentions what the threshold is (8) and what happens when it's exceeded.

// option. This file may not be copied, modified, or distributed
// except according to those terms.

//! A small vector type intended for storing <= 8 values on the stack.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, please explain what happens if 8 is exceeded.

@Mark-Simulacrum Mark-Simulacrum force-pushed the smallvec-optimized-arenas branch from a2efca9 to 2059fdc Compare October 26, 2016 01:57
@Mark-Simulacrum Mark-Simulacrum force-pushed the smallvec-optimized-arenas branch from 2059fdc to 989eba7 Compare October 26, 2016 02:06
@eddyb
Copy link
Member

eddyb commented Oct 26, 2016

@bors r+

@bors
Copy link
Contributor

bors commented Oct 26, 2016

📌 Commit 989eba7 has been approved by eddyb

@eddyb eddyb removed I-nominated T-lang Relevant to the language team, which will review and decide on the PR/issue. labels Oct 26, 2016
@bors
Copy link
Contributor

bors commented Oct 26, 2016

⌛ Testing commit 989eba7 with merge a6b3b01...

bors added a commit that referenced this pull request Oct 26, 2016
…ddyb

Add ArrayVec and AccumulateVec to reduce heap allocations during interning of slices

Updates `mk_tup`, `mk_type_list`, and `mk_substs` to allow interning directly from iterators. The previous PR, #37220, changed some of the calls to pass a borrowed slice from `Vec` instead of directly passing the iterator, and these changes further optimize that to avoid the allocation entirely.

This change yields 50% less malloc calls in [some cases](https://pastebin.mozilla.org/8921686). It also yields decent, though not amazing, performance improvements:
```
futures-rs-test  4.091s vs  4.021s --> 1.017x faster (variance: 1.004x, 1.004x)
helloworld       0.219s vs  0.220s --> 0.993x faster (variance: 1.010x, 1.018x)
html5ever-2016-  3.805s vs  3.736s --> 1.018x faster (variance: 1.003x, 1.009x)
hyper.0.5.0      4.609s vs  4.571s --> 1.008x faster (variance: 1.015x, 1.017x)
inflate-0.1.0    3.864s vs  3.883s --> 0.995x faster (variance: 1.232x, 1.005x)
issue-32062-equ  0.309s vs  0.299s --> 1.033x faster (variance: 1.014x, 1.003x)
issue-32278-big  1.614s vs  1.594s --> 1.013x faster (variance: 1.007x, 1.004x)
jld-day15-parse  1.390s vs  1.326s --> 1.049x faster (variance: 1.006x, 1.009x)
piston-image-0. 10.930s vs 10.675s --> 1.024x faster (variance: 1.006x, 1.010x)
reddit-stress    2.302s vs  2.261s --> 1.019x faster (variance: 1.010x, 1.026x)
regex.0.1.30     2.250s vs  2.240s --> 1.005x faster (variance: 1.087x, 1.011x)
rust-encoding-0  1.895s vs  1.887s --> 1.005x faster (variance: 1.005x, 1.018x)
syntex-0.42.2   29.045s vs 28.663s --> 1.013x faster (variance: 1.004x, 1.006x)
syntex-0.42.2-i 13.925s vs 13.868s --> 1.004x faster (variance: 1.022x, 1.007x)
```

We implement a small-size optimized vector, intended to be used primarily for collection of presumed to be short iterators. This vector cannot be "upsized/reallocated" into a heap-allocated vector, since that would require (slow) branching logic, but during the initial collection from an iterator heap-allocation is possible.

We make the new `AccumulateVec` and `ArrayVec` generic over implementors of the `Array` trait, of which there is currently one, `[T; 8]`. In the future, this is likely to expand to other values of N.

Huge thanks to @nnethercote for collecting the performance and other statistics mentioned above.
@nnethercote
Copy link
Contributor

I did some malloc profiling on Oct 24 (before this landed) and again on Oct 27 (after this landed). I expect the vast majority of changes are due to this PR. Note that these are total/cumulative numbers, i.e. not the number of live blocks/bytes at any point. (The latter numbers barely changed because the allocations avoided by this PR are all short-lived.)

futures-rs-test-all
- tot_alloc:    1,707,560,318 bytes in 6,425,790 blocks
- tot_alloc:    1,684,053,208 bytes in 4,534,644 blocks

helloworld
- tot_alloc:    16,319,139 bytes in 55,957 blocks
- tot_alloc:    16,293,064 bytes in 54,758 blocks

html5ever-2016-08-25
- tot_alloc:    2,694,623,813 bytes in 13,261,685 blocks
- tot_alloc:    2,669,785,090 bytes in 11,574,173 blocks

hyper.0.5.0
- tot_alloc:    2,041,311,149 bytes in 7,190,317 blocks
- tot_alloc:    2,025,502,800 bytes in 6,044,778 blocks

inflate-0.1.0
- tot_alloc:    1,354,371,607 bytes in 3,480,682 blocks
- tot_alloc:    1,347,002,012 bytes in 2,989,486 blocks

issue-32062-equality-relations-complexity
- tot_alloc:    53,487,018 bytes in 485,981 blocks
- tot_alloc:    49,690,973 bytes in 159,117 blocks

issue-32278-big-array-of-strings
- tot_alloc:    646,450,757 bytes in 3,599,115 blocks
- tot_alloc:    623,056,568 bytes in 2,598,961 blocks

jld-day15-parser
- tot_alloc:    545,848,751 bytes in 3,111,576 blocks
- tot_alloc:    525,497,296 bytes in 1,389,803 blocks

piston-image-0.10.3
- tot_alloc:    5,015,922,795 bytes in 19,904,822 blocks
- tot_alloc:    4,921,821,045 bytes in 13,080,308 blocks

reddit-stress
- tot_alloc:    842,543,754 bytes in 4,627,120 blocks
- tot_alloc:    827,087,063 bytes in 3,681,204 blocks

regex.0.1.30
- tot_alloc:    1,242,156,462 bytes in 4,072,953 blocks
- tot_alloc:    1,233,722,963 bytes in 3,456,286 blocks

rust-encoding-0.3.0
- tot_alloc:    846,368,853 bytes in 2,923,601 blocks
- tot_alloc:    842,673,347 bytes in 2,663,391 blocks

syntex-0.42.2
- tot_alloc:    12,563,080,222 bytes in 42,469,971 blocks
- tot_alloc:    12,462,476,374 bytes in 34,220,542 blocks

syntex-0.42.2-incr-clean
- tot_alloc:    5,890,957,601 bytes in 25,516,470 blocks
- tot_alloc:    5,795,781,970 bytes in 18,186,472 blocks

@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Nov 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants