-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ArrayVec and AccumulateVec to reduce heap allocations during interning of slices #37270
Add ArrayVec and AccumulateVec to reduce heap allocations during interning of slices #37270
Conversation
f963bbd
to
beee655
Compare
cc @rust-lang/lang We'd like to have impls of |
☔ The latest upstream changes (presumably #37269) made this pull request unmergeable. Please resolve the merge conflicts. |
type Target = [T]; | ||
fn deref(&self) -> &Self::Target { | ||
let values = &self.values[..self.count]; | ||
unsafe { &*(values as *const [_] as *const [T]) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a good place to use from_raw_parts and skip the bounds check for the slice? Or does it not matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, didn't think of it like that. Yeah it would be slightly less cruft to optimize away.
} | ||
|
||
#[allow(unions_with_drop_fields, dead_code)] | ||
union ManuallyDrop<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since rustc devs are the UB police, is this guaranteed to be convertible to T
, even without #[repr(C)]
? (I'm reading this PR with interest for the future of the external arrayvec and smallvec crates of course.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's technically in newtype form, so we can guarantee it at no cost. However, adding #[repr(C)]
would be better here, for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding #[repr(C)]
would require that T is repr(C), I thought? Or is that just a lint, and we can allow it away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should work fine, try it out.
|
||
impl<T> Drop for SmallVec8<T> { | ||
fn drop(&mut self) { | ||
for i in 0..self.count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have DerefMut, this is just drop_in_place(&mut self[..])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding DerefMut feels unsafe since we then have potential for safe mutation of the stack allocated buffer, which we probably don't want to allow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see what's unsound about modifying the buffer, since it's only making a slice of the valid/initialized element range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's perfectly safe, it can't do anything harmful, not even semantically.
80c2bcf
to
bad8be6
Compare
ManuallyDrop { empty: () }, | ||
ManuallyDrop { empty: () }, | ||
ManuallyDrop { empty: () }, | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two variables need to be in a SmallVec8
variable for unwinding to drop the right values.
ManuallyDrop { empty: () }, | ||
]; | ||
for el in iter { | ||
values[count] = ManuallyDrop { value: el }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use field assignment, i.e .value = el
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving comments to myself, mainly, but a few questions for @eddyb as well.
|
||
pub fn mk_substs_trait(self, | ||
s: Ty<'tcx>, | ||
t: &[Ty<'tcx>]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably take I: SliceInternable
, but then the current code doesn't support passing it to iter::chain, since SliceInternable
is not an Iterator
. @eddyb Thoughts on what we should/could do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most users are genuinely slice though. Not high-priority.
@@ -298,20 +275,20 @@ impl<'a, 'gcx, 'tcx> Substs<'tcx> { | |||
target_substs: &Substs<'tcx>) | |||
-> &'tcx Substs<'tcx> { | |||
let defs = tcx.lookup_generics(source_ancestor); | |||
Substs::new(tcx, target_substs.iter().chain(&self[defs.own_count()..]).cloned()) | |||
tcx.mk_substs(target_substs.iter().chain(&self[defs.own_count()..]).cloned()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to avoid Iterator::cloned()
, but the types don't match up here to allow that, since we're iterating over &Substs
, not Substs
. I'm not sure if there is/what the good solution here is; I believe this is semi-common though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't avoid an iterator here. .cloned()
is suspicious when used on a slice that can be passed directly - there's no cloning cost, these are just pointer copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify as to if it's possible to avoid .cloned()
here? The target_substs
and (I believe) the slice into self
are both &Kind instead of Kind. Or is it Ty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not but that's irrelevant here. cloned
is not bad, it's just a sign of a slice iterator, which in some cases (not this) can be replaced with the slice itself.
} | ||
|
||
#[allow(unions_with_drop_fields, dead_code)] | ||
union ManuallyDrop<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding #[repr(C)]
would require that T is repr(C), I thought? Or is that just a lint, and we can allow it away?
|
||
impl<T> Drop for SmallVec8<T> { | ||
fn drop(&mut self) { | ||
for i in 0..self.count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding DerefMut feels unsafe since we then have potential for safe mutation of the stack allocated buffer, which we probably don't want to allow?
@@ -800,7 +800,7 @@ fn build_free<'a, 'gcx, 'tcx>(tcx: TyCtxt<'a, 'gcx, 'tcx>, | |||
-> TerminatorKind<'tcx> { | |||
let free_func = tcx.lang_items.require(lang_items::BoxFreeFnLangItem) | |||
.unwrap_or_else(|e| tcx.sess.fatal(&e)); | |||
let substs = Substs::new(tcx, iter::once(Kind::from(data.item_ty))); | |||
let substs = tcx.mk_substs(iter::once(Kind::from(data.item_ty))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this might be easier to optimize if we skip the iter::once
, since we're not joining it with anything anyway. Leaving this is a note to self, mostly, since I'm not sure we can omit the iterator with the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intern_substs
with a slice.
@@ -146,7 +146,7 @@ impl<'a, 'gcx, 'tcx> Cx<'a, 'gcx, 'tcx> { | |||
params: &[Ty<'tcx>]) | |||
-> (Ty<'tcx>, Literal<'tcx>) { | |||
let method_name = token::intern(method_name); | |||
let substs = Substs::new_trait(self.tcx, self_ty, params); | |||
let substs = self.tcx.mk_substs_trait(self_ty, params); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be taking a params: SliceInternable<...>
instead of &[...]
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish. Most of these take a slice though.
@@ -1663,8 +1663,8 @@ impl<'o, 'gcx: 'tcx, 'tcx> AstConv<'gcx, 'tcx>+'o { | |||
hir::TyTup(ref fields) => { | |||
let flds = fields.iter() | |||
.map(|t| self.ast_ty_to_ty(rscope, &t)) | |||
.collect(); | |||
tcx.mk_tup(flds) | |||
.collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should collect into a MaybeSmallVec8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this use an itera- ah shucks, the slice impl problem.
@@ -88,7 +88,7 @@ impl<'a, 'gcx, 'tcx> FnCtxt<'a, 'gcx, 'tcx> { | |||
|
|||
// Tuple up the arguments and insert the resulting function type into | |||
// the `closures` table. | |||
fn_ty.sig.0.inputs = vec![self.tcx.mk_tup(fn_ty.sig.0.inputs)]; | |||
fn_ty.sig.0.inputs = vec![self.tcx.mk_tup(&fn_ty.sig.0.inputs)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be allocating a vec![...]
here? I feel like we might be able to change the struct's type to a SmallVec8
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct thing to do is to intern a slice, but that's a bit more complicated of a refactoring.
@@ -3733,11 +3733,11 @@ impl<'a, 'gcx, 'tcx> FnCtxt<'a, 'gcx, 'tcx> { | |||
}; | |||
err_field = err_field || t.references_error(); | |||
t | |||
}).collect(); | |||
}).collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't collect into a Vec. (this requires slice impl)
unsafe { | ||
v.values[v.count].value = el; | ||
} | ||
v.count += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make this loop an Extend
impl, you can then have fill_item
take T: Extend<Kind> + Deref<Target=[Kind]>
.
EDIT: Moved this to a PR comment to be able to link it. Oh, turns out That is, there is a "minimum struct/union" alignment (not sure where in the C standard, but LLVM supports it and so do we! I just checked and no built-in target specs have minimum ABI alignment for struct/union that's larger than one byte, just "preferred alignment"). The correct thing to do is Thanks to @ubsan and @retep998 on IRC for figuring this out! |
} | ||
|
||
fn fill_item<FR, FT>(substs: &mut Vec<Kind<'tcx>>, | ||
fn fill_item<FR, FT, T: Extend<Kind<'tcx>> + Deref<Target=[Kind<'tcx>]>>(substs: &mut T, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the bounds in the where clause.
@@ -679,7 +679,7 @@ fn subst_ty_renumber_bound() { | |||
env.t_fn(&[t_param], env.t_nil()) | |||
}; | |||
|
|||
let substs = Substs::new(env.infcx.tcx, iter::once(Kind::from(t_rptr_bound1))); | |||
let substs = env.infcx.tcx.mk_substs(iter::once(Kind::from(t_rptr_bound1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can pass a slice to intern_substs
.
@@ -714,7 +714,7 @@ fn subst_ty_renumber_some_bounds() { | |||
env.t_pair(t_param, env.t_fn(&[t_param], env.t_nil())) | |||
}; | |||
|
|||
let substs = Substs::new(env.infcx.tcx, iter::once(Kind::from(t_rptr_bound1))); | |||
let substs = env.infcx.tcx.mk_substs(iter::once(Kind::from(t_rptr_bound1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@@ -776,7 +776,7 @@ fn subst_region_renumber_region() { | |||
env.t_fn(&[env.t_rptr(re_early)], env.t_nil()) | |||
}; | |||
|
|||
let substs = Substs::new(env.infcx.tcx, iter::once(Kind::from(re_bound1))); | |||
let substs = env.infcx.tcx.mk_substs(iter::once(Kind::from(re_bound1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
f21961f
to
a36e00c
Compare
☔ The latest upstream changes (presumably #37220) made this pull request unmergeable. Please resolve the merge conflicts. |
a36e00c
to
1a3cb0f
Compare
@@ -191,12 +191,13 @@ impl<'a, 'gcx, 'tcx> Substs<'tcx> { | |||
} | |||
} | |||
|
|||
fn fill_item<FR, FT, T: Extend<Kind<'tcx>> + Deref<Target=[Kind<'tcx>]>>(substs: &mut T, | |||
fn fill_item<T, FR, FT>(substs: &mut T, | |||
tcx: TyCtxt<'a, 'gcx, 'tcx>, | |||
defs: &ty::Generics<'tcx>, | |||
mk_region: &mut FR, | |||
mk_type: &mut FT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these arguments need to be reindented.
d39d8f4
to
25d3b4a
Compare
Performance, compared with current master, stage2 release builds (configured with no arguments):
|
25d3b4a
to
7517e34
Compare
@Mark-Simulacrum I'm surprised a bit, I would've expected more wins given how high allocation was on the profile. Maybe the |
OK, so, @rust-lang/lang: @eddyb and I had a short chat about fundamental traits on IRC. I am not comfortable with declaring However, I would be comfortable with doing a targeted extension to fundamental. Basically a mildly hacky negative impl. For example, we discussed allowing We would thus be committed to adding negative impls of How do people feel about that? (I'm happy to guide @Mark-Simulacrum through the steps to do that change.) |
@nikomatsakis first paragraph is how I felt when I saw @eddyb's question also. One of the biggest problems with
|
Do note that we don't care about |
I guess part of this is that, if I were happy with But yeah this does mean we are committing to either @eddyb tbh I meant |
IIRC, LLVM can optimize even the malloc case better if you have a fixed capacity (i.e. use |
pub fn intern_substs(self, ts: &[Kind<'tcx>]) -> &'tcx Slice<Kind<'tcx>> { | ||
if ts.len() == 0 { | ||
return unsafe { | ||
mem::transmute(slice::from_raw_parts(0x1 as *mut Kind<'tcx>, 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be just Slice::empty()
like the other one.
b49bfe9
to
d56f099
Compare
d56f099
to
c751299
Compare
c751299
to
a2efca9
Compare
// option. This file may not be copied, modified, or distributed | ||
// except according to those terms. | ||
|
||
//! A vector type intended to be used for collecting from iterators onto the stack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like a slightly more expansive comment that mentions what the threshold is (8) and what happens when it's exceeded.
// option. This file may not be copied, modified, or distributed | ||
// except according to those terms. | ||
|
||
//! A small vector type intended for storing <= 8 values on the stack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, please explain what happens if 8 is exceeded.
a2efca9
to
2059fdc
Compare
AccumulateVec is generic over the Array trait, which is currently only implemented for [T; 8].
2059fdc
to
989eba7
Compare
@bors r+ |
📌 Commit 989eba7 has been approved by |
⌛ Testing commit 989eba7 with merge a6b3b01... |
…ddyb Add ArrayVec and AccumulateVec to reduce heap allocations during interning of slices Updates `mk_tup`, `mk_type_list`, and `mk_substs` to allow interning directly from iterators. The previous PR, #37220, changed some of the calls to pass a borrowed slice from `Vec` instead of directly passing the iterator, and these changes further optimize that to avoid the allocation entirely. This change yields 50% less malloc calls in [some cases](https://pastebin.mozilla.org/8921686). It also yields decent, though not amazing, performance improvements: ``` futures-rs-test 4.091s vs 4.021s --> 1.017x faster (variance: 1.004x, 1.004x) helloworld 0.219s vs 0.220s --> 0.993x faster (variance: 1.010x, 1.018x) html5ever-2016- 3.805s vs 3.736s --> 1.018x faster (variance: 1.003x, 1.009x) hyper.0.5.0 4.609s vs 4.571s --> 1.008x faster (variance: 1.015x, 1.017x) inflate-0.1.0 3.864s vs 3.883s --> 0.995x faster (variance: 1.232x, 1.005x) issue-32062-equ 0.309s vs 0.299s --> 1.033x faster (variance: 1.014x, 1.003x) issue-32278-big 1.614s vs 1.594s --> 1.013x faster (variance: 1.007x, 1.004x) jld-day15-parse 1.390s vs 1.326s --> 1.049x faster (variance: 1.006x, 1.009x) piston-image-0. 10.930s vs 10.675s --> 1.024x faster (variance: 1.006x, 1.010x) reddit-stress 2.302s vs 2.261s --> 1.019x faster (variance: 1.010x, 1.026x) regex.0.1.30 2.250s vs 2.240s --> 1.005x faster (variance: 1.087x, 1.011x) rust-encoding-0 1.895s vs 1.887s --> 1.005x faster (variance: 1.005x, 1.018x) syntex-0.42.2 29.045s vs 28.663s --> 1.013x faster (variance: 1.004x, 1.006x) syntex-0.42.2-i 13.925s vs 13.868s --> 1.004x faster (variance: 1.022x, 1.007x) ``` We implement a small-size optimized vector, intended to be used primarily for collection of presumed to be short iterators. This vector cannot be "upsized/reallocated" into a heap-allocated vector, since that would require (slow) branching logic, but during the initial collection from an iterator heap-allocation is possible. We make the new `AccumulateVec` and `ArrayVec` generic over implementors of the `Array` trait, of which there is currently one, `[T; 8]`. In the future, this is likely to expand to other values of N. Huge thanks to @nnethercote for collecting the performance and other statistics mentioned above.
I did some malloc profiling on Oct 24 (before this landed) and again on Oct 27 (after this landed). I expect the vast majority of changes are due to this PR. Note that these are total/cumulative numbers, i.e. not the number of live blocks/bytes at any point. (The latter numbers barely changed because the allocations avoided by this PR are all short-lived.)
|
Updates
mk_tup
,mk_type_list
, andmk_substs
to allow interning directly from iterators. The previous PR, #37220, changed some of the calls to pass a borrowed slice fromVec
instead of directly passing the iterator, and these changes further optimize that to avoid the allocation entirely.This change yields 50% less malloc calls in some cases. It also yields decent, though not amazing, performance improvements:
We implement a small-size optimized vector, intended to be used primarily for collection of presumed to be short iterators. This vector cannot be "upsized/reallocated" into a heap-allocated vector, since that would require (slow) branching logic, but during the initial collection from an iterator heap-allocation is possible.
We make the new
AccumulateVec
andArrayVec
generic over implementors of theArray
trait, of which there is currently one,[T; 8]
. In the future, this is likely to expand to other values of N.Huge thanks to @nnethercote for collecting the performance and other statistics mentioned above.