Add CGU size heuristic for partitioning #47415

varkor · 2018-01-13T15:56:54Z

This addresses the concern of #47316 by estimating CGU size based on
the size of its MIR. Looking at the size estimate differences for a
small selection of crates, this heuristic produces different orderings,
which should more accurately reflect optimisation time. (Fixes #47316.)

r? @michaelwoerister

varkor · 2018-01-13T16:00:31Z

src/librustc_mir/monomorphize/partitioning.rs

+
+    // Since `sort_by_key` currently recomputes the keys for each comparison,
+    // we can save unnecessary recomputations by storing size estimates for
+    // each `MonoItem`. Storing estimates for `CodegenUnit` might be preferable,


I'm not sure whether it's worth trying to store estimates for CodegenUnit instead. I found it somewhat troublesome to use CodegenUnit as a key, but perhaps it'd be worth introducing an index for the units to use instead.

JustAPerson · 2018-01-13T21:35:49Z

I was also looking at this issue.
I found another spot where cgus are sorted by "size", so perhaps this should be extracted to a method.

varkor · 2018-01-13T23:54:06Z

Good spot! You're right — I'll include that in the changes. (The fact that each CodegenUnit there is wrapped in an Arc makes it slightly more awkward, so I'll sort it out tomorrow.)
Sorry about the collision — I should really get into the habit of posting on an issue before experimenting with it 😬

michaelwoerister · 2018-01-15T11:46:54Z

Thank you both for looking into this, @varkor and @JustAPerson!

Here is what I would do:

Make mono_item_size_estimate() a method of MonoItem so it's available everywhere.
Cache the size estimate for CodegenUnits somewhere in order not to recompute them again and again. Computing them in the closure passed to sort() will mean computing them multiple times for each item because sorting will compare each item more than once. You could store the size estimate in a field of CodegenUnit for example. You just have to make sure to keep it updated properly during the various stages of partitioning.

varkor · 2018-01-15T12:17:24Z

Okay, I was wondering whether I could get away with extending CodegenUnit to store a size estimate, but I wasn't sure if that was too large a change for one optimisation. That should make things cleaner — I shall do that!

JustAPerson · 2018-01-15T18:44:47Z

@varkor perhaps update the comments at those two sites, because now they're wrong.

varkor · 2018-01-15T18:53:36Z

@JustAPerson: I saw the comments, thought to myself: "I must remember to update those before committing", and then completely forgot! 😅 Thank you!

michaelwoerister · 2018-01-16T10:37:00Z

OK, that looks like a correct implementation to me. Now that I see it though, I think it is doing more work than necessary:

After inlining, the sizes of all MonoItems that have already been present before inlining are computed a second time.
We are estimating the size of each monomorphized instance individually, even though the size does not depend on the concrete type arguments of the instance. That is, we are bound to compute the same size for foo<u32>, foo<Bar>, foo<()>, etc. because tcx.instance_mir() will give as the exact same MIR object for each of those.

I think (2) is more of a problem than (1). It could be solved by adding an instance_def_size_estimate query (and base MonoItem::size_estimate() on that). That would make sure that we estimate the size of each InstanceDef only once and hit the cache for subsequents calls. As a side effect, issue (1) would be much less of a problem because recomputing becomes cheaper.

Such a query could also be optimized by just assigning a fixed small size for compiler generated shims like InstanceDef::FnPtrShim, InstanceDef::Virtual, and so on -- as we already do for statics.

Would one of you be willing to implement that?

varkor · 2018-01-16T12:43:11Z

@JustAPerson: If you'd like to finish the change off to share the work, you're very welcome to! I guess the easiest way would be to make a PR on my branch, which would then appear here after being merged in? I'm happy to make the change if you don't want to though!

JustAPerson · 2018-01-16T18:09:39Z

@varkor I appreciate the offer but I think I'm a bit busy today so go ahead :)

varkor · 2018-01-16T22:29:00Z

InstanceDef now caches its size estimate, so as long as each type instantiation of type-parameterised MonoItem uses the same InstanceDef, it shouldn't have to recompute the size.

michaelwoerister · 2018-01-17T11:08:22Z

Hm, OK, interesting approach; and certainly efficient! But it doesn't feel quite right to me.

The first reason is that it's hard to uphold the condition you state: The code will copy InstanceDefs around now; but someone else might come around later and change that without noticing the detrimental effects. And it's easy to get something wrong here without noticing. For example, your code copies the Instance, caches the size estimate in the copy -- and then immediately discard it, leaving the Instance we'll be probing later still with None:
a20b94d#diff-2f7b82f74576ca9bedc54d0bcfb29ca4R31
You could get around that with a std::cell::Cell but that's asking for trouble. I try to avoid interior mutability as much as possible.

The second reason is that it kind of conflates separate concerns by putting something into InstanceDef that doesn't strictly belong there. It's a bit of a code smell.

I suggest to go the "query" route. It might be slightly less efficient but it will be more robust in the end.

If you need any help with setting up a new query, let me know!

varkor · 2018-01-18T00:47:59Z

@michaelwoerister: Ah, I misunderstood — I didn't realise that queries were provided with define_maps!. (This has been a good learning experience 🙂) I haven't had a chance today, but I'll try to fix this tomorrow.

michaelwoerister · 2018-01-18T11:44:24Z

Oh sorry, yeah, it's easy to slip into rustc-internals-lingo without noticing. Yes, "query" is what we call the things in define_maps!(). Here's a rough writeup on the concept if you are interested: https://github.com/nikomatsakis/rustc-on-demand-incremental-design-doc/blob/master/0000-rustc-on-demand-and-incremental.md

This addresses the concern of rust-lang#47316 by estimating CGU size based on the size of its MIR. Looking at the size estimate differences for a small selection of crates, this heuristic produces different orderings, which should more accurately reflect optimisation time. Fixes rust-lang#47316.

varkor · 2018-01-19T00:53:21Z

I've added an instance_def_size_estimate query as you suggested. There are a couple of places I was a bit unsure about, being unfamiliar with queries in general (annotated, to check they're all right).

varkor · 2018-01-19T00:54:55Z

src/librustc/dep_graph/dep_node.rs

@@ -638,6 +638,7 @@ define_dep_nodes!( <'tcx>
    [input] TargetFeaturesWhitelist,
    [] TargetFeaturesEnabled(DefId),

+    [] InstanceDefSizeEstimate { instance_def: InstanceDef<'tcx> },


Is [] correct here? I wasn't sure what this meant.

Yes, that's correct. Leaving the brackets empty means that this is a regular query/dep-node.

varkor · 2018-01-19T00:55:40Z

src/librustc/ty/maps/plumbing.rs

@@ -761,6 +761,7 @@ pub fn force_from_dep_node<'a, 'gcx, 'lcx>(tcx: TyCtxt<'a, 'gcx, 'lcx>,
        DepKind::EraseRegionsTy |
        DepKind::NormalizeTy |
        DepKind::SubstituteNormalizeAndTestPredicates |
+        DepKind::InstanceDefSizeEstimate |


Is this the correct location? Or should it force! something like some of the queries below?

Yes that's the correct location. Very good! :)

force_from_dep_node takes a DepNode and tries to execute the query corresponding to it. However, a DepNode only consists of a DepKind and an opaque hash value (which acts as a UUID). It is only in special cases, like when the hash value is actually a DefPathHash that we can look up in a table, that we can map from DepNode to the correct query and its arguments. For ty::Instance we don't have such a table that would make it possible to do the mapping.

michaelwoerister

Excellent! This looks almost done. I left some comments below. I'm impressed by how well you managed to navigate our woefully under-documented query system :)

michaelwoerister · 2018-01-19T11:01:32Z

src/librustc/dep_graph/dep_node.rs

@@ -638,6 +638,7 @@ define_dep_nodes!( <'tcx>
    [input] TargetFeaturesWhitelist,
    [] TargetFeaturesEnabled(DefId),

+    [] InstanceDefSizeEstimate { instance_def: InstanceDef<'tcx> },


Yes, that's correct. Leaving the brackets empty means that this is a regular query/dep-node.

michaelwoerister · 2018-01-19T11:04:03Z

src/librustc/mir/mono.rs

+    pub fn size_estimate(&self) -> usize {
+        // Should only be called if `estimate_size` has previously been called.
+        assert!(self.size_estimate.is_some());
+        self.size_estimate.unwrap()


You could replace this combination of assert!() and unwrap() with a call to Option::expect(). That amounts to the same thing but is a bit more concise.

michaelwoerister · 2018-01-19T11:10:49Z

src/librustc/mir/mono.rs

@@ -140,6 +176,7 @@ impl<'tcx> HashStable<StableHashingContext<'tcx>> for CodegenUnit<'tcx> {
        let CodegenUnit {
            ref items,
            name,
+            ..


We keep these destructuring lets exhaustive in order not to miss handling fields that are added later. In a case like this you can explicitly ignore the field:

let CodegenUnit { ref items, name, // The size estimate is not relevant to the hash size_estimate: _, } = *self;

Making these exhaustive has proven to be effective for preventing errors introduced by unrelated refactorings.

michaelwoerister · 2018-01-19T11:17:24Z

src/librustc/ty/maps/plumbing.rs

@@ -761,6 +761,7 @@ pub fn force_from_dep_node<'a, 'gcx, 'lcx>(tcx: TyCtxt<'a, 'gcx, 'lcx>,
        DepKind::EraseRegionsTy |
        DepKind::NormalizeTy |
        DepKind::SubstituteNormalizeAndTestPredicates |
+        DepKind::InstanceDefSizeEstimate |


Yes that's the correct location. Very good! :)

force_from_dep_node takes a DepNode and tries to execute the query corresponding to it. However, a DepNode only consists of a DepKind and an opaque hash value (which acts as a UUID). It is only in special cases, like when the hash value is actually a DefPathHash that we can look up in a table, that we can map from DepNode to the correct query and its arguments. For ty::Instance we don't have such a table that would make it possible to do the mapping.

michaelwoerister · 2018-01-19T11:20:32Z

src/librustc/ty/mod.rs

+        InstanceDef::Item(def_id) => {
+            let mir = tcx.optimized_mir(def_id);
+            mir.basic_blocks().iter().map(|bb| bb.statements.len()).sum()
+        },


I think we should estimate the size of DropGlue via its MIR.

Also a little bit of clean up.

michaelwoerister · 2018-01-22T14:24:44Z

This looks pretty finished to me. Going to tentatively approve. Let me know if there's still something you wanted to do, @varkor. Thanks for the great PR!

@bors r+

bors · 2018-01-26T00:04:36Z

⌛ Testing commit 62703cf with merge 94f7e25e204d8c8fdde9f55ae8a9e08512f7d63d...

alexcrichton · 2018-01-26T00:18:43Z

@bors: retry

prioritizing rollup withthis in it

bors · 2018-01-26T01:58:33Z

⌛ Testing commit 62703cf with merge 061b57274985c173d76ab948510d43168ddc0a12...

alexcrichton · 2018-01-26T02:12:18Z

@bors: retry

prioritizing rollup with this in it

bors · 2018-01-26T09:17:15Z

⌛ Testing commit 62703cf with merge 903f08a...

@michaelwoerister

…ster Add CGU size heuristic for partitioning This addresses the concern of #47316 by estimating CGU size based on the size of its MIR. Looking at the size estimate differences for a small selection of crates, this heuristic produces different orderings, which should more accurately reflect optimisation time. (Fixes #47316.) r? @michaelwoerister

bors · 2018-01-26T12:17:57Z

💔 Test failed - status-appveyor

alexcrichton · 2018-01-26T14:57:05Z

@bors: retry

Spurious appveyor 32-bit test timeouts #46903

bors · 2018-01-26T15:38:36Z

⌛ Testing commit 62703cf with merge 8952c0af0dcc0effb95544b730b6cd1e396c1feb...

alexcrichton · 2018-01-26T15:38:55Z

@bors: retry

prioritizing rollup with this in it

bors · 2018-01-26T17:32:17Z

⌛ Testing commit 62703cf with merge 1e968bad4676ef2cabe13b1c9eca697997151ada...

alexcrichton · 2018-01-26T17:41:32Z

@bors: retry

prioritizing rollup with this in it

Rollup of 19 pull requests - Successful merges: #47415, #47437, #47439, #47453, #47460, #47502, #47529, #47600, #47607, #47618, #47626, #47656, #47668, #47696, #47701, #47705, #47710, #47711, #47719 - Failed merges: #47455, #47521

Add slice::sort_by_cached_key as a memoised sort_by_key At present, `slice::sort_by_key` calls its key function twice for each comparison that is made. When the key function is expensive (which can often be the case when `sort_by_key` is chosen over `sort_by`), this can lead to very suboptimal behaviour. To address this, I've introduced a new slice method, `sort_by_cached_key`, which has identical semantic behaviour to `sort_by_key`, except that it guarantees the key function will only be called once per element. Where there are `n` elements and the key function is `O(m)`: - `slice::sort_by_cached_key` time complexity is `O(m n log m n)`, compared to `slice::sort_by_key`'s `O(m n + n log n)`. - `slice::sort_by_cached_key` space complexity remains at `O(n + m)`. (Technically, it now reserves a slice of size `n`, whereas before it reserved a slice of size `n/2`.) `slice::sort_unstable_by_key` has not been given an analogue, as it is important that unstable sorts are in-place, which is not a property that is guaranteed here. However, this also means that `slice::sort_unstable_by_key` is likely to be slower than `slice::sort_by_cached_key` when the key function does not have negligible complexity. We might want to explore this trade-off further in the future. Benchmarks (for a vector of 100 `i32`s): ``` # Lexicographic: `|x| x.to_string()` test bench_sort_by_key ... bench: 112,638 ns/iter (+/- 19,563) test bench_sort_by_cached_key ... bench: 15,038 ns/iter (+/- 4,814) # Identity: `|x| *x` test bench_sort_by_key ... bench: 1,346 ns/iter (+/- 238) test bench_sort_by_cached_key ... bench: 1,839 ns/iter (+/- 765) # Power: `|x| x.pow(31)` test bench_sort_by_key ... bench: 3,624 ns/iter (+/- 738) test bench_sort_by_cached_key ... bench: 1,997 ns/iter (+/- 311) # Abs: `|x| x.abs()` test bench_sort_by_key ... bench: 1,546 ns/iter (+/- 174) test bench_sort_by_cached_key ... bench: 1,668 ns/iter (+/- 790) ``` (So it seems functions that are single operations do perform slightly worse with this method, but for pretty much any more complex key, you're better off with this optimisation.) I've definitely found myself using expensive keys in the past and wishing this optimisation was made (e.g. for rust-lang#47415). This feels like both desirable and expected behaviour, at the small cost of slightly more stack allocation and minute degradation in performance for extremely trivial keys. Resolves rust-lang#34447.

rust-highfive assigned michaelwoerister Jan 13, 2018

varkor commented Jan 13, 2018

View reviewed changes

kennytm added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 14, 2018

varkor force-pushed the cgu-partition-heuristic branch 2 times, most recently from f055bd0 to f085542 Compare January 15, 2018 18:31

varkor force-pushed the cgu-partition-heuristic branch from a20b94d to 98752ea Compare January 19, 2018 00:33

varkor added 4 commits January 19, 2018 00:41

Refactor CodegenUnit size estimates

e60b0f8

Update comments about the partitioning inefficiency

c8e9da4

Add instance_def_size_estimate query

5c9a8b5

varkor force-pushed the cgu-partition-heuristic branch from 98752ea to 5c9a8b5 Compare January 19, 2018 00:51

varkor commented Jan 19, 2018

View reviewed changes

michaelwoerister reviewed Jan 19, 2018

View reviewed changes

Estimate size of InstanceDef::DropGlue more accurately

62703cf

Also a little bit of clean up.

bors merged commit 62703cf into rust-lang:master Jan 26, 2018

varkor deleted the cgu-partition-heuristic branch January 27, 2018 14:31

varkor mentioned this pull request Mar 1, 2018

Add slice::sort_by_cached_key as a memoised sort_by_key #48639

Merged

Add CGU size heuristic for partitioning #47415

Add CGU size heuristic for partitioning #47415

Conversation

varkor commented Jan 13, 2018 • edited Loading

Choose a reason for hiding this comment

JustAPerson commented Jan 13, 2018

varkor commented Jan 13, 2018

michaelwoerister commented Jan 15, 2018

varkor commented Jan 15, 2018

JustAPerson commented Jan 15, 2018

varkor commented Jan 15, 2018

michaelwoerister commented Jan 16, 2018

varkor commented Jan 16, 2018

JustAPerson commented Jan 16, 2018

varkor commented Jan 16, 2018

michaelwoerister commented Jan 17, 2018

varkor commented Jan 18, 2018 • edited Loading

michaelwoerister commented Jan 18, 2018

varkor commented Jan 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelwoerister left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelwoerister commented Jan 22, 2018

bors commented Jan 26, 2018

alexcrichton commented Jan 26, 2018

bors commented Jan 26, 2018

alexcrichton commented Jan 26, 2018

bors commented Jan 26, 2018

bors commented Jan 26, 2018

alexcrichton commented Jan 26, 2018

bors commented Jan 26, 2018

alexcrichton commented Jan 26, 2018

bors commented Jan 26, 2018

alexcrichton commented Jan 26, 2018

varkor commented Jan 13, 2018 •

edited

Loading

varkor commented Jan 18, 2018 •

edited

Loading