for a more even partitioning inline before merge #65281

Generally, the changes look good to me. I want to think a bit more if there was a reason not to do things in this order in the first place and, if yes, if that reason still applies today.

Also, it looks like some of the code here does some unnecessary work around size estimation and mono-item placement (even before this PR). That should be fixed as part of this PR.

I'll take another (closer) look on Monday.

rust-timer · 2019-10-11T12:46:08Z

Finished benchmarking try commit 977495b636d186bcf98d6f7360b5d74c4d1bf5d6, comparison URL.

Eh2406 · 2019-10-11T16:55:30Z

Perf result are underwhelming, is is possible that the Perf computer is not powerful enough for this to help?

mati865 · 2019-10-11T17:18:49Z

IIRC perf is running on 4C/8T CPU and OP has 32 threads. That can make big difference.
cc @Mark-Simulacrum

andjo403 · 2019-10-11T18:16:59Z

As there is 16CGUs by default at least 16 threads is probably needed for the large time win

andjo403 · 2019-10-11T18:19:53Z

Also the wall times is showing the result better
edit: not only better this PR shall only affect the wall time as the amount of work that is done is the same, only ordered better for parallel work.

Mark-Simulacrum · 2019-10-11T18:39:44Z

Yes, perf may not be sufficiently parallel to see improvements here. However, the lack of regressions seems good - if we are reliably seeing good results locally, I think landing probably makes sense from a performance perspective?

michaelwoerister · 2019-10-14T11:09:42Z

Hm, so one interesting thing here is that adding size estimates is not quite right for predicting the size of the merged CGU, because with inlining there can be duplicates between the mono-item. For example, say we have cgu1 and cgu2, containing regular functions F1 and F2 respectively. Let's both F1 and F2 each pull in inline functions I1, I2, I3. So after inlining we'd have the following CGUs:

cgu1 = { F1, I1, I2, I3 }
cgu2 = { F2, I1, I2, I3 }

Let's say (for simplicity's sake) that all functions have a size of 1. Then CGU sizes would be:

size_of(cgu1) = 4
size_of(cgu2) = 4
size_of(cgu1) + size_of(cgu2) = 8
size_of(merge(cgu3, cgu4)) = 5

I'm not sure if it would make a difference if estimations of merged CGU would take that into account.

michaelwoerister · 2019-10-14T11:33:57Z

@nnethercote, what's your take on these performance numbers?

There are some regressions (like regex-debug and encoding-debug) that might be real and not just noise. On the other hand, this optimizations makes a lot of sense.

andjo403 · 2019-10-14T12:53:58Z

I do not understand how there can be duplicates after merge? There is only one map with all the mono items. It was to handle the not duplicated function that I was calling the estimate function after each merge

michaelwoerister · 2019-10-14T13:40:23Z

@andjo403 You are right, there cannot be duplicates after merging. What I was writing about is that the pre-merge logic does not account for duplicates.

I'm thinking of something like the following example

cgu1 = { F1, I1, I2, I3, I4 }
cgu2 = { F2, I1, I2, I3 }
cgu3 = { F3, I5, I6, I7 }

The logic in the PR would produce:

cgu1 = { F1, I1, I2, I3, I4 }
cgu2.3 = { F2, F3, I1, I2, I3, I5, I6, I7 }

while a smarter algorithm could produce:

cgu1.2 = { F1, F2, I1, I2, I3, I4 }
cgu3 = { F3, I5, I6, I7 }

I don't know if such examples are common in practice. It might be worth a try though.

Note, that the algorithm that is in the compiler currently would still do worse than the one in the PR :)

andjo403 · 2019-10-14T13:41:15Z

if we think that it is to much overhead to call estimate_size for the cgu it is possible to add a counter that only counts the sizes of the mono items that is added to the merged cgu and then do the modify_size_estimate but in the logs that I have looked at this is less then 1ms

about the regressions what I can see it is the LLVM_module_optimize_module_passes that stands for most of it. do not understand how this spilt affects the debug builds where no inlining is happening.
for opt I think it is possible that the amount of inlining can affect how much work that llvm can do.

andjo403 · 2019-10-14T13:46:13Z

have been thinking about a smarter merger that can take in to account the duplicated functions but have not been able to find a algorithm that will not be brute force and it will take to long time

michaelwoerister · 2019-10-14T13:51:48Z

have been thinking about a smarter merger that can take in to account the duplicated functions but have not been able to find a algorithm that will not be brute force and it will take to long time

Yes, the naive approach is probably at least O(n²) where n=number of CGUs. Sounds like a fun challenge :) But not necessarily for this PR.

nnethercote · 2019-10-15T00:18:12Z

@nnethercote, what's your take on these performance numbers?

Those numbers are very messy and inconclusive. Unfortunately can't use instruction counts for this one, so we have to look at wall-times which have high variance. Based on just the numbers, I would say that the combination of improvements and regressions mean that it's not a clear win, and so I would be conservative and recommend against landing it. But I don't know anything about the theory of the patch, and whether that is compelling.

It might be interesting to do another perf run and see how much the numbers change. That would at least give an idea of what is and is not noise.

andjo403 · 2019-10-15T06:53:57Z

@michaelwoerister if we shall try another perf run did you have some changes that you wanted? thinks that it is a good idea to rebase also before the rerun as the base also can change the outcome.

michaelwoerister · 2019-10-15T08:42:55Z

Thanks for taking a look, @nnethercote! We'll do another perf-run.

@andjo403, yes, depending on how much you want to tinker:

Add a comment to the CodegenUnit::size_estimate field, saying this field is initialized during the CGU partitioning pass and can always be expected to be Some after that.

Make size estimation more efficient by locally caching the sizes of each mono-item and updating sizes during merging. You could have something like:

struct SizeEstimator {
   cache: FxHashMap<MonoItem, usize>,
}
impl SizeEstimator {
    fn size_estimate(&mut self, mono_item: &MonoItem) -> usize { ... }
}

and then do this during merging:

for (k, v) in smallest.items_mut().drain() {
    if second_smallest.items_mut().insert(k, v).is_none() {
        let item_size = estimator.size_estimate(k);
        second_smallest.modify_size_estimate(item_size);
    }
}

Or you could just make size estimation for CodegenUnit use the SizeEstimator. In any case, feel free to move the entire size estimation logic to partitioning.rs and just leave accessor methods on CodegenUnit.

consider the size of the inlined items for a more even partitioning

andjo403 · 2019-10-15T21:13:31Z

@michaelwoerister rebased and made an attempt to fix the comments

michaelwoerister · 2019-10-17T08:19:24Z

Thanks, @andjo403! Let's do the perf run.

@bors try @rust-timer queue

rust-timer · 2019-10-17T08:19:26Z

Awaiting bors try build completion

bors · 2019-10-17T08:19:36Z

⌛ Trying commit a465d97 with merge f0ea016...

for a more even partitioning inline before merge consider the size of the inlined items for a more even partitioning for me this change take the compile time for script-servo-opt from 306s to 249s edit: the times is for a 32 thread cpu. cc #64913

bors · 2019-10-17T11:34:49Z

☀️ Try build successful - checks-azure
Build commit: f0ea016 (f0ea016f70fa7e96f190c675f339961f950dab17)

rust-timer · 2019-10-17T11:34:51Z

Queued f0ea016 with parent 7e49800, future comparison URL.

rust-timer · 2019-10-17T18:04:48Z

Finished benchmarking try commit f0ea016, comparison URL.

michaelwoerister · 2019-10-21T10:54:38Z

Thanks for implementing size estimation caching, @andjo403! The new code is actually nicer than before.

From the second perf run it looks like webrender-opt clean and tokio-webpush-simple-opt clean consistently regress by 3-4%. style-servo-opt clean is similar but with only 1% wall-time regression. Looking at the detailed view, it looks like both cases spend more time in LLVM IR generation ("codegen_module") and LLVM itself. This leads me to believe that for those two crates the old scheme, by chance, produces less LLVM IR because fewer inline functions get duplicated.

I'm a bit conflicted about what to do with the PR. It seems like an intrinsically good idea but it does not seem to universally improve the situation. I'm still wondering if there is a (simple) way to make merging smarter with respect to inline function duplication. We wouldn't need perfect results.

andjo403 · 2019-10-25T10:14:50Z

@michaelwoerister I can not find a "simple" solution to merge have been looking in to some kind of minHash but that will only tell what sets of functions that is similar so it can be used to sort but then the question still is how to select what to merge we do not want to merge two large CGUs only due to they are similar, I'm ok with closing the PR if this is not what we want even if it is sad for me that have many cores and can not use them

michaelwoerister · 2019-10-25T11:15:31Z

@andjo403 Would you be up for doing a local run of our benchmark suite? Then we would have a clearer picture on how performance is at higher core counts. We could also maybe just make the behavior depend on the actual core count of the machine (although that might turn out to be tricky because the jobserver gives out tokens dynamically).

andjo403 · 2019-10-27T20:08:24Z

top 12 largest changes style-servo and script-servo failed due to #65846 seems like there is no point with this PR without the smarter merge function but at least we know that there is possibility for gains in the area.

shall I close this PR?
where can the discussion continue for how to do the merge?

michaelwoerister · 2019-10-29T10:10:06Z

Thanks for collecting that data, @andjo403!

I concur that the results are too mixed for making this change. Let's close this PR but let's also make sure to record the findings in a followup issue.

andjo403 · 2020-02-22T21:17:43Z

@michaelwoerister I think that I have changed my mind about that we maybe shall merge it any way as the perf numbers now build on luck more then some deliberate algorithm. also think that the mir-inliner PR #68213 have the same problem.
do you think that is it possible to have some more persons look at this?
maybe the partitioning shall have some work meeting or something.

michaelwoerister · 2020-02-24T10:30:12Z

@andjo403 It probably makes sense for someone to look into this. What we have at the moment is strictly a "better than nothing" solution.
I personally will not have time to do something here. The people around the MIR inlining work might be better equipped?

tgnottingham · 2021-09-21T03:03:42Z

@andjo403, with #86777 merged, would you like to reopen this and see how it performs? It's still going to come down to luck of the partition to a degree, at least until we make some other things smarter, but I think that this was a step in the right direction.

rust-highfive assigned petrochenkov Oct 10, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 10, 2019

andjo403 mentioned this pull request Oct 10, 2019

Less codegen parallelism than expected with -C codegen-units=16 #64913

Open

rust-highfive assigned michaelwoerister and unassigned petrochenkov Oct 11, 2019

petrochenkov added S-waiting-on-perf Status: Waiting on a perf run to be completed. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 11, 2019

andjo403 added 2 commits October 15, 2019 19:41

for a more even partitioning inline before merge

0bbc9fa

consider the size of the inlined items for a more even partitioning

locally caching the sizes of each mono-item

a465d97

andjo403 force-pushed the partitioning branch from 6396e58 to a465d97 Compare October 15, 2019 20:17

andjo403 closed this Oct 29, 2019

andjo403 mentioned this pull request Feb 22, 2020

instance_def_size_estimate is not that good at estimating #69382

Open

andjo403 mentioned this pull request Apr 21, 2020

[wip] Tweak cgu partitioning #71349

Closed

tgnottingham mentioned this pull request Mar 2, 2021

Optimize codegen scheduling for memory usage and compile time #82685

Closed

for a more even partitioning inline before merge #65281

for a more even partitioning inline before merge #65281

Uh oh!

Conversation

andjo403 commented Oct 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rust-highfive commented Oct 10, 2019

Uh oh!

wesleywiser commented Oct 10, 2019

Uh oh!

bors commented Oct 10, 2019

Uh oh!

bors commented Oct 10, 2019

Uh oh!

nnethercote commented Oct 11, 2019

Uh oh!

rust-timer commented Oct 11, 2019

Uh oh!

petrochenkov commented Oct 11, 2019

Uh oh!

michaelwoerister commented Oct 11, 2019

Uh oh!

rust-timer commented Oct 11, 2019

Uh oh!

Eh2406 commented Oct 11, 2019

Uh oh!

mati865 commented Oct 11, 2019

Uh oh!

andjo403 commented Oct 11, 2019

Uh oh!

andjo403 commented Oct 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mark-Simulacrum commented Oct 11, 2019

Uh oh!

michaelwoerister commented Oct 14, 2019

Uh oh!

michaelwoerister commented Oct 14, 2019

Uh oh!

andjo403 commented Oct 14, 2019

Uh oh!

michaelwoerister commented Oct 14, 2019

Uh oh!

andjo403 commented Oct 14, 2019

Uh oh!

andjo403 commented Oct 14, 2019

Uh oh!

michaelwoerister commented Oct 14, 2019

Uh oh!

nnethercote commented Oct 15, 2019

Uh oh!

andjo403 commented Oct 15, 2019

Uh oh!

michaelwoerister commented Oct 15, 2019

Uh oh!

andjo403 commented Oct 15, 2019

Uh oh!

michaelwoerister commented Oct 17, 2019

Uh oh!

rust-timer commented Oct 17, 2019

Uh oh!

bors commented Oct 17, 2019

Uh oh!

bors commented Oct 17, 2019

Uh oh!

rust-timer commented Oct 17, 2019

Uh oh!

rust-timer commented Oct 17, 2019

Uh oh!

michaelwoerister commented Oct 21, 2019

Uh oh!

andjo403 commented Oct 25, 2019

Uh oh!

michaelwoerister commented Oct 25, 2019

Uh oh!

andjo403 commented Oct 27, 2019

Uh oh!

michaelwoerister commented Oct 29, 2019

Uh oh!

andjo403 commented Oct 10, 2019 •

edited

Loading

andjo403 commented Oct 11, 2019 •

edited

Loading