Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for a more even partitioning inline before merge #65281

Closed
wants to merge 2 commits into from

Conversation

andjo403
Copy link
Contributor

@andjo403 andjo403 commented Oct 10, 2019

consider the size of the inlined items for a more even partitioning

for me this change take the compile time for script-servo-opt from 306s to 249s

edit: the times is for a 32 thread cpu.

cc #64913

@rust-highfive
Copy link
Collaborator

r? @petrochenkov

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 10, 2019
@wesleywiser
Copy link
Member

@bors try @rust-timer queue

@bors
Copy link
Contributor

bors commented Oct 10, 2019

⌛ Trying commit 6396e58bc8d6fdf1646161ea28e01445343e3a74 with merge 977495b636d186bcf98d6f7360b5d74c4d1bf5d6...

@bors
Copy link
Contributor

bors commented Oct 10, 2019

☀️ Try build successful - checks-azure
Build commit: 977495b636d186bcf98d6f7360b5d74c4d1bf5d6 (977495b636d186bcf98d6f7360b5d74c4d1bf5d6)

@nnethercote
Copy link
Contributor

@rust-timer build 977495b636d186bcf98d6f7360b5d74c4d1bf5d6

@rust-timer
Copy link
Collaborator

Queued 977495b636d186bcf98d6f7360b5d74c4d1bf5d6 with parent 58b5491, future comparison URL.

@petrochenkov
Copy link
Contributor

r? @michaelwoerister

@petrochenkov petrochenkov added S-waiting-on-perf Status: Waiting on a perf run to be completed. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 11, 2019
@michaelwoerister
Copy link
Member

Thanks for the PR, @andjo403! This might be a great find!

Generally, the changes look good to me. I want to think a bit more if there was a reason not to do things in this order in the first place and, if yes, if that reason still applies today.

Also, it looks like some of the code here does some unnecessary work around size estimation and mono-item placement (even before this PR). That should be fixed as part of this PR.

I'll take another (closer) look on Monday.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit 977495b636d186bcf98d6f7360b5d74c4d1bf5d6, comparison URL.

@Eh2406
Copy link
Contributor

Eh2406 commented Oct 11, 2019

Perf result are underwhelming, is is possible that the Perf computer is not powerful enough for this to help?

@mati865
Copy link
Contributor

mati865 commented Oct 11, 2019

IIRC perf is running on 4C/8T CPU and OP has 32 threads. That can make big difference.
cc @Mark-Simulacrum

@andjo403
Copy link
Contributor Author

As there is 16CGUs by default at least 16 threads is probably needed for the large time win

@andjo403
Copy link
Contributor Author

andjo403 commented Oct 11, 2019

Also the wall times is showing the result better
edit: not only better this PR shall only affect the wall time as the amount of work that is done is the same, only ordered better for parallel work.

@Mark-Simulacrum
Copy link
Member

Yes, perf may not be sufficiently parallel to see improvements here. However, the lack of regressions seems good - if we are reliably seeing good results locally, I think landing probably makes sense from a performance perspective?

@michaelwoerister
Copy link
Member

Hm, so one interesting thing here is that adding size estimates is not quite right for predicting the size of the merged CGU, because with inlining there can be duplicates between the mono-item. For example, say we have cgu1 and cgu2, containing regular functions F1 and F2 respectively. Let's both F1 and F2 each pull in inline functions I1, I2, I3. So after inlining we'd have the following CGUs:

cgu1 = { F1, I1, I2, I3 }
cgu2 = { F2, I1, I2, I3 }

Let's say (for simplicity's sake) that all functions have a size of 1. Then CGU sizes would be:

size_of(cgu1) = 4
size_of(cgu2) = 4
size_of(cgu1) + size_of(cgu2) = 8
size_of(merge(cgu3, cgu4)) = 5

I'm not sure if it would make a difference if estimations of merged CGU would take that into account.

@michaelwoerister
Copy link
Member

@nnethercote, what's your take on these performance numbers?

There are some regressions (like regex-debug and encoding-debug) that might be real and not just noise. On the other hand, this optimizations makes a lot of sense.

@andjo403
Copy link
Contributor Author

I do not understand how there can be duplicates after merge? There is only one map with all the mono items. It was to handle the not duplicated function that I was calling the estimate function after each merge

@michaelwoerister
Copy link
Member

@andjo403 You are right, there cannot be duplicates after merging. What I was writing about is that the pre-merge logic does not account for duplicates.

I'm thinking of something like the following example

cgu1 = { F1, I1, I2, I3, I4 }
cgu2 = { F2, I1, I2, I3 }
cgu3 = { F3, I5, I6, I7 }

The logic in the PR would produce:

cgu1 = { F1, I1, I2, I3, I4 }
cgu2.3 = { F2, F3, I1, I2, I3, I5, I6, I7 }

while a smarter algorithm could produce:

cgu1.2 = { F1, F2, I1, I2, I3, I4 }
cgu3 = { F3, I5, I6, I7 }

I don't know if such examples are common in practice. It might be worth a try though.

Note, that the algorithm that is in the compiler currently would still do worse than the one in the PR :)

@andjo403
Copy link
Contributor Author

if we think that it is to much overhead to call estimate_size for the cgu it is possible to add a counter that only counts the sizes of the mono items that is added to the merged cgu and then do the modify_size_estimate but in the logs that I have looked at this is less then 1ms

about the regressions what I can see it is the LLVM_module_optimize_module_passes that stands for most of it. do not understand how this spilt affects the debug builds where no inlining is happening.
for opt I think it is possible that the amount of inlining can affect how much work that llvm can do.

@andjo403
Copy link
Contributor Author

have been thinking about a smarter merger that can take in to account the duplicated functions but have not been able to find a algorithm that will not be brute force and it will take to long time

@michaelwoerister
Copy link
Member

have been thinking about a smarter merger that can take in to account the duplicated functions but have not been able to find a algorithm that will not be brute force and it will take to long time

Yes, the naive approach is probably at least O(n²) where n=number of CGUs. Sounds like a fun challenge :) But not necessarily for this PR.

@nnethercote
Copy link
Contributor

@nnethercote, what's your take on these performance numbers?

Those numbers are very messy and inconclusive. Unfortunately can't use instruction counts for this one, so we have to look at wall-times which have high variance. Based on just the numbers, I would say that the combination of improvements and regressions mean that it's not a clear win, and so I would be conservative and recommend against landing it. But I don't know anything about the theory of the patch, and whether that is compelling.

It might be interesting to do another perf run and see how much the numbers change. That would at least give an idea of what is and is not noise.

@andjo403
Copy link
Contributor Author

@michaelwoerister if we shall try another perf run did you have some changes that you wanted? thinks that it is a good idea to rebase also before the rerun as the base also can change the outcome.

@michaelwoerister
Copy link
Member

Thanks for taking a look, @nnethercote! We'll do another perf-run.

@andjo403, yes, depending on how much you want to tinker:

  1. Add a comment to the CodegenUnit::size_estimate field, saying this field is initialized during the CGU partitioning pass and can always be expected to be Some after that.
  2. Make size estimation more efficient by locally caching the sizes of each mono-item and updating sizes during merging. You could have something like:
    struct SizeEstimator {
       cache: FxHashMap<MonoItem, usize>,
    }
    impl SizeEstimator {
        fn size_estimate(&mut self, mono_item: &MonoItem) -> usize { ... }
    }
    and then do this during merging:
    for (k, v) in smallest.items_mut().drain() {
        if second_smallest.items_mut().insert(k, v).is_none() {
            let item_size = estimator.size_estimate(k);
            second_smallest.modify_size_estimate(item_size);
        }
    }
    Or you could just make size estimation for CodegenUnit use the SizeEstimator. In any case, feel free to move the entire size estimation logic to partitioning.rs and just leave accessor methods on CodegenUnit.

consider the size of the inlined items for a more even partitioning
@andjo403
Copy link
Contributor Author

@michaelwoerister rebased and made an attempt to fix the comments

@michaelwoerister
Copy link
Member

Thanks, @andjo403! Let's do the perf run.

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Oct 17, 2019

⌛ Trying commit a465d97 with merge f0ea016...

bors added a commit that referenced this pull request Oct 17, 2019
for a more even partitioning inline before merge

consider the size of the inlined items for a more even partitioning

for me this change take the compile time for script-servo-opt from 306s to 249s

edit: the times is for a 32 thread cpu.

cc #64913
@bors
Copy link
Contributor

bors commented Oct 17, 2019

☀️ Try build successful - checks-azure
Build commit: f0ea016 (f0ea016f70fa7e96f190c675f339961f950dab17)

@rust-timer
Copy link
Collaborator

Queued f0ea016 with parent 7e49800, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit f0ea016, comparison URL.

@michaelwoerister
Copy link
Member

Thanks for implementing size estimation caching, @andjo403! The new code is actually nicer than before.

From the second perf run it looks like webrender-opt clean and tokio-webpush-simple-opt clean consistently regress by 3-4%. style-servo-opt clean is similar but with only 1% wall-time regression. Looking at the detailed view, it looks like both cases spend more time in LLVM IR generation ("codegen_module") and LLVM itself. This leads me to believe that for those two crates the old scheme, by chance, produces less LLVM IR because fewer inline functions get duplicated.

I'm a bit conflicted about what to do with the PR. It seems like an intrinsically good idea but it does not seem to universally improve the situation. I'm still wondering if there is a (simple) way to make merging smarter with respect to inline function duplication. We wouldn't need perfect results.

@andjo403
Copy link
Contributor Author

@michaelwoerister I can not find a "simple" solution to merge have been looking in to some kind of minHash but that will only tell what sets of functions that is similar so it can be used to sort but then the question still is how to select what to merge we do not want to merge two large CGUs only due to they are similar, I'm ok with closing the PR if this is not what we want even if it is sad for me that have many cores and can not use them

@michaelwoerister
Copy link
Member

@andjo403 Would you be up for doing a local run of our benchmark suite? Then we would have a clearer picture on how performance is at higher core counts. We could also maybe just make the behavior depend on the actual core count of the machine (although that might turn out to be tricky because the jobserver gives out tokens dynamically).

@andjo403
Copy link
Contributor Author

image
top 12 largest changes style-servo and script-servo failed due to #65846 seems like there is no point with this PR without the smarter merge function but at least we know that there is possibility for gains in the area.

shall I close this PR?
where can the discussion continue for how to do the merge?

@michaelwoerister
Copy link
Member

Thanks for collecting that data, @andjo403!

I concur that the results are too mixed for making this change. Let's close this PR but let's also make sure to record the findings in a followup issue.

@andjo403
Copy link
Contributor Author

@michaelwoerister I think that I have changed my mind about that we maybe shall merge it any way as the perf numbers now build on luck more then some deliberate algorithm. also think that the mir-inliner PR #68213 have the same problem.
do you think that is it possible to have some more persons look at this?
maybe the partitioning shall have some work meeting or something.

@michaelwoerister
Copy link
Member

@andjo403 It probably makes sense for someone to look into this. What we have at the moment is strictly a "better than nothing" solution.
I personally will not have time to do something here. The people around the MIR inlining work might be better equipped?

@tgnottingham
Copy link
Contributor

@andjo403, with #86777 merged, would you like to reopen this and see how it performs? It's still going to come down to luck of the partition to a degree, at least until we make some other things smarter, but I think that this was a step in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-perf Status: Waiting on a perf run to be completed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.