-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clang is slower than gcc when compiling some codes in chrome #31385
Comments
assigned to @RKSimon |
I've take a look at what is killing us in APInt bit manipulation of multi-word (> 64 bit) values. The 3 patterns in particular are: Bits |= APInt::getBitsSet(SizeInBits, Lo, Hi); Bits |= Sub.zextOrTrunc(SizeInBits).shl(Lo); APInt Sub = Bits.lshr(Lo).zextOrTrunc(SubSizeInBits); with 'Bits' typically 128/256/512 bits in size and 'Sub' always 64 bits or less. With this in mind, APInt methods that would speed things up would be along the lines of (along with some questions on the exact behaviour we want): // Set all bits to true from loBit (inclusive) to hiBit (exclusive). // Insert bits with Sub starting from offset. // Extract bits from loBit (inclusive) to hiBit (exclusive). |
+Simon and Chandler for combineX86ShufflesRecursively |
https://reviews.llvm.org/D30265 proposes APInt::setBits to try and avoid: Bits |= APInt::getBitsSet(SizeInBits, Lo, Hi); |
I really appreciate the self-contained test case. :) Here's a proposal for a small fix to the IR optimizer (opt): For this test, that will just be noise until the backend is sped-up, but there have been recent complaints (but no test cases that I know of) about InstCombine / InstSimplify compile-time in general, so hopefully that makes all optimized compiles a little faster. |
https://reviews.llvm.org/rL295898 |
Can we mark fixed? |
My changes should have close to no difference on this benchmark, so you can safely ignore those. Here's what I see running on 4GHz Haswell with a recent release build of clang: $ ./clang -v $ time ./clang -target x86_64-cros-linux-gnu -O2 row_common.cc -c -o /dev/null -march=corei7 real 0m6.254s real 0m1.292s real 0m1.138s So I'd say there's still a lot of potential to improve the backend. We're still combining shuffles recursively for an SSE3 or SSE4 target for a very long time. |
I still want to get the APInt::insertBits implementation done after which I need to start trying to break down where the time is going in combineX86ShufflesRecursively |
I've added a patch for APInt::insertBits - https://reviews.llvm.org/D30780 |
combineX86ShufflesRecursively - I've avoided some APInt malloc calls in rL297267 |
Improved llvm::extractConstantMask in rL297381 |
It's getting better. I'm down to 5.05 secs (23% faster) than the corei7 run in comment 7. |
Committed the APInts::insertBits support at rL297458 |
Another good improvement on my build machine: 4.35 sec (16% faster than the last check). The base (not corei7) build is still running 1.25 sec, so there's still a lot of room for improvement. An Instruments profile on macOS says that the hottest path is 8-deep on combineX86ShufflesRecursively(). Can we limit the recursion and still catch most shuffle optimizations? combineShuffle() is ~80% of the total compile time. |
In case this helps narrow down the slow path - we only hit this with -mssse3 (3 's'), and it disappears with -mavx: $ time ./clang ... -msse2 $ time ./clang ... -msse3 $ time ./clang ... -mssse3 $ time ./clang ... -msse4.1 $ time ./clang ... -msse4.2 $ time ./clang ... -mavx $ time ./clang ... -mavx2 |
For this case at least, perf goes over the cliff at Depth > 7. Ie, with this patch: Index: lib/Target/X86/X86ISelLowering.cpp--- lib/Target/X86/X86ISelLowering.cpp (revision 297460)
...we're about 1.2 sec for the whole compile. There are a couple of regression test failures with that change, so we need to weigh the compile-time tradeoff? The depth could be scaled with optimization level? There's also a micro-optimization opportunity for the loop that starts with: // Merge this shuffle operation's mask into our accumulated mask. That loop has 4 or more integer div/rem ops. Might be possible to hoist and/or combine and/or avoid some of those? |
rL305284 avoids the unnecessary 'large' APInt creations during vector constant extraction on the test cases. |
https://reviews.llvm.org/rL305284 improved usage of APInt, so let's see where we stand today. First, a new baseline clang built from r305283. Again, the command-line template for the timing is: $ time ./clang ... -msse2 And now apply r305284 and rebuild: $ time ./clang ... -msse2 |
Because I still can't keep the marketing names and feature sets straight: the "corei7" in the problem description is a Nehalem, and that has SSE4.2. |
Try to make sense of the data:
I think we need some guidance from the Chrome folks as to where we stand:
Regardless of that, I'll try to get some new profile data. |
The SSE2 regressions from the baseline can probably be attributed to us now trying to combine PINSRW(PEXTRW()) patterns, which row_common has quite a few of. Unfortunately they are all for vector transpose patterns that won't actually combine - hence going to full shuffle combine depth for nothing. Ideally for runtime performance these would become nested trees of unpacks, but I don't think we do much to match such patterns. |
As expected, a profile (on macOS) shows that 66% of the time for the SSE4.2 compile is now going to combineX86ShufflesRecursively(). As mentioned in comment 16, this code has a bunch of div/rem ops, and LLVM should do a better job of optimizing that (bug 31028), but we can hack this a bit: Getting rid of the div/rem makes the function ~14% faster which makes this entire compile ~9% faster for the SSE4.2 config: $ time ./clang ... -msse4.2 This was 2.5 sec with pipeline-clogging div/rem on Haswell. |
Updated timing after the micro-optimization hackery in: $ time ./clang ... -msse2 Based on the profile, I think we could still improve this substantially by doing some kind of shuffle mask constant caching, but I'd definitely like to here from someone what the perf goal is. |
Hi, this is the bug reporter. The bug is reported because they looked like outliers that may indicate some low-hanging fruits (or bugs). Nothing is blocked on this as far as I know. |
OK, thanks for the info. I'm still not sure when we can declare victory. :) |
Updated timing on the same Haswell system as before using clang built from r323118. Performance change (negative numbers are bad) relative to the measurements in comment 23: $ time ./clang ... -msse2 So except for 'ssse3' -- which was significantly worse than everything else last time -- we have regressed since r305414. |
...but Simon has a patch proposal to limit the shuffle recursion while maintaining existing codegen in all current regression tests (so an improvement over the naive limiting that I suggested in comment 16). With that patch applied, I show (perf relative to comment 26): $ time ./clang ... -msse2 So that should eliminate x86 shuffle recursion overhead as a compile-time concern. |
|
rL323320 - resolving this. There might be further micro-optimizations that can be made but the main bulk of it is done now. |
Extended Description
This is about compile time, not code quality.
When building chrome, there are some objects that take noticeably longer than gcc. Here are the top 5:
diff / clang-4.0 time / gcc-4.9.2 time / object
17.752848, 58.020705, 40.267857, obj/third_party/sqlite/chromium_sqlite3/sqlite3.o
10.927715, 15.727189, 4.799474, obj/third_party/libyuv/libyuv/row_common.o
8.964672, 19.499460, 10.534788, obj/v8/v8_base/wasm-interpreter.o
7.865181, 11.146169, 3.280988, obj/third_party/libvpx/libvpx/variance.o
5.407040, 15.328771, 9.921731, obj/components/policy/cloud_policy_proto_generated_compile_proto/cloud_policy.pb.o
Although this is comparing apples to oranges, it would be nice to make clang run faster if there are some low hanging fruits. One possibility might be row_common.o, which is compiled with -march=corei7. Removing the option make it 4x faster.
$ time clang -target x86_64-cros-linux-gnu -O2 -c row_common.cc -march=corei7
real 0m6.920s
user 0m6.900s
sys 0m0.022s
$ time clang -target x86_64-cros-linux-gnu -O2 -c row_common.cc
real 0m1.565s
user 0m1.526s
sys 0m0.037s
Top 10 time-consuming functions:
Overhead Command Shared Object Symbol
........ ................ ................ .................................
The text was updated successfully, but these errors were encountered: