Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clang is slower than gcc when compiling some codes in chrome #31385

Closed
llvmbot opened this issue Feb 22, 2017 · 30 comments
Closed

clang is slower than gcc when compiling some codes in chrome #31385

llvmbot opened this issue Feb 22, 2017 · 30 comments
Assignees
Labels
backend:X86 bugzilla Issues migrated from bugzilla slow-compile

Comments

@llvmbot
Copy link
Member

llvmbot commented Feb 22, 2017

Bugzilla Link 32037
Resolution FIXED
Resolved on Jan 24, 2018 03:57
Version trunk
OS Linux
Blocks #24719
Attachments row_common.cc, preprocessed
Reporter LLVM Bugzilla Contributor
CC @chandlerc,@topperc,@zmodem,@hfinkel,@lalozano,@RKSimon,@rnk,@rotateright
Fixed by commit(s) 323320

Extended Description

This is about compile time, not code quality.

When building chrome, there are some objects that take noticeably longer than gcc. Here are the top 5:

diff / clang-4.0 time / gcc-4.9.2 time / object
17.752848, 58.020705, 40.267857, obj/third_party/sqlite/chromium_sqlite3/sqlite3.o
10.927715, 15.727189, 4.799474, obj/third_party/libyuv/libyuv/row_common.o
8.964672, 19.499460, 10.534788, obj/v8/v8_base/wasm-interpreter.o
7.865181, 11.146169, 3.280988, obj/third_party/libvpx/libvpx/variance.o
5.407040, 15.328771, 9.921731, obj/components/policy/cloud_policy_proto_generated_compile_proto/cloud_policy.pb.o

Although this is comparing apples to oranges, it would be nice to make clang run faster if there are some low hanging fruits. One possibility might be row_common.o, which is compiled with -march=corei7. Removing the option make it 4x faster.

$ time clang -target x86_64-cros-linux-gnu -O2 -c row_common.cc -march=corei7
real 0m6.920s
user 0m6.900s
sys 0m0.022s

$ time clang -target x86_64-cros-linux-gnu -O2 -c row_common.cc
real 0m1.565s
user 0m1.526s
sys 0m0.037s

Top 10 time-consuming functions:

Overhead Command Shared Object Symbol

........ ................ ................ .................................

12.08%  clang-4.0         clang-4.0         [.] combineX86ShufflesRecursively                               
 8.81%  clang-4.0         libc-2.23.so      [.] malloc                                                      
 8.05%  clang-4.0         clang-4.0         [.] llvm::extractConstantMask                                   
 6.86%  clang-4.0         clang-4.0         [.] llvm::APInt::shlSlowCase                                    
 4.84%  clang-4.0         clang-4.0         [.] llvm::APInt::lshr                                           
 2.13%  clang-4.0         libstdc++.so.6.0  [.] operator new                                                
 1.92%  clang-4.0         clang-4.0         [.] peekThroughBitcasts                                         
 1.63%  clang-4.0         clang-4.0         [.] llvm::DecodePSHUFBMask                                      
 1.61%  clang-4.0         clang-4.0         [.] llvm::APInt::operator|=                                     
 1.39%  clang-4.0         clang-4.0         [.] llvm::APInt::initSlowCase
@llvmbot
Copy link
Member Author

llvmbot commented Feb 22, 2017

assigned to @RKSimon

@RKSimon
Copy link
Collaborator

RKSimon commented Feb 22, 2017

I've take a look at what is killing us in APInt bit manipulation of multi-word (> 64 bit) values.

The 3 patterns in particular are:

Bits |= APInt::getBitsSet(SizeInBits, Lo, Hi);

Bits |= Sub.zextOrTrunc(SizeInBits).shl(Lo);

APInt Sub = Bits.lshr(Lo).zextOrTrunc(SubSizeInBits);

with 'Bits' typically 128/256/512 bits in size and 'Sub' always 64 bits or less.

With this in mind, APInt methods that would speed things up would be along the lines of (along with some questions on the exact behaviour we want):

// Set all bits to true from loBit (inclusive) to hiBit (exclusive).
// Wraparound behaviour is identical to APInt::getBitsSet.
void APInt::setBits(unsigned loBit, unsigned hiBit);

// Insert bits with Sub starting from offset.
// ?? Wraparound behaviour
// ?? Replace existing bits or OR with them?
void APInt::insertBits(APInt Sub, unsigned offset);

// Extract bits from loBit (inclusive) to hiBit (exclusive).
// ?? Wraparound behaviour
APInt APInt::extractBits(unsigned loBit, unsigned hiBit);

@zmodem
Copy link
Collaborator

zmodem commented Feb 22, 2017

+Simon and Chandler for combineX86ShufflesRecursively

@RKSimon
Copy link
Collaborator

RKSimon commented Feb 22, 2017

With this in mind, APInt methods that would speed things up would be along
the lines of (along with some questions on the exact behaviour we want):

// Set all bits to true from loBit (inclusive) to hiBit (exclusive).
// Wraparound behaviour is identical to APInt::getBitsSet.
void APInt::setBits(unsigned loBit, unsigned hiBit);

https://reviews.llvm.org/D30265 proposes APInt::setBits to try and avoid:

Bits |= APInt::getBitsSet(SizeInBits, Lo, Hi);

@rotateright
Copy link
Contributor

I really appreciate the self-contained test case. :)

Here's a proposal for a small fix to the IR optimizer (opt):
https://reviews.llvm.org/D30270

For this test, that will just be noise until the backend is sped-up, but there have been recent complaints (but no test cases that I know of) about InstCombine / InstSimplify compile-time in general, so hopefully that makes all optimized compiles a little faster.

@rotateright
Copy link
Contributor

I really appreciate the self-contained test case. :)

Here's a proposal for a small fix to the IR optimizer (opt):
https://reviews.llvm.org/D30270

https://reviews.llvm.org/rL295898
https://reviews.llvm.org/rL296129

@rnk
Copy link
Collaborator

rnk commented Mar 1, 2017

Can we mark fixed?

@rotateright
Copy link
Contributor

My changes should have close to no difference on this benchmark, so you can safely ignore those.

Here's what I see running on 4GHz Haswell with a recent release build of clang:

$ ./clang -v
clang version 5.0.0 (trunk 296695)
Target: x86_64-apple-darwin16.4.0

$ time ./clang -target x86_64-cros-linux-gnu -O2 row_common.cc -c -o /dev/null -march=corei7

real 0m6.254s
user 0m6.215s
sys 0m0.016s
bin $ time ./clang -target x86_64-cros-linux-gnu -O2 row_common.cc -c -o /dev/null

real 0m1.292s
user 0m1.263s
sys 0m0.014s
bin $ time ./clang -target x86_64-cros-linux-gnu -O2 row_common.cc -c -o /dev/null -mavx

real 0m1.138s
user 0m1.110s
sys 0m0.012s


So I'd say there's still a lot of potential to improve the backend. We're still combining shuffles recursively for an SSE3 or SSE4 target for a very long time.

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 2, 2017

I still want to get the APInt::insertBits implementation done after which I need to start trying to break down where the time is going in combineX86ShufflesRecursively

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 9, 2017

I've added a patch for APInt::insertBits - https://reviews.llvm.org/D30780

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 9, 2017

combineX86ShufflesRecursively - I've avoided some APInt malloc calls in rL297267

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 9, 2017

Improved llvm::extractConstantMask in rL297381

@rotateright
Copy link
Contributor

combineX86ShufflesRecursively - I've avoided some APInt malloc calls in
rL297267

It's getting better. I'm down to 5.05 secs (23% faster) than the corei7 run in comment 7.

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 10, 2017

Committed the APInts::insertBits support at rL297458

@rotateright
Copy link
Contributor

Committed the APInts::insertBits support at rL297458

Another good improvement on my build machine: 4.35 sec (16% faster than the last check).

The base (not corei7) build is still running 1.25 sec, so there's still a lot of room for improvement.

An Instruments profile on macOS says that the hottest path is 8-deep on combineX86ShufflesRecursively(). Can we limit the recursion and still catch most shuffle optimizations? combineShuffle() is ~80% of the total compile time.

@rotateright
Copy link
Contributor

In case this helps narrow down the slow path - we only hit this with -mssse3 (3 's'), and it disappears with -mavx:

$ time ./clang ... -msse2
user 0m1.254s

$ time ./clang ... -msse3
user 0m1.257s

$ time ./clang ... -mssse3
user 0m4.258s

$ time ./clang ... -msse4.1
user 0m4.354s

$ time ./clang ... -msse4.2
user 0m4.374s

$ time ./clang ... -mavx
user 0m1.030s

$ time ./clang ... -mavx2
user 0m1.370s

@rotateright
Copy link
Contributor

For this case at least, perf goes over the cliff at Depth > 7. Ie, with this patch:

Index: lib/Target/X86/X86ISelLowering.cpp

--- lib/Target/X86/X86ISelLowering.cpp (revision 297460)
+++ lib/Target/X86/X86ISelLowering.cpp (working copy)
@@ -27573,7 +27573,7 @@
const X86Subtarget &Subtarget) {
// Bound the depth of our recursive combine because this is ultimately
// quadratic in nature.

  • if (Depth > 8)
  • if (Depth > 6)

...we're about 1.2 sec for the whole compile. There are a couple of regression test failures with that change, so we need to weigh the compile-time tradeoff? The depth could be scaled with optimization level?

There's also a micro-optimization opportunity for the loop that starts with:

// Merge this shuffle operation's mask into our accumulated mask.

That loop has 4 or more integer div/rem ops. Might be possible to hoist and/or combine and/or avoid some of those?

@RKSimon
Copy link
Collaborator

RKSimon commented Jun 13, 2017

rL305284 avoids the unnecessary 'large' APInt creations during vector constant extraction on the test cases.

@rotateright
Copy link
Contributor

https://reviews.llvm.org/rL305284 improved usage of APInt, so let's see where we stand today.

First, a new baseline clang built from r305283. Again, the command-line template for the timing is:
$ time ./clang -target x86_64-cros-linux-gnu -O2 row_common.cc -c -o /dev/null -mXXXX

$ time ./clang ... -msse2
user 0m3.062s <--- big regression from r296695 / comment 15
$ time ./clang ... -msse3
user 0m3.045s <--- big regression from r296695 / comment 15
$ time ./clang ... -mssse3
user 0m6.019s <--- big regression from r296695 / comment 15
$ time ./clang ... -msse4.1
user 0m4.203s <--- about the same as r296695 / comment 15
$ time ./clang ... -msse4.2
user 0m4.195s <--- about the same as r296695 / comment 15
$ time ./clang ... -mavx
user 0m0.809s <--- improvement from r296695 / comment 15
$ time ./clang ... -mavx2
user 0m1.234s <--- improvement from r296695 / comment 15

And now apply r305284 and rebuild:

$ time ./clang ... -msse2
user 0m2.207s <--- big win from r305283, but still slow?
$ time ./clang ... -msse3
user 0m2.199s <--- big win from r305283, but still slow?
$ time ./clang ... -mssse3
user 0m3.619s <--- big win from r305283, but still slow?
$ time ./clang ... -msse4.1
user 0m2.508s <--- big win from r305283
$ time ./clang ... -msse4.2
user 0m2.516s <--- big win from r305283
$ time ./clang ... -mavx
user 0m0.778s
$ time ./clang ... -mavx2
user 0m1.089s

@rotateright
Copy link
Contributor

Because I still can't keep the marketing names and feature sets straight: the "corei7" in the problem description is a Nehalem, and that has SSE4.2.

@rotateright
Copy link
Contributor

Try to make sense of the data:

  1. The corei7 / SSE4.2 case started at 6.2sec. After several improvements through r305283, it got down to 4.2sec. r305284 got it down to 2.5sec. So we're 2.5x faster than where we started for this config.

  2. The default SSE2 path has regressed to be 1.7x slower than it used to be.

  3. With the SSE4.2 win and SSE2 regression, the SSE4.2 compile is now 1.14x slower than the default SSE2.

  4. The problem description said the SSE4.2 path was 4.4x slower than the default SSE2 path. On my machine (Haswell 4GHz and r296695), it was 3.5x slower. Either way, the difference should be much smaller now.

  5. The best case AVX compile is now ~3x faster than SSE2 / SSE4.2.


I think we need some guidance from the Chrome folks as to where we stand:

  1. Is perf for this particular case acceptable now?
  2. How does clang compare to gcc at this point?

Regardless of that, I'll try to get some new profile data.

@RKSimon
Copy link
Collaborator

RKSimon commented Jun 13, 2017

The SSE2 regressions from the baseline can probably be attributed to us now trying to combine PINSRW(PEXTRW()) patterns, which row_common has quite a few of.

Unfortunately they are all for vector transpose patterns that won't actually combine - hence going to full shuffle combine depth for nothing. Ideally for runtime performance these would become nested trees of unpacks, but I don't think we do much to match such patterns.

@rotateright
Copy link
Contributor

As expected, a profile (on macOS) shows that 66% of the time for the SSE4.2 compile is now going to combineX86ShufflesRecursively().

As mentioned in comment 16, this code has a bunch of div/rem ops, and LLVM should do a better job of optimizing that (bug 31028), but we can hack this a bit:
https://reviews.llvm.org/D34174

Getting rid of the div/rem makes the function ~14% faster which makes this entire compile ~9% faster for the SSE4.2 config:

$ time ./clang ... -msse4.2
user 0m2.304s

This was 2.5 sec with pipeline-clogging div/rem on Haswell.

@rotateright
Copy link
Contributor

Updated timing after the micro-optimization hackery in:
https://reviews.llvm.org/rL305398
https://reviews.llvm.org/rL305414

$ time ./clang ... -msse2
user 0m2.024s <--- about 10% faster
$ time ./clang ... -msse3
user 0m2.047s <--- about 10% faster
$ time ./clang ... -mssse3
user 0m3.191s <--- about 10% faster
$ time ./clang ... -msse4.1
user 0m2.261s <--- about 10% faster
$ time ./clang ... -msse4.2
user 0m2.256s <--- about 10% faster
$ time ./clang ... -mavx
user 0m0.782s
$ time ./clang ... -mavx2
user 0m1.055s


Based on the profile, I think we could still improve this substantially by doing some kind of shuffle mask constant caching, but I'd definitely like to here from someone what the perf goal is.

@llvmbot
Copy link
Member Author

llvmbot commented Jun 14, 2017

Hi, this is the bug reporter. The bug is reported because they looked like outliers that may indicate some low-hanging fruits (or bugs). Nothing is blocked on this as far as I know.

@rotateright
Copy link
Contributor

Hi, this is the bug reporter. The bug is reported because they looked like
outliers that may indicate some low-hanging fruits (or bugs). Nothing is
blocked on this as far as I know.

OK, thanks for the info. I'm still not sure when we can declare victory. :)
We've made some improvements, but I think there's more that could be done.
But if nobody actually cares because this is acceptable and/or this isn't very common...

@rotateright
Copy link
Contributor

Updated timing on the same Haswell system as before using clang built from r323118. Performance change (negative numbers are bad) relative to the measurements in comment 23:

$ time ./clang ... -msse2
user 0m2.353s <--- -14%
$ time ./clang ... -msse3
user 0m2.344s <--- -13%
$ time ./clang ... -mssse3
user 0m2.605s <--- +22%
$ time ./clang ... -msse4.1
user 0m2.832s <--- -20%
$ time ./clang ... -msse4.2
user 0m2.837s <--- -20%
$ time ./clang ... -mavx
user 0m1.039s <--- -25%
$ time ./clang ... -mavx2
user 0m1.357s <--- -22%


So except for 'ssse3' -- which was significantly worse than everything else last time -- we have regressed since r305414.

@rotateright
Copy link
Contributor

...but Simon has a patch proposal to limit the shuffle recursion while maintaining existing codegen in all current regression tests (so an improvement over the naive limiting that I suggested in comment 16).

With that patch applied, I show (perf relative to comment 26):

$ time ./clang ... -msse2
user 0m1.080s <--- +118%
$ time ./clang ... -msse3
user 0m1.076s <--- +118%
$ time ./clang ... -mssse3
user 0m1.052s <--- +148%
$ time ./clang ... -msse4.1
user 0m1.129s <--- +151%
$ time ./clang ... -msse4.2
user 0m1.139s <--- +149%
$ time ./clang ... -mavx
user 0m1.017s <--- +2%
$ time ./clang ... -mavx2
user 0m1.295s <--- +5%


So that should eliminate x86 shuffle recursion overhead as a compile-time concern.

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 22, 2018

...but Simon has a patch proposal to limit the shuffle recursion while
maintaining existing codegen in all current regression tests (so an
improvement over the naive limiting that I suggested in comment 16).

https://reviews.llvm.org/D42378

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 24, 2018

rL323320 - resolving this. There might be further micro-optimizations that can be made but the main bulk of it is done now.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla slow-compile
Projects
None yet
Development

No branches or pull requests

5 participants