Add support for unified builds to drastically speed up compilation time #177

sethfowler · 2016-12-10T03:13:43Z

I'm just going to give a high level description in this PR because Chris and I have talked about this IRL and there is pretty good documentation in the code.

These patches modify p4c's build system to support unified compilation. bootstrap.sh now runs gen-unified-makefile.py, which walks over the source tree, reads the existing automake files, and generates a new automake file which is included in Makefile.am. The new automake file has rules that generate unified .cpp files as part of the build process and hook up the dependencies correctly. Please take a look at the comments at the top of gen-unified-makefile.py for documentation.

To make this work, some changes are needed to the existing automake files. The high level idea is that gen-unified-makefile.py takes care of generating foo_SOURCES variables. Instead of adding files to foo_SOURCES, automake files should add them to either foo_UNIFIED or foo_NONUNIFIED. Files in foo_UNIFIED are subject to unified compilation; files in foo_NONUNIFIED will be compiled independently, as before. Yacc files, for example, have to go in foo_NONUNIFIED. gen-unified-makefile.py collects all of the foo_UNIFIED and foo_NONUNIFIED variables it encounters and generates foo_SOURCES definitions which are fed into automake to produce the final makefile. Again, there's more documentation in the tool. There's a fair amount of error checking to make it hard to mess things up.

The generated .cpp files are split into chunks, each containing up to N files. N is a tunable parameter; there's a tradeoff between single-threaded compilation time (for which you want a very large N) and better parallelism / incremental builds (for which you want a smaller N). If N gets too small, you start to lose the benefits of unified compilation, so balancing these factors is tricky and is best done experimentally. The default value I've chosen of '10' is a good compromise based on the experiments I've conducted; over time we can re-tune if necessary.

Because unified builds effectively merge multiple independent source files into one, you can sometimes run into problems with name clashes when multiple files use the same names for static declarations or declarations in anonymous namespaces, since those declarations are normally restricted to a single translation unit. P4c only hit this issue in one case: parserUnroll.cpp has an anonymous namespace that contains some declarations that clash with other declarations elsewhere in the project. I've fixed this by converting the anonymous namespace into a named namespace.

One other aside: there's no benefit to including the headers in foo_SOURCES. Since automake infers dependencies automatically, that was never needed for dependency reasons; it's recommended only so that if you use e.g. make dist to create a distribution tarball, the headers will be included. Taking this into account, I removed the headers when I replaced foo_SOURCES with foo_UNIFIED, because including them just means I have to filter them out immediately in gen-unified-makefile.py to avoid producing spurious dependencies in the generated automake file. I could do that filtering (automake, after all, does so when building), but I think it's better to just not include the headers in the first place so as to not mislead people into thinking that including them serves any function. This also brings us in line with most other build systems I've used.

So after my efforts to improve compilation performance this week, where does this leave us? Going through my compilation performance patch stack (and apologies for the lack of decent formatting):

Original state, -j8: 142.03 real 951.23 user 26.34 sys
-O0 patch, -j8: 113.92 real 787.10 user 26.57 sys
Irgenerator patch, -j8: 108.66 real 773.05 user 26.30 sys
Unified compilation patch, -j8: 35.11 real 203.78 user 7.83 sys

So we go from 142 to 35 seconds. Pretty good!

I think we can actually substantially improve on this when we have time to circle around and look at compilation performance again. Much of the remaining overhead seems to come from template instantiations, and some judicious code rearrangement is likely to eliminate a lot of that cost. Even right now it's very easy, by setting the chunk size very large in bootstrap.sh, to compile the entire project in 30 seconds. Merging every file in the project into a single unified .cpp file got me closer to 20 seconds, so consider that a limit study of sorts.

If you want to test locally, remember that you need to destroy your build directory and rerun bootstrap.sh.

antoninbas · 2016-12-10T04:18:27Z

Not trying to block that pull request per se as I don't contribute to this repo, but are you all sure a unity build is a desired thing?
This guy makes a couple good points: https://engineering-game-dev.com/2009/12/15/the-evils-of-unity-builds/. For example I see that you may have already run into the anonymous namespace issue... Basically I believe that name conflicts for local-scope declarations is a non-negligible issue.
Did you consider turning on pre-compiled headers on demand or using ccache instead?

sethfowler · 2016-12-10T23:40:56Z

Thanks for the feedback, Antonin. I read through the blog article you linked, and while there are some insights there, I think that as a commentary on unified compilation it's a bit of a straw man. The criticisms are focused on a naive implementation of unified compilation. If you simply combine every .cpp file in the entire project into a single translation unit, he's correct that you'll encounter a lot of downsides. However, that's not what is generally done in practice, and that's not what the code in this PR does.

One of the primary differences between the system that blog article discussed and this PR is that this PR only combines source files which are in the same module. What "module" means is determined by the automake rules - i.e., foo_SOURCES constitutes a single module, distinct from bar_SOURCES. In p4c that pretty much maps to directories, though it doesn't have to. This means that the issue of namespace clashes arising due to using, static declarations, and the like is restricted to a small, manageable scope. My experience has been that this rarely arises as an issue in practice, and given C++'s One Definition Rule, any issue that did arise would be caught at compile time.

As far as the tradeoffs with parallelism and incremental builds, there is more detailed discussion in the original comment, but the short version is that individual modules are automatically split into several smaller translation units if they become too large. This is another difference between what this PR does and the approach discussed in the blog post. As mentioned in the original comment, the threshold for "too large" is a value that has to be tuned, but making a change is as easy as changing one number in bootstrap.sh. I ran some experiments prior to submitting the PR and concluded that a maximum of 10 source files per unified compilation translation unit seems to offer a good tradeoff, at least as a starting point.

Regarding precompiled headers, in some ways you can think of them as behaving like unified compilation, but for headers rather than source files. You run into some of the same issues; for example, they can massively worsen incremental build performance if you tune them wrong, because they cause source files to depend on headers that they don't actually include. They're also a bit wonky and compiler specific. In practice my experience has been that you get the same or better benefits with unified compilation with a lot less manual tuning and pain. YMMV, of course.

You also mentioned ccache, but I think that's best thought of as being orthogonal to this PR. ccache is useful, but it can only help in specific situations. If you change a header file that's included in several source files, ccache offers zero help. Unified compilation, by contrast, will often speed things up drastically. The two have different strengths, and using both is what I'd recommend.

Mozilla uses unified compilation heavily and at scale, and my experience has been that there are few downsides in practice, and the productivity gains offered by faster builds are massive. As the numbers in the original comment show, I saw a 68% reduction in build times with this PR applied. An improvement of that magnitude in the edit-compile-test cycle for every member of the team is nothing to sniff at, and I think that should be kept in mind when weighing the arguments here. One reason I wanted to get this done immediately, before I started writing a lot of p4c code, is because I felt this would have more long term impact than almost any other single improvement I could make.

One thing I agree with in that blog article is that there is no substitute for fixing issues in the source code itself. Changes like improving the code's modularity and splitting headers up so that files include no more than what they need are really important. They can make a night-and-day difference in compilation times. That's something that we should definitely keep in mind as we continue to work on p4c.

sethfowler · 2016-12-11T00:08:25Z

tools/gen-unified-makefile.py

+    for file in file_set:
+        # We can't handle expansions ('$FOO or $(foo)') in 'foo_UNIFIED',
+        # because we're not actually evaluating the makefile, so report an error
+        # if we find anyway.


s/anyway/any. I'll fix that before merging.

antoninbas · 2016-12-11T00:36:42Z

I'll guess we'll have to agree to disagree. Changing the language semantics by merging multiple translation units is a big deal for me. Whether these files are in the same automake "module" is irrelevant to me. I do agree that productivity gain is important though :)
For automated p4c regressions, I would recommend setting the gen-unified-makefile.py N argument to something large (ideally enough to include all the files in a single module). This way you will guarantee that the unity build is correct for every N potentially chosen by BF developers / customers / open-source users (i.e. no naming conflicts in the unified translation units, no unexpected behavior because of preprocessor macros). Maybe the special value N == 0 could be used to have a single chunk per module.

ChrisDodd

My main concern here is that as written it will break the LOG code that depends on a single static _file_log_level varaible per source file that holds the logging detail level for that source file. This could easily be fixed (as I note below).

Also required are corresponding changes to Makefile.am in the p4c-extension-tofino repo.

I'm also a little concerned with this slowing down incremental builds of 1 or 2 source files (the common case for development) in order to speed up the rarer full rebuilds. And while faster builds are nice, the test suite takes longer than a rebuild, so this is attacking only a small part of downtime. Overall I expect time savings to be a wash.

ChrisDodd · 2016-12-12T16:54:55Z

tools/gen-unified-sources.py

+# generates on stdout:
+#   #include "foo.cpp"
+#   #include "bar.cpp"
+#   #include "baz.cpp"


In order to not break the per source file logging code, we could instead generate this as

#define LOG_CONTROL_VAR _file_log_level_foo static int _file_log_level_foo __attribute((unused)) = -1; #include "foo.cpp" #undef LOG_CONTROL_VAR

for each source file and change lib/log.h slightly to make use of LOG_CONTROL_VAR

As we discussed IRL, this is being handled in PR #180, which we'll land before this one.

sethfowler · 2016-12-12T18:28:40Z

Typically if you're touching header files, and not just source files, I'd expect the time savings to be quite significant. It's certainly something that needs to be tuned though. In the limit, N=1 will reproduce the current situation, but as Antonin suggested we can use a very large N for builds on the continuous integration infrastructure, which would all be full builds, so we'll definitely get a win there at a minimum.

That's a good point about the log level. I'll get that fixed.

sethfowler · 2016-12-12T18:30:41Z

I totally agree about the test suite, by the way. I'll investigate improving the test suite speed as well.

sethfowler · 2016-12-14T00:27:58Z

OK, I've added a new commit that should fix the build with p4c-extension-tofino. The problem is that p4c-extension-tofino adds new source files to ir_SOURCES, which get-unified-makefile.py forbids since ir_SOURCES is being automatically generated.

The new commit adds a new variable to Makefile.am, extension_frontend_SOURCES, which can be used by extensions to add new source files to libfrontend.a. (We could just use libfrontend_a_SOURCES. but that seemed a bit icky to me.) This new variable is then used in p4c-extension-tofino to add its new IR classes, instead of adding those classes directly to ir_SOURCES.

That's all that's needed to fix the build. After seeing how this is being used, I decided that ir/ir-generated.cpp would be better off living outside of ir_SOURCES, especially since it's already built via rules located in the top-level Makefile.am, so I've removed it from ir_NONUNIFIED and placed it directly in libfrontend_a_SOURCES. That change is also made in the new commit.

I think we're good to go now. We'll have to merge the corresponding PR in the other repo at the same time as this one.

sethfowler · 2016-12-14T02:36:18Z

FYI: I've written the necessary patches to make p4c-extension-tofino build with unified compilation as well. (It's really easy once this PR is in place.) Here are some numbers for building p4c-extension-tofino and p4c together, as you'd do during a normal development workflow:

(current master)  time make -j8:   201.33 real      1426.61 user        47.08 sys
(unified)               time make -j8:     55.97 real       344.69 user        13.21 sys

As you can see, that's an improvement of 72% in terms of wall clock time. Nice.

sethfowler self-assigned this Dec 10, 2016

sethfowler requested a review from ChrisDodd December 10, 2016 03:13

sethfowler commented Dec 11, 2016

View reviewed changes

ChrisDodd requested changes Dec 12, 2016

View reviewed changes

sethfowler added 3 commits December 13, 2016 16:16

Remove anonymous namespace from parserUnroll.cpp

12705d1

Add support for unified builds

0a9634a

Better support unified builds for extensions that add new IR classes

899b733

sethfowler closed this Dec 15, 2016

sethfowler deleted the seth/unified-builds branch December 15, 2016 23:21

fefer mentioned this pull request Aug 7, 2018

"make" get's 100% without error, but "make check" fails #1433

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for unified builds to drastically speed up compilation time #177

Add support for unified builds to drastically speed up compilation time #177

sethfowler commented Dec 10, 2016 •

edited

Loading

antoninbas commented Dec 10, 2016 •

edited

Loading

sethfowler commented Dec 10, 2016 •

edited

Loading

sethfowler Dec 11, 2016

antoninbas commented Dec 11, 2016

ChrisDodd left a comment

ChrisDodd Dec 12, 2016

sethfowler Dec 13, 2016

sethfowler commented Dec 12, 2016

sethfowler commented Dec 12, 2016

sethfowler commented Dec 14, 2016

sethfowler commented Dec 14, 2016

Add support for unified builds to drastically speed up compilation time #177

Add support for unified builds to drastically speed up compilation time #177

Conversation

sethfowler commented Dec 10, 2016 • edited Loading

antoninbas commented Dec 10, 2016 • edited Loading

sethfowler commented Dec 10, 2016 • edited Loading

sethfowler Dec 11, 2016

Choose a reason for hiding this comment

antoninbas commented Dec 11, 2016

ChrisDodd left a comment

Choose a reason for hiding this comment

ChrisDodd Dec 12, 2016

Choose a reason for hiding this comment

sethfowler Dec 13, 2016

Choose a reason for hiding this comment

sethfowler commented Dec 12, 2016

sethfowler commented Dec 12, 2016

sethfowler commented Dec 14, 2016

sethfowler commented Dec 14, 2016

sethfowler commented Dec 10, 2016 •

edited

Loading

antoninbas commented Dec 10, 2016 •

edited

Loading

sethfowler commented Dec 10, 2016 •

edited

Loading