Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ccache support #44

Merged
merged 4 commits into from
Apr 30, 2021
Merged

Ccache support #44

merged 4 commits into from
Apr 30, 2021

Conversation

vlovich
Copy link
Contributor

@vlovich vlovich commented Apr 23, 2021

There's a few things that were needed here.

  • EKAM_BYPASS_DIRS - colon-separated list of directories that can be supplied that Ekam treats as passthrough. Needed for the ccache directory.
  • Support for treating /var/run<UID> as a temporary directory. Could be worked around with ccache environment variables, but since this is a systemd convention, I figured it makes more sense as a part of ekam.
  • Add wrapper for stat64. Not sure why this wasn't causing problems before but ccache was tripping up on it.
  • Strip off compiler wrappers when linking (otherwise the executable path is nonsense).

@vlovich
Copy link
Contributor Author

vlovich commented Apr 23, 2021

This cuts down true incremental builds, but only slightly on Workers (~2m 46s => ~2m 25s). I notice the following stats (zeroed before the build but with a fully populated cache):

ccache -s
cache directory                     /home/vlovich/.cache/ccache
primary config                      /home/vlovich/.config/ccache/ccache.conf
secondary config (readonly)         /etc/ccache.conf
stats updated                       Thu Apr 22 18:45:54 2021
stats zeroed                        Thu Apr 22 18:44:05 2021
cache hit (direct)                   278
cache hit (preprocessed)               2
cache miss                             0
cache hit rate                    100.00 %
preprocessor error                   131
unsupported code directive             1
could not write to output file       120
cleanups performed                     0
files in cache                       559
cache size                          94.2 MB
max cache size                      20.0 GB

I think the preprocessor error is perhaps Ekam invoking the compiler in an interim state maybe? If so, that could be a large part of the cost as compiler invocations can be expensive if the compiler is actually evaluating things. It's certainly in the right ball park as an explanation (we have 121 executables vs 131 preprocessor errors, so maybe some executable take multiple invocations to resolve?).

The "could not write to output file" might indicate there's some kind of remaining bug within Ekam that's preventing ccache from materializing the cached file in the output location for some reason? I didn't see any errors in the ccache log & the Ekam log wasn't helpful either (it's a lot of log content though, so possible to have missed).

This overall does seem to indicate that perhaps Ekam itself is adding a noticeable amount of overhead to do its magic "no build files". If that's the case, the benefit of caching (& distributed compilation) disappears pretty quickly. Or it could be a mix of magic discovery & interposition (they're too intertwined to provide a reasonable guess).

@vlovich
Copy link
Contributor Author

vlovich commented Apr 23, 2021

Actually, thinking about this more, the preprocessor errors wouldn't be a linkage error & we don't even call ccache for that.. Might need to investigate more thoroughly via more ccache debug logs what the preprocessor errors are. Are there scenarios where Ekam ends up calling the compiler where it can fail & requires calling it another time.

@kentonv
Copy link
Member

kentonv commented Apr 23, 2021

Does this actually work with Ekam's execution model?

Consider that Ekam frequently speculatively executes compiler commands and kills them early if it decides that some other action has changed what the result would be. For example, Ekam might try to compile foo.c++, which includes foo.capnp.h, which doesn't exist yet, producing an error. But then a separate capnp rule might generate foo.capnp.h. Ekam knows that this will change the result of the foo.c++ compile so it re-runs it -- and if it was already running, it kills the already-running compile command. Or, more commonly, if the developer modifies a file while a build is running, Ekam might kill all the compile commands that are currently running which transitively depend on the modified file.

Is ccache designed in such a way that being randomly killed won't result in cache corruption?

@vlovich
Copy link
Contributor Author

vlovich commented Apr 23, 2021

I think that's fine in terms of cache integrity. ccache being terminated unexpectedly isn't generally the source of their corruption issues (skimming their NEWS file, it more has to do with not playing nice with the compiler or NFS systems). They also have checksumming of the cache so even if it corrupts in theory it shouldn't be a problem. Given the robustness & how wide-spread ccache is used, I'm not concerned about this aspect.

I'm curious under what timing scenario a compiler will have encountered a missing file but end up with the compiler still running when the missing file is generated. The compiler should exit pretty quick on any such error.

What you've described does explain why there's no benefit from ccache. All the speculative compilation is actually expensive. Could probably speed up Ekam significantly by doing something as simple as prioritizing some of the rules (e.g. run any source code gen rules first since they should drastically reduce the amount of erroneous compiles).

@kentonv
Copy link
Member

kentonv commented Apr 23, 2021

I'm curious under what timing scenario a compiler will have encountered a missing file but end up with the compiler still running when the missing file is generated. The compiler should exit pretty quick on any such error.

Yes, it's probably rare for compile steps, but if the code gen happened to run at the same time as the compile that needed it, it could happen I think.

The more common scenario is linking, where as soon as ekam sees an object file with a main symbol, it'll start trying to link it, repeatedly failing as some symbols are still missing, but missing symbols are continuously showing up due to concurrent compiles completing, causing ekam to cancel and re-run the link repeatedly...

@vlovich
Copy link
Contributor Author

vlovich commented Apr 23, 2021

Ekam doesn't wait for the entire set of missing symbols for the current linkage to be resolved before trying to link?

On the topic of this PR itself, this commit is harmless & fixes some issues within Ekam (e.g. stat64 wasn't wrapped). I think I will try to tackle having Ekam do less work at some point as it seems like it's costing us ~2 minutes 20 seconds out of ~2 minutes 50 seconds of the overall build which seems a bit high (+ it'll also make the ccache/distcc support more meaningful).

@kentonv
Copy link
Member

kentonv commented Apr 26, 2021

Ekam doesn't wait for the entire set of missing symbols for the current linkage to be resolved before trying to link?

No. The Ekam engine only knows: "When I last ran this action, it looked for all of these tags, and I gave this set of answers. If any of my answers change, I need to re-run the action." The rule implementation doesn't have any way to tell the engine "Don't bother running me again unless all of these tags are found." Note that searching for a tag that doesn't exist isn't necessarily an error -- for example, the compiler commonly searches the include path for headers. For any system header that you include, the compiler will first look for the header in the local repo, resulting in Ekam saying "not found", after which it will look in the system include directory. So if we blocked re-running of actions with "missing dependencies", pretty much all compile actions would always be blocked from re-running.

It might be reasonable to de-prioritize actions that have known missing dependencies, as long as they do still run once there's no other actions left to run.

Copy link
Member

@kentonv kentonv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the topic of this PR itself, this commit is harmless

Unfortunately, this isn't true -- this PR contains several bugs. If it isn't adding much real value, it might make sense to shelve it rather than iterate on it.

it seems like it's costing us ~2 minutes 20 seconds out of ~2 minutes 50 seconds of the overall build

I think that's an exaggeration. Remember that the build does a lot of things other than compiling -- e.g. it links a lot of binaries and runs a lot of tests. ccache isn't eliminating those.

src/ekam/CppActionFactory.cpp Outdated Show resolved Hide resolved
src/ekam/rules/intercept.c Outdated Show resolved Hide resolved
src/ekam/rules/intercept.c Outdated Show resolved Hide resolved
src/ekam/rules/intercept.c Show resolved Hide resolved
@vlovich
Copy link
Contributor Author

vlovich commented Apr 27, 2021

I think that's an exaggeration. Remember that the build does a lot of things other than compiling -- e.g. it links a lot of binaries and runs a lot of tests. ccache isn't eliminating those.

True, but given that the UI & htop seems to indicate that a non-trivial amount of time is spent compiling, you would expect ccache to show a non-trivial speedup. I didn't mean to state it as conclusive evidence. It could be that ccache support is still incomplete in some way that causes it to not realize the gains you might expect. It's hard to say for sure, but I think reducing Ekam overbuild is part of the strategy to get clarity on that (otherwise it's too hard to spot issues in the ccache logs).

Unfortunately, this isn't true -- this PR contains several bugs. If it isn't adding much real value, it might make sense to shelve it rather than iterate on it.

Did you find any other issue beyond the strdup/free bug? Based on my experience looking into this, it's not any 1 PR that will resolve this & maintaining an out-of-tree set of PRs will be a nightmare. Provided a change is no-op when building without ccache, my preference would be to merge it in to allow forward progress on figuring out what's happening with ccache (if I do work to reduce Ekam overbuild, it's hard to verify the impact on ccache without having ccache support & if support is out-of-tree, juggling that becomes really annoying to manage).

@vlovich vlovich force-pushed the ccache-support branch 2 times, most recently from 948e621 to 9571f7b Compare April 27, 2021 17:23
vlovich added a commit to vlovich/ekam that referenced this pull request Apr 29, 2021
Distcc/ccache/sccache can wrap the compiler command line & speed up the
build. Setting `CXX_WRAPPER` to the path to that binary lets Ekam invoke
correctly when compiling & doesn't break linking.

See discussion in capnproto#44 for more context.
vlovich added 4 commits April 29, 2021 07:45
This environment variable is a colon separated list of paths that have a
trailing '/' for which remapping should be skipped. This is useful for
things like ccache which try to access `~/.ccache` to write the cache
result. Since there's no actual guarantee that ccache is writing to
~/.ccache (could be overwritten as a setting or via runtime environment
variable), it's up to the end user to orchestrate this correctly.
systemd recommends using /var/run/X/ for temporary directories (where X
is your uid) for security & ccache makes use of it if available. Add
this to the list of temporary paths that are allow-listed to bypass
remapping.
Not sure why we hadn't hit this elsewhere but ccache does resolve stat
to stat64 when building.
Distcc/ccache/sccache can wrap the compiler command line & speed up the
build. Setting `CXX_WRAPPER` to the path to that binary lets Ekam invoke
correctly when compiling & doesn't break linking.

See discussion in capnproto#44 for more context.
@kentonv kentonv merged commit e450aaa into capnproto:master Apr 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants