automatic recompilation of stale cache files #12458

stevengj · 2015-08-04T20:56:21Z

This fixes #12259 by automating the recompilation of stale cache files whenever you require a module (e.g. by using) and a stale cache file is found. Currently, staleness is judged by timestamps; other mechanisms like checksums are left to future PRs.

For example, here is a typical session after I had done Base.compile(:PyPlot) and then updated a file in Compat:

julia> using PyPlot
INFO: Recompiling stale cache file /Users/stevenj/.julia/lib/v0.4/Compat.ji for module Compat.
INFO: Recompiling stale cache file /Users/stevenj/.julia/lib/v0.4/LaTeXStrings.ji for module LaTeXStrings.
INFO: Recompiling stale cache file /Users/stevenj/.julia/lib/v0.4/PyPlot.ji for module PyPlot.
INFO: Recompiling stale cache file /Users/stevenj/.julia/lib/v0.4/PyCall.ji for module PyCall.
INFO: Recompiling stale cache file /Users/stevenj/.julia/lib/v0.4/Color.ji for module Color.
INFO: Recompiling stale cache file /Users/stevenj/.julia/lib/v0.4/FixedPointNumbers.ji for module FixedPointNumbers.

This patch supersedes #12445.

Like #12445, it adds a list of the dependency files to the .ji file, with a new include_dependency(path) function to manually supply non-include dependencies for a module.

(Also as discussed in #12445, I changed the serialization metadata to use bigendian rather than littleendian storage, for ease of reading back in via ntoh.)

Once #12448 lands, it will be easy to add other information about the dependencies (e.g. checksums) if that is judged necessary in the future, but I think it is better to develop that incrementally. (Merging a hash/checksum algorithm is a nontrivial task that is best left to a separate PR.)

Note that you still have to manually Base.compile a module at least once—this PR only automates the recompilation of the image. A separate PR can implement a @cacheable tag (or whatever) to allow modules to opt-in for automatic initial compilation.

jakebolewski · 2015-08-04T21:04:39Z

Looks great. Could we restrict the verbose info output to interactive sessions only?

stevengj · 2015-08-04T21:15:33Z

@jakebolewski, sure I guess that seems reasonable.

However, I think that may end up suppressing some of the info output even "interactive" sessions, because some of the compiles are triggered within Base.compile, which executes in a separate non-interactive process. (e.g. the compilation of FixedPointNumbers above is triggered by the compilation of Color.) ~~I take that back, all the info calls should be from the main process.~~ Yep, this happens e.g. whenever recompilation is triggered by an updated file in the module, because then it calls Base.compile before requiring any other modules.

...Okay, what I can do is to have info output only if isinteractive() || 0 != ccall(:jl_generating_output, Cint, ()), which enables it during Base.compile sessions too.

johnmyleswhite · 2015-08-04T21:17:49Z

So glad this looks to be ready for 0.4. Thank you, @stevengj

KristofferC · 2015-08-04T21:29:07Z

This is game changing!

timholy · 2015-08-04T23:58:27Z

Yes, this really puts the icing on the cake. 0.4 is awesome.

vtjnash · 2015-08-05T03:49:42Z

AppVeyor got confused (assigned two jobs the same number). AFAICT the build can not be restarted from here, so you'll have to force-push a new SHA1 hash if you want to trigger a new build.

(also Travis is having issues with their Mac cluster, so those are backed up right now too)

stevengj · 2015-08-05T04:44:45Z

@vtjnash, I already pushed a couple of patches since the initial AppVeyor failure, and the same failure happened again.

vtjnash · 2015-08-05T04:48:18Z

I had to manually increment the AppVeyor "next build" number to un-stick it. Builds should be working again now, however.

ViralBShah · 2015-08-05T08:44:27Z

We have all green now!

timholy · 2015-08-05T17:47:40Z

LGTM

stevengj · 2015-08-05T17:54:32Z

I'll merge by the end of the day if there are no objections.

tkelman · 2015-08-05T22:19:58Z

src/dump.c

+// are include depenencies
+void jl_serialize_dependency_list(ios_t *s)
+{
+     size_t total_size = 0;


indentation again

stevengj · 2015-08-05T23:59:46Z

(rebased to squash commits, merge with #12448, fix indentation)

automatic recompilation of stale cache files

timholy · 2015-08-06T03:04:32Z

Hooray!!

tkelman · 2015-08-06T06:04:17Z

base/docs/helpdb.jl

+used via `include`.   It has no effect outside of compilation.
+```
+"""
+include_dependency


Newly added functions do not appear to get correctly populated by doc/genstdlib.jl. I think you need to add an opening signature for this somewhere in the RST docs for it to work?

I don't really understand the new doc system for the manual. @one-more-minute, can you clarify what we are supposed to do? I'd really rather just put the documentation inline in loading.jl.

Putting the docs inline should work fine. As with the rest of the docstrings, genstdlib.jl will look for .. function:: include_dependency(...) and splice the doc string it finds there, so if that doesn't exist already it needs to be added (or there wouldn't be any way to know where the doc string should go).

When I have more free time I'm going to put up a bunch of examples of how to work with this, which will hopefully make it clearer.

tkelman · 2015-08-11T05:41:49Z

I'm already seeing timestamps cause issues here, editing packages and not seeing the corresponding .ji files get recompiled correctly. As I mentioned elsewhere I work over scp a lot editing files on remote machines, and the time stamps are not always getting preserved right. I can potentially cope by finding the right settings to tweak on every editor and ftp client multiplied by every machine I use, but this is not a very robust solution.

StefanKarpinski · 2015-08-11T06:13:28Z

I have to wonder if there's some hybrid hashing/timestamp solution to this.

tkelman · 2015-08-11T08:00:13Z

What would that look like? Check time stamp first, then check hash if equal? Doesn't seem like that would be all that beneficial over checking just the hashes.

Maybe opt in to "I work across multiple machines therefore need this to be more careful" somehow? Is there some way to make this user-configurable? I'm trying to override the recompile_stale function to a pessimistic always-recompile version (just as a start, for the sake of testing) via Base.recompile_stale(mod, cachefile) = Base.create_expr_cache(Base.find_in_path(string(mod)), cachefile) but that doesn't seem to do the job.

vchuravy · 2015-08-11T08:07:05Z

base/loading.jl

+        end
+        modules, files = cache_dependencies(io)
+        for f in files
+            if mtime(f) > cachefile_mtime


Maybe instead of checking whether mtime(f) > cachefile_mtime this should be mtime(f) != cachefile_mtime to assure that the cache is in sync with the file content. Otherwise if you change machines and one of your clocks is skewed a file change might happen "in the past".

cache_dependencies returns all of the included files, and compiling a .ji doesn't modify the mtime for each of those included files, so I suspect this would result in recompiling every time.

I see what you mean. Somehow I took it that we store all those mtimes and if one of them changes we should recompile. We are clearly not doing that, sorry for the noise.

With #12559, now we do store all the mtimes.

pao · 2015-08-11T12:41:20Z

What would that look like? Check time stamp first, then check hash if equal? Doesn't seem like that would be all that beneficial over checking just the hashes.

"env.Decider('MD5-timestamp'): as of SCons 0.98, you can set the Decider function on an environment. MD5-timestamp says if the timestamp matches, don't bother re-MD5ing the file. This can give huge speedups. See the man page for info." (quoting https://bitbucket.org/scons/scons/wiki/GoFastButton)

stevengj · 2015-08-11T12:42:54Z

The discussion in #12259 seemed to indicate that a checksum would not be too expensive. If you want one, the steps would be:

Pick a checksum algorithm and merge a decent implementation of it into base
Modify this dependency code to store the checksum in the .ji file and to check it when determining staleness
- to ensure that the file is not edited between including the file and computing the checksum, you might want to modify include file to compute the checksum as it goes along. This would be most convenient with a C checksum implementation

I would check the timestamp before checking the checksum, mainly because it gives a little extra protection against the unlikely event of checksum collisions yielding false negatives. (I'm assuming we would use a checksum optimized for speed here, like CRC32, rather than a cryptographically secure checksum.) As @pao says, we can also store the timestamp in the file and check whether it matches, not just whether it is > mtime, as an optimization (although this only helps in the case where the file is stale, which is slow anyway).

stevengj · 2015-08-11T12:45:28Z

We could store the ~~checksums~~ timestamps in the .ji file and then check whether they match as @pao implied, rather than are > mtime. That would (a) protect against most clock-skew problems, (b) protect against the case where you replace a module file with an older file, which also requires recompilation, and (c) be super-easy and quick to implement .

pao · 2015-08-11T12:49:52Z

CRC32 is great for bit error detection, but I'm not sure about possibly large intentional changes (this site notes that crc32("plumless") == crc32("buckeroo")). I wouldn't go further than git on this though and figure SHA1 is the upper bound on complexity.

On a reasonably large scons-built project, I've never run into an MD5 hash collision causing a miscompile.

vchuravy · 2015-08-11T13:02:48Z

Instead of adding a hashing algorithm to base we could also use git_odb_hashfile [1] from libgit2

https://libgit2.github.com/libgit2/#HEAD/group/odb/git_odb_hashfile

ScottPJones · 2015-08-11T13:10:42Z

Remember, this is going to be done every time a module is loaded.
What worked very well, in a very high volume environment, in my past, was:

check length 2) check timestamp == 3) check major/minor versions 4) check CRC-32
This isn't meant to prevent intentional changes, it is meant to keep an optimization from giving incorrect results. Doing MD5 or SHA starts cutting into the benefit of the optimization you were trying to get in the first place.

stevengj · 2015-08-11T14:08:07Z

@vchuravy, I agree that if we want an cryptographically secure hash, we should just use SHA1 from libgit2. @pao, of course with a cryptographically secure hash, a collision is astronomically unlikely.

tkelman · 2015-08-11T20:38:34Z

The need for a hash is now significantly lower, I don't think it's immediately pressing.

I'm having trouble getting git_odb_hashfile or the equivalent git hash-object to give results that are consistent with sha1sum. And since part of this would need to be accessible from C, I'm not sure if requiring libjulia to be linked against libgit2 is something we absolutely want. Nettle.jl has SHA1 implementations that work well, but better to avoid bringing in a new dependency library of that size. There's a BSD-licensed C implementation of SHA1 here https://github.com/dottedmag/libsha1/blob/master/sha1.c or a public-domain once here https://github.com/minix3/minix/blob/master/common/lib/libc/hash/sha1/sha1.c that are both pretty short.

StefanKarpinski · 2015-08-11T21:06:40Z

Does git maybe hash the file including some implied header or something?

ScottPJones · 2015-08-11T21:38:03Z

@tkelman A CRC-32 in C is very fast, and not at all much code (Intel even has an assembly instruction for a CRC-32 step). It would be easy to add it to the julia core, and not link against anything else.
SHA1 is really over-engineering, and would just hurt the performance that the caching is trying to improve in the first place.

tkelman · 2015-08-11T21:48:49Z

We also already have murmurhash easily available from either C or Julia, worth benchmarking and thinking about hash size/collision likelihood tradeoffs there. Unless I find a way to hit any more trouble I don't think we need to worry about it for now. Will let you know if my programmatically generated .jl file ideas get anywhere.

ScottPJones · 2015-08-11T21:52:10Z

What are you doing with programmatically generating Julia code? Sounds interesting.
(very heavy use of code generation on a distributed system is why I've been a PITA about using a hash instead of / or with a timestamp).

tkelman · 2015-08-12T04:45:23Z

https://www.youtube.com/watch?v=-SBw6nVvJSo

ScottPJones · 2015-08-12T05:38:13Z

Murmurhash3 would be good too. I hope Julia has more success than the Brain in taking over the world!

stevengj added the compiler:precompilation Precompilation of modules label Aug 4, 2015

stevengj mentioned this pull request Aug 4, 2015

store/retrieve dependencies in .ji file #12445

Closed

3 tasks

stevengj mentioned this pull request Aug 5, 2015

static compile part 6: automated opt-in compilation #12462

Closed

stevengj force-pushed the ji_rebuild branch from 451ab02 to 2631a34 Compare August 5, 2015 04:52

ViralBShah added this to the 0.4.0 milestone Aug 5, 2015

stevengj mentioned this pull request Aug 5, 2015

add "#pragma compile [true]" for opt-in to automatic compilation #12475

Closed

tkelman reviewed Aug 5, 2015
View reviewed changes

src/dump.c

// are include depenencies

void jl_serialize_dependency_list(ios_t *s)

{

size_t total_size = 0;

Copy link

Contributor

tkelman Aug 5, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation again

automatic recompilation of stale cache files (fixes JuliaLang#12259)

6ed7412

stevengj force-pushed the ji_rebuild branch from 2631a34 to 6ed7412 Compare August 5, 2015 23:35

stevengj added a commit that referenced this pull request Aug 6, 2015

Merge pull request #12458 from stevengj/ji_rebuild

55cfad9

automatic recompilation of stale cache files

stevengj merged commit 55cfad9 into JuliaLang:master Aug 6, 2015

stevengj deleted the ji_rebuild branch August 6, 2015 02:58

tkelman reviewed Aug 6, 2015
View reviewed changes

stevengj mentioned this pull request Aug 6, 2015

__precompile__(isprecompilable) function for automated opt-in to module precompilation #12491

Merged

vchuravy reviewed Aug 11, 2015
View reviewed changes

stevengj mentioned this pull request Aug 11, 2015

cache dependency mtimes and check equality, not mtime(dep) <= mtime(cache) #12559

Merged

tkelman mentioned this pull request Aug 5, 2016

Precompile should check staleness using a hash instead of a timestamp #17845

Closed

automatic recompilation of stale cache files #12458

automatic recompilation of stale cache files #12458

Conversation

stevengj commented Aug 4, 2015

jakebolewski commented Aug 4, 2015

stevengj commented Aug 4, 2015

johnmyleswhite commented Aug 4, 2015

KristofferC commented Aug 4, 2015

timholy commented Aug 4, 2015

vtjnash commented Aug 5, 2015

stevengj commented Aug 5, 2015

vtjnash commented Aug 5, 2015

ViralBShah commented Aug 5, 2015

timholy commented Aug 5, 2015

stevengj commented Aug 5, 2015

Choose a reason for hiding this comment

stevengj commented Aug 5, 2015

timholy commented Aug 6, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman commented Aug 11, 2015

StefanKarpinski commented Aug 11, 2015

tkelman commented Aug 11, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pao commented Aug 11, 2015

stevengj commented Aug 11, 2015

stevengj commented Aug 11, 2015

pao commented Aug 11, 2015

vchuravy commented Aug 11, 2015

ScottPJones commented Aug 11, 2015

stevengj commented Aug 11, 2015

tkelman commented Aug 11, 2015

StefanKarpinski commented Aug 11, 2015

ScottPJones commented Aug 11, 2015

tkelman commented Aug 11, 2015

ScottPJones commented Aug 11, 2015

tkelman commented Aug 12, 2015

ScottPJones commented Aug 12, 2015