-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compilation: detect and regenerate missing cache artifacts #18792
Conversation
While it is not clear why artifacts are occasionally missing for which the manifest remains intact, regardless of how it happens, the cache should be at least somewhat resilient to kills, crashes, power loss, etc. affecting compilation processes anyway. Workaround ziglang#18763
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cache should be at least somewhat resilient to kills, crashes, power loss, etc.
It is resilient to these things already by writing the manifest file only after all the artifacts are created. This workaround is not desirable because it masks symptoms of a non understood phenomenon.
I don't want this workaround in the codebase. We need to figure out the true cause and address it.
This makes no sense to me. The OS is under no obligation to physically perform writes to disk in the same order as they occur in the program. Its only obligation is to atomically update the filesystem cache in a way consistent with some total ordering of all fs operations, but after a power loss, a journal fs is only consistent with some arbitrary, possibly truncated, alternative ordering of events, where the end result would have been the same only if they were not truncated.
I agree with this sentiment, but to be clear, this change does not mask the error from occurring, but rather, after a compilation has returned an error and failed, this change allows the next run to recover without requiring all of the caches to be deleted first. My opinion is that the manifest should also contain a hash of the output artifact(s) to allow for automatic fs corruption detection and recovery in a way that is more robust than the simple exists check in this PR. |
I really don't think that's the right tradeoff here. An additional hashing step on every compilation at the end is already a lot, but to additionally check the hash on a cache hit? That's too much to ask. It's unfortunate that file systems do not expose the primitives that databases need in order to synchronize writes relative to each other. I don't even think fsync is the right approach here. Ultimately, these conditions are too rare and too costly to work around. We lose more than we gain, given the file system primitives available to us. Does power loss, kills, and crashes even have anything to do with #18763? It looks to me like the latest hypothesis is a cron job to delete the tmp directory racing against the zig cache. |
Fixes extracted to #18801 |
I wanted to add: I think it's worth noting that crashes are distinct from the other two because they are not subject to the out-of-order possibilities. The atomic weaknesses of the OS you mentioned above don't apply to crashes. |
While it is not clear why artifacts are occasionally missing for which the manifest remains intact, regardless of how it happens, the cache should be at least somewhat resilient to kills, crashes, power loss, etc. affecting compilation processes anyway.
Workaround #18763