Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compilation: detect and regenerate missing cache artifacts #18792

Closed
wants to merge 1 commit into from

Conversation

jacobly0
Copy link
Member

@jacobly0 jacobly0 commented Feb 3, 2024

While it is not clear why artifacts are occasionally missing for which the manifest remains intact, regardless of how it happens, the cache should be at least somewhat resilient to kills, crashes, power loss, etc. affecting compilation processes anyway.

Workaround #18763

While it is not clear why artifacts are occasionally missing for which
the manifest remains intact, regardless of how it happens, the cache
should be at least somewhat resilient to kills, crashes, power loss,
etc. affecting compilation processes anyway.

Workaround ziglang#18763
Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cache should be at least somewhat resilient to kills, crashes, power loss, etc.

It is resilient to these things already by writing the manifest file only after all the artifacts are created. This workaround is not desirable because it masks symptoms of a non understood phenomenon.

I don't want this workaround in the codebase. We need to figure out the true cause and address it.

@jacobly0
Copy link
Member Author

jacobly0 commented Feb 3, 2024

It is resilient to these things already by writing the manifest file only after all the artifacts are created.

This makes no sense to me. The OS is under no obligation to physically perform writes to disk in the same order as they occur in the program. Its only obligation is to atomically update the filesystem cache in a way consistent with some total ordering of all fs operations, but after a power loss, a journal fs is only consistent with some arbitrary, possibly truncated, alternative ordering of events, where the end result would have been the same only if they were not truncated.

This workaround is not desirable because it masks symptoms of a non understood phenomenon.

I agree with this sentiment, but to be clear, this change does not mask the error from occurring, but rather, after a compilation has returned an error and failed, this change allows the next run to recover without requiring all of the caches to be deleted first.

My opinion is that the manifest should also contain a hash of the output artifact(s) to allow for automatic fs corruption detection and recovery in a way that is more robust than the simple exists check in this PR.

@andrewrk
Copy link
Member

andrewrk commented Feb 4, 2024

I really don't think that's the right tradeoff here. An additional hashing step on every compilation at the end is already a lot, but to additionally check the hash on a cache hit? That's too much to ask.

It's unfortunate that file systems do not expose the primitives that databases need in order to synchronize writes relative to each other. I don't even think fsync is the right approach here. Ultimately, these conditions are too rare and too costly to work around. We lose more than we gain, given the file system primitives available to us.

Does power loss, kills, and crashes even have anything to do with #18763? It looks to me like the latest hypothesis is a cron job to delete the tmp directory racing against the zig cache.

@jacobly0
Copy link
Member Author

jacobly0 commented Feb 4, 2024

Fixes extracted to #18801

@jacobly0 jacobly0 closed this Feb 4, 2024
@andrewrk
Copy link
Member

andrewrk commented Feb 4, 2024

loss, kills, and crashes

I wanted to add: I think it's worth noting that crashes are distinct from the other two because they are not subject to the out-of-order possibilities. The atomic weaknesses of the OS you mentioned above don't apply to crashes.

@jacobly0 jacobly0 deleted the paranoid-cache branch February 4, 2024 04:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants