-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std.Build.Cache.hit: more discipline in error handling #22202
Conversation
Previous commits 2b09299 4ea2f44 had this text: > There are no dir components, so you would think that this was > unreachable, however we have observed on macOS two processes racing to > do openat() with O_CREAT manifest in ENOENT. This appears to have been a misunderstanding based on the issue report #12138 and corresponding PR #12139 in which the steps to reproduce removed the cache directory in a loop which also executed detached Zig compiler processes. There is no evidence for the macOS kernel bug however the ENOENT is easily explained by the removal of the cache directory. This commit reverts those commits, ultimately reporting the ENOENT as an error rather than repeating the create file operation. However this commit also adds an explicit error set to `std.Build.Cache.hit` as well as changing the `failed_file_index` to a proper diagnostic field that fully communicates what failed, leading to more informative error messages on failure to check the cache. The equivalent failure when occuring for AstGen performs a fatal process kill, reasoning being that the compiler has an invariant of the cache directory not being yanked out from underneath it while executing. This could be made a more granular error in the future but I suspect such thing is not valuable to pursue. Related to #18340 but does not solve it.
Well, I think this is evidence:
Because the only place this directory is removed is in the same CI script which failed, which is currently running. |
On aarch64-macos:
So the originally reported issue (#12138) is indeed accurate about this macOS kernel behavior. Given this, I can think of two workarounds:
|
@jacobly0 pointed out that with Pseudocode:
Linux does not need this workaround since ENOENT always means the directory was deleted. It is unclear whether other POSIX operating systems need this workaround. However, the path forward is clear: do not implement workarounds if we have the ability to pressure the faulty systems to improve. Apple doesn't give a flying fuck about Zig so we're stuck working around their crap, however there is a chance that not implementing this workaround for one of the BSDs, for instance, leads to a developer actually fixing the bug. |
The previous commit cast doubt upon the initial report about macOS kernel behavior, identifying another reason that ENOENT could be returned from file creation. However, it is demonstrable that ENOENT can be returned for both cases: 1. create file race 2. handle refers to deleted directory This commit re-introduces the workaround for the file creation race on macOS however it does not unconditionally retry - it first tries again with O_EXCL to disambiguate the error condition that has occurred.
Previous commits
2b09299
4ea2f44
had this text:
This is indeed true, as verified by the snippet of code in a below comment on this PR description. However, ENOENT is also possible to be returned when a handle refers to a deleted directory. Thus, the previous workaround that unconditionally retried the file system operation is not sufficient. This patch changes the workaround to retry with
O_EXCL
, disambiguating which error condition has occurred.This patch also adds an explicit error set to
std.Build.Cache.hit
as well as changing thefailed_file_index
to a proper diagnostic field that fully communicates what failed, leading to more informative error messages on failure to check the cache.The equivalent failure when occurring for AstGen performs a fatal process kill, reasoning being that the compiler has an invariant of the cache directory not being yanked out from underneath it while executing. This could be made a more granular error in the future but I suspect such thing is not valuable to pursue.
Mitigates #18340 but does not solve it.