Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std.Build.Cache.hit: more discipline in error handling #22202

Merged
merged 2 commits into from
Dec 11, 2024
Merged

Conversation

andrewrk
Copy link
Member

@andrewrk andrewrk commented Dec 11, 2024

Previous commits

2b09299
4ea2f44

had this text:

There are no dir components, so you would think that this was
unreachable, however we have observed on macOS two processes racing to
do openat() with O_CREAT manifest in ENOENT.

This is indeed true, as verified by the snippet of code in a below comment on this PR description. However, ENOENT is also possible to be returned when a handle refers to a deleted directory. Thus, the previous workaround that unconditionally retried the file system operation is not sufficient. This patch changes the workaround to retry with O_EXCL, disambiguating which error condition has occurred.

This patch also adds an explicit error set to std.Build.Cache.hit as well as changing the failed_file_index to a proper diagnostic field that fully communicates what failed, leading to more informative error messages on failure to check the cache.

The equivalent failure when occurring for AstGen performs a fatal process kill, reasoning being that the compiler has an invariant of the cache directory not being yanked out from underneath it while executing. This could be made a more granular error in the future but I suspect such thing is not valuable to pursue.

Mitigates #18340 but does not solve it.

Previous commits

2b09299
4ea2f44

had this text:

> There are no dir components, so you would think that this was
> unreachable, however we have observed on macOS two processes racing to
> do openat() with O_CREAT manifest in ENOENT.

This appears to have been a misunderstanding based on the issue
report #12138 and corresponding PR #12139 in which the steps to
reproduce removed the cache directory in a loop which also executed
detached Zig compiler processes.

There is no evidence for the macOS kernel bug however the ENOENT is
easily explained by the removal of the cache directory.

This commit reverts those commits, ultimately reporting the ENOENT as an
error rather than repeating the create file operation. However this
commit also adds an explicit error set to `std.Build.Cache.hit` as well
as changing the `failed_file_index` to a proper diagnostic field that
fully communicates what failed, leading to more informative error
messages on failure to check the cache.

The equivalent failure when occuring for AstGen performs a fatal process
kill, reasoning being that the compiler has an invariant of the cache
directory not being yanked out from underneath it while executing. This
could be made a more granular error in the future but I suspect such
thing is not valuable to pursue.

Related to #18340 but does not solve it.
@andrewrk andrewrk added breaking Implementing this issue could cause existing code to no longer compile or have different behavior. standard library This issue involves writing Zig code for the standard library. labels Dec 11, 2024
@andrewrk
Copy link
Member Author

There is no evidence for the macOS kernel bug however the ENOENT is easily explained by the removal of the cache directory.

Well, I think this is evidence:

error: cache directory '/Users/ci/zach2/_work/zig/zig/build-debug/zig-global-cache/z/' unexpectedly removed during compiler execution
error: cache directory '/Users/ci/zach2/_work/zig/zig/build-debug/zig-global-cache/z/' unexpectedly removed during compiler execution
error: cache directory '/Users/ci/zach2/_work/zig/zig/build-debug/zig-global-cache/z/' unexpectedly removed during compiler execution

Because the only place this directory is removed is in the same CI script which failed, which is currently running.

@andrewrk
Copy link
Member Author

On aarch64-macos:

% cat repro.zig
const std = @import("std");
pub fn main() !void {
    const dir = try std.fs.cwd().makeOpenPath("dir", .{});
    try dir.deleteFile("file");
    const thread1 = try std.Thread.spawn(.{}, thread, .{dir});
    defer thread1.join();
    const thread2 = try std.Thread.spawn(.{}, thread, .{dir});
    defer thread2.join();
}
fn thread(dir: std.fs.Dir) void {
    const file = dir.createFile("file", .{}) catch |err| @panic(@errorName(err));
    defer file.close();
}
% zig run repro.zig
thread 2427287 panic: FileNotFound

So the originally reported issue (#12138) is indeed accurate about this macOS kernel behavior.

Given this, I can think of two workarounds:

  • Handle error.FileNotFound on macOS by attempting to create a temporary test file in the directory using a random number, in order to disambiguate whether the error code comes from the directory no longer existing (ENOENT also occurs for creating the secondary test file), or from the file creation race (operation should be retried).
  • On macOS, attempt the file open X times before concluding the directory has been removed.

@andrewrk
Copy link
Member Author

andrewrk commented Dec 11, 2024

@jacobly0 pointed out that with .exclusive = true you get EEXIST instead of ENOENT. This suggests another workaround: on macOS, retry with .exclusive = true and then handle EEXIST with a retry of the original syscall, and handle ENOENT as directory deletion.

Pseudocode:

loop:
  createFile exclusive=false
  ENOENT: (macOS)
    createFile exclusive=true
    EEXIST:
      goto loop
    ENOENT:
      directory deletion detected
  ENOENT: (Linux)
    directory deletion detected

Linux does not need this workaround since ENOENT always means the directory was deleted.

It is unclear whether other POSIX operating systems need this workaround. However, the path forward is clear: do not implement workarounds if we have the ability to pressure the faulty systems to improve. Apple doesn't give a flying fuck about Zig so we're stuck working around their crap, however there is a chance that not implementing this workaround for one of the BSDs, for instance, leads to a developer actually fixing the bug.

src/Compilation.zig Outdated Show resolved Hide resolved
The previous commit cast doubt upon the initial report about macOS
kernel behavior, identifying another reason that ENOENT could be
returned from file creation.

However, it is demonstrable that ENOENT can be returned for both cases:
1. create file race
2. handle refers to deleted directory

This commit re-introduces the workaround for the file creation race on
macOS however it does not unconditionally retry - it first tries again
with O_EXCL to disambiguate the error condition that has occurred.
@andrewrk andrewrk merged commit 3670910 into master Dec 11, 2024
10 checks passed
@andrewrk andrewrk deleted the Cache.hit branch December 11, 2024 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Implementing this issue could cause existing code to no longer compile or have different behavior. standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants