Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occasional error:FileNotFound with empty zig cache on MacOS #18763

Closed
motiejus opened this issue Jan 31, 2024 · 21 comments
Closed

occasional error:FileNotFound with empty zig cache on MacOS #18763

motiejus opened this issue Jan 31, 2024 · 21 comments
Labels

Comments

@motiejus
Copy link
Contributor

motiejus commented Jan 31, 2024

hermetic_cc_toolchain MacOS occasionally see error: FileNotFound when compiling a zig program for the first time:

$ ZIG_LOCAL_CACHE_DIR=/tmp/zig-cache ZIG_GLOBAL_CACHE_DIR=/tmp/zig-cache zig build-exe -target x86_64-macos-none -mcpu=baseline -fstrip -OReleaseSafe zig-wrapper.zig
error: FileNotFound

If the user receives such error, subsequent runs of zig build fail with the same error. The only known mitigation is to wipe zig cache and try again, then it always succeeds.

A few comments:

  • I wasn't able to reproduce this in a "lab" no matter how much I tried.
  • Only MacOS users are known to be affected. This report has details about x86_64, I am not sure if I received reports on aarch64.
  • These reports spike after a MacOS system upgrade. Users have learned to remove the zig cache directory and move on, which made it quite difficult to get it (which we do have now!).

Attaching an archive of /tmp/zig-cache from the machine that just got this error.

If there is anything more I can provide, let me know, I will do my best to instruct the users to do so.

Files and versions

cc @jacobly0 to whom this might be of interest, he worked on cache races before.

ghost pushed a commit to uber/hermetic_cc_toolchain that referenced this issue Jan 31, 2024
Now that we have a proper bug report upstream, we can apply the
workaound in the toolchain repository behind the scenes. THanks
@nadavwe!

Zig upstream issue: ziglang/zig#18763
motiejus pushed a commit to uber/hermetic_cc_toolchain that referenced this issue Jan 31, 2024
Now that we have a proper bug report upstream, we can apply the
workaound in the toolchain repository behind the scenes. THanks
@nadavwe!

Zig upstream issue: ziglang/zig#18763
@jacobly0 jacobly0 added the zig build system std.Build, the build runner, `zig build` subcommand, package management label Feb 2, 2024
@jacobly0
Copy link
Member

jacobly0 commented Feb 3, 2024

I'd be curious what the stderr output will be on version 0.12.0-dev.2475+dcaf43674 or later, which is expected to be a bit more descriptive. Also, can you confirm whether any other zig processes might be accessing /tmp/zig-cache at the same time as this failed command?

jacobly0 added a commit to jacobly0/zig that referenced this issue Feb 3, 2024
While it is not clear why artifacts are occasionally missing for which
the manifest remains intact, regardless of how it happens, the cache
should be at least somewhat resilient to kills, crashes, power loss,
etc. affecting compilation processes anyway.

Workaround ziglang#18763
@jacobly0
Copy link
Member

jacobly0 commented Feb 3, 2024

I have written a workaround that, while it is not expected to prevent this error from occurring, which would require actually knowing how it is even happening, should at least change the mitigation to simply "try again" with no cache wipe required.

@motiejus
Copy link
Contributor Author

motiejus commented Feb 3, 2024

Also, can you confirm whether any other zig processes might be accessing /tmp/zig-cache at the same time as this failed command?

It is very unlikely that there were multiple invocations of zig build-exe when this failed. I.e. the toolchain does this single action before doing any parallel work. When I say unlikely, I mean the users should go out of their way to do it; but not impossible.

I have written a workaround that, while it is not expected to prevent this error from occurring, which would require actually knowing how it is even happening, should at least change the mitigation to simply "try again" with no cache wipe required.

I suggest I reword the error message to instruct the user to:

  • paste the full output to this issue.
  • retry the command (with your workaround when landed, but no clearing of the cache).

... and then we wait for an undetermined amount of time.

How does that sound?

@mikdusan
Copy link
Member

mikdusan commented Feb 3, 2024

On the assumption that this only happens when zig-cache is in /tmp, is it possible the systems have been customized and are cleaning /tmp on a daily schedule? macos can be tweaked with /etc/periodic.conf or /etc/periodic.conf.local. See man periodic.conf.

@motiejus
Copy link
Contributor Author

motiejus commented Feb 3, 2024

On the assumption that this only happens when zig-cache is in /tmp, is it possible the systems have been customized and are cleaning /tmp on a daily schedule? macos can be tweaked with /etc/periodic.conf or /etc/periodic.conf.local. See man periodic.conf.

That could be. We don't control when users clean their cache.

ghost pushed a commit to uber/hermetic_cc_toolchain that referenced this issue Feb 7, 2024
Since there was movement on the upstream issue
(ziglang/zig#18763), we can now gather more
information. Not removing the cache directory, asking users to
collaborate.
ghost pushed a commit to uber/hermetic_cc_toolchain that referenced this issue Feb 7, 2024
Since there was movement on the upstream issue
(ziglang/zig#18763), we can now gather more
information. Not removing the cache directory, asking users to
collaborate.
@mikdusan
Copy link
Member

A detail question; is the error message error: FileNotFound or something with some text like this:

error: unexpected error: parsing input file failed with error FileNotFound

@jacobly0
Copy link
Member

jacobly0 commented Feb 13, 2024

@mikdusan On the version of zig that the reproduction happened the output was just error: FileNotFound, but on master the error is now expected to be something like error: unexpected error: parsing input file failed with error FileNotFound.

@mikdusan
Copy link
Member

mikdusan commented Feb 13, 2024

I'm speculating that an external event such as factory macos cleaning /tmp periodically is the culprit. Unfortunately latest macos has changed from scripts to a binary doing the work so I don't have all the details but The default removal on macos 14 is by script /usr/libexec/tmp_cleaner set to clean files not accessed for 3 days. Script runs every midnight localtime. And in zig-cache one of the oldest hits is libcompiler_rt.a so it's not a stretch to see that the zig-cache/o entry can be removed while other parts of zig-cache remain.

To simulate this removal:

  1. nuke the cache
  2. create a bare bone exe pub fn main() void {}
  3. build the exe and populate cache
zig build-exe z0.zig`
  1. nuke objects in cache (we really just want to kill off compiler_rt stuff) but me lazy
rm -fr zig-cache/o
  1. try and build exe again
zig build-exe z0.zig
error: unexpected error: parsing input file failed with error FileNotFound
    note: while parsing /Users/mike/project/zig/work/main/zig-cache/o/8c7fefdddea19f3e901a5a677a6fec7a/libcompiler_rt.a
  1. repeat last step but this time with debug version of compiler with logging enabled:
stage4/bin/zig build-exe --debug-log compilation z0.zig
.
.
.
Semantic Analysis [1738] debug(compilation): CacheMode.whole cache hit for compiler_rt
error: unexpected error: parsing input file failed with error FileNotFound
    note: while parsing /tmp/zig-cache/o/8c7fefdddea19f3e901a5a677a6fec7a/libcompiler_rt.a

Thus we are getting an impossible hit on compiler_rt and removing certain portions of the remaining cache... like the h/ tree causes a proper rebuild.

I'm certain it is not a goal to handle sporadic deletions of the zig-cache tree by external actors... but shouldn't compiler_rt properly re-institute itself in the cache given such a removal?

@mikdusan mikdusan reopened this Feb 13, 2024
@motiejus
Copy link
Contributor Author

I am planning to release hermetic_cc_toolchain tomorrow which will include the updated Zig SDK and instructions to paste the error message here. Hopefully we will get a meaningful report.

I'm speculating that an external event such as factory macos cleaning /tmp periodically is the culprit. Unfortunately latest macos has changed from scripts to a binary doing the work so I don't have all the details but I will assume that removal is by some kind of date.

Could be. If you can help me with instructions on how to check it, I would pass it on to affected-colleagues-of-the-past to confirm or deny that this exists on their machine.

@mikdusan
Copy link
Member

Could be. If you can help me with instructions on how to check it, I would pass it on to affected-colleagues-of-the-past to confirm or deny that this exists on their machine.

I checked on macos 11 and macos 14, tmp cleanup is enabled by default.

@andrewrk andrewrk removed the zig build system std.Build, the build runner, `zig build` subcommand, package management label Feb 13, 2024
@andrewrk
Copy link
Member

Zig doesn't implement GC on the cache yet, so I understand the desire to use /tmp, but the bottom line is that it's not supported to race file deletion from the zig cache while using the compiler.

As it stands, the user must only delete the zig-cache directory when they have independently ensured that no compiler processes will run from start to finish of the deletion operation. Also they must fully delete the directory; running the compiler after a partial deletion is not supported.

@motiejus
Copy link
Contributor Author

Cache purging by MacOS sounds like a more and more plausible explanation.

Any tips on what would be a better place to store zig cache on MacOS? It needs to be an absolute path, because of bazel limitations.

@mikdusan
Copy link
Member

mikdusan commented Feb 13, 2024

I understand the wrapper goal is to set both local/global cache to the same value.
Perhaps simply set them both to what the default global cache likes to be:

export ZIG_GLOBAL_CACHE_DIR=$HOME/.cache/zig
export ZIG_LOCAL_CACHE_DIR=$ZIG_GLOBAL_CACHE_DIR

edit: if it's desirable to have a sep tree, perhaps $HOME/.cache/hermetic-zig-cache

@jacobly0
Copy link
Member

jacobly0 commented Feb 13, 2024

/var/tmp is a marginally better place to store a cache, although it won't be cleared every reboot on mac like /tmp was. At least on my linux machine, the only actual difference is that /var/tmp is bigger, and so it is where the package manager compiles code from.

The correct place to put a non-user cache is /var/cache of course, but there won't be any writable directories in there unless created during a package installation.

@mikdusan
Copy link
Member

here's a find:

man confstr
getconf DARWIN_USER_CACHE_DIR
_CS_DARWIN_USER_TEMP_DIR
  Provides the path to a user's temporary items directory. The directory will be created it if does not
  already exist. This directory is created with access permissions of 0700 and restricted by the
  umask(2) of the calling process and is a good location for temporary files.

  By default, files in this location may be cleaned (removed) by the system if they are not accessed in
  3 days.

_CS_DARWIN_USER_CACHE_DIR
  Provides the path to the user's cache directory. The directory will be created if it does not already
  exist. This directory is created with access permissions of 0700 and restricted by the umask(2) of
  the calling process and is a good location for user cache data as it will not be automatically
  cleaned by the system.

  Files in this location will be removed during safe boot.

DARWIN_USER_TEMP_DIR looks like /var/folders/pv/rrs7y6q14q11zz5hg2sqxj0m0000gn/T/ and is basically a private-/tmp. Same default cleanup policy 3+ days, but it only deletes regular files so dirs are left behind. Also on reboot, everything is removed. Getting this value is probably easier because $TMPDIR is defined as such.

DARWIN_USER_CACHE_DIR looks like /var/folders/pv/rrs7y6q14q11zz5hg2sqxj0m0000gn/C/ (notice C for cache instead of T for temp) and is private to each user. This is a decent fit if you want all of:

  1. private to user
  2. cleanup everything on boot
  3. no periodic cleanup of any kind

but getting this value would require adding confstr to std.c or shelling out to getconf. Obviously this would need to be macos-conditional.

@motiejus
Copy link
Contributor Author

motiejus commented Feb 14, 2024

I understand the wrapper goal is to set both local/global cache to the same value. Perhaps simply set them both to what the default global cache likes to be:

export ZIG_GLOBAL_CACHE_DIR=$HOME/.cache/zig
export ZIG_LOCAL_CACHE_DIR=$ZIG_GLOBAL_CACHE_DIR

This path will need to appear in --sandbox_add_mount_pair bazel arguments: https://github.com/uber/hermetic_cc_toolchain/?tab=readme-ov-file#usage

This option only accepts full paths, without variable expansion. Since this option violates hermeticity, Bazel treats it quite strictly (and I don't presume any PRs would be accepted to, say, allow environment variables there).

/var/tmp is a marginally better place to store a cache, although it won't be cleared every reboot on mac like /tmp was. At least on my linux machine, the only actual difference is that /var/tmp is bigger, and so it is where the package manager compiles code from.

I don't need the cache to be cleared on every reboot. /var/tmp seems like an excellent choice, provided:

  1. it's writable.
  2. it doesn't get cleared automatically.

I will confirm both today and, voilà, we may have a resolution to this.

The correct place to put a non-user cache is /var/cache of course, but there won't be any writable directories in there unless created during a package installation.

Ideally it should be in ~/.cache/zig, but it's been rejected due to the reasons above.

@jacobly0
Copy link
Member

jacobly0 commented Feb 14, 2024

Yeah that really doesn't leave much choice, as the only directories you can reasonably expect to be writable on linux are:

  • /tmp
  • /var/tmp
  • /dev/mqueue
  • /dev/shm
  • $XDG_RUNTIME_DIR
  • $HOME

I should also note that there is configuration on macOS (at least the older versions) for excluding patterns from the tmp cleaning, but obviously fewer user configuration requirements would be better.

motiejus added a commit to motiejus/hermetic_cc_toolchain that referenced this issue Feb 14, 2024
MacOS has a cronjob that deletes files older than 3 days from /tmp.
That's not good for zig cache: ziglang/zig#18763

Switch to /var/tmp, which does not seem to be randomly wiped at runtime.

This breaking change means the next hermetic_cc_toolchain will be
v3.0.0.
@motiejus
Copy link
Contributor Author

Thanks everyone and sorry for such an alarm. I now have high confidence that moving cache to /var/tmp will fix this.

This was truly a long-time and multi-person effort to find!

@andrewrk
Copy link
Member

No need to apologize.

Maybe there is still room for improving the error message here- it could have saved time if it clearly indicated the problem occurred due to files missing from the global cache directory. I want to be careful not to offer misleading hints but, any clarifications that could have more quickly led to the diagnosis would make sense to make in the error reporting.

@jacobly0
Copy link
Member

jacobly0 commented Feb 14, 2024

it could have saved time if it clearly indicated the problem occurred due to files missing from the global cache directory.

Note that after the recent linker changes, the status quo with lld and the self-hosted linker on master is:

error: ld.lld: cannot open ~/.cache/zig/o/7a704696d970e90cbff9db1a66fc9583/libc.a: No such file or directory

error: unexpected error: parsing input file failed with error FileNotFound
    note: while parsing ~/.cache/zig/o/7a704696d970e90cbff9db1a66fc9583/libc.a

@andrewrk
Copy link
Member

andrewrk commented Feb 15, 2024

My only suggestion would be

-error: unexpected error: parsing input file failed with error FileNotFound
+error: unable to parse input file: FileNotFound
    note: while parsing ~/.cache/zig/o/7a704696d970e90cbff9db1a66fc9583/libc.a

but that has nothing to do with this issue. So I think there is nothing else to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants