Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macOS: Cannot delete sandbox directory after action execution #2371

Closed
bendowski opened this issue Jan 18, 2017 · 17 comments
Closed

macOS: Cannot delete sandbox directory after action execution #2371

bendowski opened this issue Jan 18, 2017 · 17 comments

Comments

@bendowski
Copy link

bendowski commented Jan 18, 2017

Description of the problem / feature request / question:

After running tests, I see this warning:

WARNING: Cannot delete sandbox directory after action execution: /private/var/tmp/_bazel_user/81e2f173210dcd47c26dbf4a42147a8b/bazel-sandbox/2551a3b4-abe3-4c34-a39d-37b44153e291-0 (java.io.IOException: /private/var/tmp/_bazel_user/81e2f173210dcd47c26dbf4a42147a8b/bazel-sandbox/2551a3b4-abe3-4c34-a39d-37b44153e291-0/execroot/master/_tmp/tests_2 (Directory not empty)).

Please let me know how I can provide more debugging information.

Environment info

  • Operating System:
    macOS Sierra 10.12.2

  • Bazel version (output of bazel info release):
    release 0.4.3-homebrew

@hermione521
Copy link
Contributor

We used to have this problem although I don't remember the reason. I remember @philwo had a fix 95b16a8. I'm not sure if this should happen now...

@laszlocsomor
Copy link
Contributor

@benol : Thanks for the bug report! I'm trying to collect some hopefully useful data:

  • Do you see this warning consistently or just intermittently?
  • Have you seen it with just this target or with other tests too?
  • Is the directory left behind after Bazel finished? Are you able to list the contents of it does it indeed look non-empty?
  • Are you aware of any workaround?

@bendowski
Copy link
Author

bendowski commented Jan 24, 2017

  • I can see it consistently for one target that creates a lot of files in TEST_TMPDIR and runs other processes from within the test.
  • Only in this target.
  • It's left on disk and it's non-empty.

In our case we spawn a new JVM to work, inside the sandbox, on files in the TEST_TMPDIR directory. After Bazel fails to clean up the sandbox, I can see this directory left there, with a lock file used by the child process.

So it's possible that Bazel fails to kill all child processes before deleting sandbox. It would seem the child process was either alive and holding the lock, or killed forcefully and Bazel fails to delete the locked file.

After Bazel prints the warning and leaves the sandbox directory on disk, I can manually delete it with simple "rm -rf /private/var/.../bazel-sandbox/hashhash"

We don't have this problem on Linux.

@laszlocsomor
Copy link
Contributor

Thanks, that's great info.

So it sounds like the sandbox cleanup is broken -- it doesn't kill all child processes and fails to clean up all directories.

@hermione521 : does that sound like a plausible root cause? I think we could repro this with an action that spawns a process which holds on to a file descriptor and doesn't terminate.

@hermione521
Copy link
Contributor

I tried several things but still can't reproduce. It would be much helpful if you can provide a minimal example to reproduce. Thank you!

@ittaiz
Copy link
Member

ittaiz commented Jan 27, 2017

I actually got a similar message but I'm not sure it was deterministic.
I was wading through several issues so ignored it for the time being but if it will surface again I'll try and triage and generate a minimal example.

@hermione521
Copy link
Contributor

@ittaiz that would be very helpful! Thank you in advance!

@philwo philwo self-assigned this Feb 22, 2017
@ittaiz
Copy link
Member

ittaiz commented Mar 12, 2017

@hermione521 happened again to me today but since I'm generating a big bazel codebase (rather generating hundreds of builds files via a migration tool from maven) I don't think I can generate a minima example. Mainly since I don't know why this happens.
Any chance you can give me a few concrete steps you'd like me to take when this occurs to capture the state of the workshop?

@ittaiz
Copy link
Member

ittaiz commented Mar 20, 2017

@hermione521 ping? happened to me again. a small repro isn't likely but maybe I can dig some more details if you point me to the right direction

@hermione521
Copy link
Contributor

Hmm.. I don't have any idea except clean them manually.. Let's ping @philwo to see if it helps.

@philwo
Copy link
Member

philwo commented Mar 21, 2017

I'm currently doing a big round of bug-fixes and will try to fix whatever might cause this. I'll specifically look into whether there are any race conditions in Bazel's process management on macOS.

I'll follow up on this, but if I don't, please ping this bug in ~1 week and I'll send a status update. :)

@hermione521 hermione521 removed their assignment Mar 27, 2017
@ittaiz
Copy link
Member

ittaiz commented Mar 29, 2017 via email

@philwo
Copy link
Member

philwo commented Apr 6, 2017

Hi @ittaiz,

I've been working on a fix for this over the last days, hope to get it submitted tomorrow. Will ping this bug then!

Philipp

bazel-io pushed a commit that referenced this issue Apr 24, 2017
This uses Linux's PR_SET_CHILD_SUBREAPER and FreeBSD's PROC_REAP_ACQUIRE features to become an init-like process for all (grand)children spawned by process-wrapper, which allows us to a) kill them reliably and then b) wait for them reliably. Before this change, we only killed the main child, waited for it, then fired off a kill -9 on the process group, without waiting for it. This led to a race condition where Bazel would try to use or delete files that were still helt open by children of the main child and thus to bugs like #2371.

This means we now have reliable process management on Linux, FreeBSD and Windows. Unfortunately I couldn't find any feature like this on macOS, so this is the only OS that will still have this race condition.

PiperOrigin-RevId: 153817210
@ulfjack
Copy link
Contributor

ulfjack commented Jun 29, 2017

@philwo can we close this now?

@philwo
Copy link
Member

philwo commented Jul 10, 2017

The fix has been rolled back, so maybe we should keep it open. On the other hand... did this happen to someone in the last month? If not, we can also close it, I don't mind.

@ittaiz
Copy link
Member

ittaiz commented Jul 10, 2017 via email

@philwo
Copy link
Member

philwo commented Jul 19, 2018

Closing. Please re-open if someone sees this again.

@philwo philwo closed this as completed Jul 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants