Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uninterruptible sleep in linux-sandbox processes - bazel hang only fixed with reboot #3494

Closed
brunobowden opened this issue Aug 2, 2017 · 12 comments
Labels
category: sandboxing team-Local-Exec Issues and PRs for the Execution (Local) team untriaged

Comments

@brunobowden
Copy link

brunobowden commented Aug 2, 2017

Summary

A group of us at the same company have been repeated bazel hangs for months now. Bazel will hang indefinitely and can't be killed with kill -9. The cause appears to be multiple linux-sandbox processes that are stuck in the uninterruptible "D" state (waiting for unknown IO operations). Only a reboot fixes the problem, but thankfully it always fixes the issue.

What's curious is that it's happened on multiple engineer laptops and desktop machines as well. The cause is unclear but possible contributing factors are:

  • Bazel Cache Server - we regularly switch from connected to disconnected builds
  • Ubuntu 16.04
  • TensorFlow (perhaps because it's doing larger builds?)
  • WORKSPACE (ours is a substantial size but unclear if that's relevant)

A very similar report from @igor0 mentioned an identical issue building TensorFlow. Given that this issue hasn't been tracked separately, I'm wondering if it's something specific to our build setup or is a more widely seen issue by others:
tensorflow/tensorflow#717 (comment)

Reproduction Steps

Unknown. Happens fairly infrequently, maybe every few hundred builds?

Environment info

  • Ubuntu 16.04.2 LTS
  • Bazel version: 0.5.2

Have you found anything relevant by searching the web?

Nothing found on SO, GitHub or the web aside from issue 717:
tensorflow/tensorflow#717 (comment)

Other Info

bazel either hangs after the INFO line or part way through the build (shown below):

Building bazel target: //XXX/...
INFO: Reading 'startup' options from /home/dev/XXX/tools/bazel.rc: --host_jvm_args=-Dbazel.DigestFunction=SHA1
.
____Loading package: third_party/XXX
____Loading package: third_party/XXX
____Loading package: third_party/XXX
____Loading package: third_party/XXX
____Loading package: third_party/gtest
____Loading package: third_party/tensorflow/tensorflow/cc
____Loading package: third_party/tensorflow/tensorflow/user_ops
____Loading package: third_party/tensorflow/tensorflow/core
____Loading package: third_party/tensorflow/tensorflow/core/platform/default/gpu
$ ps aux | grep bazel
username   8756  0.0  0.0  14356  1208 ?        D    17:43   0:00 /home/username/.cache/bazel/_bazel_username/d80c0ea9e24d79779b730e31ad544b46/execroot/__main__/_bin/linux-sandbox -- /bin/true
username   8789 52.3  1.6 2342356 276336 pts/18 Sl+  17:45  26:48 bazel-bin/XXX -p XXX
username  11415  0.0  0.0 135368  6612 ?        Sl   18:30   0:00 /usr/bin/bazel build -- //XXX/... -//third_party/...
username  11443  4.3  4.0 9961172 669896 ?      Ssl  18:30   0:18 bazel(dev) -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/username/.cache/bazel/_bazel_username/d80c0ea9e24d79779b730e31ad544b46 -Xverify:none -Djava.util.logging.config.file=/home/username/.cache/bazel/_bazel_username/d80c0ea9e24d79779b730e31ad544b46/javalog.properties -Djava.library.path=/home/username/.cache/bazel/_bazel_username/install/23983a8c74aab550c8acc97c193a0f1f/_embedded_binaries/ -Dfile.encoding=ISO-8859-1 -Dbazel.DigestFunction=SHA1 -jar /home/username/.cache/bazel/_bazel_username/install/23983a8c74aab550c8acc97c193a0f1f/_embedded_binaries/A-server.jar --max_idle_secs=10800 --connect_timeout_secs=10 --install_base=/home/username/.cache/bazel/_bazel_username/install/23983a8c74aab550c8acc97c193a0f1f --install_md5=23983a8c74aab550c8acc97c193a0f1f --output_base=/home/username/.cache/bazel/_bazel_username/d80c0ea9e24d79779b730e31ad544b46 --workspace_directory=/home/username/dev/XXX --deep_execroot --experimental_oom_more_eagerly_threshold=100 --nofatal_event_bus_exceptions --client_debug=false --host_jvm_args=-Dbazel.DigestFunction=SHA1 --product_name=Bazel --option_sources=host_Ujvm_Uargs:/home/username/dev/XXX/tools/bazel.rc
username  11791  0.0  0.0  13328  3092 ?        S    18:30   0:00 /home/username/.cache/bazel/_bazel_username/d80c0ea9e24d79779b730e31ad544b46/execroot/__main__/_bin/linux-sandbox -- /bin/true
username  11792  0.0  0.0  14356  1212 ?        D    18:30   0:00 /home/username/.cache/bazel/_bazel_username/d80c0ea9e24d79779b730e31ad544b46/execroot/__main__/_bin/linux-sandbox -- /bin/true
username  12202  0.0  0.0  14228  1016 pts/19   S+   18:37   0:00 grep --color=auto bazel
@brunobowden
Copy link
Author

We have a theory that this is due to the build-cache server. What's unusual with us is that we regularly switch between the sandboxed and remote build strategies. That may be why this issue hasn't been noticed by others. I'd be happy to dig in with more debug information if you could let me know what would be helpful.

Two more failures happened within the last 24 hours from the same engineer:

$~/dev/DIR$ ps aux | grep bazel
XXX  17988  0.0  0.0  14356  1208 ?        D    16:48   0:00 /home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46/execroot/__main__/_bin/linux-sandbox -- /bin/true
XXX  18001  0.0  0.0  14228  1016 pts/2    S+   16:49   0:00 grep --color=auto bazel
$
$:~/dev/DIR$ bazel info
INFO: Reading 'startup' options from /home/XXX/dev/DIR/tools/bazel.rc: --host_jvm_args=-Dbazel.DigestFunction=SHA1
.......
bazel-bin: /home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46/execroot/__main__/bazel-out/local_linux-fastbuild/bin
bazel-genfiles: /home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46/execroot/__main__/bazel-out/local_linux-fastbuild/genfiles
bazel-testlogs: /home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46/execroot/__main__/bazel-out/local_linux-fastbuild/testlogs
character-encoding: file.encoding = ISO-8859-1, defaultCharset = ISO-8859-1
command_log: /home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46/command.log
committed-heap-size: 315MB
execution_root: /home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46/execroot/__main__
gc-count: 5
gc-time: 52ms
install_base: /home/XXX/.cache/bazel/_bazel_XXX/install/23983a8c74aab550c8acc97c193a0f1f
jDIRa-home: /usr/lib/jvm/jDIRa-8-oracle/jre
jDIRa-runtime: JDIRa(TM) SE Runtime Environment (build 1.8.0_131-b11) by Oracle Corporation
jDIRa-vm: JDIRa HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode) by Oracle Corporation
max-heap-size: 3728MB
message_log: /home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46/message.log
output_base: /home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46
output_path: /home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46/execroot/DIR/bazel-out
package_path: %workspace%
release: release 0.5.2
server_pid: 18029
used-heap-size: 89MB
workspace: /home/XXX/dev/DIR
$:~/dev/DIR$ ps aux | grep bazel
XXX  17591  0.0  0.0 135368  6320 ?        Sl   16:48   0:00 /usr/bin/bazel build -- //DIR/... -//third_party/...
XXX  17620 91.8  3.5 9961176 574416 ?      Ssl  16:48   0:21 bazel(dev) -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46 -Xverify:none -DjDIRa.util.logging.config.file=/home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46/jDIRalog.properties -DjDIRa.library.path=/home/XXX/.cache/bazel/_bazel_XXX/install/23983a8c74aab550c8acc97c193a0f1f/_embedded_binaries/ -Dfile.encoding=ISO-8859-1 -Dbazel.DigestFunction=SHA1 -jar /home/XXX/.cache/bazel/_bazel_XXX/install/23983a8c74aab550c8acc97c193a0f1f/_embedded_binaries/A-server.jar --max_idle_secs=10800 --connect_timeout_secs=10 --install_base=/home/XXX/.cache/bazel/_bazel_XXX/install/23983a8c74aab550c8acc97c193a0f1f --install_md5=23983a8c74aab550c8acc97c193a0f1f --output_base=/home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46 --workspace_directory=/home/XXX/dev/DIR --deep_execroot --experimental_oom_more_eagerly_threshold=100 --nofatal_event_bus_exceptions --client_debug=false --host_jvm_args=-Dbazel.DigestFunction=SHA1 --product_name=Bazel --option_sources=host_Ujvm_Uargs:/home/XXX/dev/DIR/tools/bazel.rc
XXX  17987  0.0  0.0  13328  3020 ?        S    16:48   0:00 /home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46/execroot/__main__/_bin/linux-sandbox -- /bin/true
XXX  17988  0.0  0.0  14356  1208 ?        D    16:48   0:00 /home/XXX/.cache/bazel/_bazel_XXX/d80c0ea9e24d79779b730e31ad544b46/execroot/__main__/_bin/linux-sandbox -- /bin/true
XXX  17995  0.0  0.0  14228   976 pts/2    S+   16:49   0:00 grep --color=auto bazel

@dslomov
Copy link
Contributor

dslomov commented Aug 4, 2017

@philwo could you take a look?

@igor0
Copy link

igor0 commented Aug 4, 2017

A few comments from me based on what I saw tensorflow/tensorflow#717 (comment)

@brunobowden: if you get a reproduction with a bazel process in an uninterruptible sleep, can you get its stack trace? For example, if the PID of your hung process is 17988, you can run:

sudo cat /proc/17988/stack

Hopefully the stack trace will pinpoint where in the kernel is bazel stuck.

I am aware of two possible forms of this issue:

  • If your bazel build touches any storage on NFS and the NFS server is inaccessible, bazel build will hang in an uninterruptible sleep. (At least this is the behavior with the default mount options.)
  • Even if your bazel build does not touch any NFS server but you have some NFS servers mounted locally, bazel startup seems to enumerate all of the local mounts, presumably as a part of setting up the Linux namespace. If you have any hung mounts, bazel will hang in uninterruptible sleep on startup, even if the mounts are completely unrelated to what bazel is doing.

The first form of the issue is really not the fault of bazel, since any application accessing the mount will behave similarly. The second form of the issue could be argued to be an issue with bazel, or possibly the implementation of Linux namespaces (if that's where the behavior is coming from).

@brunobowden
Copy link
Author

@igor0 - you were right on target with your NFS speculation. Here's the stack trace:

$ sudo cat /proc/8118/stack 
[<ffffffffc0e99b14>] rpc_wait_bit_killable+0x24/0xa0 [sunrpc]
[<ffffffffc0e9aa0a>] __rpc_execute+0x15a/0x460 [sunrpc]
[<ffffffffc0e9cfa1>] rpc_execute+0x61/0xc0 [sunrpc]
[<ffffffffc0e8de04>] rpc_run_task+0x104/0x140 [sunrpc]
[<ffffffffc0e8df50>] rpc_call_sync+0x50/0xc0 [sunrpc]
[<ffffffffc17a3609>] nfs3_rpc_wrapper.constprop.12+0x39/0xa0 [nfsv3]
[<ffffffffc17a42b6>] nfs3_proc_getattr+0x56/0xb0 [nfsv3]
[<ffffffffc188e6ca>] __nfs_revalidate_inode+0xca/0x350 [nfs]
[<ffffffffc1885b25>] nfs_weak_revalidate+0xb5/0x110 [nfs]
[<ffffffff8f85023a>] complete_walk+0x6a/0x80
[<ffffffff8f851ab1>] path_lookupat+0x91/0x110
[<ffffffff8f854d21>] filename_lookup+0xb1/0x180
[<ffffffff8f854ec6>] user_path_at_empty+0x36/0x40
[<ffffffff8f868f89>] do_mount+0x69/0xd20
[<ffffffff8f869f75>] SyS_mount+0x95/0xe0
[<ffffffff8fed3ebb>] entry_SYSCALL_64_fastpath+0x1e/0xad
[<ffffffffffffffff>] 0xffffffffffffffff```

Here's a summarized ps -aux output with the zombie process:

$ ps -aux | grep bazel
....
XXXX      8114  0.0  0.0  13328  3064 ?        S    18:15   0:00 /home/XXXX/.cache/bazel/_bazel_XXXX/862dc87c72b9cb65b3a70327b1cd1b26/execroot/__main__/_bin/linux-sandbox -- /bin/true
XXXX      8118  0.0  0.0  14356  1212 ?        D    18:15   0:00 /home/XXXX/.cache/bazel/_bazel_XXXX/862dc87c72b9cb65b3a70327b1cd1b26/execroot/__main__/_bin/linux-sandbox -- /bin/true
XXXX      8569  0.0  0.0  14224  1084 pts/2    S+   18:19   0:00 grep --color=auto bazel
XXXX     19051  0.3  2.6 6928252 441788 ?      Sl   17:47   0:06 /home/XXXX/.cache/bazel/_bazel_XXXX/862dc87c72b9cb65b3a70327b1cd1b26/execroot/__main__/external/local_jdk/bin/java -Xbootclasspath/p:external/bazel_tools/third_party/java/jdk/langtools/javac-9-dev-r4023-2.jar -XX:+TieredCompilation -XX:TieredStopAtLevel=1 -jar external/bazel_tools/tools/jdk/JavaBuilder_deploy.jar --persistent_worker
....

@brunobowden
Copy link
Author

This would also match up with what I mentioned earlier about the problem occurring when we switched from using the build-cache to --config=away. Perhaps it just coincides with the mounts becoming inaccessible and is nothing to do with the build-cache itself.

@brunobowden
Copy link
Author

I just had another occurrence with exactly the same stack trace (copied below). I won't report further stack traces unless I see something different from what was reported before. Log out / in of user doesn't fix the issue on Ubuntu, only the restart.

$ sudo cat /proc/19042/stack
[<ffffffffc1097b14>] rpc_wait_bit_killable+0x24/0xa0 [sunrpc]
[<ffffffffc1098a0a>] __rpc_execute+0x15a/0x460 [sunrpc]
[<ffffffffc109afa1>] rpc_execute+0x61/0xc0 [sunrpc]
[<ffffffffc108be04>] rpc_run_task+0x104/0x140 [sunrpc]
[<ffffffffc108bf50>] rpc_call_sync+0x50/0xc0 [sunrpc]
[<ffffffffc1c61609>] nfs3_rpc_wrapper.constprop.12+0x39/0xa0 [nfsv3]
[<ffffffffc1c622b6>] nfs3_proc_getattr+0x56/0xb0 [nfsv3]
[<ffffffffc19e16ca>] __nfs_revalidate_inode+0xca/0x350 [nfs]
[<ffffffffc19d8b25>] nfs_weak_revalidate+0xb5/0x110 [nfs]
[<ffffffff9a45022a>] complete_walk+0x6a/0x80
[<ffffffff9a451aa1>] path_lookupat+0x91/0x110
[<ffffffff9a454c91>] filename_lookup+0xb1/0x180
[<ffffffff9a454e36>] user_path_at_empty+0x36/0x40
[<ffffffff9a468e39>] do_mount+0x69/0xd20
[<ffffffff9a469e25>] SyS_mount+0x95/0xe0
[<ffffffff9aad3ebb>] entry_SYSCALL_64_fastpath+0x1e/0xad
[<ffffffffffffffff>] 0xffffffffffffffff

@philwo
Copy link
Member

philwo commented Aug 10, 2017

I think you found the correct cause here. As the sandbox iterates over all mount points in order to make them readonly in its mount namespace via a bind/remount mount syscall, I could imagine that this causes a hang when the nfs mount is not available.

I don't think there's a way to avoid this. I would advise to just not have inaccessible NFS mount points.
If anyone has a suggestion how to detect and workaround this on the sandbox side, I'd be happy to discuss though :)

@brunobowden
Copy link
Author

We've figured out a workaround. If you just umount -l the disconnected mounts, then bazel immediately starts working again. Then sometime later the uninterruptible processes disappears without any further intervention... presumably the umount causes the disk I/O to eventually timeout and allow the bazel process to end.

$ ps -aux | grep bazel
....
bruno    27082  0.0  0.0  14356  1212 ?        D    12:21   0:00 /home/bruno/.cache/bazel/_bazel_bruno/cb25f06abe1c2f7684e373eaf9cdc17e/execroot/__main__/_bin/linux-sandbox -- /bin/true
$
# lazy unmount
$ sudo umount -l /XXX
$
# away mode required as build cache is inaccessible
$ bazel build --config=away XXX
INFO: Reading 'startup' options from /home/bruno/av/tools/bazel.rc: --host_jvm_args=-Dbazel.DigestFunction=SHA1
.......
INFO: Found 123 targets...
INFO: Elapsed time: 13.553s, Critical Path: 6.09s
# bazel is now working again

@igor0
Copy link

igor0 commented Aug 10, 2017

@philwo: is the sandbox code able to report errors, say by printing to stderr?

If it is an option, a possible improvement would be to print an error if a remount takes more than say 60 seconds. When you hit this issue, seeing an error message that tells you what's wrong would be very helpful.

@brunobowden
Copy link
Author

bonus points if you can detect which mount it's blocked on and suggest that it's unmounted if that can be safely done

@werkt
Copy link
Contributor

werkt commented Jan 2, 2018

We have disabled the SandboxModule locally to combat this. I have tried augmenting this mount enumeration code to find points matching a particular filesystem type or prefix and switching those to be remounted (since private unmount doesn't seem to be a thing, and likely would exhibit the hang anyway) as a read-only empty tmpfs, similar to the extra mount point locations specified as options.

I suppose it is our fault for 'letting' our machines operate in the degraded state of possibly hung mounts, but I have several issues with the existing solution:

There is no way to disable the isSupported invocation of linux-sandbox through options (specifying entirely standalone spawn/genrule does not prevent this) - bazel will, on every build (not daemon instance, incidentally), try to invoke linux-sandbox /bin/true. Not sure why it has to call this isSupported each time...
There are no options presented to this isSupported invocation by default (we added the passthrough from the execution options).
There are no options to control the module inclusion in BazelMain::BAZEL_MODULES, which can successfully remove the capability for a bazel daemon to support Sandbox (or standalone, or any other problematic or experimental module).

Compounding this is the fact that sandboxing buys us so little that we've long since decided to turn it off in favor of buildfarm remote execution, since an example of a problematic mount enumeration is, as described here in no uncertain terms, remotely served nfs unavailability, which sandbox will gleefully provide, yet remote execution cuts off.

@jin jin added team-Local-Exec Issues and PRs for the Execution (Local) team untriaged labels May 6, 2020
@jmmv
Copy link
Contributor

jmmv commented May 14, 2020

I don't think we can do anything reasonable here in the presence of non-functional NFS mounts. The sandbox can be disabled, so I think that's a valid workaround. And last I looked, we now cache the sandboxing state during the life of the Bazel server and not check for it on every build as we did before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: sandboxing team-Local-Exec Issues and PRs for the Execution (Local) team untriaged
Projects
None yet
Development

No branches or pull requests

7 participants