-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test hung due to Go runtime hang during panic #1876
Comments
On helios-1.0.21316, using this build:
(based on oxidecomputer/garbage-compactor#6) I was trying to reproduce the test failures mentioned in #1223 using:
The first full run took 23s (see above). The second run has been hung for 17+ hours. Where is it?
It's been stuck for 17 hours:
Based on prstat, it's not on CPU. It's used very little CPU time:
and those "user" and "sys" numbers aren't going up. What's it doing?
Most interesting to me are thread 1 (which appears to be blocked on a lock while panicking from the Go allocator) and thread 14 (which appears to be blocked on a lock while panicking during garbage collection). I've saved a core file, though I'm not sure what we'll be able to do with it. I'm not sure what other live state we might be able to gather here. In case it's useful, here are the open files:
stdout and stderr are pipes to the Rust test runner so we won't see the output until/unless I kill this process. In case for whatever reason that doesn't happen, I've saved a core file of the Rust program, though I'm also not sure what we'll be able to do with that. I'm not sure what other state will be useful so I'm going to collect a bunch:
The test program is logging to this file (found by looking at
I didn't find anything of particular note in the logs. I think this makes sense because we haven't actually started CockroachDB yet -- this is all copied from the seed directory. But I'll tar up the whole directory while it remains hung, just in case:
I'm about out of ideas for what to collect from the running state. I'm going to try sending SIGABRT to the Go process, as I think that's supposed to dump out some debugging stuff. I expect at that point we'll get any stdout/stderr that was emitted. |
Well, I killed the process with:
and the test run immediately completed with:
And there's nothing new in the test log. I forgot that we don't dump the stdout/stderr of that command on failure -- we really should. I've also tar'd up the relevant test directories, to the extent that they still exist:
|
The individual core files are too large for GitHub and it also doesn't support tarballs, so I've put this on our lab machine "catacomb" in |
I'd say this is indeed illumos#15254 (#1146) and possibly a second bug in the Go runtime that I'll file separately. From the core file: in thread 1, we can find the
The call site is at +999 (one instruction before the return address of +99e), and shortly before that we load a non-local address into %rax and put it on to the stack:
(I cheated and figured out the string length here ahead of time. That's because the actual string is not NULL-terminated in memory so if we tell mdb to print it as a string, we get a much longer string of unrelated strings.) So far, this looks like one of the variants of #1146. Let's take a look at this span. I loaded the same ::typedef data that I used in #1146. Let's grab the argument to
I expected the argument to be one word ahead of the return address (
going by:
So: does allocBits look corrupted by the same root cause as #1146? Recall that the signature is a repeating 32-byte pattern, the last 16 bytes of which are always zero, and at least some of the leading 16 bytes are not zero.
Not so much. What about gcmarkBits?
Bingo. How long is this run? I used my new tool:
Sure enough, that second hit refers to a similarly-sized run (it covers most of the 64 KiB mapping) that starts shortly before the address in question. This has all the hallmarks of illumos#15254 and is consistent with the analysis in #1146 for the "sweep increased allocation count" failure mode. So I think it's fair to say we hit this bug here, at which point all bets are off. Remember that we also had thread 14 panicking from reportZombies -- another failure mode we've analyzed from the same problem. Again, looking through the stack (this time in the reportZombies frame), we find a valid-looking mspan (in the right place this time) whose allocBits and gcmarkBits are both in that same region:
So we appear to be panicking twice in this same Go process in two different ways resulting from the %ymm0 corruption issue. I have not dug into the hang itself, but at this point we're far enough off the rails that I'm not sure how interesting the rest of it is. I'm filing a Go bug in case the Go team wants to dig further into the hang. |
Filed golang/go#57420. |
Also filed https://www.illumos.org/issues/15276 for the pstack issue. |
While trying to reproduce #1223 with CockroachDB 22.1.9, I ran into a case where a test hung because the
cockroach version
command that we run during the test suite itself hung, apparently while panicking. Details coming.The text was updated successfully, but these errors were encountered: