Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No crash reason in logs despite roachtest: tpcc/mixed-headroom/n5cpu16 failing due to process exiting with exit code 1 #2039

Closed
joshimhoff opened this issue Oct 21, 2022 · 3 comments

Comments

@joshimhoff
Copy link
Contributor

joshimhoff commented Oct 21, 2022

@renatolabs was debugging a roachtest failure and did the following the discover that the cause of the failure was #2019:

The logs didn’t include anything indeed, which is why this has been a tough test failure to debug (we need to fix whatever is not right with logging). I instrumented exit.WithCode as suggested in this thread but that led nowhere. To get the stack trace, I compiled cockroach with a patched version of the stdlib that dumps a stack trace on os.Exit, and running the test with that

This indicates a bug of some kind. We do not expect to have a process crash without any info about the crash reason in the logs, if the CRDB process is doing the crashing (as opposed to the OOM killer or similar). With that said, it might be a test-only bug.

Here's a theory: Within a roachtest context but NOT within a production context, we are using the pebble default logger, instead of the CRDB logger. The latter takes care to get a log line in an expected file; the former does not seem to, based on the presence of this bug. See the pebble default logger here:

// Fatalf implements the Logger.Fatalf interface.
Within the context of #2019, I believe it is called here: https://github.com/cockroachdb/pebble/blob/master/internal/base/filenames.go#L167. Actually, I think the stacktrace given at #2019 confirms that within the context of the roachtest we are using the pebble default logger: github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/logger.go:32 +0x49 is in the stacktrace, and that corresponds to this line in the code:
os.Exit(1)
With all that said, I am not sure why the pebble default logger fails to get a log line in an expected file pre-crash.

@joshimhoff
Copy link
Contributor Author

@renatolabs and I think the two bugs together lead to the bad behavior of no log entry in the logs.

In the context of the roachtest mentioned above, as already discussed here, we call the pebble default logger instead of the CRDB logging lib. That is, we do the following as per these lines:

	_ = log.Output(2, fmt.Sprintf(format, args...))
	os.Exit(1)

Above should be fine, as it is exactly what log does: https://cs.opensource.google/go/go/+/master:src/log/log.go;l=226!

BUT we also are affected by the bug @srosenberg found:

I think the issue is that the log_bridge.go writes all stdLog messages under Severity_INFO which means they don’t get flushed after log.Fatalf (from pebble) returns, and the next call is os.Exit(1).
Together, they lead to no log entry in the logs, as the log entry is not flushed before the process exists, since it is logged at severity INFO.

I don't understand the details of the bug @srosenberg found. In particular, I'd note I have found many panics in the stderr logs in production, suggesting that a successful flush at least happens in that case. Stan, do you want to elaborate on it a bit?

@joshimhoff
Copy link
Contributor Author

Whoops. I meant to open this in the CRDB repo.

@joshimhoff
Copy link
Contributor Author

Closing in favor of cockroachdb/cockroach#90483.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant