-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: synctest failed on Pebble #48603
Comments
Hi @petermattis, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate. While you're here, please consider adding an A- label to help keep our repository tidy. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
See cockroachdb#48603 Release note: None
There might've been some code drift since May when we skipped it, because charybdefs fails to build now. I'll look into that first. I found two main areas when a filesystem error is considered unrecoverable in Pebble: Writing & syncing the WAL: We can add a facility to catch |
Fix the synctest roachtest and `cockroach debug synctest` commands. It fixes a compilation issue in setting up the roachtest dependency and hooks up a custom Pebble Logger to handle calls to Fatalf that occur with some filesystem errors. Fix cockroachdb#48603. Release note: none
I'm not sure if I have a bug in my adaptation of the test or there's a real issue here, but using the adapted test I can consistently produce a disk stall that seemingly never recovers. The stalled goroutine stack trace looks something like:
I haven't been able to produce these stalls with the charbydefs disabled, which makes me think this is not a real disk stall caused by high write volume. I tried using
The rest of the strace is futex syscalls until the my introspection code times out and prints the stack trace and kills the test. It seems suspicious that the errno here is |
This program also hangs.
RocksDB does not use non-blocking IO, whereas Go does, right? It seems like there's no real issue here, other than that we need charybdefs to not return specifically EAGAIN. |
I thought non-blocking IO on file descriptors wasn't a useful thing. Perhaps the issue is that the Go runtime always retries system calls on EAGAIN. I agree with your conclusion that changing charybdefs to not return EAGAIN will likely fix the issue. |
If that's the case, I wonder if any distributed file systems might ever return |
Possibly. Maybe. I thought we had evidence they were due to threads being stuck in the kernel forever. |
Can confirm - in some of those cases, the Go runtime wasn't involved at all. We saw |
I think the action item here is to first get scylladb/charybdefs#24 across the line (or change the roachtest to use our own fork). Not sure on relative priority here. |
Drive-by comment (came across while looking for examples of failure-injection in roachtests)... seems that a simple stopgap is to intercept |
Linking this to #80986. If we remove We'll need to make a call as to whether to keep this test around (and land the fix to |
manually reviewed and brought up to date |
@jbowens, here is my current understanding. Please correct where mistaken.
btw, I am confused by the cockroach/pkg/cli/debug_synctest.go Lines 90 to 102 in b3c64ee
generation decides the directory for the DB, aren't we opening an empty DB each time, so the previous writes won't be seen. I must be missing something obvious.
|
Yeah, that's right.
Oh wow, I've completely forgotten about that. Yeah, they might be sufficient.
Yeah, I think that's true. Relatedly, we have #96670 tracking better EAR VFS crash testing. In #102492 we stopped using charybdefs for disk stall roachtests because it was flaky during setup and teardown, so +1 to "questionable value to fix and maintain".
It looks like it intends for clean non-corrupted databases to exit with |
Thanks for the responses. I am going to delete this test. |
This test has been skipped since May 2020, because Pebble code is written to crash on an error when writing to the MANIFEST and the WAL, since it is not viable to continue. Additionally, there were problems with hangs due to charybdefs returning EAGAIN (which have since been fixed). We are deleting this now since: - The purpose of this test is to test that Pebble does not lose data due to filesystem errors. It does so in a cruder manner than Pebble's strict MemFS based testing (which came later). Given the existence of the latter, there is no good reason to retain this test. - Working around charybdefs flakiness (see cockroachdb#102492), and Pebble's failstop-on-error behavior is not worth the initial and ongoing maintenance effort. Epic: none Fixes: cockroachdb#48603 Release note: None
106917: roachtest: delete synctest r=RaduBerinde,itsbilal a=sumeerbhola This test has been skipped since May 2020, because Pebble code is written to crash on an error when writing to the MANIFEST and the WAL, since it is not viable to continue. Additionally, there were problems with hangs due to charybdefs returning EAGAIN (which have since been fixed). We are deleting this now since: - The purpose of this test is to test that Pebble does not lose data due to filesystem errors. It does so in a cruder manner than Pebble's strict MemFS based testing (which came later). Given the existence of the latter, there is no good reason to retain this test. - Working around charybdefs flakiness (see #102492), and Pebble's failstop-on-error behavior is not worth the initial and ongoing maintenance effort. Epic: none Fixes: #48603 Release note: None 106941: build: use pnpm for cluster-ui auto-publishing r=nathanstilwell,rickystewart a=sjbarag The pkg/ui/ tree recently switched to pnpm (from yarn) for package management [1], but the GitHub workflow to automatically publish new versions of pkg/ui/workspaces/cluster-ui wasn't updated to align with that. Use pnpm for dependency management when auto-publishing canary and full-release versions of cluster-ui. Release note: None Epic: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Sean Barag <barag@cockroachlabs.com>
Seen while running roachtests on #48145
The
no message of desired type
is a strange error message.Given the error injection that
synctest
performs, a crash here does not seem unexpected. Hmm, this is not runningcockroach
normally, but runningcockroach debug synctest
. Perhaps that isn't compatible with Pebble. I'm not seeing any facility to catch log fatals.Jira issue: CRDB-4311
The text was updated successfully, but these errors were encountered: