-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: stability problems during tpcc-{5,10,20}k on {6,12,24} nodes #31409
Comments
Which cluster is this? PS You grabbed the logs for |
andy-1539640859-tpccbench-nodes-6-cpu-16-partition |
It looks like the cluster has been wiped. Update: oops, fooled myself due to not accounting for the cluster being run by a different user. Logs are still there. |
Some sort of memory leak was occurring only on |
I'm not seeing anything interesting in the logs, but I don't have time to do a thorough check right now. |
The filename of the heap_profiler profile used for this is
which puts this close to the crash, probably even within a minute, for otherwise I'd expect there to be a second profile (but there isn't). What we're seeing is lots of memory allocated unmarshaling entries. I don't understand the last row of boxes here, perhaps @nvanbenschoten or @petermattis can help me understand what that means. Looking at alloc_space, things are a lot more spread out. Of the bigger offenders, the largest is Looking at the logs, I'm seeing this node require lots of Raft snapshots just before it dies. I think what's happening here is that during the restore, n3 falls behind on a number of Raft groups. Now it needs lots of snapshots, and those will contain boatloads of AddSSTable commands, which are now inlined. Things go south quickly from here. I looked at the size of the snapshots. They don't seem absurdly high, but they are all pulled into memory at the receiving end:
|
As far as timing goes, the test runs from ~10pm to ~11pm. Here's the number of Raft snapshots within each minute:
Now those are lots of snapshots, but many of them are small. Let's restrict to snapshots that are at least 10mb:
They all come in just as things take a severe turn south. This is worth a ping to #10320 (kv level memory accounting), though that would be the second line of defense. The first question is why the cluster decided that it was OK to not take n3 into account when truncating the Raft logs. for lots of ranges. We can see from the logs that there are Raft snapshots essentially throughout the test, but this is unexpected. Raft snapshots are a recovery mechanism that shouldn't usually be triggered during normal cluster operations. |
16 of these big snapshots arrive to an initialized replica (i.e. one knowing its bounds). The other 28 hit one that doesn't know its bounds. This ratio is approximately the same when you include all snapshots (105 vs 268), so it's just that we saw more and bigger Raft snapshots, not different ones. (Computed via |
cockroach/pkg/storage/raft_log_queue.go Lines 110 to 116 in a0b7cd4
suggests that we'll fall back to Raft snapshots when the estimated raft log size is larger than the replica and I suspect that this is a little noisy/aggressive for small replicas. But especially for this situation there's another catch, namely that the target size is additionally clamped to 4mb (which is going to kick in in the vast majority of cases): cockroach/pkg/storage/raft_log_queue.go Line 122 in a0b7cd4
My reading of this is that once you have a few SSTs in your replica, you're going to aggressively truncate to the quorum commit index, which is basically saying that it's leaving a lagging replica behind. I assume that due to the geo-distributed nature of this cluster, n3 is prone to being that odd one out and even more so once it gets to apply all of these snapshots, which come with a large memory amplification, perhaps exacerbated by the sideloading of the SSTs. |
Yes, the Raft log is limited to 4MB max size, but proposal quota is supposed to avoid ever hitting that limit. And even if we do hit the truncation limit and allow a replica to fall behind and require a Raft snapshot, the Raft snapshots are serialized by the Raft snapshot queue so there should be a limited number outbound, and in-bound snapshots are also serialized. I'm confused about where the excess memory usage is coming from.
This is interesting. We limit the size of the Raft log because when the Raft log gets large applying the entries can be slower than sending a snapshot. But that heuristic breaks when the Raft log gets side-loaded sstables. |
See cockroachdb#31409. Release note: None
I haven't dared to look more, but this is definitely not what we want to see while running this test:
|
Seeing long-lived stretches of
I wonder if this is related to #31330. @benesch as an aside, this looks like a merge, but why is the event type "unknown"? This range generally seems to be hanging around alone, which I think means it must've been merged away (I'm running master at 6773854 which has the merge queue on). |
The snapshot intersects existing range error is logged at such high rate that all interesting history has been rotated away from the log files. edit: experimentally obtained a rate of ~114 logs per second for this error. |
Touches cockroachdb#31409. Release note: None
Touches cockroachdb#31409. Release note: None
When a Range has followers that aren't replicating properly, splitting that range results in a right-hand side with followers in a similar state. Certain workloads (restore/import/presplit) can run large numbers of splits against a given range, and this can result in a large number of Raft snapshots that backs up the Raft snapshot queue. Ideally we'd never have any ranges that require a snapshot, but over the last weeks it has become clear that this is very difficult to achieve since the knowledge required to decide whether a snapshot can efficiently be prevented is distributed across multiple nodes that don't share the necessary information. This is a bit of a nuclear option to prevent the likely last big culprit in large numbers of Raft snapshots in cockroachdb#31409. With this change, we should expect to see Raft snapshots regularly when a split/scatter phase of an import/restore is active, but never large volumes at once. Release note: None
When a Range has followers that aren't replicating properly, splitting that range results in a right-hand side with followers in a similar state. Certain workloads (restore/import/presplit) can run large numbers of splits against a given range, and this can result in a large number of Raft snapshots that backs up the Raft snapshot queue. Ideally we'd never have any ranges that require a snapshot, but over the last weeks it has become clear that this is very difficult to achieve since the knowledge required to decide whether a snapshot can efficiently be prevented is distributed across multiple nodes that don't share the necessary information. This is a bit of a nuclear option to prevent the likely last big culprit in large numbers of Raft snapshots in cockroachdb#31409. With this change, we should expect to see Raft snapshots regularly when a split/scatter phase of an import/restore is active, but never large volumes at once. Release note: None
When a Range has followers that aren't replicating properly, splitting that range results in a right-hand side with followers in a similar state. Certain workloads (restore/import/presplit) can run large numbers of splits against a given range, and this can result in a large number of Raft snapshots that backs up the Raft snapshot queue. Ideally we'd never have any ranges that require a snapshot, but over the last weeks it has become clear that this is very difficult to achieve since the knowledge required to decide whether a snapshot can efficiently be prevented is distributed across multiple nodes that don't share the necessary information. This is a bit of a nuclear option to prevent the likely last big culprit in large numbers of Raft snapshots in cockroachdb#31409. With this change, we should expect to see Raft snapshots regularly when a split/scatter phase of an import/restore is active, but never large volumes at once. Release note: None
When a Range has followers that aren't replicating properly, splitting that range results in a right-hand side with followers in a similar state. Certain workloads (restore/import/presplit) can run large numbers of splits against a given range, and this can result in a large number of Raft snapshots that backs up the Raft snapshot queue. Ideally we'd never have any ranges that require a snapshot, but over the last weeks it has become clear that this is very difficult to achieve since the knowledge required to decide whether a snapshot can efficiently be prevented is distributed across multiple nodes that don't share the necessary information. This is a bit of a nuclear option to prevent the likely last big culprit in large numbers of Raft snapshots in cockroachdb#31409. With this change, we should expect to see Raft snapshots regularly when a split/scatter phase of an import/restore is active, but never large volumes at once. Release note: None
When a Range has followers that aren't replicating properly, splitting that range results in a right-hand side with followers in a similar state. Certain workloads (restore/import/presplit) can run large numbers of splits against a given range, and this can result in a large number of Raft snapshots that backs up the Raft snapshot queue. Ideally we'd never have any ranges that require a snapshot, but over the last weeks it has become clear that this is very difficult to achieve since the knowledge required to decide whether a snapshot can efficiently be prevented is distributed across multiple nodes that don't share the necessary information. This is a bit of a nuclear option to prevent the likely last big culprit in large numbers of Raft snapshots in cockroachdb#31409. With this change, we should expect to see Raft snapshots regularly when a split/scatter phase of an import/restore is active, but never large volumes at once. Release note: None
When a Range has followers that aren't replicating properly, splitting that range results in a right-hand side with followers in a similar state. Certain workloads (restore/import/presplit) can run large numbers of splits against a given range, and this can result in a large number of Raft snapshots that backs up the Raft snapshot queue. Ideally we'd never have any ranges that require a snapshot, but over the last weeks it has become clear that this is very difficult to achieve since the knowledge required to decide whether a snapshot can efficiently be prevented is distributed across multiple nodes that don't share the necessary information. This commit is a bit of a nuclear option to prevent the likely last big culprit in large numbers of Raft snapshots in cockroachdb#31409. With this change, we should expect to see Raft snapshots regularly when a split/scatter phase of an import/restore is active, but never large volumes at once (except perhaps for an initial spike). Splits are delayed only for manual splits. In particular, the split queue is not affected and could in theory cause Raft snapshots. However, at the present juncture, adding delays in the split queue could cause problems as well, so we retain the previous behavior there which isn't known to have caused problems. More follow-up work in the area of Raft snapshots will be necessary to add some more sanity to this area of the code. Release note: None
When a Range has followers that aren't replicating properly, splitting that range results in a right-hand side with followers in a similar state. Certain workloads (restore/import/presplit) can run large numbers of splits against a given range, and this can result in a large number of Raft snapshots that backs up the Raft snapshot queue. Ideally we'd never have any ranges that require a snapshot, but over the last weeks it has become clear that this is very difficult to achieve since the knowledge required to decide whether a snapshot can efficiently be prevented is distributed across multiple nodes that don't share the necessary information. This commit is a bit of a nuclear option to prevent the likely last big culprit in large numbers of Raft snapshots in cockroachdb#31409. With this change, we should expect to see Raft snapshots regularly when a split/scatter phase of an import/restore is active, but never large volumes at once (except perhaps for an initial spike). Splits are delayed only for manual splits. In particular, the split queue is not affected and could in theory cause Raft snapshots. However, at the present juncture, adding delays in the split queue could cause problems as well, so we retain the previous behavior there which isn't known to have caused problems. More follow-up work in the area of Raft snapshots will be necessary to add some more sanity to this area of the code. Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots.
When a Range has followers that aren't replicating properly, splitting that range results in a right-hand side with followers in a similar state. Certain workloads (restore/import/presplit) can run large numbers of splits against a given range, and this can result in a large number of Raft snapshots that backs up the Raft snapshot queue. Ideally we'd never have any ranges that require a snapshot, but over the last weeks it has become clear that this is very difficult to achieve since the knowledge required to decide whether a snapshot can efficiently be prevented is distributed across multiple nodes that don't share the necessary information. This commit is a bit of a nuclear option to prevent the likely last big culprit in large numbers of Raft snapshots in cockroachdb#31409. With this change, we should expect to see Raft snapshots regularly when a split/scatter phase of an import/restore is active, but never large volumes at once (except perhaps for an initial spike). Splits are delayed only for manual splits. In particular, the split queue is not affected and could in theory cause Raft snapshots. However, at the present juncture, adding delays in the split queue could cause problems as well, so we retain the previous behavior there which isn't known to have caused problems. More follow-up work in the area of Raft snapshots will be necessary to add some more sanity to this area of the code. Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots.
32594: storage: delay manual splits that would result in more snapshots r=petermattis a=tbg This is unpolished, but I had used an earlier version of this with what at the time looked like success. At this point I suspect that this is the best way to suppress Raft snapshot growth in IMPORT/RESTORE. (Definitely needs tests). ---- When a Range has followers that aren't replicating properly, splitting that range results in a right-hand side with followers in a similar state. Certain workloads (restore/import/presplit) can run large numbers of splits against a given range, and this can result in a large number of Raft snapshots that backs up the Raft snapshot queue. Ideally we'd never have any ranges that require a snapshot, but over the last weeks it has become clear that this is very difficult to achieve since the knowledge required to decide whether a snapshot can efficiently be prevented is distributed across multiple nodes that don't share the necessary information. This is a bit of a nuclear option to prevent the likely last big culprit in large numbers of Raft snapshots in #31409. With this change, we should expect to see Raft snapshots regularly when a split/scatter phase of an import/restore is active, but never large volumes at once. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
This is a dangerous condition. Adding this to the health checker has the additional benefit of logging it during the nightly restore/import tests, which can in turn help diagnose whether a particular run is affected by cockroachdb#31409. Release note: None
Touches cockroachdb#31409. Release note: None
The nightlies post #32594 look very promising (they all passed, and didn't have Raft snapshot buildup). I'm going to optimistically close this, but @ajwerner and @awoods187 please let me know when imports fail (in the way specific to this issue). |
Describe the problem
Dead node with 9k underreplicated ranges when running tpcc
To Reproduce
Modified to use roachtest from 2 days ago:
Modified test to use partitioning and 6 nodes:
Ran:
bin/roachtest bench '^tpccbench/nodes=6/cpu=16/partition$$' --wipe=false --user=andy
Expected behavior
No dead nodes
Additional data / screenshots
Environment:
Dead node logs:
cockroach.log
The text was updated successfully, but these errors were encountered: