Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: stability problems during tpcc-{5,10,20}k on {6,12,24} nodes #31409

Closed
awoods187 opened this issue Oct 15, 2018 · 64 comments
Closed
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@awoods187
Copy link
Contributor

awoods187 commented Oct 15, 2018

Describe the problem

Dead node with 9k underreplicated ranges when running tpcc

To Reproduce

Modified to use roachtest from 2 days ago:

Modified test to use partitioning and 6 nodes:

@@ -675,11 +675,12 @@ func registerTPCCBench(r *registry) {
                        // StoreDirVersion: "2.0-5",
                },
                {
-                       Nodes: 3,
+                       Nodes: 6,
                        CPUs:  16,
 
-                       LoadWarehouses: 2000,
-                       EstimatedMax:   1300,
+                       LoadWarehouses: 5000,
+                       EstimatedMax:   3000,
+                       LoadConfig:     singlePartitionedLoadgen,
                },

Ran:

bin/roachtest bench '^tpccbench/nodes=6/cpu=16/partition$$' --wipe=false --user=andy

Expected behavior
No dead nodes

Additional data / screenshots
image

Environment:

  • CockroachDB version 2.1 Beta 1008

Dead node logs:
cockroach.log

@petermattis
Copy link
Collaborator

petermattis commented Oct 15, 2018

Which cluster is this?

PS You grabbed the logs for n5, not n3. The confusion between roachprod indexes and node IDs strikes again.

@awoods187
Copy link
Contributor Author

andy-1539640859-tpccbench-nodes-6-cpu-16-partition

@petermattis
Copy link
Collaborator

petermattis commented Oct 15, 2018

It looks like the cluster has been wiped.

Update: oops, fooled myself due to not accounting for the cluster being run by a different user. Logs are still there.

@awoods187
Copy link
Contributor Author

cockroach.log

@petermattis
Copy link
Collaborator

n3 hit an OOM: fatal error: runtime: out of memory. The previous runtime.go message shows:

I181015 23:09:34.905569 360 server/status/runtime.go:465  [n3] runtime stats: 13 GiB RSS, 348 goroutines, 7.0 GiB/1.4
GiB/8.7 GiB GO alloc/idle/total, 3.6 GiB/4.4 GiB CGO alloc/total, 98011.8 CGO/sec, 59.0/34.5 %(u/s)time, 0.0 %gc (0x),
 258 MiB/6.7 MiB (r/w)net

Some sort of memory leak was occurring only on n3:

screen shot 2018-10-15 at 7 58 56 pm

@petermattis
Copy link
Collaborator

I'm not seeing anything interesting in the logs, but I don't have time to do a thorough check right now.

@tbg
Copy link
Member

tbg commented Oct 16, 2018

inuse_space.svg.txt

The filename of the heap_profiler profile used for this is

memprof.fraction_system_memory.000000012938502144_2018-10-15T23_08_34.684

which puts this close to the crash, probably even within a minute, for otherwise I'd expect there to be a second profile (but there isn't).

What we're seeing is lots of memory allocated unmarshaling entries.

image

I don't understand the last row of boxes here, perhaps @nvanbenschoten or @petermattis can help me understand what that means.

image

Looking at alloc_space, things are a lot more spread out. Of the bigger offenders, the largest is

image

Looking at the logs, I'm seeing this node require lots of Raft snapshots just before it dies. I think what's happening here is that during the restore, n3 falls behind on a number of Raft groups. Now it needs lots of snapshots, and those will contain boatloads of AddSSTable commands, which are now inlined. Things go south quickly from here.

I looked at the size of the snapshots. They don't seem absurdly high, but they are all pulled into memory at the receiving end:

rg --no-line-number -o 'encoded size=\d+' cockroach.log | sed -e 's/encoded size.//' | sort -rn | less
70188352
70188159
70188102
70188021
70187982
70187952
70187904
70187893
70187674
70187642
70187629
70187525
70187415
70187372
70187360
70187295
70187281
70184560
70183990
70183925
70183778
70183635
70183617
70183585
64491018
63721883
35022151
35016766
35012746
35012711
35012618
35012599
35012556
35012552
35012549
35012503
35012470
35012423
35012406
35012402
35012394
35012392
35012294
35012294
35012292
35012251
35012073
34377270
34377190
34377064
34377061
34376805
34376748
27622064
26108517
19901334
12975627
12079662
8582929
8221034
176283
95793
87492
... many more but smaller

@tbg
Copy link
Member

tbg commented Oct 16, 2018

As far as timing goes, the test runs from ~10pm to ~11pm. Here's the number of Raft snapshots within each minute:

$ rg --no-line-number 'applying Raft snapshot at index' cockroach.log | cut -c 1-13 | uniq -c
   1 I181015 22:06
   1 I181015 22:08
   1 I181015 22:21
   3 I181015 22:48
   9 I181015 22:49
   7 I181015 22:50
   1 I181015 22:51
   3 I181015 22:52
   1 I181015 22:53
   1 I181015 22:54
   2 I181015 22:58
   7 I181015 22:59
   7 I181015 23:00
   5 I181015 23:01
  19 I181015 23:02
   2 I181015 23:03
  40 I181015 23:04
  59 I181015 23:05
  72 I181015 23:06
  62 I181015 23:07
  50 I181015 23:08
  20 I181015 23:09

Now those are lots of snapshots, but many of them are small. Let's restrict to snapshots that are at least 10mb:

$ rg --no-line-number 'applying Raft snapshot at index' cockroach.log | rg 'encoded size=\d{5}' | cut -c 1-13 | uniq -c
   3 I181015 23:04
   8 I181015 23:05
   8 I181015 23:06
   8 I181015 23:07
  10 I181015 23:08
   7 I181015 23:09

They all come in just as things take a severe turn south. This is worth a ping to #10320 (kv level memory accounting), though that would be the second line of defense. The first question is why the cluster decided that it was OK to not take n3 into account when truncating the Raft logs. for lots of ranges.

We can see from the logs that there are Raft snapshots essentially throughout the test, but this is unexpected. Raft snapshots are a recovery mechanism that shouldn't usually be triggered during normal cluster operations.

@tbg
Copy link
Member

tbg commented Oct 16, 2018

16 of these big snapshots arrive to an initialized replica (i.e. one knowing its bounds). The other 28 hit one that doesn't know its bounds. This ratio is approximately the same when you include all snapshots (105 vs 268), so it's just that we saw more and bigger Raft snapshots, not different ones.

(Computed via rg --no-line-number 'applying Raft snapshot at index' cockroach.log | rg 'encoded size=\d{5}' | grep -cF /Table etc)

@tbg
Copy link
Member

tbg commented Oct 16, 2018

// We target the raft log size at the size of the replicated data. When
// writing to a replica, it is common for the raft log to become larger than
// the replicated data as the raft log contains the overhead of the
// BatchRequest which includes the full transaction state as well as begin
// and end transaction operations. If the estimated raft log size becomes
// larger than the replica size, we're better off recovering the replica
// using a snapshot.

suggests that we'll fall back to Raft snapshots when the estimated raft log size is larger than the replica and I suspect that this is a little noisy/aggressive for small replicas. But especially for this situation there's another catch, namely that the target size is additionally clamped to 4mb (which is going to kick in in the vast majority of cases):

targetSize = raftLogMaxSize

My reading of this is that once you have a few SSTs in your replica, you're going to aggressively truncate to the quorum commit index, which is basically saying that it's leaving a lagging replica behind.

I assume that due to the geo-distributed nature of this cluster, n3 is prone to being that odd one out and even more so once it gets to apply all of these snapshots, which come with a large memory amplification, perhaps exacerbated by the sideloading of the SSTs.

@petermattis
Copy link
Collaborator

Yes, the Raft log is limited to 4MB max size, but proposal quota is supposed to avoid ever hitting that limit. And even if we do hit the truncation limit and allow a replica to fall behind and require a Raft snapshot, the Raft snapshots are serialized by the Raft snapshot queue so there should be a limited number outbound, and in-bound snapshots are also serialized. I'm confused about where the excess memory usage is coming from.

My reading of this is that once you have a few SSTs in your replica, you're going to aggressively truncate to the quorum commit index, which is basically saying that it's leaving a lagging replica behind.

This is interesting. We limit the size of the Raft log because when the Raft log gets large applying the entries can be slower than sending a snapshot. But that heuristic breaks when the Raft log gets side-loaded sstables.

@tbg tbg changed the title Dead node when running tpcc 5k with a 6 node cluster and partitioning roachtest: stability problems during tpcc-{3,5}k on {6,12} nodes Oct 16, 2018
@tbg tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 16, 2018
@tbg tbg added this to the 2.1 milestone Oct 16, 2018
@tbg tbg changed the title roachtest: stability problems during tpcc-{3,5}k on {6,12} nodes roachtest: stability problems during tpcc-{5,10,20}k on {6,12} nodes Oct 16, 2018
@tbg tbg changed the title roachtest: stability problems during tpcc-{5,10,20}k on {6,12} nodes roachtest: stability problems during tpcc-{5,10,20}k on {6,12,24} nodes Oct 16, 2018
@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. labels Oct 16, 2018
tbg added a commit to tbg/cockroach that referenced this issue Oct 16, 2018
@tbg tbg self-assigned this Oct 17, 2018
@tbg
Copy link
Member

tbg commented Oct 17, 2018

I haven't dared to look more, but this is definitely not what we want to see while running this test:

root@localhost:26257/defaultdb> select * from crdb_internal.gossip_alerts;
  node_id | store_id | category |            description             | value
+---------+----------+----------+------------------------------------+-------+
        5 |        5 | metrics  | queue.raftsnapshot.process.failure |    10
        3 |        3 | metrics  | queue.raftsnapshot.process.failure |    38
        2 |        2 | metrics  | queue.raftsnapshot.process.failure |     2
        6 |        6 | metrics  | queue.raftsnapshot.process.failure |    14
(4 rows)

Time: 873.526µs

@tbg
Copy link
Member

tbg commented Oct 17, 2018

Seeing long-lived stretches of

E181017 09:54:03.321069 399050 storage/queue.go:791  [n2,raftsnapshot,s2,r20548/2:/Table/57/1/3565/{1/-1…-5/-2…}] snapshot failed: (n4,s4):1: remote couldn't accept Raft snapshot 8299c9a5 at applied index 22 with error: [n4,s4],r20548: cannot apply snapshot: snapshot intersects existing range [n4,s4,r20328/1:/Table/5{7/1/3542…-8/1/0/1/…}]
E181017 09:54:03.322758 398811 storage/queue.go:791  [n2,raftsnapshot,s2,r21421/2:/Table/57/1/3656/8/-476/11…] snapshot failed: (n4,s4):1: remote couldn't accept Raft snapshot cd7551ad at applied index 15 with error: [n4,s4],r21421: cannot apply snapshot: snapshot intersects existing range [n4,s4,r20328/1:/Table/5{7/1/3542…-8/1/0/1/…}]
E181017 09:54:03.324610 399052 storage/queue.go:791  [n2,raftsnapshot,s2,r24034/2:/Table/57/1/4754/1/-14{5…-4…}] snapshot failed: (n4,s4):1: remote couldn't accept Raft snapshot 32526b2c at applied index 34 with error: [n4,s4],r24034: cannot apply snapshot: snapshot intersects existing range [n4,s4,r20328/1:/Table/5{7/1/3542…-8/1/0/1/…}]
E181017 09:54:03.326988 398863 storage/queue.go:791  [n2,raftsnapshot,s2,r22671/2:/Table/57/1/3612/2/-676/5…] snapshot failed: (n4,s4):1: remote couldn't accept Raft snapshot c4a4a879 at applied index 15 with error: [n4,s4],r22671: cannot apply snapshot: snapshot intersects existing range [n4,s4,r20328/1:/Table/5{7/1/3542…-8/1/0/1/…}]

I wonder if this is related to #31330.

@benesch as an aside, this looks like a merge, but why is the event type "unknown"?

image

This range generally seems to be hanging around alone, which I think means it must've been merged away (I'm running master at 6773854 which has the merge queue on).

image

@tbg
Copy link
Member

tbg commented Oct 17, 2018

The snapshot intersects existing range error is logged at such high rate that all interesting history has been rotated away from the log files.

edit: experimentally obtained a rate of ~114 logs per second for this error.

tbg added a commit to tbg/cockroach that referenced this issue Nov 26, 2018
tbg added a commit to tbg/cockroach that referenced this issue Nov 26, 2018
craig bot pushed a commit that referenced this issue Nov 26, 2018
32588: storage: indicate when a split causes a Raft snapshot r=petermattis a=tbg

Touches #31409.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Nov 26, 2018
When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.

Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.

This is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in cockroachdb#31409.

With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Nov 26, 2018
When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.

Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.

This is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in cockroachdb#31409.

With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Nov 27, 2018
When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.

Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.

This is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in cockroachdb#31409.

With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Nov 28, 2018
When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.

Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.

This is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in cockroachdb#31409.

With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Nov 29, 2018
When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.

Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.

This is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in cockroachdb#31409.

With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Nov 29, 2018
When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.

Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.

This commit is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in cockroachdb#31409.

With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once (except perhaps for an initial spike).

Splits are delayed only for manual splits. In particular, the split
queue is not affected and could in theory cause Raft snapshots. However,
at the present juncture, adding delays in the split queue could cause
problems as well, so we retain the previous behavior there which isn't
known to have caused problems.

More follow-up work in the area of Raft snapshots will be necessary to
add some more sanity to this area of the code.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Nov 29, 2018
When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.

Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.

This commit is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in cockroachdb#31409.

With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once (except perhaps for an initial spike).

Splits are delayed only for manual splits. In particular, the split
queue is not affected and could in theory cause Raft snapshots. However,
at the present juncture, adding delays in the split queue could cause
problems as well, so we retain the previous behavior there which isn't
known to have caused problems.

More follow-up work in the area of Raft snapshots will be necessary to
add some more sanity to this area of the code.

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.
tbg added a commit to tbg/cockroach that referenced this issue Nov 30, 2018
When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.

Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.

This commit is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in cockroachdb#31409.

With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once (except perhaps for an initial spike).

Splits are delayed only for manual splits. In particular, the split
queue is not affected and could in theory cause Raft snapshots. However,
at the present juncture, adding delays in the split queue could cause
problems as well, so we retain the previous behavior there which isn't
known to have caused problems.

More follow-up work in the area of Raft snapshots will be necessary to
add some more sanity to this area of the code.

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.
craig bot pushed a commit that referenced this issue Nov 30, 2018
32594: storage: delay manual splits that would result in more snapshots r=petermattis a=tbg

This is unpolished, but I had used an earlier version of this with what at the
time looked like success. At this point I suspect that this is the best way to
suppress Raft snapshot growth in IMPORT/RESTORE.

(Definitely needs tests).

----

When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.

Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.

This is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in #31409.

With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Nov 30, 2018
This is a dangerous condition. Adding this to the health checker has the
additional benefit of logging it during the nightly restore/import
tests, which can in turn help diagnose whether a particular run is
affected by cockroachdb#31409.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Nov 30, 2018
@tbg
Copy link
Member

tbg commented Dec 3, 2018

The nightlies post #32594 look very promising (they all passed, and didn't have Raft snapshot buildup).

I'm going to optimistically close this, but @ajwerner and @awoods187 please let me know when imports fail (in the way specific to this issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

6 participants