roachtest: stability problems during tpcc-{5,10,20}k on {6,12,24} nodes #31409

awoods187 · 2018-10-15T23:31:34Z

Describe the problem

Dead node with 9k underreplicated ranges when running tpcc

To Reproduce

Modified to use roachtest from 2 days ago:

Modified test to use partitioning and 6 nodes:

@@ -675,11 +675,12 @@ func registerTPCCBench(r *registry) {
                        // StoreDirVersion: "2.0-5",
                },
                {
-                       Nodes: 3,
+                       Nodes: 6,
                        CPUs:  16,
 
-                       LoadWarehouses: 2000,
-                       EstimatedMax:   1300,
+                       LoadWarehouses: 5000,
+                       EstimatedMax:   3000,
+                       LoadConfig:     singlePartitionedLoadgen,
                },

Ran:

bin/roachtest bench '^tpccbench/nodes=6/cpu=16/partition$$' --wipe=false --user=andy

Expected behavior
No dead nodes

Additional data / screenshots

Environment:

CockroachDB version 2.1 Beta 1008

Dead node logs:
cockroach.log

The text was updated successfully, but these errors were encountered:

petermattis · 2018-10-15T23:45:15Z

Which cluster is this?

PS You grabbed the logs for n5, not n3. The confusion between roachprod indexes and node IDs strikes again.

awoods187 · 2018-10-15T23:48:39Z

andy-1539640859-tpccbench-nodes-6-cpu-16-partition

petermattis · 2018-10-15T23:52:43Z

It looks like the cluster has been wiped.

Update: oops, fooled myself due to not accounting for the cluster being run by a different user. Logs are still there.

awoods187 · 2018-10-15T23:59:20Z

cockroach.log

petermattis · 2018-10-15T23:59:51Z

n3 hit an OOM: fatal error: runtime: out of memory. The previous runtime.go message shows:

I181015 23:09:34.905569 360 server/status/runtime.go:465  [n3] runtime stats: 13 GiB RSS, 348 goroutines, 7.0 GiB/1.4
GiB/8.7 GiB GO alloc/idle/total, 3.6 GiB/4.4 GiB CGO alloc/total, 98011.8 CGO/sec, 59.0/34.5 %(u/s)time, 0.0 %gc (0x),
 258 MiB/6.7 MiB (r/w)net

Some sort of memory leak was occurring only on n3:

petermattis · 2018-10-16T00:05:22Z

I'm not seeing anything interesting in the logs, but I don't have time to do a thorough check right now.

tbg · 2018-10-16T10:32:15Z

inuse_space.svg.txt

The filename of the heap_profiler profile used for this is

memprof.fraction_system_memory.000000012938502144_2018-10-15T23_08_34.684

which puts this close to the crash, probably even within a minute, for otherwise I'd expect there to be a second profile (but there isn't).

What we're seeing is lots of memory allocated unmarshaling entries.

I don't understand the last row of boxes here, perhaps @nvanbenschoten or @petermattis can help me understand what that means.

Looking at alloc_space, things are a lot more spread out. Of the bigger offenders, the largest is

Looking at the logs, I'm seeing this node require lots of Raft snapshots just before it dies. I think what's happening here is that during the restore, n3 falls behind on a number of Raft groups. Now it needs lots of snapshots, and those will contain boatloads of AddSSTable commands, which are now inlined. Things go south quickly from here.

I looked at the size of the snapshots. They don't seem absurdly high, but they are all pulled into memory at the receiving end:

rg --no-line-number -o 'encoded size=\d+' cockroach.log | sed -e 's/encoded size.//' | sort -rn | less
70188352
70188159
70188102
70188021
70187982
70187952
70187904
70187893
70187674
70187642
70187629
70187525
70187415
70187372
70187360
70187295
70187281
70184560
70183990
70183925
70183778
70183635
70183617
70183585
64491018
63721883
35022151
35016766
35012746
35012711
35012618
35012599
35012556
35012552
35012549
35012503
35012470
35012423
35012406
35012402
35012394
35012392
35012294
35012294
35012292
35012251
35012073
34377270
34377190
34377064
34377061
34376805
34376748
27622064
26108517
19901334
12975627
12079662
8582929
8221034
176283
95793
87492
... many more but smaller

tbg · 2018-10-16T10:40:31Z

As far as timing goes, the test runs from ~10pm to ~11pm. Here's the number of Raft snapshots within each minute:

$ rg --no-line-number 'applying Raft snapshot at index' cockroach.log | cut -c 1-13 | uniq -c
   1 I181015 22:06
   1 I181015 22:08
   1 I181015 22:21
   3 I181015 22:48
   9 I181015 22:49
   7 I181015 22:50
   1 I181015 22:51
   3 I181015 22:52
   1 I181015 22:53
   1 I181015 22:54
   2 I181015 22:58
   7 I181015 22:59
   7 I181015 23:00
   5 I181015 23:01
  19 I181015 23:02
   2 I181015 23:03
  40 I181015 23:04
  59 I181015 23:05
  72 I181015 23:06
  62 I181015 23:07
  50 I181015 23:08
  20 I181015 23:09

Now those are lots of snapshots, but many of them are small. Let's restrict to snapshots that are at least 10mb:

$ rg --no-line-number 'applying Raft snapshot at index' cockroach.log | rg 'encoded size=\d{5}' | cut -c 1-13 | uniq -c
   3 I181015 23:04
   8 I181015 23:05
   8 I181015 23:06
   8 I181015 23:07
  10 I181015 23:08
   7 I181015 23:09

They all come in just as things take a severe turn south. This is worth a ping to #10320 (kv level memory accounting), though that would be the second line of defense. The first question is why the cluster decided that it was OK to not take n3 into account when truncating the Raft logs. for lots of ranges.

We can see from the logs that there are Raft snapshots essentially throughout the test, but this is unexpected. Raft snapshots are a recovery mechanism that shouldn't usually be triggered during normal cluster operations.

tbg · 2018-10-16T10:45:48Z

16 of these big snapshots arrive to an initialized replica (i.e. one knowing its bounds). The other 28 hit one that doesn't know its bounds. This ratio is approximately the same when you include all snapshots (105 vs 268), so it's just that we saw more and bigger Raft snapshots, not different ones.

(Computed via rg --no-line-number 'applying Raft snapshot at index' cockroach.log | rg 'encoded size=\d{5}' | grep -cF /Table etc)

tbg · 2018-10-16T10:56:25Z

cockroach/pkg/storage/raft_log_queue.go

Lines 110 to 116 in a0b7cd4

    
           // We target the raft log size at the size of the replicated data. When 
        
           // writing to a replica, it is common for the raft log to become larger than 
        
           // the replicated data as the raft log contains the overhead of the 
        
           // BatchRequest which includes the full transaction state as well as begin 
        
           // and end transaction operations. If the estimated raft log size becomes 
        
           // larger than the replica size, we're better off recovering the replica 
        
           // using a snapshot.

suggests that we'll fall back to Raft snapshots when the estimated raft log size is larger than the replica and I suspect that this is a little noisy/aggressive for small replicas. But especially for this situation there's another catch, namely that the target size is additionally clamped to 4mb (which is going to kick in in the vast majority of cases):

cockroach/pkg/storage/raft_log_queue.go

Line 122 in a0b7cd4

targetSize = raftLogMaxSize

My reading of this is that once you have a few SSTs in your replica, you're going to aggressively truncate to the quorum commit index, which is basically saying that it's leaving a lagging replica behind.

I assume that due to the geo-distributed nature of this cluster, n3 is prone to being that odd one out and even more so once it gets to apply all of these snapshots, which come with a large memory amplification, perhaps exacerbated by the sideloading of the SSTs.

petermattis · 2018-10-16T12:32:10Z

Yes, the Raft log is limited to 4MB max size, but proposal quota is supposed to avoid ever hitting that limit. And even if we do hit the truncation limit and allow a replica to fall behind and require a Raft snapshot, the Raft snapshots are serialized by the Raft snapshot queue so there should be a limited number outbound, and in-bound snapshots are also serialized. I'm confused about where the excess memory usage is coming from.

My reading of this is that once you have a few SSTs in your replica, you're going to aggressively truncate to the quorum commit index, which is basically saying that it's leaving a lagging replica behind.

This is interesting. We limit the size of the Raft log because when the Raft log gets large applying the entries can be slower than sending a snapshot. But that heuristic breaks when the Raft log gets side-loaded sstables.

See cockroachdb#31409. Release note: None

tbg · 2018-10-17T09:45:29Z

I haven't dared to look more, but this is definitely not what we want to see while running this test:

root@localhost:26257/defaultdb> select * from crdb_internal.gossip_alerts;
  node_id | store_id | category |            description             | value
+---------+----------+----------+------------------------------------+-------+
        5 |        5 | metrics  | queue.raftsnapshot.process.failure |    10
        3 |        3 | metrics  | queue.raftsnapshot.process.failure |    38
        2 |        2 | metrics  | queue.raftsnapshot.process.failure |     2
        6 |        6 | metrics  | queue.raftsnapshot.process.failure |    14
(4 rows)

Time: 873.526µs

tbg · 2018-10-17T10:15:09Z

Seeing long-lived stretches of

E181017 09:54:03.321069 399050 storage/queue.go:791  [n2,raftsnapshot,s2,r20548/2:/Table/57/1/3565/{1/-1…-5/-2…}] snapshot failed: (n4,s4):1: remote couldn't accept Raft snapshot 8299c9a5 at applied index 22 with error: [n4,s4],r20548: cannot apply snapshot: snapshot intersects existing range [n4,s4,r20328/1:/Table/5{7/1/3542…-8/1/0/1/…}]
E181017 09:54:03.322758 398811 storage/queue.go:791  [n2,raftsnapshot,s2,r21421/2:/Table/57/1/3656/8/-476/11…] snapshot failed: (n4,s4):1: remote couldn't accept Raft snapshot cd7551ad at applied index 15 with error: [n4,s4],r21421: cannot apply snapshot: snapshot intersects existing range [n4,s4,r20328/1:/Table/5{7/1/3542…-8/1/0/1/…}]
E181017 09:54:03.324610 399052 storage/queue.go:791  [n2,raftsnapshot,s2,r24034/2:/Table/57/1/4754/1/-14{5…-4…}] snapshot failed: (n4,s4):1: remote couldn't accept Raft snapshot 32526b2c at applied index 34 with error: [n4,s4],r24034: cannot apply snapshot: snapshot intersects existing range [n4,s4,r20328/1:/Table/5{7/1/3542…-8/1/0/1/…}]
E181017 09:54:03.326988 398863 storage/queue.go:791  [n2,raftsnapshot,s2,r22671/2:/Table/57/1/3612/2/-676/5…] snapshot failed: (n4,s4):1: remote couldn't accept Raft snapshot c4a4a879 at applied index 15 with error: [n4,s4],r22671: cannot apply snapshot: snapshot intersects existing range [n4,s4,r20328/1:/Table/5{7/1/3542…-8/1/0/1/…}]

I wonder if this is related to #31330.

@benesch as an aside, this looks like a merge, but why is the event type "unknown"?

This range generally seems to be hanging around alone, which I think means it must've been merged away (I'm running master at 6773854 which has the merge queue on).

tbg · 2018-10-17T10:17:30Z

The snapshot intersects existing range error is logged at such high rate that all interesting history has been rotated away from the log files.

edit: experimentally obtained a rate of ~114 logs per second for this error.

Touches cockroachdb#31409. Release note: None

32588: storage: indicate when a split causes a Raft snapshot r=petermattis a=tbg Touches #31409. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

When a Range has followers that aren't replicating properly, splitting that range results in a right-hand side with followers in a similar state. Certain workloads (restore/import/presplit) can run large numbers of splits against a given range, and this can result in a large number of Raft snapshots that backs up the Raft snapshot queue. Ideally we'd never have any ranges that require a snapshot, but over the last weeks it has become clear that this is very difficult to achieve since the knowledge required to decide whether a snapshot can efficiently be prevented is distributed across multiple nodes that don't share the necessary information. This is a bit of a nuclear option to prevent the likely last big culprit in large numbers of Raft snapshots in cockroachdb#31409. With this change, we should expect to see Raft snapshots regularly when a split/scatter phase of an import/restore is active, but never large volumes at once. Release note: None

When a Range has followers that aren't replicating properly, splitting that range results in a right-hand side with followers in a similar state. Certain workloads (restore/import/presplit) can run large numbers of splits against a given range, and this can result in a large number of Raft snapshots that backs up the Raft snapshot queue. Ideally we'd never have any ranges that require a snapshot, but over the last weeks it has become clear that this is very difficult to achieve since the knowledge required to decide whether a snapshot can efficiently be prevented is distributed across multiple nodes that don't share the necessary information. This commit is a bit of a nuclear option to prevent the likely last big culprit in large numbers of Raft snapshots in cockroachdb#31409. With this change, we should expect to see Raft snapshots regularly when a split/scatter phase of an import/restore is active, but never large volumes at once (except perhaps for an initial spike). Splits are delayed only for manual splits. In particular, the split queue is not affected and could in theory cause Raft snapshots. However, at the present juncture, adding delays in the split queue could cause problems as well, so we retain the previous behavior there which isn't known to have caused problems. More follow-up work in the area of Raft snapshots will be necessary to add some more sanity to this area of the code. Release note: None

When a Range has followers that aren't replicating properly, splitting that range results in a right-hand side with followers in a similar state. Certain workloads (restore/import/presplit) can run large numbers of splits against a given range, and this can result in a large number of Raft snapshots that backs up the Raft snapshot queue. Ideally we'd never have any ranges that require a snapshot, but over the last weeks it has become clear that this is very difficult to achieve since the knowledge required to decide whether a snapshot can efficiently be prevented is distributed across multiple nodes that don't share the necessary information. This commit is a bit of a nuclear option to prevent the likely last big culprit in large numbers of Raft snapshots in cockroachdb#31409. With this change, we should expect to see Raft snapshots regularly when a split/scatter phase of an import/restore is active, but never large volumes at once (except perhaps for an initial spike). Splits are delayed only for manual splits. In particular, the split queue is not affected and could in theory cause Raft snapshots. However, at the present juncture, adding delays in the split queue could cause problems as well, so we retain the previous behavior there which isn't known to have caused problems. More follow-up work in the area of Raft snapshots will be necessary to add some more sanity to this area of the code. Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots.

32594: storage: delay manual splits that would result in more snapshots r=petermattis a=tbg This is unpolished, but I had used an earlier version of this with what at the time looked like success. At this point I suspect that this is the best way to suppress Raft snapshot growth in IMPORT/RESTORE. (Definitely needs tests). ---- When a Range has followers that aren't replicating properly, splitting that range results in a right-hand side with followers in a similar state. Certain workloads (restore/import/presplit) can run large numbers of splits against a given range, and this can result in a large number of Raft snapshots that backs up the Raft snapshot queue. Ideally we'd never have any ranges that require a snapshot, but over the last weeks it has become clear that this is very difficult to achieve since the knowledge required to decide whether a snapshot can efficiently be prevented is distributed across multiple nodes that don't share the necessary information. This is a bit of a nuclear option to prevent the likely last big culprit in large numbers of Raft snapshots in #31409. With this change, we should expect to see Raft snapshots regularly when a split/scatter phase of an import/restore is active, but never large volumes at once. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

This is a dangerous condition. Adding this to the health checker has the additional benefit of logging it during the nightly restore/import tests, which can in turn help diagnose whether a particular run is affected by cockroachdb#31409. Release note: None

Touches cockroachdb#31409. Release note: None

tbg · 2018-12-03T13:00:00Z

The nightlies post #32594 look very promising (they all passed, and didn't have Raft snapshot buildup).

I'm going to optimistically close this, but @ajwerner and @awoods187 please let me know when imports fail (in the way specific to this issue).

This was referenced Oct 16, 2018

One dead node/lost connection to entire cluster during tpcc #31415

Closed

roachtest: tpccbench failures on 6/12 nodes #31460

Closed

tbg changed the title ~~Dead node when running tpcc 5k with a 6 node cluster and partitioning~~ roachtest: stability problems during tpcc-{3,5}k on {6,12} nodes Oct 16, 2018

tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 16, 2018

tbg added this to the 2.1 milestone Oct 16, 2018

tbg changed the title ~~roachtest: stability problems during tpcc-{3,5}k on {6,12} nodes~~ roachtest: stability problems during tpcc-{5,10,20}k on {6,12} nodes Oct 16, 2018

tbg changed the title ~~roachtest: stability problems during tpcc-{5,10,20}k on {6,12} nodes~~ roachtest: stability problems during tpcc-{5,10,20}k on {6,12,24} nodes Oct 16, 2018

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-investigation Further steps needed to qualify. C-label will change. labels Oct 16, 2018

tbg added a commit to tbg/cockroach that referenced this issue Oct 16, 2018

roachtest: add tpccbench configurations that have caused instability

ebb9b9b

See cockroachdb#31409. Release note: None

tbg mentioned this issue Oct 16, 2018

roachtest: add tpccbench configurations that have caused instability #31466

Merged

tbg mentioned this issue Oct 16, 2018

sql: oom during 24 node tpcc #31458

Closed

tbg self-assigned this Oct 17, 2018

tbg added a commit to tbg/cockroach that referenced this issue Nov 26, 2018

storage: indicate when a split causes a Raft snapshot

72a7747

Touches cockroachdb#31409. Release note: None

tbg added a commit to tbg/cockroach that referenced this issue Nov 26, 2018

storage: indicate when a split causes a Raft snapshot

5f865f3

Touches cockroachdb#31409. Release note: None

tbg mentioned this issue Nov 26, 2018

storage: delay manual splits that would result in more snapshots #32594

Merged

craig bot pushed a commit that referenced this issue Nov 26, 2018

Merge #32588

169140a

32588: storage: indicate when a split causes a Raft snapshot r=petermattis a=tbg Touches #31409. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

tbg added a commit to tbg/cockroach that referenced this issue Nov 30, 2018

storage: indicate when a split causes a Raft snapshot

8c12c3e

Touches cockroachdb#31409. Release note: None

tbg closed this as completed Dec 3, 2018

tbg mentioned this issue Dec 4, 2018

roachtest: run tests against larger clusters #26978

Closed

tbg mentioned this issue Dec 12, 2018

storage: speed up and avoid contention during upreplication #33073

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: stability problems during tpcc-{5,10,20}k on {6,12,24} nodes #31409

roachtest: stability problems during tpcc-{5,10,20}k on {6,12,24} nodes #31409

awoods187 commented Oct 15, 2018 •

edited

Loading

petermattis commented Oct 15, 2018 •

edited

Loading

awoods187 commented Oct 15, 2018

petermattis commented Oct 15, 2018 •

edited

Loading

awoods187 commented Oct 15, 2018

petermattis commented Oct 15, 2018

petermattis commented Oct 16, 2018

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018 •

edited

Loading

tbg commented Oct 16, 2018 •

edited

Loading

petermattis commented Oct 16, 2018

tbg commented Oct 17, 2018 •

edited

Loading

tbg commented Oct 17, 2018 •

edited

Loading

tbg commented Oct 17, 2018 •

edited

Loading

tbg commented Dec 3, 2018

roachtest: stability problems during tpcc-{5,10,20}k on {6,12,24} nodes #31409

roachtest: stability problems during tpcc-{5,10,20}k on {6,12,24} nodes #31409

Comments

awoods187 commented Oct 15, 2018 • edited Loading

petermattis commented Oct 15, 2018 • edited Loading

awoods187 commented Oct 15, 2018

petermattis commented Oct 15, 2018 • edited Loading

awoods187 commented Oct 15, 2018

petermattis commented Oct 15, 2018

petermattis commented Oct 16, 2018

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018 • edited Loading

tbg commented Oct 16, 2018 • edited Loading

petermattis commented Oct 16, 2018

tbg commented Oct 17, 2018 • edited Loading

tbg commented Oct 17, 2018 • edited Loading

tbg commented Oct 17, 2018 • edited Loading

tbg commented Dec 3, 2018

awoods187 commented Oct 15, 2018 •

edited

Loading

petermattis commented Oct 15, 2018 •

edited

Loading

petermattis commented Oct 15, 2018 •

edited

Loading

tbg commented Oct 16, 2018 •

edited

Loading

tbg commented Oct 16, 2018 •

edited

Loading

tbg commented Oct 17, 2018 •

edited

Loading

tbg commented Oct 17, 2018 •

edited

Loading

tbg commented Oct 17, 2018 •

edited

Loading