backup: OOM during TPC-H scalefactor=10 restore running rc2 #15681

rjnn · 2017-05-03T23:19:40Z

Cluster navy was running a TPC-H scalefactor=10 restore job on a cluster that was rolling upgraded to rc2. There was an existing TPC-H scalefactor=1 database on navy, so I ran the following commands:
DROP DATABASE tpch; (ctrl-c-ed out)
CREATE DATABASE tpch;
RESTORE tpch.* FROM azure://... (ctrl-c-ed out)

Everything was proceeding fine for a couple hours, when navy-0001 OOMed. The restore continued after the OOM on node 1, until it got to 0.998769998550415, at which point progress stopped increasing.

The logs on navy-0001 have been preserved on the machine, and it has not been restarted in order to facilitate debugging.

The text was updated successfully, but these errors were encountered:

danhhz · 2017-05-04T15:17:48Z

Arjun mentioned offline that the memprofiles and logs from right after the crash are on navy 1 in ~/restore.rc2.oom.logs.tgz. Looking at memprof.2017-05-03T22_20_43.523 the first two entries for inuse_space are

 2720.12MB 75.24% 75.24%  2720.12MB 75.24%  github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raftpb.(*Entry).Unmarshal
  594.02MB 16.43% 91.67%   594.02MB 16.43%  github.com/cockroachdb/cockroach/pkg/storage.encodeRaftCommand

Here's the call stack for that Unmarshal (is there an easier way to get this in pprof?)

cc @petermattis

danhhz · 2017-05-04T15:36:02Z

I wonder if it has something to do with raft.limitSize or raft.(*unstable).slice both called from raft.(*raftLog).slice. Bizarre is that raft.(*raft).becomeLeader is the ultimate source of the slice being handed back, but it never saves it, just hands it to numOfPendingConf.

Also cc @a-robinson

danhhz · 2017-05-04T15:40:12Z

It's possible that we could work around this for 1.0 by reducing the batchSizeBytes constant in import.go, which would make each entry smaller (at the cost of reduce RESTORE throughput). Though I think we'd need to understand the leak before saying for sure that it would help

petermattis · 2017-05-04T15:53:29Z

We cache entries returned from Replica.Entries in the Raft entry cache. But that cache holds individual entries and it is limited in size.

petermattis · 2017-05-04T16:20:15Z

I'm not sure if there is a memory leak at all. This is the Go alloc value from the last 20 runtime.go log messages:

3.3 GiB
6.1 GiB
7.0 GiB
1.5 GiB
2.8 GiB
4.0 GiB
6.2 GiB
8.4 GiB
2.8 GiB
3.5 GiB
4.0 GiB
5.7 GiB
6.9 GiB
964 MiB
1.0 GiB
2.2 GiB
3.8 GiB
6.3 GiB
7.0 GiB
8.2 GiB

petermattis · 2017-05-04T16:49:11Z

memprof.2017-05-03T22237_13.52 shows the top 2 allocators as:

 2136.01MB 69.46% 69.46%  2136.01MB 69.46%  github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raftpb.(*Entry).Unmarshal
  632.23MB 20.56% 90.02%   632.23MB 20.56%  github.com/cockroachdb/cockroach/pkg/storage.encodeRaftCommand

10 seconds later, memprof.2017-05-03T22_17_23.523 shows:

  621.10MB 67.49% 67.49%   621.10MB 67.49%  github.com/cockroachdb/cockroach/pkg/storage.encodeRaftCommand
   99.58MB 10.82% 78.31%    99.58MB 10.82%  runtime.gobytes

danhhz · 2017-05-04T16:49:50Z

Interesting. So it's not leaking, just peaking too high?

petermattis · 2017-05-04T16:51:22Z

Something like that. I'm very surprised the GC isn't doing a better job here. I think we should save away the logs and profiles and then try another restore on navy. I'd like to run with GODEBUG=gctrace=1.

danhhz · 2017-05-04T16:55:52Z

sgtm. @justinj mentioned that he wants some production experience, so I'll let him run it (and handhold as necessary)

danhhz · 2017-05-04T17:39:43Z

This is running now

There doesn't seem to be a large benefit to running these WriteBatch requests in parallel, since they just end up contending for disk. May as well rate limit them to smooth out the disk usage a bit. Maybe this will help with cockroachdb#15681? Dunno. name old time/op new time/op delta ClusterRestore-8 6.46µs ±10% 7.53µs ± 5% +16.49% (p=0.000 n=11+5) name old speed new speed delta ClusterRestore-8 19.1MB/s ± 9% 16.4MB/s ± 5% -14.32% (p=0.000 n=11+5)

danhhz · 2017-05-05T15:27:46Z

Forgot to update this yesterday. The issue reproduced readily when running a RESTORE while DROPing a large database.

Running pprof top5 on the last 20 memprofiles shows that the raftpb.(*Entry).Unmarshal inuse_space mirrors the memory growth and gc that @petermattis pointed out above. I can't say I understand the proximate cause but it sounds very similar to @a-robinson's latest update in 15702.

Also noticed that node 3 (the one that OOM'd) has /mnt mounted as fuseblk which seems like it could explain some stuff.

@a-robinson is looking at the logs to see if this is the same issue as #15702. Assigning to him while that happens.

In the meantime, we should consider documenting as a known limitation that running RESTORE and other exceptionally disk-heavy commands (DROP) should avoided if possible. Possibly also we should document that having one node with a much slower disk is a bad idea.

petermattis · 2017-05-05T16:02:09Z

@danhhz Did you see this assertion failure on navy 5:

cockroach: /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/version_set.cc:1400: void rocksdb::VersionStorageInfo::AddFile(int, rocksdb::FileMetaData*, rocksdb::Logger*): Assertion 'false' failed.

petermattis · 2017-05-05T16:24:40Z

Both navy 3 and 5 have the fuseblk devices:

~ crl-ssh navy all mount | grep /mnt
/dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered)
/dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered)
/dev/sdb1 on /mnt type fuseblk (rw,relatime,user_id=0,group_id=0,allow_other,blksize=4096)
/dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered)
/dev/sdb1 on /mnt type fuseblk (rw,relatime,user_id=0,group_id=0,allow_other,blksize=4096)
/dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered)

danhhz · 2017-05-05T16:37:38Z

Ah yes. Forgot to include that in my notes. navy 5 hung yesterday (I couldn't even ssh in) and it had to be restarted in the azure portal. So anything wonky with it could be related to that

petermattis · 2017-05-05T17:09:08Z

I wonder if we're hitting some sort of election death spiral. If reading the unapplied tail of the log takes close to or longer than the Raft election timeout, we could get into a situation where we're constantly calling elections, becoming the leader but then never maintaining the leadership because another follower times out, calls an election and tries to become leader itself.

a-robinson · 2017-05-05T17:37:58Z

Yeah, that's exactly what I was calling out in #15702 (comment)

petermattis · 2017-05-05T18:48:31Z

Yeah, that's exactly what I was calling out in #15702 (comment)

Ah, should have read the closer. At least we're independently thinking along the same lines.

danhhz · 2017-07-17T19:32:03Z

@a-robinson is there anything to be done here for 1.1? This was the same as #15702 which has since been fixed (though also not closed), right?

a-robinson · 2017-07-17T19:36:41Z

We concluded that this was the same as #15702, but #15702 hasn't been fixed. They're both essentially equivalent to #15723, so we could close them as dupes.

danhhz · 2017-07-17T19:39:54Z

Fine by me! duplicate of #15702

rjnn added the C-investigation Further steps needed to qualify. C-label will change. label May 3, 2017

rjnn added this to the 1.0 milestone May 3, 2017

rjnn assigned dt, maddyblue, danhhz and benesch May 3, 2017

a-robinson mentioned this issue May 4, 2017

high memory usage in cluster #15702

Closed

danhhz mentioned this issue May 4, 2017

storageccl: rate limit for WriteBatch sent from Import #15707

Closed

danhhz assigned a-robinson and unassigned dt, maddyblue, danhhz and benesch May 5, 2017

petermattis modified the milestones: 1.1, 1.0 Jun 1, 2017

danhhz marked this as a duplicate of #15702 Jul 17, 2017

danhhz closed this as completed Jul 17, 2017

a-robinson mentioned this issue Dec 21, 2017

storage: avoid reading uncommitted tail of Raft log when becoming leader #18601

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backup: OOM during TPC-H scalefactor=10 restore running rc2 #15681

backup: OOM during TPC-H scalefactor=10 restore running rc2 #15681

rjnn commented May 3, 2017

danhhz commented May 4, 2017

danhhz commented May 4, 2017

danhhz commented May 4, 2017

petermattis commented May 4, 2017

petermattis commented May 4, 2017

petermattis commented May 4, 2017

danhhz commented May 4, 2017

petermattis commented May 4, 2017

danhhz commented May 4, 2017

danhhz commented May 4, 2017

danhhz commented May 5, 2017

petermattis commented May 5, 2017

petermattis commented May 5, 2017

danhhz commented May 5, 2017

petermattis commented May 5, 2017

a-robinson commented May 5, 2017

petermattis commented May 5, 2017

danhhz commented Jul 17, 2017

a-robinson commented Jul 17, 2017

danhhz commented Jul 17, 2017

backup: OOM during TPC-H scalefactor=10 restore running rc2 #15681

backup: OOM during TPC-H scalefactor=10 restore running rc2 #15681

Comments

rjnn commented May 3, 2017

danhhz commented May 4, 2017

danhhz commented May 4, 2017

danhhz commented May 4, 2017

petermattis commented May 4, 2017

petermattis commented May 4, 2017

petermattis commented May 4, 2017

danhhz commented May 4, 2017

petermattis commented May 4, 2017

danhhz commented May 4, 2017

danhhz commented May 4, 2017

danhhz commented May 5, 2017

petermattis commented May 5, 2017

petermattis commented May 5, 2017

danhhz commented May 5, 2017

petermattis commented May 5, 2017

a-robinson commented May 5, 2017

petermattis commented May 5, 2017

danhhz commented Jul 17, 2017

a-robinson commented Jul 17, 2017

danhhz commented Jul 17, 2017