Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backup: OOM during TPC-H scalefactor=10 restore running rc2 #15681

Closed
rjnn opened this issue May 3, 2017 · 20 comments
Closed

backup: OOM during TPC-H scalefactor=10 restore running rc2 #15681

rjnn opened this issue May 3, 2017 · 20 comments
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change.
Milestone

Comments

@rjnn
Copy link
Contributor

rjnn commented May 3, 2017

Cluster navy was running a TPC-H scalefactor=10 restore job on a cluster that was rolling upgraded to rc2. There was an existing TPC-H scalefactor=1 database on navy, so I ran the following commands:
DROP DATABASE tpch; (ctrl-c-ed out)
CREATE DATABASE tpch;
RESTORE tpch.* FROM azure://... (ctrl-c-ed out)

Everything was proceeding fine for a couple hours, when navy-0001 OOMed. The restore continued after the OOM on node 1, until it got to 0.998769998550415, at which point progress stopped increasing.

The logs on navy-0001 have been preserved on the machine, and it has not been restarted in order to facilitate debugging.

@rjnn rjnn added the C-investigation Further steps needed to qualify. C-label will change. label May 3, 2017
@rjnn rjnn added this to the 1.0 milestone May 3, 2017
@danhhz
Copy link
Contributor

danhhz commented May 4, 2017

Arjun mentioned offline that the memprofiles and logs from right after the crash are on navy 1 in ~/restore.rc2.oom.logs.tgz. Looking at memprof.2017-05-03T22_20_43.523 the first two entries for inuse_space are

 2720.12MB 75.24% 75.24%  2720.12MB 75.24%  github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raftpb.(*Entry).Unmarshal
  594.02MB 16.43% 91.67%   594.02MB 16.43%  github.com/cockroachdb/cockroach/pkg/storage.encodeRaftCommand

Here's the call stack for that Unmarshal (is there an easier way to get this in pprof?)
image

cc @petermattis

@danhhz
Copy link
Contributor

danhhz commented May 4, 2017

I wonder if it has something to do with raft.limitSize or raft.(*unstable).slice both called from raft.(*raftLog).slice. Bizarre is that raft.(*raft).becomeLeader is the ultimate source of the slice being handed back, but it never saves it, just hands it to numOfPendingConf.

Also cc @a-robinson

@danhhz
Copy link
Contributor

danhhz commented May 4, 2017

It's possible that we could work around this for 1.0 by reducing the batchSizeBytes constant in import.go, which would make each entry smaller (at the cost of reduce RESTORE throughput). Though I think we'd need to understand the leak before saying for sure that it would help

@petermattis
Copy link
Collaborator

We cache entries returned from Replica.Entries in the Raft entry cache. But that cache holds individual entries and it is limited in size.

@petermattis
Copy link
Collaborator

I'm not sure if there is a memory leak at all. This is the Go alloc value from the last 20 runtime.go log messages:

3.3 GiB
6.1 GiB
7.0 GiB
1.5 GiB
2.8 GiB
4.0 GiB
6.2 GiB
8.4 GiB
2.8 GiB
3.5 GiB
4.0 GiB
5.7 GiB
6.9 GiB
964 MiB
1.0 GiB
2.2 GiB
3.8 GiB
6.3 GiB
7.0 GiB
8.2 GiB

@petermattis
Copy link
Collaborator

memprof.2017-05-03T22237_13.52 shows the top 2 allocators as:

 2136.01MB 69.46% 69.46%  2136.01MB 69.46%  github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raftpb.(*Entry).Unmarshal
  632.23MB 20.56% 90.02%   632.23MB 20.56%  github.com/cockroachdb/cockroach/pkg/storage.encodeRaftCommand

10 seconds later, memprof.2017-05-03T22_17_23.523 shows:

  621.10MB 67.49% 67.49%   621.10MB 67.49%  github.com/cockroachdb/cockroach/pkg/storage.encodeRaftCommand
   99.58MB 10.82% 78.31%    99.58MB 10.82%  runtime.gobytes

@danhhz
Copy link
Contributor

danhhz commented May 4, 2017

Interesting. So it's not leaking, just peaking too high?

@petermattis
Copy link
Collaborator

Something like that. I'm very surprised the GC isn't doing a better job here. I think we should save away the logs and profiles and then try another restore on navy. I'd like to run with GODEBUG=gctrace=1.

@danhhz
Copy link
Contributor

danhhz commented May 4, 2017

sgtm. @justinj mentioned that he wants some production experience, so I'll let him run it (and handhold as necessary)

@danhhz
Copy link
Contributor

danhhz commented May 4, 2017

This is running now

danhhz added a commit to danhhz/cockroach that referenced this issue May 4, 2017
There doesn't seem to be a large benefit to running these WriteBatch
requests in parallel, since they just end up contending for disk. May as
well rate limit them to smooth out the disk usage a bit. Maybe this will
help with cockroachdb#15681? Dunno.

name              old time/op    new time/op    delta
ClusterRestore-8    6.46µs ±10%    7.53µs ± 5%  +16.49%         (p=0.000 n=11+5)

name              old speed      new speed      delta
ClusterRestore-8  19.1MB/s ± 9%  16.4MB/s ± 5%  -14.32%         (p=0.000 n=11+5)
@danhhz
Copy link
Contributor

danhhz commented May 5, 2017

Forgot to update this yesterday. The issue reproduced readily when running a RESTORE while DROPing a large database.

Running pprof top5 on the last 20 memprofiles shows that the raftpb.(*Entry).Unmarshal inuse_space mirrors the memory growth and gc that @petermattis pointed out above. I can't say I understand the proximate cause but it sounds very similar to @a-robinson's latest update in 15702.

Also noticed that node 3 (the one that OOM'd) has /mnt mounted as fuseblk which seems like it could explain some stuff.

@a-robinson is looking at the logs to see if this is the same issue as #15702. Assigning to him while that happens.

In the meantime, we should consider documenting as a known limitation that running RESTORE and other exceptionally disk-heavy commands (DROP) should avoided if possible. Possibly also we should document that having one node with a much slower disk is a bad idea.

@danhhz danhhz assigned a-robinson and unassigned dt, maddyblue, danhhz and benesch May 5, 2017
@petermattis
Copy link
Collaborator

@danhhz Did you see this assertion failure on navy 5:

cockroach: /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/version_set.cc:1400: void rocksdb::VersionStorageInfo::AddFile(int, rocksdb::FileMetaData*, rocksdb::Logger*): Assertion 'false' failed.

@petermattis
Copy link
Collaborator

Both navy 3 and 5 have the fuseblk devices:

~ crl-ssh navy all mount | grep /mnt
/dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered)
/dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered)
/dev/sdb1 on /mnt type fuseblk (rw,relatime,user_id=0,group_id=0,allow_other,blksize=4096)
/dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered)
/dev/sdb1 on /mnt type fuseblk (rw,relatime,user_id=0,group_id=0,allow_other,blksize=4096)
/dev/sdb1 on /mnt type ext4 (rw,relatime,data=ordered)

@danhhz
Copy link
Contributor

danhhz commented May 5, 2017

Ah yes. Forgot to include that in my notes. navy 5 hung yesterday (I couldn't even ssh in) and it had to be restarted in the azure portal. So anything wonky with it could be related to that

@petermattis
Copy link
Collaborator

I wonder if we're hitting some sort of election death spiral. If reading the unapplied tail of the log takes close to or longer than the Raft election timeout, we could get into a situation where we're constantly calling elections, becoming the leader but then never maintaining the leadership because another follower times out, calls an election and tries to become leader itself.

@a-robinson
Copy link
Contributor

Yeah, that's exactly what I was calling out in #15702 (comment)

@petermattis
Copy link
Collaborator

Yeah, that's exactly what I was calling out in #15702 (comment)

Ah, should have read the closer. At least we're independently thinking along the same lines.

@petermattis petermattis modified the milestones: 1.1, 1.0 Jun 1, 2017
@danhhz
Copy link
Contributor

danhhz commented Jul 17, 2017

@a-robinson is there anything to be done here for 1.1? This was the same as #15702 which has since been fixed (though also not closed), right?

@a-robinson
Copy link
Contributor

We concluded that this was the same as #15702, but #15702 hasn't been fixed. They're both essentially equivalent to #15723, so we could close them as dupes.

@danhhz
Copy link
Contributor

danhhz commented Jul 17, 2017

Fine by me! duplicate of #15702

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

No branches or pull requests

7 participants