Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: OOM on delta #10427

Closed
bdarnell opened this issue Nov 3, 2016 · 66 comments
Closed

stability: OOM on delta #10427

bdarnell opened this issue Nov 3, 2016 · 66 comments
Assignees
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@bdarnell
Copy link
Contributor

bdarnell commented Nov 3, 2016

One node on delta (running 9ce8833) ran out of memory, with a large spike in goroutine count right before the end. This node had also experienced problems with the previous build (when prevote was enabled)

image

The logs show a lot of intent resolutions being cancelled in the command queue:

W161103 01:21:41.071928 22449127 storage/replica.go:1240  [n8,s16,r2341/7:/Table/55
/1/197{22994…-34433…}] context canceled while in command queue: ResolveIntent [/Tab
le/55/1/1973297837063008645/"986e18e6-d1fc-48b9-8153-27e0148c025d"/5778676/0,/Min),
 Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min
,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min)

and raftMu being held for over a second at a time (across multiple ranges).

This node had also experienced problems under the previous build, which we had blamed on prevote.

Many goroutines are blocked at replica.go:1221 (waiting for cmdQMu) for very long times. There are 10k+ of these but panicparse isn't grouping them all together for some reason):

1173: semacquire [15~264 minutes] [Created by grpc.(*Server).serveStreams.func1 @ server.go:423]
    sync                sema.go:47                  runtime_Semacquire(#42751)
    sync                mutex.go:85                 (*Mutex).Lock(#42750)
    syncutil            timedmutex.go:84            (*TimedMutex).Lock(#32315)
    storage             replica.go:1221             (*Replica).beginCmds(#32314, #254751, *, *, *, *, 0)
    storage             replica.go:1628             (*Replica).tryAddWriteCmd(#32314, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    storage             replica.go:1603             (*Replica).addWriteCmd(#32314, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    storage             replica.go:1104             (*Replica).Send(#32314, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    storage             store.go:2333               (*Store).Send(#1687, #254751, *, *, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    storage             stores.go:187               (*Stores).Send(#804, #254751, *, 0, 0, #246, 0x7, 0x8ec, 0, 0, ...)
    server              node.go:843                 (*Node).batchInternal.func1(0, 0)
    stop                stopper.go:245              (*Stopper).RunTaskWithErr(#751, *, 0, 0)
    server              node.go:854                 (*Node).batchInternal(#519, #254751, *, *, *, *, 0x603fa7)
    server              node.go:874                 (*Node).Batch(#519, #254751, *, *, 0, 0xac1, 0xac1)
    roachpb             api.pb.go:1512              _Internal_Batch_Handler(#41, #519, #254751, *, *, 0, 0, 0, 0, *)
    grpc                server.go:611               (*Server).processUnaryRPC(#691, #110, *, *, #726, #75, 0, 0, 0)
    grpc                server.go:769               (*Server).handleStream(#691, #110, *, *, 0)
    grpc                server.go:422               (*Server).serveStreams.func1.1(*, #691, #110, *, *)

It looks like the command queue has gotten large:

1: runnable [Created by grpc.(*Server).serveStreams.func1 @ server.go:423]
    interval            interval.go:678             (*Node).doMatch(#54501, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34715)
    interval            interval.go:692             (*Node).doMatch(#54500, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34714)
    interval            interval.go:680             (*Node).doMatch(#54502, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34716)
    interval            interval.go:680             (*Node).doMatch(#54503, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34717)
    interval            interval.go:692             (*Node).doMatch(#54499, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34713)
    interval            interval.go:692             (*Node).doMatch(#54498, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34712)
    interval            interval.go:680             (*Node).doMatch(#54504, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34718)
    interval            interval.go:692             (*Node).doMatch(#44087, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34711)
    interval            interval.go:692             (*Node).doMatch(#76407, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34710)
    interval            interval.go:692             (*Node).doMatch(#64129, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #4909)
    interval            interval.go:692             (*Node).doMatch(#41627, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #28365)
    interval            interval.go:680             (*Node).doMatch(#19169, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #34719)
    interval            interval.go:692             (*Node).doMatch(#40511, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #35269)
    interval            interval.go:680             (*Node).doMatch(#47831, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #30706)
    interval            interval.go:680             (*Node).doMatch(#58539, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #13753)
    interval            interval.go:692             (*Node).doMatch(#10407, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #58350)
    interval            interval.go:680             (*Node).doMatch(#46061, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, #62, #9663)
    interval            interval.go:673             (*Tree).DoMatching(#18943, #22866, #64177, 0, 0x1, #64177, 0x1, 0x1, 0x1)
    storage             command_queue.go:338        (*CommandQueue).getOverlaps(#18943, #64177, 0, 0x1, #64177, 0x1, 0x1, #244168, 0x287c7, 0x2e400)
    storage             command_queue.go:177        (*CommandQueue).getWait(#18943, #104238, #57186, 0x1f7, 0x1f7, 0, 0, 0)
    storage             replica.go:1222             (*Replica).beginCmds(#32008, #254751, #81445, #72445, #22867, #49198, 0)
    storage             replica.go:1628             (*Replica).tryAddWriteCmd(#32008, #254751, #81445, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    storage             replica.go:1603             (*Replica).addWriteCmd(#32008, #254751, #81445, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    storage             replica.go:1104             (*Replica).Send(#32008, #254751, #81445, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    storage             store.go:2333               (*Store).Send(#873, #254751, #81444, #255981, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    storage             stores.go:187               (*Stores).Send(#804, #254751, #81443, 0, 0, #245, 0x5, 0x8d4, 0, 0, ...)
    server              node.go:843                 (*Node).batchInternal.func1(0, 0)
    stop                stopper.go:245              (*Stopper).RunTaskWithErr(#751, #22868, 0, 0)
    server              node.go:854                 (*Node).batchInternal(#519, #254751, #81441, #72444, #81441, #46953, 0x603fa7)
    server              node.go:874                 (*Node).Batch(#519, #254751, #81441, #72444, 0, 0xad1, 0xad1)
    roachpb             api.pb.go:1512              _Internal_Batch_Handler(#41, #519, #254751, #81440, #69183, 0, 0, 0, 0, #255031)
    grpc                server.go:611               (*Server).processUnaryRPC(#691, #110, #1601, #45897, #726, #75, 0, 0, 0)
    grpc                server.go:769               (*Server).handleStream(#691, #110, #1601, #45897, 0)
    grpc                server.go:422               (*Server).serveStreams.func1.1(#1519, #691, #110, #1601, #45897)

This is similar to what I saw with race builds in #10388.

@bdarnell bdarnell added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Nov 3, 2016
@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 3, 2016

I think this may be the result of @spencerkimball 's #10279 to stop trying to cancel raft commands when the context is canceled. Previously, the number of async intent resolutions that could be queued up at a time were bounded, because we couldn't start a new one until an old one timed out. Now, when the old one times out, it stays queued up in the replica but the intentResolver is freed to go queue up another one. This allows the intentResolver to queue up work faster than it can be completed.

@tbg
Copy link
Member

tbg commented Nov 3, 2016

We should consider using the semaphore mode in which you don't block when you can't proceed. Then most of these intent resolutions would simply be skipped, which I think makes sense overall (what goroutines would the old code block? Presumably goroutines that would either hang around for nothing, or even block things we would want to happen faster).
The GC Queue would have to deal with large(r) numbers of intents, but it must be able to anyway.

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 3, 2016

We're not seeing goroutines blocked on the semaphore. The goroutines that accumulate are the intent resolutions themselves, blocked on the command queue.

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 4, 2016

A contributing factor here is that whenever there is a noop in the request (because of the way DistSender split things up), the command queue sees it as a span from KeyMin to the actual key. This creates a lot of unnecessary contention in the command queue. I don't think that's enough to explain the whole problem, but it sure doesn't help.

Fixing this is trivial (just filter out noops before the command queue), but my test is doing something I don't quite understand so I'm still looking at it.

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 6, 2016

PRs #10470 and #10487 address factors that could result in a death spiral when command queue contention gets too high. However, I don't have a convincing theory about what might have triggered this sudden spike.

The focus on intent resolutions may have been a red herring - they're noisy in the logs, but that doesn't mean they're the root cause. I don't think we'll be able to uncover any more at this point. Maybe with the two command queue fixes in place we won't run out of memory as quickly and we'll be able to see more in the logs. Unless anyone else has ideas I'm planning to close this when #10487 is merged.

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 7, 2016

Closing now that the two PRs above have been merged.

@bdarnell bdarnell closed this as completed Nov 7, 2016
@petermattis
Copy link
Collaborator

A node on blue just died with similar symptoms.

screen shot 2016-11-07 at 1 40 10 pm

screen shot 2016-11-07 at 1 40 16 pm

screen shot 2016-11-07 at 1 40 41 pm

screen shot 2016-11-07 at 1 40 54 pm

screen shot 2016-11-07 at 1 41 31 pm

screen shot 2016-11-07 at 1 41 42 pm

Note that the inflection point in the memory graph is when the node was restarted. The Raft tick and Raft time graphs are interesting. The node appeared to be up, but wasn't ticking or doing Raft work for many minutes. There should be something in the logs.

@petermattis petermattis reopened this Nov 7, 2016
@a-robinson
Copy link
Contributor

Well, I'm finally seeing the logs from around the time of the crash (for whatever reason I was only seeing logs up through 11:11 in the cockroach.stderr file until just recently even though the crash happened around 16:56 -- it seems as if the process restarting again recently fixed the problem).

There was a ton of contention. There are an incredible number of complaints about mutex holds longer than a second or two on handleRaftReady, like this:

W161107 16:56:27.402206 122 storage/replica.go:537  [n5,s5,r3297/31:/Table/51/1/76{57985…-60253…}] r
aftMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReady for 2.1
02556799s (>500ms):
goroutine 122 [running]:
runtime/debug.Stack(0x1d41d68, 0x1d41da8, 0x47)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/cockroachdb/cockroach/pkg/util/syncutil.ThresholdLogger.func1(0x7d52787f)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:65 +0xe1
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*TimedMutex).Unlock(0xc420e4c778)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:92 +0x80
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReady(0xc420e4c700, 0x0, 0x0, 0x0,
 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:2139 +0xee
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processReady(0xc4208ed180, 0xce1)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3174 +0x13e
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc420956780, 0xc4203ea000)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:230 +0x28c
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2()
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:181 +0x33
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4203ea000, 0xc4211c3e00
)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0x7d
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:197 +0x66

These were going on basically nonstop. A little earlier, a bunch of work got added to the GC queue at peer suggestion:

I161107 16:55:21.914463 768 storage/store.go:2964  [n5,s5] [n5,s5,r2743/57:/Table/51/1/4{698925…-701
153…}]: added to replica GC queue (peer suggestion)
I161107 16:55:21.935723 796 storage/store.go:2964  [n5,s5] [n5,s5,r1257/67:/Table/51/1/784{08551…-30
898…}]: added to replica GC queue (peer suggestion)
I161107 16:55:21.938703 768 storage/store.go:2964  [n5,s5] [n5,s5,r5431/31:/Table/51/1/53{07262…-295
75…}]: added to replica GC queue (peer suggestion)
I161107 16:55:21.939009 768 storage/store.go:2964  [n5,s5] [n5,s5,r4914/31:/Table/51/1/53{28626…-308
71…}]: added to replica GC queue (peer suggestion)
I161107 16:55:21.940001 825 storage/store.go:2964  [n5,s5] [n5,s5,r2237/56:/Table/51/1/748{41331…-63
794…}]: added to replica GC queue (peer suggestion)
I161107 16:55:21.940114 825 storage/store.go:2964  [n5,s5] [n5,s5,r2199/65:/Table/51/1/201{03979…-26
553…}]: added to replica GC queue (peer suggestion)
I161107 16:55:21.951457 825 storage/store.go:2964  [n5,s5] [n5,s5,r1902/58:/Table/51/1/106{44313…-66
968…}]: added to replica GC queue (peer suggestion)
I161107 16:55:21.957773 807 storage/store.go:2964  [n5,s5] [n5,s5,r926/64:/Table/51/1/440{56700…-792
86…}]: added to replica GC queue (peer suggestion)
I161107 16:55:21.964253 825 storage/store.go:2964  [n5,s5] [n5,s5,r2020/55:/Table/51/1/60{29223…-314
72…}]: added to replica GC queue (peer suggestion)
I161107 16:55:21.964370 825 storage/store.go:2964  [n5,s5] [n5,s5,r533/74:/Table/51/1/409{06422…-289
01…}]: added to replica GC queue (peer suggestion)

This succession of things being added happened multiple times over the relevant couple minutes of life, and is actually the first thing that happened when the node came up at 16:54:

I161107 16:54:03.645026 1 server/config.go:443  1 storage engine initialized
I161107 16:54:03.647703 1 base/node_id.go:62  [n5] NodeID set to 5
I161107 16:54:04.029349 1 storage/store.go:1039  [n5] [n5,s5,r4769/?:/Table/51/1/166{08488…-31015…}]: added to replica GC queue
I161107 16:54:04.462060 1 storage/store.go:1039  [n5] [n5,s5,r4576/?:/Table/51/1/398{49584…-72060…}]: added to replica GC queue
I161107 16:54:04.679272 1 storage/store.go:1039  [n5] [n5,s5,r5630/?:/Table/51/1/509{68739…-90947…}]: added to replica GC queue
I161107 16:54:05.161409 1 storage/store.go:1039  [n5] [n5,s5,r4471/?:/Table/51/1/79{37765…-40018…}]: added to replica GC queue
I161107 16:54:05.317056 1 storage/store.go:1039  [n5] [n5,s5,r1928/?:/Table/51/1/877{57258…-79898…}]: added to replica GC queue
W161107 16:54:05.431494 1 storage/store.go:772  [n5] storeMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Store).Start for 1.783579241s (>100ms):

@petermattis
Copy link
Collaborator

It makes sense that we'd discover a bunch of GC work from peers when a dead node restarts. When the peer was down a number of ranges would have been rebalanced away from it.

Can you get a count of the number of ranges that were added to the GC queue?

@a-robinson
Copy link
Contributor

The entire cockroach.stderr file only has 711 lines containing "added to replica GC queue", which doesn't seem that terrible.

The node is up and running again now without any runaway memory usage or other notable problems, so I don't know that the debug endpoint is going to have what we need.

@petermattis
Copy link
Collaborator

I was thinking of something like: grep "added to replica GC queue" cockroach.stderr | egrep -o -w 'r[0-9]+' | sort -u | wc -l

@a-robinson
Copy link
Contributor

a-robinson commented Nov 7, 2016

509 over the entire file, but narrowing down on the part of the log that we're actually interested in:

$ grep "161107 16:5" logs/cockroach.stderr | grep "added to replica" | egrep -o -w 'r[0-9]+' | sort -u | wc -l
92
$ grep "161107 16:5" logs/cockroach.stderr | grep "added to replica" | egrep -o -w 'r[0-9]+' | wc -l
92

@petermattis
Copy link
Collaborator

Yeah, that doesn't sound too bad.

@a-robinson
Copy link
Contributor

Someone else may want to see if they can get anything out of the log, because I'm not finding anything of interest. Even filtering out all the mutex-related errors, pretty much everything in the logs is along the lines of one of these messages:

I161107 16:55:21.964370 825 storage/store.go:2964  [n5,s5] [n5,s5,r533/74:/Table/51/1/409{06422…-28901…}]: added to replica GC queue (peer suggestion)
W161107 16:55:22.355916 124 storage/store.go:3186  [n5,s5,r5374/27:/Table/51/1/5{083521…-105910…}] handle raft ready: 2.8s
I161107 16:55:25.508395 186 server/status/runtime.go:228  runtime stats: 7.1 GiB RSS, 4515 goroutines, 2.9 GiB/1.0 MiB/3.1 GiB GO alloc/idle/total, 3.8 GiB/4.6 GiB CGO alloc/total, 95145.30cgo/sec, 2.52/0.71 %(u/s)time, 0.01 %gc (1x)
I161107 16:55:42.530690 61 gossip/gossip.go:1024  [n5] starting client to distant node 9 to tighten network graph
I161107 16:55:42.531399 9287 gossip/client.go:125  [n5] started gossip client to cockroach-blue-09:26257
I161107 16:55:42.551774 9287 gossip/client.go:130  [n5] closing client to node 9 (cockroach-blue-09:26257): received forward from node 9 to 10 (cockroach-blue-10:26257)
I161107 16:55:42.552097 9259 gossip/client.go:125  [n5] started gossip client to cockroach-blue-10:26257
W161107 16:55:42.688340 196 storage/engine/rocksdb.go:1010  batch [374173/25003827/0] commit took 560.665507ms (>500ms):
I161107 16:55:46.677276 7761 storage/replica_raftstorage.go:592  [n5,s5] [n5,s5,r3698/32:/Table/51/1/735{10631…-33281…}]: with replicaID 32, applying Raft snapshot at index 191559 (id=66d60744, encoded size=16, 53 rocksdb batches, 95 log entries)
E161107 16:56:32.687724 195 storage/queue.go:568  [n5,replicate] on [n5,s5,r1221/80:/Table/51/1/338{08766…-31261…}]: [n5,s5,r1221/80:/Table/51/1/338{08766…-31261…}]: could not obtain lease: range 1221 was not found

Maybe the handful of slow rocksdb batch commits is interesting (there are some that took over a second), but I don't see much else.

@petermattis
Copy link
Collaborator

Can you point me to the log file you're looking at? Or perhaps attach it to this issue.

@a-robinson
Copy link
Contributor

After some filtering/uniqing, the only remotely interesting line other than the ones mentioned above is:

I161107 16:54:25.376472 120 storage/replica.go:2073  [n5,s5,r4437/33:/Table/51/1/30{18051…-20300…}] not quiescing: 1 pending commands

Here's the log from just the two minutes that the process was alive:
cockroach-stderr.txt

@petermattis
Copy link
Collaborator

The long rocksdb batch commits are due to some mega-batches:

W161107 16:55:28.987855 196 storage/engine/rocksdb.go:1010  batch [381884/25515249/0] commit took 549.423208ms (>500ms):
W161107 16:55:42.688340 196 storage/engine/rocksdb.go:1010  batch [374173/25003827/0] commit took 560.665507ms (>500ms):
W161107 16:55:46.739576 196 storage/engine/rocksdb.go:1010  batch [384459/25687421/0] commit took 561.526045ms (>500ms):
W161107 16:55:53.756479 196 storage/engine/rocksdb.go:1010  batch [376423/25151353/0] commit took 758.771507ms (>500ms):

Those numbers after batch [x/y/z] indicate the batches containing 370k-380k keys and 25MB of data. I imagine these correspond to the replica GCs. Hmm, the stack traces show some are for replica removal and some are for snapshots.

The way rocksdb performs writes, one long batch commit can affect others being committed at the same time:

W161107 16:55:53.769931 125 storage/engine/rocksdb.go:1010  batch [3/237/3] commit took 669.24103ms (>500ms):
W161107 16:55:53.781257 120 storage/engine/rocksdb.go:1010  batch [3/237/3] commit took 657.3112ms (>500ms):
W161107 16:55:57.715262 123 storage/engine/rocksdb.go:1010  batch [3/237/3] commit took 1.513268124s (>500ms):
W161107 16:55:57.715388 120 storage/engine/rocksdb.go:1010  batch [3/237/3] commit took 1.510755812s (>500ms):
W161107 16:55:57.716082 125 storage/engine/rocksdb.go:1010  batch [3/237/3] commit took 1.511698954s (>500ms):
W161107 16:55:57.719473 122 storage/engine/rocksdb.go:1010  batch [3/236/3] commit took 1.516734078s (>500ms):
W161107 16:55:57.719844 124 storage/engine/rocksdb.go:1010  batch [3/237/3] commit took 1.516342961s (>500ms):
W161107 16:55:57.731454 126 storage/engine/rocksdb.go:1010  batch [3/237/3] commit took 1.527611561s (>500ms):
W161107 16:55:57.781258 7761 storage/engine/rocksdb.go:1010  batch [383548/25707337/57] commit took 1.513055214s (>500ms):

Notice the timestamps here.

@bdarnell
Copy link
Contributor Author

bdarnell commented Nov 8, 2016

I saw the same thing in #10494, and we've been tracking this in #9663.

@a-robinson
Copy link
Contributor

A node on delta on just OOM'ed with similar symptoms. It's:

  • Logging an incredible amount of contexts being canceled while in the command queue (15+ log messages about it per second, it appears)
W161109 14:26:52.742925 27966194 storage/replica.go:1281  [n3,s6,r1263/8:/Table/55/1/170{79398…-85057…}] context canceled while in command queue: ResolveIntent [/Table/55/1/1708290858787580620/"86276f6a-6277-4e53-accc-1385e209b1f9"/8879936/0,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min)
  • Applying raft snapshots that are generating a bunch of rocksdb batches
I161109 14:26:40.669825 27963166 storage/replica_raftstorage.go:592  [n3,s6,r1362/7:/Table/55/1/747{
56600…-62275…}] [n3,s6,r1362/7:/Table/55/1/747{56600…-62275…}]: with replicaID 7, applying Raft snap
shot at index 1701380 (id=258d5db5, encoded size=16, 49 rocksdb batches, 771 log entries)
  • Taking a long time to apply some large rocksdb batches
W161109 14:26:38.649584 27960772 storage/engine/rocksdb.go:1010  batch [346997/23330254/52] commit t
ook 1.903706226s (>500ms):
goroutine 27960772 [running]:
runtime/debug.Stack(0x1764020, 0x236b9a0, 0xc420010280)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).Commit(0xc4fd8f48c0, 0xc4d6bd84b
0, 0x2371140)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1010 +0x212
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).applySnapshot(0xc4e1595180, 0x7f2f4b672138,
0xc4d6bd84b0, 0x6f4eada590fd1607, 0x7d23a3818ca8fc85, 0x34e0, 0xc4d1f97dc0, 0x36, 0x40, 0xc4d1f97e00
, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raftstorage.go:674 +0xd20
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc4e1595180, 0x
6f4eada590fd1607, 0x7d23a3818ca8fc85, 0x34e0, 0xc4d1f97dc0, 0x36, 0x40, 0xc4d1f97e00, 0x35, 0x40, ..
.)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:2200 +0x11eb
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest(0xc421c78000, 0x7f2f4b67213
8, 0xc4a59aa7b0, 0xc4774fb8b8, 0x6f4eada590fd1607, 0x7d23a3818ca8fc85, 0x34e0, 0xc4d1f97dc0, 0x36, 0
x40, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2936 +0xc60
github.com/cockroachdb/cockroach/pkg/storage.(*Store).HandleSnapshot(0xc421c78000, 0xc4774fb860, 0x7
f2f4b6380f0, 0xc48acd5680, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2590 +0x6cd
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1.1(0x2371fe0, 0xc48a
cd5680, 0xc4205aad20, 0xc48b640738, 0xc48b640740)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:361 +0x17f
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1(0x7f2f4b672138, 0xc
4a59aa660)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:362 +0x45
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc420436120, 0x202f05c
, 0x19, 0x16b, 0x0, 0x0, 0xc4a59aa720, 0x7f2f4b672138, 0xc4a59aa660)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:264 +0xdf
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:265 +0x234

@petermattis
Copy link
Collaborator

Why are we applying so many snapshots? This reminds me of the snapshot death spirals that we were experiencing in late summer. Can you grep for whether these are Raft or preemptive snapshots (applying Raft snapshot vs applying preemptive snapshot)? The former indicates that the node fell behind for some reason.

@a-robinson
Copy link
Contributor

cockroach@cockroach-delta-03:~$ grep "applying" logs/cockroach.stderr | wc
    509   12216  144593
cockroach@cockroach-delta-03:~$ grep "applying Raft snapshot" logs/cockroach.stderr | wc
    416    9984  123082
cockroach@cockroach-delta-03:~$ grep "applying preemptive snapshot" logs/cockroach.stderr | wc
     93    2232   21511

@petermattis
Copy link
Collaborator

That's a lot of Raft snapshots. We need to figure why this node is getting behind in the first place. I'd look for the first occurrence and see what the node was doing around then.

@a-robinson
Copy link
Contributor

Summary of some high-level investigating:

  • The node was started around 8:13 after more than 2 hours of not running
  • For about 3 hours, most of its raft traffic was snapshots (no proposals, votes, etc.). The cluster as a whole was experiencing a huge amount of raft voting at this time
  • The node has had practically no leaseholders since it first went down around 6:50
  • Over the same time period, it's had a preposterous amount of snapshot activity, if I'm reading the Snapshots graph on the Replicas page right. If I'm reading it right, it's been pegged at around 97-98 GBs per minute of snapshotting traffic ever since it came back online
  • It looks like things started getting out of control in the 11:15-14:30 range once the node started trying doing more than just snapshotting, causing goroutines and memory to grow until it OOM'ed

The fact that a node is doing nothing but snapshotting for 3+ hours after coming back on is crazy. Maybe the block_writer workload is particularly bad, but it looks like we have a real problem on our hands.

@petermattis - would you like me to continue looking, or for someone from @cockroachdb/stability to take over?

@a-robinson
Copy link
Contributor

Alright, I may have found the issue:

In processIntents, we tell RunLimitedAsyncTask to run a function that calls resolveIntents. This uses the replica's store's stopper and the intent resolver's semaphore.

This would be fine if resolveIntents didn't also call RunLimitedAsyncTask itself. This uses the intent resolver's store's stopper and the intent resolver's semaphore, meaning that it's using the same semaphore.

This effectively becomes a deadlock if there aren't any async tasks from resolveIntents finishing, which could either be due to some underlying issue in the batches they're trying to run or due to terribly bad luck in all 100 of the tasks getting started from processIntents and then getting blocked in resolveIntents on trying to start a new async task.

@petermattis
Copy link
Collaborator

The call to resolveIntents from processIntents passes wait = true. Doesn't that avoid the recursive use of RunLimitedAsyncTask?

@spencerkimball
Copy link
Member

Ben wrote the intent resolver, though I'm happy to help out here.

On Mon, Nov 21, 2016 at 3:40 PM Alex Robinson notifications@github.com
wrote:

I'm not aware of any 7 min timeouts.

Yeah, there's nothing special about that 7 minute number given that the
same pattern of (context cancellations -> long-held mutexes being released
-> short period of life -> more intent_resolver throttling -> long period
of deadness) happened again after a 2 minute break, then a 4 minute break,
etc. Something else (not a particular amount of time passing) is causing
the contexts to be canceled.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#10427 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF3MTQuhwWgsjQOQf6H27dsfjy4Yizmuks5rAgHLgaJpZM4KoApO
.

@a-robinson
Copy link
Contributor

Although I suppose that only blocks up the intent resolver, and shouldn't actually block callers of processIntentsAsync due to #10867. It will keep intents from getting cleaned up though, if that could bog down other parts of the system.

@a-robinson
Copy link
Contributor

Ah, you're right, it does look like we're avoiding that recursive use of RunLimitedAsyncTask, so the issue is deeper than that. It's probably worth @bdarnell / @spencerkimball taking a closer look here.

@a-robinson
Copy link
Contributor

Particularly, it's likely worth understanding what's causing

if err := func() error {
to consume 2.5 gigs of memory.

@a-robinson
Copy link
Contributor

It's unfortunate that the goroutine_profile process was (like the node_exporter) stopped without being restarted on Friday. I've kicked it back off again in case we decide to restart the process.

@petermattis
Copy link
Collaborator

I think we should wipe delta and try this again. We're encountering errors pretty reliably by starting from a fresh cluster.

@bdarnell
Copy link
Contributor Author

Yeah, I'm the expert on intentResolver these days.

Particularly, it's likely worth understanding what's causing

if err := func() error {
to consume 2.5 gigs of memory.

#10867 made it possible to queue up unbounded amounts of work in processIntentsAsync. This is where the bulk of the memory used by those tasks gets allocated. So it's unsurprising that when there's a deadlock in the intent resolver, we run out of memory here. The question is where the deadlock is. (especially since #10867 made it so I don't think we're calling into the intent resolver while holding any important locks)

I'm starting to suspect that this and #10698 might have the same root cause, of some command getting submitted to the local replica, racing against its removal, and getting stuck.

@petermattis
Copy link
Collaborator

Any ideas what the problem with replica removal might be? We cancel local proposals when a replica is removed in Store.removeReplicaImpl. I would think that would unstick any proposal, but perhaps something isn't working correctly there. Or perhaps we somehow allow a proposal to sneak in after the replica has been destroyed. That doesn't seem possible because we cancel the local proposals and mark the replica destroyed atomically.

@bdarnell
Copy link
Contributor Author

We cancel all local proposals when the replica is removed in Store.removeReplicaImpl, but in order to get there we have to perform a consistent read on the range descriptor. If we're talking about removing a meta range, then this might entail sending a request to the local replica, which will get stuck trying to acquire the lease (which doesn't time out).

This code has been sketchy for some time, but there have been a few changes recently that touch related code that might account for the sudden appearance of problems (all of them by @spencerkimball, coincidentally: AmbiguousResultError, parallel DistSender, and the changes to cancellation in beginCmds). The way it's currently supposed to work (assuming we're doing a consistent read on a meta range, and we have a local replica which is slated for removal) is that we send a request to our local replica, it hangs trying to get a lease, SendNextTimeout kicks in and the DistSender tries one or both of the other replicas, and gets either NotLeaderErrors or success from them. Once it gets a successful response, it cancels the hung command on the local replica.

One possible change would be to cancel any pending lease requests when we have an event that triggers a replica GC. I think that might help dislodge things that are getting stuck.

@bdarnell
Copy link
Contributor Author

After wiping and restarting delta, we're seeing another node with growing memory consumption and goroutine counts. Most of those are in tryAddWriteCmd (the first stack is from incoming rpcs, and the second is from the internal/local fast path in DistSender)

goroutine profile: total 2941
1916 @ 0x5f93ea 0x608ae9 0x6078dc 0xaa9405 0xaa87d0 0xaa3fb2 0xafb56e 0xb09f72 0xa562c1 0x6e8180 0xa42fbd 0xa432bb 0x93ff8d 0xcf2c60 0xcf4910 0xcf969b 0x62a551
#	0xaa9404	github.com/cockroachdb/cockroach/pkg/storage.(*Replica).tryAddWriteCmd+0x8a4	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:1867
...
391 @ 0x5f93ea 0x608ae9 0x6078dc 0xaa9405 0xaa87d0 0xaa3fb2 0xafb56e 0xb09f72 0xa562c1 0x6e8180 0xa42fbd 0xa432bb 0xff6115 0xfede14 0xfe8ccb 0xfe91ce 0xfebe6f 0xffde4d 0x6ea39c 0x62a551
#	0xaa9404	github.com/cockroachdb/cockroach/pkg/storage.(*Replica).tryAddWriteCmd+0x8a4			/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:1867

There's also some lock contention in Replica.propose:

161 @ 0x5f93ea 0x5f94de 0x60965d 0x6092f0 0x6a4740 0xdd6ab4 0xaaa8f4 0xaa8d20 0xaa87d0 0xaa3fb2 0xafb56e 0xb09f72 0xa562c1 0x6e8180 0xa42fbd 0xa432bb 0x93ff8d 0xcf2c60 0xcf4910 0xcf969b 0x62a551
#	0x6092ef	sync.runtime_Semacquire+0x2f							/usr/local/go/src/runtime/sema.go:47
#	0x6a473f	sync.(*Mutex).Lock+0xcf								/usr/local/go/src/sync/mutex.go:85
#	0xdd6ab3	github.com/cockroachdb/cockroach/pkg/util/syncutil.(*TimedMutex).Lock+0x33	/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:86
#	0xaaa8f3	github.com/cockroachdb/cockroach/pkg/storage.(*Replica).propose+0x1e3		/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:2072

8 goroutines are attempting to acquire a lease:

8 @ 0x5f93ea 0x608ae9 0x6078dc 0xaa9405 0xaa87d0 0xaa3fb2 0xb2156f 0x6ea1df 0x62a551
#	0xaa9404	github.com/cockroachdb/cockroach/pkg/storage.(*Replica).tryAddWriteCmd+0x8a4				/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:1867
#	0xaa87cf	github.com/cockroachdb/cockroach/pkg/storage.(*Replica).addWriteCmd+0x8f				/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:1705
#	0xaa3fb1	github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send+0x221					/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:1172
#	0xb2156e	github.com/cockroachdb/cockroach/pkg/storage.(*pendingLeaseRequest).InitOrJoinRequest.func1+0x1de	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_range_lease.go:133
#	0x6ea1de	github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1+0xde			/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:264

Moving on from goroutine stacks to /debug/requests, here are a couple of long-running intent resolution traces:

2016/11/23 03:40:08.404258	1.629734	(*intentResolver).resolveIntents
03:40:08.404270	 .    12	... resolving intents [wait=true]
03:40:08.404416	 .   146	... [n5] querying next range at /Table/51/1/219755434109322358/"49178b11-190a-4469-9d4f-39ac3767302a"/2809/0
03:40:08.404427	 .    11	... [n5] meta descriptor lookup
03:40:08.404441	 .    15	... [n5] looked up range descriptor from cache
03:40:08.404496	 .    55	... [n5] sending RPC for batch: 1 ResolveIntent, 4 Noop
03:40:08.956160	 .551664	... [n5] querying next range at /Table/51/1/832410399915310057/"49178b11-190a-4469-9d4f-39ac3767302a"/2810/0
03:40:08.956165	 .     5	... [n5] meta descriptor lookup
03:40:08.956178	 .    13	... [n5] looked up range descriptor from cache
03:40:08.956251	 .    73	... [n5] querying next range at /Table/51/1/1237744213661419168/"49178b11-190a-4469-9d4f-39ac3767302a"/2803/0
03:40:08.956261	 .     9	... [n5] meta descriptor lookup
03:40:08.956267	 .     7	... [n5] looked up range descriptor from cache
03:40:08.956319	 .    52	... [n5] querying next range at /Table/51/1/7023611640861738801/"49178b11-190a-4469-9d4f-39ac3767302a"/2802/0
03:40:08.956323	 .     4	... [n5] meta descriptor lookup
03:40:08.956330	 .     7	... [n5] looked up range descriptor from cache
03:40:08.956362	 .    31	... [n5] querying next range at /Table/51/1/7309774248664141470/"49178b11-190a-4469-9d4f-39ac3767302a"/2806/0
03:40:08.956379	 .    17	... [n5] meta descriptor lookup
03:40:08.956388	 .     9	... [n5] looked up range descriptor from cache
03:40:08.956413	 .    25	... [n5] sending RPC for batch: 1 ResolveIntent, 4 Noop
2016/11/23 03:40:08.375431	1.660041	(*intentResolver).resolveIntents
03:40:08.375443	 .    12	... resolving intents [wait=true]
03:40:08.375615	 .   172	... [n5] querying next range at /Table/51/1/222321605578102281/"0a23f68e-5373-4d74-8587-3c33f18e7459"/4862/0
03:40:08.375627	 .    13	... [n5] meta descriptor lookup
03:40:08.375642	 .    15	... [n5] looked up range descriptor from cache
03:40:08.375700	 .    57	... [n5] querying next range at /Table/51/1/856598158964203092/"0a23f68e-5373-4d74-8587-3c33f18e7459"/4866/0
03:40:08.375742	 .    43	... [n5] meta descriptor lookup
03:40:08.375755	 .    13	... [n5] looked up range descriptor from cache
03:40:08.375806	 .    51	... [n5] querying next range at /Table/51/1/1537613722308926215/"0a23f68e-5373-4d74-8587-3c33f18e7459"/4865/0
03:40:08.375815	 .     9	... [n5] meta descriptor lookup
03:40:08.375825	 .    10	... [n5] looked up range descriptor from cache
03:40:08.375867	 .    42	... [n5] querying next range at /Table/51/1/2871749494348429500/"0a23f68e-5373-4d74-8587-3c33f18e7459"/4864/0
03:40:08.375874	 .     7	... [n5] meta descriptor lookup
03:40:08.375886	 .    12	... [n5] looked up range descriptor from cache
03:40:08.375927	 .    41	... [n5] querying next range at /Table/51/1/2952671718014397498/"0a23f68e-5373-4d74-8587-3c33f18e7459"/4869/0
03:40:08.375935	 .     8	... [n5] meta descriptor lookup
03:40:08.375952	 .    17	... [n5] looked up range descriptor from cache
03:40:08.376004	 .    52	... [n5] sending RPC for batch: 1 ResolveIntent, 8 Noop
03:40:09.029082	 .653078	... [n5] querying next range at /Table/51/1/5965338343911269226/"0a23f68e-5373-4d74-8587-3c33f18e7459"/4870/0
03:40:09.029091	 .     9	... [n5] meta descriptor lookup
03:40:09.029110	 .    18	... [n5] looked up range descriptor from cache
03:40:09.029193	 .    83	... [n5] querying next range at /Table/51/1/8512685040791409210/"0a23f68e-5373-4d74-8587-3c33f18e7459"/4863/0
03:40:09.029203	 .    10	... [n5] meta descriptor lookup
03:40:09.029219	 .    16	... [n5] looked up range descriptor from cache
03:40:09.029267	 .    47	... [n5] querying next range at /Table/51/1/9116955393426193730/"0a23f68e-5373-4d74-8587-3c33f18e7459"/4867/0
03:40:09.029273	 .     6	... [n5] meta descriptor lookup
03:40:09.029286	 .    13	... [n5] looked up range descriptor from cache
03:40:09.029331	 .    46	... [n5] sending RPC for batch: 1 ResolveIntent, 8 Noop

It's doing a lot of meta descriptor lookups in between sending the actual batches. This suggests that either we're iterating through the entire keyspace instead of seeking to the next range for which we have an actual request, or we're seeking and re-seeking many times due to concurrent splits. But in any case those range descriptor lookups are hitting the cache and are quite fast; most of the time is spent in the real batches (which are not being parallelized. maybe because there's a local replica?)

The logs are full of util/stop/stopper.go:294 stopper throttling task from storage/intent_resolver.go:437 due to semaphore but it looks like the only goroutines getting blocked for this are the ones in intentResolver.Start.

@bdarnell
Copy link
Contributor Author

Also from the logs, there are bursts of this:

W161123 03:48:42.880232 11461039 storage/replica.go:1881  [n5,s9,r67/2:/Table/51/1/{791610…-806288…}] context cancellation after 3.0s of attempting command Noop [/Min,/Min
), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), ResolveIntent [/Table/51/1/8054586357341410321/"9d7d6a52-cb67-4f73-9607-546f79
ba698a"/4613/0,/Min)
W161123 03:48:42.886792 11461039 storage/intent_resolver.go:391  [n5,s9,r35/2:/Table/51/1/{564867…-707383…}]: failed to resolve intents: result is ambiguous (context deadl
ine exceeded)
W161123 03:48:42.932955 11465409 storage/replica.go:1881  [n5,s9,r67/2:/Table/51/1/{791610…-806288…}] context cancellation after 2.5s of attempting command Noop [/Min,/Min
), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), ResolveIntent [/Table/51/1/7950393756809084247/"5149d4ae-510
4-4d54-aa76-727c33b89107"/5080/0,/Min)
W161123 03:48:42.933261 11465409 storage/intent_resolver.go:391  [n5,s9,r26/2:/Table/51/1/3{464150…-605030…}]: failed to resolve intents: result is ambiguous (context dead
line exceeded)
W161123 03:48:43.082086 11469911 storage/replica.go:1881  [n5,s9,r67/2:/Table/51/1/{791610…-806288…}] context cancellation after 2.4s of attempting command Noop [/Min,/Min
), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), ResolveIntent [/Table/51/1/8023876340673471143/"869bb784-cc75-40da-b1a8-ea5f96
3d2c92"/4356/0,/Min), Noop [/Min,/Min), Noop [/Min,/Min)
W161123 03:48:43.082313 11466505 storage/intent_resolver.go:391  [n5,s9,r22/2:/Table/51{-/1/13930…}]: failed to resolve intents: result is ambiguous (context deadline exce
eded)
W161123 03:48:43.178631 11467250 storage/replica.go:1881  [n5,s9,r67/2:/Table/51/1/{791610…-806288…}] context cancellation after 3.0s of attempting command Noop [/Min,/Min
), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), Noop [/Min,/Min), ResolveIntent [/Table/51/1/7990530510029727228/"6d18efc8-c57
2-461a-9b78-d61ac7cde794"/2979/0,/Min), Noop [/Min,/Min)
W161123 03:48:43.179005 11467035 storage/intent_resolver.go:391  [n5,s9,r24/2:/Table/51/1/2{307963…-448065…}]: failed to resolve intents: result is ambiguous (context dead
line exceeded)

I wonder how often we're hitting context cancellation without logging this message because tryAbandon returned false. Then we block the caller indefinitely. (I thought with the change to endCmds we would be able to return early in this case?)

@bdarnell
Copy link
Contributor Author

This node ran out of memory and died 53 minutes after startup, with memory growing linearly the whole time. Goroutine count showed a slight upward trend, but in the same time period that memory usage more than doubled, it went from 3000 to 3500 goroutines. After restarting the node, memory usage is flat. I was too slow to grab heap profiles from the time when it was running out of memory.

@tamird
Copy link
Contributor

tamird commented Nov 23, 2016

Based on your comment in #10970, it sounds like this was not seen again (and #11191 was seen instead), correct?

@bdarnell
Copy link
Contributor Author

Since delta was last wiped and restarted, we saw this OOM once and the consistency issue #11191 on two nodes. I didn't push any new binaries; it was running b56e3c7-dirty the whole time. Only @petermattis can tell us exactly what's in that, but it looks like #10970 was included in this build, so that PR did not fix the problem.

@petermattis
Copy link
Collaborator

That build is the ReplicaState snapshot optimization. It was dirty because
of untracked log files in the build directory. It definitely contained
#10970. We were crossing our fingers that would address the OOM. Looks like
it didn't. Back to debugging.

On Wed, Nov 23, 2016 at 8:14 AM Ben Darnell notifications@github.com
wrote:

Since delta was last wiped and restarted, we saw this OOM once and the
consistency issue #11191
#11191 on two nodes. I
didn't push any new binaries; it was running b56e3c7
b56e3c7f7a54553814d1b357c3c73d1a41d7c37f-dirty
the whole time. Only @petermattis https://github.com/petermattis can
tell us exactly what's in that, but it looks like #10970
#10970 was included in
this build, so that PR did not fix the problem.


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#10427 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF6f99_LyQXFLBLuLtsZoz2TApH5TdMMks5rBDxLgaJpZM4KoApO
.

@bdarnell
Copy link
Contributor Author

Calls to the local server are made synchronously; maybe the problem is that #10970 is unable to cancel those. Putting those in a goroutine is the next thing I'd try.

@petermattis
Copy link
Collaborator

Putting those in a goroutine is the next thing I'd try.

Sure, I can try that. Is there any additional information you want to gather on the consistency check failure. That's rather worrisome. I struggle to see how #10931 could have caused it, but perhaps I fouled something up there.

@bdarnell
Copy link
Contributor Author

I don't know what to look for for #11191; it's definitely worrisome. Maybe run the consistency checker more often and the raft log GC less often? (but even with the raft logs I don't know what exactly I'd find that would be useful).

#10931 looks safe to me, but I'll pick over it again.

petermattis added a commit to petermattis/cockroach that referenced this issue Nov 23, 2016
Perform local calls in a goroutine so that they can be cancelled and so
that they don't block the caller indefinitely.

See cockroachdb#10427
petermattis added a commit to petermattis/cockroach that referenced this issue Nov 23, 2016
Perform local calls in a goroutine so that they can be cancelled and so
that they don't block the caller indefinitely.

See cockroachdb#10427
@tamird
Copy link
Contributor

tamird commented Dec 1, 2016

Closed by #11196 which made the intent resolver apply backpressure.

@tamird tamird closed this as completed Dec 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

8 participants