-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: OOM on delta #10427
Comments
I think this may be the result of @spencerkimball 's #10279 to stop trying to cancel raft commands when the context is canceled. Previously, the number of async intent resolutions that could be queued up at a time were bounded, because we couldn't start a new one until an old one timed out. Now, when the old one times out, it stays queued up in the replica but the intentResolver is freed to go queue up another one. This allows the intentResolver to queue up work faster than it can be completed. |
We should consider using the semaphore mode in which you don't block when you can't proceed. Then most of these intent resolutions would simply be skipped, which I think makes sense overall (what goroutines would the old code block? Presumably goroutines that would either hang around for nothing, or even block things we would want to happen faster). |
We're not seeing goroutines blocked on the semaphore. The goroutines that accumulate are the intent resolutions themselves, blocked on the command queue. |
A contributing factor here is that whenever there is a noop in the request (because of the way DistSender split things up), the command queue sees it as a span from KeyMin to the actual key. This creates a lot of unnecessary contention in the command queue. I don't think that's enough to explain the whole problem, but it sure doesn't help. Fixing this is trivial (just filter out noops before the command queue), but my test is doing something I don't quite understand so I'm still looking at it. |
PRs #10470 and #10487 address factors that could result in a death spiral when command queue contention gets too high. However, I don't have a convincing theory about what might have triggered this sudden spike. The focus on intent resolutions may have been a red herring - they're noisy in the logs, but that doesn't mean they're the root cause. I don't think we'll be able to uncover any more at this point. Maybe with the two command queue fixes in place we won't run out of memory as quickly and we'll be able to see more in the logs. Unless anyone else has ideas I'm planning to close this when #10487 is merged. |
Closing now that the two PRs above have been merged. |
A node on Note that the inflection point in the memory graph is when the node was restarted. The Raft tick and Raft time graphs are interesting. The node appeared to be up, but wasn't ticking or doing Raft work for many minutes. There should be something in the logs. |
Well, I'm finally seeing the logs from around the time of the crash (for whatever reason I was only seeing logs up through 11:11 in the cockroach.stderr file until just recently even though the crash happened around 16:56 -- it seems as if the process restarting again recently fixed the problem). There was a ton of contention. There are an incredible number of complaints about mutex holds longer than a second or two on
These were going on basically nonstop. A little earlier, a bunch of work got added to the GC queue at peer suggestion:
This succession of things being added happened multiple times over the relevant couple minutes of life, and is actually the first thing that happened when the node came up at 16:54:
|
It makes sense that we'd discover a bunch of GC work from peers when a dead node restarts. When the peer was down a number of ranges would have been rebalanced away from it. Can you get a count of the number of ranges that were added to the GC queue? |
The entire cockroach.stderr file only has 711 lines containing "added to replica GC queue", which doesn't seem that terrible. The node is up and running again now without any runaway memory usage or other notable problems, so I don't know that the debug endpoint is going to have what we need. |
I was thinking of something like: |
509 over the entire file, but narrowing down on the part of the log that we're actually interested in:
|
Yeah, that doesn't sound too bad. |
Someone else may want to see if they can get anything out of the log, because I'm not finding anything of interest. Even filtering out all the mutex-related errors, pretty much everything in the logs is along the lines of one of these messages:
Maybe the handful of slow rocksdb batch commits is interesting (there are some that took over a second), but I don't see much else. |
Can you point me to the log file you're looking at? Or perhaps attach it to this issue. |
After some filtering/uniqing, the only remotely interesting line other than the ones mentioned above is:
Here's the log from just the two minutes that the process was alive: |
The long rocksdb batch commits are due to some mega-batches:
Those numbers after The way rocksdb performs writes, one long batch commit can affect others being committed at the same time:
Notice the timestamps here. |
A node on delta on just OOM'ed with similar symptoms. It's:
|
Why are we applying so many snapshots? This reminds me of the snapshot death spirals that we were experiencing in late summer. Can you grep for whether these are |
|
That's a lot of |
Summary of some high-level investigating:
The fact that a node is doing nothing but snapshotting for 3+ hours after coming back on is crazy. Maybe the block_writer workload is particularly bad, but it looks like we have a real problem on our hands. @petermattis - would you like me to continue looking, or for someone from @cockroachdb/stability to take over? |
Alright, I may have found the issue: In This would be fine if This effectively becomes a deadlock if there aren't any async tasks from |
The call to |
Ben wrote the intent resolver, though I'm happy to help out here. On Mon, Nov 21, 2016 at 3:40 PM Alex Robinson notifications@github.com
|
Although I suppose that only blocks up the intent resolver, and shouldn't actually block callers of processIntentsAsync due to #10867. It will keep intents from getting cleaned up though, if that could bog down other parts of the system. |
Ah, you're right, it does look like we're avoiding that recursive use of |
Particularly, it's likely worth understanding what's causing cockroach/pkg/storage/replica_command.go Line 706 in a7b7048
|
It's unfortunate that the goroutine_profile process was (like the node_exporter) stopped without being restarted on Friday. I've kicked it back off again in case we decide to restart the process. |
I think we should wipe |
Yeah, I'm the expert on intentResolver these days.
#10867 made it possible to queue up unbounded amounts of work in processIntentsAsync. This is where the bulk of the memory used by those tasks gets allocated. So it's unsurprising that when there's a deadlock in the intent resolver, we run out of memory here. The question is where the deadlock is. (especially since #10867 made it so I don't think we're calling into the intent resolver while holding any important locks) I'm starting to suspect that this and #10698 might have the same root cause, of some command getting submitted to the local replica, racing against its removal, and getting stuck. |
Any ideas what the problem with replica removal might be? We cancel local proposals when a replica is removed in |
We cancel all local proposals when the replica is removed in Store.removeReplicaImpl, but in order to get there we have to perform a consistent read on the range descriptor. If we're talking about removing a meta range, then this might entail sending a request to the local replica, which will get stuck trying to acquire the lease (which doesn't time out). This code has been sketchy for some time, but there have been a few changes recently that touch related code that might account for the sudden appearance of problems (all of them by @spencerkimball, coincidentally: AmbiguousResultError, parallel DistSender, and the changes to cancellation in beginCmds). The way it's currently supposed to work (assuming we're doing a consistent read on a meta range, and we have a local replica which is slated for removal) is that we send a request to our local replica, it hangs trying to get a lease, SendNextTimeout kicks in and the DistSender tries one or both of the other replicas, and gets either NotLeaderErrors or success from them. Once it gets a successful response, it cancels the hung command on the local replica. One possible change would be to cancel any pending lease requests when we have an event that triggers a replica GC. I think that might help dislodge things that are getting stuck. |
After wiping and restarting delta, we're seeing another node with growing memory consumption and goroutine counts. Most of those are in tryAddWriteCmd (the first stack is from incoming rpcs, and the second is from the internal/local fast path in DistSender)
There's also some lock contention in Replica.propose:
8 goroutines are attempting to acquire a lease:
Moving on from goroutine stacks to /debug/requests, here are a couple of long-running intent resolution traces:
It's doing a lot of meta descriptor lookups in between sending the actual batches. This suggests that either we're iterating through the entire keyspace instead of seeking to the next range for which we have an actual request, or we're seeking and re-seeking many times due to concurrent splits. But in any case those range descriptor lookups are hitting the cache and are quite fast; most of the time is spent in the real batches (which are not being parallelized. maybe because there's a local replica?) The logs are full of |
Also from the logs, there are bursts of this:
I wonder how often we're hitting context cancellation without logging this message because |
This node ran out of memory and died 53 minutes after startup, with memory growing linearly the whole time. Goroutine count showed a slight upward trend, but in the same time period that memory usage more than doubled, it went from 3000 to 3500 goroutines. After restarting the node, memory usage is flat. I was too slow to grab heap profiles from the time when it was running out of memory. |
Since delta was last wiped and restarted, we saw this OOM once and the consistency issue #11191 on two nodes. I didn't push any new binaries; it was running b56e3c7-dirty the whole time. Only @petermattis can tell us exactly what's in that, but it looks like #10970 was included in this build, so that PR did not fix the problem. |
That build is the ReplicaState snapshot optimization. It was dirty because On Wed, Nov 23, 2016 at 8:14 AM Ben Darnell notifications@github.com
|
Calls to the local server are made synchronously; maybe the problem is that #10970 is unable to cancel those. Putting those in a goroutine is the next thing I'd try. |
Sure, I can try that. Is there any additional information you want to gather on the consistency check failure. That's rather worrisome. I struggle to see how #10931 could have caused it, but perhaps I fouled something up there. |
Perform local calls in a goroutine so that they can be cancelled and so that they don't block the caller indefinitely. See cockroachdb#10427
Perform local calls in a goroutine so that they can be cancelled and so that they don't block the caller indefinitely. See cockroachdb#10427
Closed by #11196 which made the intent resolver apply backpressure. |
One node on delta (running 9ce8833) ran out of memory, with a large spike in goroutine count right before the end. This node had also experienced problems with the previous build (when prevote was enabled)
The logs show a lot of intent resolutions being cancelled in the command queue:
and
raftMu
being held for over a second at a time (across multiple ranges).This node had also experienced problems under the previous build, which we had blamed on prevote.
Many goroutines are blocked at replica.go:1221 (waiting for cmdQMu) for very long times. There are 10k+ of these but panicparse isn't grouping them all together for some reason):
It looks like the command queue has gotten large:
This is similar to what I saw with race builds in #10388.
The text was updated successfully, but these errors were encountered: