ResourceExhausted while retrieving snapshot (grpc: received message larger than max) #5919

danielmai · 2020-07-09T22:25:13Z

What version of Dgraph are you using?

v20.03.1

Have you tried reproducing the issue with the latest release?

No

What is the hardware spec (RAM, OS)?

Kubernetes, 40 GB memory per Alpha

Steps to reproduce the issue (command/config used to run Dgraph).

Run 6-node Dgraph cluster (3 Zeros, 3 Alphas)
Get an Alpha to retrieve a snapshot from a peer

Expected behaviour and actual result.

Snapshot retrieval should be successful.

The error logs show this:

alpha-2 (snapshot receiver):

The receiver is receiving a snapshot, but then encounters this error message saying that the message is bigger than the 2 GB max size, and then immediately requests a new snapshot.

E0709 21:41:57.526150       1 draft.go:1056] While retrieving snapshot, error: cannot retrieve snapshot from peer: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (2147483648 vs. 2147483647). Retrying...
I0709 21:41:57.526195       1 draft.go:112] Operation completed with id: opSnapshot
I0709 21:41:57.626301       1 draft.go:147] Operation started with id: opSnapshot

alpha-1 (snapshot sender):
The sender was sending a snapshot for 46 minutes and then started a new stream. It reported an error with the original snapshot 20 seconds later:

I0709 21:41:47.327839       1 log.go:34] Sending Snapshot Time elapsed: 46m13s, bytes sent: 21 GB, speed: 7.7 MB/sec
I0709 21:41:57.626109       1 snapshot.go:186] Got StreamSnapshot request: context:<id:3 group:1 addr:"<redacted>:7080" > index:34604729 read_ts:72055639 since_ts:57373150 
...
E0709 21:42:07.674264       1 snapshot.go:188] While streaming snapshot: context canceled. Reporting failure.

More logs around the same time:

I0709 21:41:14.491801       1 log.go:34] Sending Snapshot Created batch of size: 2.7 GB in 27.135662379s.
W0709 21:41:23.218177       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 290ms. Breakdown: [{sync 290ms} {disk 0s} {proposals 0s} {advance 0s}] Num entries: 1. MustSync: true
W0709 21:41:33.037682       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 391ms. Breakdown: [{sync 391ms} {disk 0s} {proposals 0s} {advance 0s}] Num entries: 1. MustSync: true
I0709 21:41:47.327801       1 log.go:34] Sending Snapshot Created batch of size: 211 MB in 32.60734449s.
I0709 21:41:47.327839       1 log.go:34] Sending Snapshot Time elapsed: 46m13s, bytes sent: 21 GB, speed: 7.7 MB/sec
I0709 21:41:57.626109       1 snapshot.go:186] Got StreamSnapshot request: context:<id:3 group:1 addr:"<redacted>:7080" > index:34604729 read_ts:72055639 since_ts:57373150 
I0709 21:41:57.626179       1 snapshot.go:124] Waiting to reach timestamp: 72055639
I0709 21:41:58.022547       1 log.go:34] Sending Snapshot Created batch of size: 4.3 MB in 12.156603ms.
I0709 21:41:58.090950       1 log.go:34] Sending Snapshot Created batch of size: 4.3 MB in 12.166202ms.
I0709 21:41:58.337845       1 log.go:34] Sending Snapshot Created batch of size: 4.3 MB in 12.752222ms.
I0709 21:41:58.366555       1 log.go:34] Sending Snapshot Created batch of size: 4.3 MB in 11.302674ms.
I0709 21:41:58.428954       1 log.go:34] Sending Snapshot Created batch of size: 4.3 MB in 13.484547ms.
I0709 21:41:58.445024       1 log.go:34] Sending Snapshot Created batch of size: 4.3 MB in 14.333675ms.
W0709 21:41:58.579282       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 739ms. Breakdown: [{sync 739ms} {disk 0s} {proposals 0s} {advance 0s}] Num entries: 1. MustSync: true
W0709 21:41:59.217652       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 629ms. Breakdown: [{sync 629ms} {disk 0s} {proposals 0s} {advance 0s}] Num entries: 2. MustSync: true
W0709 21:41:59.423229       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 206ms. Breakdown: [{sync 205ms} {disk 0s} {proposals 0s} {advance 0s}] Num entries: 5. MustSync: true
I0709 21:41:59.620173       1 log.go:34] Sending Snapshot Created batch of size: 4.3 MB in 1.173804391s.
I0709 21:42:01.321820       1 log.go:34] Sending Snapshot Created batch of size: 151 MB in 1.379716895s.
W0709 21:42:02.817896       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 672ms. Breakdown: [{disk 591ms} {sync 81ms} {proposals 0s} {advance 0s}] Num entries: 1. MustSync: true
I0709 21:42:04.654627       1 log.go:34] Sending Snapshot Created batch of size: 142 MB in 1.532204734s.
E0709 21:42:07.674264       1 snapshot.go:188] While streaming snapshot: context canceled. Reporting failure.
W0709 21:42:12.350625       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 232ms. Breakdown: [{sync 232ms} {disk 0s} {proposals 0s} {advance 0s}] Num entries: 1. MustSync: true
W0709 21:42:22.945681       1 draft.go:1183] Raft.Ready took too long to process: Timer Total: 528ms. Breakdown: [{sync 527ms} {disk 0s} {proposals 0s} {advance 0s}] Num entries: 1. MustSync: true
I0709 21:42:23.446618       1 log.go:34] Sending Snapshot Created batch of size: 104 MB in 18.61112033s.
I0709 21:42:42.394785       1 log.go:34] Sending Snapshot Created batch of size: 203 MB in 18.575915681s.
I0709 21:42:42.394857       1 log.go:34] Sending Snapshot Time elapsed: 44s, bytes sent: 630 MB, speed: 14 MB/sec
I0709 21:42:57.963867       1 log.go:34] Sending Snapshot Created batch of size: 216 MB in 15.330505804s.
I0709 21:42:57.963904       1 log.go:34] Sending Snapshot Time elapsed: 01m00s, bytes sent: 845 MB, speed: 14 MB/sec

The text was updated successfully, but these errors were encountered:

martinmr · 2020-07-09T23:09:27Z

The problem is here:

stream.Send = func(list *bpb.KVList) error {
	kvs := &pb.KVS{Kv: list.Kv}
	num += len(kvs.Kv)
	return out.Send(kvs)
}

if kvs is bigger than 2GB, well attempt to send it all at once over the wire.

The fix can be made in two different places.

In badger, we avoid creating batches over the maximum grpc message.
We modify the Send function to split the list into multiple ones and send them one by one.

@jarifibrahim thoughts?

poonai · 2020-07-10T10:06:35Z

I think we need to tweak the discardN to send a small snapshot one by one Instead of splitting the single snapshot into multiple requests.

martinmr · 2020-07-15T18:16:37Z

Fixed by dgraph-io/badger#1412

danielmai added kind/bug Something is broken. status/accepted We accept to investigate/work on it. area/operations Related to operational aspects of the DB, including signals, flags, env vars, etc. labels Jul 9, 2020

martinmr closed this as completed Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResourceExhausted while retrieving snapshot (grpc: received message larger than max) #5919

ResourceExhausted while retrieving snapshot (grpc: received message larger than max) #5919

danielmai commented Jul 9, 2020

martinmr commented Jul 9, 2020

poonai commented Jul 10, 2020

martinmr commented Jul 15, 2020

ResourceExhausted while retrieving snapshot (grpc: received message larger than max) #5919

ResourceExhausted while retrieving snapshot (grpc: received message larger than max) #5919

Comments

danielmai commented Jul 9, 2020

What version of Dgraph are you using?

Have you tried reproducing the issue with the latest release?

What is the hardware spec (RAM, OS)?

Steps to reproduce the issue (command/config used to run Dgraph).

Expected behaviour and actual result.

martinmr commented Jul 9, 2020

poonai commented Jul 10, 2020

martinmr commented Jul 15, 2020