storage/bulk: split and scatter during ingest #35016

dt · 2019-02-16T00:12:54Z

When many processes are concurrently bulk-ingesting without explicit
pre-splitting, it is easy for them to all suddenly overwhelm a given
range with SSTs, before it gets a chance update its stats and start to
split.

True pre-splitting is sometimes difficult -- absent a pre-read to smaple
the data to ingest or an external source of that information (like a
BACKUP descriptor), until we actually read the data and are ready to
ingest it, we don't know where to split.

However, if we produce an SST large enough that we opt to flush it based
on its size, given that we usually chunk at known range boundaries, we
can reasonably assume it will make a decently full range. Thus, it
probably would be helpful to split off, and scatter, a new range for it
before we send it out.

As the target key-space becomes more split, the range-aware chunking
will mean we start flushing smaller SSTs. We have no idea when sending a
small SST if it is going to cause the target range to be over-full or
not, so it doesn't make sense to indecriminately pre-split it. But
happily, we probably don't need to -- the fact that we chunked based on
a known split suggests the target keyspace is already well enough split
that we're at least reouting work to different ranges, so we can let
the ranges split themselves as they fill up from there on.

Release note: none.

cockroach-teamcity · 2019-02-16T00:13:00Z

This change is

When many processes are concurrently bulk-ingesting without explicit pre-splitting, it is easy for them to all suddenly overwhelm a given range with SSTs, before it gets a chance update its stats and start to split. True pre-splitting is sometimes difficult -- absent a pre-read to smaple the data to ingest or an external source of that information (like a BACKUP descriptor), until we actually read the data and are ready to ingest it, we don't know where to split. However, if we produce an SST large enough that we opt to flush it based on its size, given that we usually chunk at known range boundaries, we can reasonably assume it will make a decently full range. Thus, it probably would be helpful to split off, and scatter, a new range for it before we send it out. As the target key-space becomes more split, the range-aware chunking will mean we start flushing smaller SSTs. We have no idea when sending a small SST if it is going to cause the target range to be over-full or not, so it doesn't make sense to indescriminately pre-split it. But happily, we probably don't need to -- the fact that we chunked based on a known split suggests the target keyspace is already well enough split that we're at least routing work to different ranges, so we can let the ranges split themselves as they fill up from there on. Release note: none.

tbg

Something that I'm mildly worried about is that you may not see your own splits. This is not a new concern and what I mean is the following: if you run AdminSplit, the DistSender won't discover the split until it sends a request to the RHS. By splitting at the start key, that request that discovers it is reasonably the AddSSTable, which makes this an expensive proposition.

Really DistSender should in the common case discover the splits as they happen without the need for an errant request to do the work, but I'm not sure that's how it works today, and I also don't know that we would find out naturally.

You could send a manual Get to the split key after the split, that should fix it. Actually the scatter should do that job already, come to think of it. But maybe this phenomenon is at work in any of the other places that split in restore/import. What do you think?

The change itself LGTM.

Reviewed 2 of 2 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @dt)

pkg/ccl/storageccl/import.go, line 132 at r1 (raw file):

	}

	const presplit = false // RESTORE does its own split-and-scatter.

One day I'll understand who does what where and why. Not today, I suppose, unless you'd humor me by adding a few lines of commentary in a convenient place (here?)

pkg/storage/bulk/sst_batcher.go, line 182 at r1 (raw file):

			b.sstWriter.DataSize, roachpb.PrettyPrintKey(nil, start))
		if err := b.db.AdminSplit(ctx, start, start); err != nil {
			return err

Do you really want errors here to end the party? Maybe spurious errors aren't actually as common with AdminSplit as I expected. (If you change it, I would also make sure that you don't scatter if the split already failed).
Hmm, I'm now seeing your comment on scatter below. You seem to know what works and what doesn't. Carry on.

tbg · 2019-02-19T19:46:27Z

Oh and I was curious -- did you see this being a problem in practice or is just PR just based on what you're expecting to be a problem in practice?

dt · 2019-02-19T20:20:30Z

this was a reaction to running a big dist ingest IMPORT -- it locked up pretty badly with 7 nodes all at idle CPU waiting on addsst RPCs, and one node sitting with a bunch of requests waiting on spanlatchmanager acquire and only one at a time actually applying (which gets much more expensive when ingesting into non-empty key space as index and fast-import do).

With this change, it failed differently -- initially everything seemed blocked on AdminSplit calls, then one by one the nodes seemed to unblock and start making progress. That said, some of the nodes made it to 10 minutes of waiting on a given AdminSplit call -- like literally the goroutines dump showed a single goroutine with a stack in AdminSplit had been sitting on one select for 10 minutes.

tbg · 2019-02-20T10:10:43Z

Interesting. Is this unexpected? I've been suspecting that this overlapping SST business really drives Rocks into the ground and yeah, the span latch manager won't like it neither. Is that what you're seeing? Look for goroutines in engine._C.
An AdminSplit shouldn't block for ten minutes, so here too I'd be interested in learning what it actually blocked on.

dt · 2019-04-02T14:36:15Z

Adding AddSSTable to the split-backpressure list made automatic splitting work again, so I'm not currently working on pre-splitting. I might come back to this -- if we know we're going to add a full range's worth of data, it might make sense to pre-split before doing so, but with automatic splits working again, I'm going to punt on this for now.

dt requested review from a team and danhhz February 16, 2019 00:12

dt force-pushed the split-in-ingest branch from f52575d to 4a0641b Compare February 16, 2019 00:14

tbg approved these changes Feb 19, 2019

View reviewed changes

thoszhang mentioned this pull request Mar 24, 2019

backfill: high latency when index backfill adds SSTs with large spans #36091

Closed

dt closed this Apr 2, 2019

dt deleted the split-in-ingest branch April 2, 2019 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

storage/bulk: split and scatter during ingest #35016

storage/bulk: split and scatter during ingest #35016

Uh oh!

dt commented Feb 16, 2019

Uh oh!

cockroach-teamcity commented Feb 16, 2019

Uh oh!

tbg left a comment

Uh oh!

tbg commented Feb 19, 2019 via email

Uh oh!

dt commented Feb 19, 2019 •

edited

Loading

Uh oh!

tbg commented Feb 20, 2019

Uh oh!

dt commented Apr 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

storage/bulk: split and scatter during ingest #35016

storage/bulk: split and scatter during ingest #35016

Uh oh!

Conversation

dt commented Feb 16, 2019

Uh oh!

cockroach-teamcity commented Feb 16, 2019

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

tbg commented Feb 19, 2019 via email

Uh oh!

dt commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tbg commented Feb 20, 2019

Uh oh!

dt commented Apr 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dt commented Feb 19, 2019 •

edited

Loading