Skip to content

Conversation

@dt
Copy link
Contributor

@dt dt commented Feb 16, 2019

When many processes are concurrently bulk-ingesting without explicit
pre-splitting, it is easy for them to all suddenly overwhelm a given
range with SSTs, before it gets a chance update its stats and start to
split.

True pre-splitting is sometimes difficult -- absent a pre-read to smaple
the data to ingest or an external source of that information (like a
BACKUP descriptor), until we actually read the data and are ready to
ingest it, we don't know where to split.

However, if we produce an SST large enough that we opt to flush it based
on its size, given that we usually chunk at known range boundaries, we
can reasonably assume it will make a decently full range. Thus, it
probably would be helpful to split off, and scatter, a new range for it
before we send it out.

As the target key-space becomes more split, the range-aware chunking
will mean we start flushing smaller SSTs. We have no idea when sending a
small SST if it is going to cause the target range to be over-full or
not, so it doesn't make sense to indecriminately pre-split it. But
happily, we probably don't need to -- the fact that we chunked based on
a known split suggests the target keyspace is already well enough split
that we're at least reouting work to different ranges, so we can let
the ranges split themselves as they fill up from there on.

Release note: none.

@dt dt requested review from a team and danhhz February 16, 2019 00:12
@cockroach-teamcity
Copy link
Member

This change is Reviewable

When many processes are concurrently bulk-ingesting without explicit
pre-splitting, it is easy for them to all suddenly overwhelm a given
range with SSTs, before it gets a chance update its stats and start to
split.

True pre-splitting is sometimes difficult -- absent a pre-read to smaple
the data to ingest or an external source of that information (like a
BACKUP descriptor), until we actually read the data and are ready to
ingest it, we don't know where to split.

However, if we produce an SST large enough that we opt to flush it based
on its size, given that we usually chunk at known range boundaries, we
can reasonably assume it will make a decently full range. Thus, it
probably would be helpful to split off, and scatter, a new range for it
before we send it out.

As the target key-space becomes more split, the range-aware chunking
will mean we start flushing smaller SSTs. We have no idea when sending a
small SST if it is going to cause the target range to be over-full or
not, so it doesn't make sense to indescriminately pre-split it. But
happily, we probably don't need to -- the fact that we chunked based on
a known split suggests the target keyspace is already well enough split
that we're at least routing work to different ranges, so we can let
the ranges split themselves as they fill up from there on.

Release note: none.
@dt dt force-pushed the split-in-ingest branch from f52575d to 4a0641b Compare February 16, 2019 00:14
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that I'm mildly worried about is that you may not see your own splits. This is not a new concern and what I mean is the following: if you run AdminSplit, the DistSender won't discover the split until it sends a request to the RHS. By splitting at the start key, that request that discovers it is reasonably the AddSSTable, which makes this an expensive proposition.

Really DistSender should in the common case discover the splits as they happen without the need for an errant request to do the work, but I'm not sure that's how it works today, and I also don't know that we would find out naturally.

You could send a manual Get to the split key after the split, that should fix it. Actually the scatter should do that job already, come to think of it. But maybe this phenomenon is at work in any of the other places that split in restore/import. What do you think?

The change itself LGTM.

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @dt)


pkg/ccl/storageccl/import.go, line 132 at r1 (raw file):

	}

	const presplit = false // RESTORE does its own split-and-scatter.

One day I'll understand who does what where and why. Not today, I suppose, unless you'd humor me by adding a few lines of commentary in a convenient place (here?)


pkg/storage/bulk/sst_batcher.go, line 182 at r1 (raw file):

			b.sstWriter.DataSize, roachpb.PrettyPrintKey(nil, start))
		if err := b.db.AdminSplit(ctx, start, start); err != nil {
			return err

Do you really want errors here to end the party? Maybe spurious errors aren't actually as common with AdminSplit as I expected. (If you change it, I would also make sure that you don't scatter if the split already failed).
Hmm, I'm now seeing your comment on scatter below. You seem to know what works and what doesn't. Carry on.

@tbg
Copy link
Member

tbg commented Feb 19, 2019 via email

@dt
Copy link
Contributor Author

dt commented Feb 19, 2019

this was a reaction to running a big dist ingest IMPORT -- it locked up pretty badly with 7 nodes all at idle CPU waiting on addsst RPCs, and one node sitting with a bunch of requests waiting on spanlatchmanager acquire and only one at a time actually applying (which gets much more expensive when ingesting into non-empty key space as index and fast-import do).

With this change, it failed differently -- initially everything seemed blocked on AdminSplit calls, then one by one the nodes seemed to unblock and start making progress. That said, some of the nodes made it to 10 minutes of waiting on a given AdminSplit call -- like literally the goroutines dump showed a single goroutine with a stack in AdminSplit had been sitting on one select for 10 minutes.

@tbg
Copy link
Member

tbg commented Feb 20, 2019

Interesting. Is this unexpected? I've been suspecting that this overlapping SST business really drives Rocks into the ground and yeah, the span latch manager won't like it neither. Is that what you're seeing? Look for goroutines in engine._C.
An AdminSplit shouldn't block for ten minutes, so here too I'd be interested in learning what it actually blocked on.

@dt
Copy link
Contributor Author

dt commented Apr 2, 2019

Adding AddSSTable to the split-backpressure list made automatic splitting work again, so I'm not currently working on pre-splitting. I might come back to this -- if we know we're going to add a full range's worth of data, it might make sense to pre-split before doing so, but with automatic splits working again, I'm going to punt on this for now.

@dt dt deleted the split-in-ingest branch April 2, 2019 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants