-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage/bulk: split and scatter during ingest #35016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When many processes are concurrently bulk-ingesting without explicit pre-splitting, it is easy for them to all suddenly overwhelm a given range with SSTs, before it gets a chance update its stats and start to split. True pre-splitting is sometimes difficult -- absent a pre-read to smaple the data to ingest or an external source of that information (like a BACKUP descriptor), until we actually read the data and are ready to ingest it, we don't know where to split. However, if we produce an SST large enough that we opt to flush it based on its size, given that we usually chunk at known range boundaries, we can reasonably assume it will make a decently full range. Thus, it probably would be helpful to split off, and scatter, a new range for it before we send it out. As the target key-space becomes more split, the range-aware chunking will mean we start flushing smaller SSTs. We have no idea when sending a small SST if it is going to cause the target range to be over-full or not, so it doesn't make sense to indescriminately pre-split it. But happily, we probably don't need to -- the fact that we chunked based on a known split suggests the target keyspace is already well enough split that we're at least routing work to different ranges, so we can let the ranges split themselves as they fill up from there on. Release note: none.
tbg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that I'm mildly worried about is that you may not see your own splits. This is not a new concern and what I mean is the following: if you run AdminSplit, the DistSender won't discover the split until it sends a request to the RHS. By splitting at the start key, that request that discovers it is reasonably the AddSSTable, which makes this an expensive proposition.
Really DistSender should in the common case discover the splits as they happen without the need for an errant request to do the work, but I'm not sure that's how it works today, and I also don't know that we would find out naturally.
You could send a manual Get to the split key after the split, that should fix it. Actually the scatter should do that job already, come to think of it. But maybe this phenomenon is at work in any of the other places that split in restore/import. What do you think?
The change itself LGTM.
Reviewed 2 of 2 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @dt)
pkg/ccl/storageccl/import.go, line 132 at r1 (raw file):
} const presplit = false // RESTORE does its own split-and-scatter.
One day I'll understand who does what where and why. Not today, I suppose, unless you'd humor me by adding a few lines of commentary in a convenient place (here?)
pkg/storage/bulk/sst_batcher.go, line 182 at r1 (raw file):
b.sstWriter.DataSize, roachpb.PrettyPrintKey(nil, start)) if err := b.db.AdminSplit(ctx, start, start); err != nil { return err
Do you really want errors here to end the party? Maybe spurious errors aren't actually as common with AdminSplit as I expected. (If you change it, I would also make sure that you don't scatter if the split already failed).
Hmm, I'm now seeing your comment on scatter below. You seem to know what works and what doesn't. Carry on.
|
Oh and I was curious -- did you see this being a problem in practice or is
just PR just based on what you're expecting to be a problem in practice?
|
|
this was a reaction to running a big dist ingest IMPORT -- it locked up pretty badly with 7 nodes all at idle CPU waiting on addsst RPCs, and one node sitting with a bunch of requests waiting on spanlatchmanager acquire and only one at a time actually applying (which gets much more expensive when ingesting into non-empty key space as index and fast-import do). With this change, it failed differently -- initially everything seemed blocked on AdminSplit calls, then one by one the nodes seemed to unblock and start making progress. That said, some of the nodes made it to 10 minutes of waiting on a given AdminSplit call -- like literally the goroutines dump showed a single goroutine with a stack in AdminSplit had been sitting on one |
|
Interesting. Is this unexpected? I've been suspecting that this overlapping SST business really drives Rocks into the ground and yeah, the span latch manager won't like it neither. Is that what you're seeing? Look for goroutines in |
|
Adding |
When many processes are concurrently bulk-ingesting without explicit
pre-splitting, it is easy for them to all suddenly overwhelm a given
range with SSTs, before it gets a chance update its stats and start to
split.
True pre-splitting is sometimes difficult -- absent a pre-read to smaple
the data to ingest or an external source of that information (like a
BACKUP descriptor), until we actually read the data and are ready to
ingest it, we don't know where to split.
However, if we produce an SST large enough that we opt to flush it based
on its size, given that we usually chunk at known range boundaries, we
can reasonably assume it will make a decently full range. Thus, it
probably would be helpful to split off, and scatter, a new range for it
before we send it out.
As the target key-space becomes more split, the range-aware chunking
will mean we start flushing smaller SSTs. We have no idea when sending a
small SST if it is going to cause the target range to be over-full or
not, so it doesn't make sense to indecriminately pre-split it. But
happily, we probably don't need to -- the fact that we chunked based on
a known split suggests the target keyspace is already well enough split
that we're at least reouting work to different ranges, so we can let
the ranges split themselves as they fill up from there on.
Release note: none.