-
Notifications
You must be signed in to change notification settings - Fork 4.1k
batcheval: estimate stats in EvalAddSSTable #35231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Some potential follow-ups: we could add a stats field in the AddSSTable request so the caller could do the computation for the SST it is sending and then the eval could use that instead of computing itself. Moving that computation would probably help significantly when lots of producers are all sending to a single range at the same time. |
danhhz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code change but I'd like someone more familiar with stats to sign off on the idea.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @danhhz, @dt, and @nvanbenschoten)
pkg/storage/batcheval/cmd_add_sstable.go, line 58 at r1 (raw file):
} stats.ContainsEstimates = true ms.Add(stats)
This deserves a big comment, what you have in the commit message would be perfect.
a216566 to
8def39f
Compare
dt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @nvanbenschoten)
pkg/storage/batcheval/cmd_add_sstable.go, line 58 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
This deserves a big comment, what you have in the commit message would be perfect.
commented bigly.
|
I'm thinking of a followup on this to have IMPORT/RESTORE/IndexBackfiller explicitly send a recompute request to the TableSpan or IndexSpan it loaded when it is done. It looks like Recompute is a point request though, so I think I might actually want a |
You could also do the same trick we do with ExportRequest etc, and manually do it via a sender. |
nvb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
triggering a recomputation to correct the stats after this is all done sounds like a good idea.
It's also worth pointing out how splitting works, both normally and in relation to the ContainsEstimates flag. When we split a range, we already need to recompute the stats of the LHS. In addition, when splitting a range with estimated stats, we also recompute the stats of the RHS. What this means in practice is that we should be periodically correcting the estimated stats that we're creating here as ranges split, which is a good thing. The downside is that splits will become more expensive due to the ContainsEstimates flag, which may cut into the perf win this provides.
Speaking of which, do you have any numbers about the improvement we see due to this?
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @nvanbenschoten)
|
@nvanbenschoten I took a stab at sending out explicit recomputes in #35251. w.r.t. splits having to recompute: that's still better than what's currently happening. With the code before this change, every addsstable has to recompute the existing stats for the span it covers, so the cost was proportionate to the span it covered, not the amount of data added. Currently in index backfill, we're ingesting many small (in size) but wide (in overlap) SSTs. These won't fill (and thus split) a range for some time, so we'll do many of these ingestions per split, recomputing the same stats over and over. Before this change, I was seeing some runs where due to luck/data distribution, one node would get hit with a bunch of the |
8def39f to
0069d8c
Compare
This helper is intended to be called on the span(s) in which a caller has been doing bulk-ingestion, to trigger consistency checks on the range(s) in that span. This might be a good idea on its own but also has the side effect of fixing up any MVCC stats inaccuracies that might have been introduced by estimates during ingestion. It is probably not critical to actually call this: eventually the consistency queue will call anyway and fixup those stats if needed. However, if we know we'll call it right away -- specifically before we mark the operation that was ingesting as completed and turn those ranges over to real traffic that might expect correct-ish stats -- we might be able to make coarser, cheaper estimates, for instance during SST ingestion where the current, accurate stats recomputation is expensive. See cockroachdb#35231. Release note: None
0069d8c to
29a7f8f
Compare
|
@nvanbenschoten asked what stats might be off with this change. The only way the stats are different here is if the SST contains a key already in rocks, in which case we should only count it once -- and for the copy in the SSt. I checked what stats we actually compute from the added file to see what we'd overcount in the case of a retry: live_bytes, live_count, key_bytes, key_count, val_bytes, val_count. Everything else is zero, so I don't think repeatedly adding an SST would change them. I think we might think a range is bigger than it is, split it, recompute and the realize it is undersized and merge it. That should be very rare -- we pretty rarely ingest the same key more than once. |
|
As discussed offline with @nvanbenschoten and @bdarnell, we have renewed interest in this as it might be the best way we have to cut the latency impact of slow |
|
The change still looks good to me. Thanks for confirming that everything is good with the parts of the stats that can get out of whack. Do we expect this to have a performance impact on RESTORE and IMPORT as well? Were those two double counting their stats previously too? |
|
Copy-pasting my concerns from slack:
In the end, I am okay with this change because the benefit outweighs the risk. The recomputation is only triggered by the consistency checker when there's a mismatch, so we wouldn't spend much extra time scanning the range after it has been corrected. The ContainsEstimates flag is poisonous and we'll still need to carry out work like the above if we want to get rid of it, but that'll need some sort of migration for which the ship has sailed this release cycle. |
|
@nvanbenschoten RESTORE and IMPORT add SSTs with negligible overlap of existing keys (only on retries) and spend basically no time computing stats for the existing keys spanned by the added sst and will see ~no change here -- this really only will show up in index backfills or in some cases for experimental direct import (non-ordered data, either in non-ordered PK or from secondary indexes). @tbg the benefit is very significant in index backfills: since we need to flush while reading the table and buffering produced index data to keep buffering memory bounded, the amount that, after sorting it by destination range, gets sent in one SST to a given range can sometimes be small, and we may end up sending many such SSTs over the course of the backfill. However the stats recompute when ingesting each of those is for all existing keys spanned by the SST so the insidious part of this is that the cost of ingesting, say, 50kb of index entries isn't proportional to 50kb but to the 25mb of existing data they span, and we pay it over and over on every little file. If we can avoid that, and just pay it once later on split, that can be a huge win. |
|
@dt right, I'm aware, and that's why I'm fine with this change. Not doing it that way is basically just terrible. This leaves us with a bit of fallout to cover but we should be able to get to it. |
If there are only 50KB of updates to a given range, is it still worthwhile to ingest an SST instead of a regular write batch? What's the cutoff at which an SST ingestion makes sense? |
|
@bdarnell we thought about that and I think it is a good idea at some size -- we opted not to do it right away as we wanted only one path exercised / debugged for now, but I think it could be a good optimization to add to the BulkAdder at some point. I used a small SST as a particularly illustrative example of why recomputing stats is expensive, but even for a decently sized one, it is still slow -- we at least get more ingested for the recompute so maybe we do fewer, but requests that end up behind one still see the recompute latency. |
|
bors r=nvanbenschoten |
Build failed |
29a7f8f to
3546b94
Compare
|
The At @nvanbenschoten suggestion, I changed it to focus on specifically on testing cases where the stats estimates over/undercount. I had to remove the randomness to make it easy to assert exactly what would / wouldn't over and undercount (at least without adding a ton of mvcc logic to the test) but that made it practical to actually document which counts did/didn't reflect keys shadowing/being shadowed. |
9473204 to
a1ee5e4
Compare
nvb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @danhhz and @dt)
pkg/storage/batcheval/cmd_add_sstable_test.go, line 250 at r2 (raw file):
kvs[i].Value = roachpb.MakeValueFromBytes([]byte(in[i].v)).RawBytes } sort.Slice(kvs, func(i, j int) bool { return kvs[i].Key.Less(kvs[j].Key) })
Should this sort based on timestamp for equal keys?
pkg/storage/batcheval/cmd_add_sstable_test.go, line 283 at r2 (raw file):
added 4
"added 4 keys"
pkg/storage/batcheval/cmd_add_sstable_test.go, line 283 at r2 (raw file):
thinks
nit: we haven't done this yet, so should these all be future tense?
pkg/storage/batcheval/cmd_add_sstable_test.go, line 291 at r2 (raw file):
colission
collision
pkg/storage/batcheval/cmd_add_sstable_test.go, line 380 at r2 (raw file):
}() if actual, estimated, expDelta := afterStats.KeyCount, evaledStats.KeyCount, delta.KeyCount; actual-estimated != expDelta {
Would MVCCStats.Subtract allow you to compare the delta against expDelta using a single reflect.DeepEqual?
Also, do we even need the reflect.DeepEqual? Aren't MVCCStats objects comparable?
Adding the SSTs stats directly, without removing the underlying stats and computing merged stats, avoids some potentially expensive repeated reads/reccomputation of the underlying existing data when repeatedly adding sparse, wide SSTs to existing, full-ish ranges. When adding one big SST to an empty range, the previous approach was not particularly expensive: recomputing the existing stats was cheap, as the underying range was empty-ish, and computing the merged stats was able to mostly iterate sequentially over just the SST. However when adding an SST with keys distributed over an existing, non-empty range, the recomputation of the existing span stats actually has to read and process a large amount of existing data, and the merged recompute has to iterate over it as well -- and in doing so has to jump between the go-side merge iterator and the underlying Rocks-backed range. Instead, we can optimistically assume all the keys/data in the SST being added is in fact being added -- i.e. not shadowing existing keys -- and simply add its stats to the range stats without recomputing. This will be incorrect in some cases -- when a key does shadow or, in particular, when retrying and all keys are shadowed -- but by flipping the flag in the stats to indicate that the stats contain estimates, we can document our assumption's imperfection. When the descrepency is found and fixed by a recompute, the flag will prevent a consistency checker error being raised. These estimated stats are probably good enough 'as-is', but a followup could send an explicit CheckConsistency command to all ranges that were sent SSTs during a given bulk operation to expedite the correction of any inccorrect estimates. Release note (performance improvement): Speed up bulk data ingestion during index backfills and IMPORT.
a1ee5e4 to
d943d4d
Compare
dt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @nvanbenschoten)
pkg/storage/batcheval/cmd_add_sstable_test.go, line 250 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should this sort based on timestamp for equal keys?
that is MVCCKey.Less, so I think it already does?
pkg/storage/batcheval/cmd_add_sstable_test.go, line 283 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
thinks
nit: we haven't done this yet, so should these all be future tense?
Done.
pkg/storage/batcheval/cmd_add_sstable_test.go, line 283 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
added 4"added 4 keys"
Done.
pkg/storage/batcheval/cmd_add_sstable_test.go, line 291 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
colissioncollision
Done.
pkg/storage/batcheval/cmd_add_sstable_test.go, line 380 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Would
MVCCStats.Subtractallow you to compare the delta againstexpDeltausing a singlereflect.DeepEqual?Also, do we even need the
reflect.DeepEqual? Aren'tMVCCStatsobjects comparable?
Done.
nvb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale)
|
bors r+ |
35231: batcheval: estimate stats in EvalAddSSTable r=dt a=dt Adding the SSTs stats directly, without removing the underlying stats and computing merged stats, avoids some potentially expensive repeated reads/reccomputation of the underlying existing data when repeatedly adding sparse, wide SSTs to existing, full-ish ranges. When adding one big SST to an empty range, the previous approach was not particularly expensive: recomputing the existing stats was cheap, as the underying range was empty-ish, and computing the merged stats was able to mostly iterate sequentially over just the SST. However when adding an SST with keys distributed over an existing, non-empty range, the recomputation of the existing span stats actually has to read and process a large amount of existing data, and the merged recompute has to iterate over it as well -- and in doing so has to jump between the go-side merge iterator and the underlying Rocks-backed range. Instead, we can optimistically assume all the keys/data in the SST being added is in fact being added -- i.e. not shadowing existing keys -- and simply add its stats to the range stats without recomputing. This will be incorrect in some cases -- when a key does shadow or, in particular, when retrying and all keys are shadowed -- but by flipping the flag in the stats to indicate that the stats contain estimates, we can document our assumption's imperfection. When the descrepency is found and fixed by a recompute, the flag will prevent a consistency checker error being raised. These estimated stats are probably good enough 'as-is', but a followup could send an explicit CheckConsistency command to all ranges that were sent SSTs during a given bulk operation to expedite the correction of any inccorrect estimates. Release note (performance improvement): Speed up bulk data ingestion during index backfills and IMPORT. Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Build succeeded |
Adding the SSTs stats directly, without removing the underlying stats and computing merged stats,
avoids some potentially expensive repeated reads/reccomputation of the underlying existing data when
repeatedly adding sparse, wide SSTs to existing, full-ish ranges.
When adding one big SST to an empty range, the previous approach was not particularly expensive:
recomputing the existing stats was cheap, as the underying range was empty-ish, and computing
the merged stats was able to mostly iterate sequentially over just the SST.
However when adding an SST with keys distributed over an existing, non-empty range, the
recomputation of the existing span stats actually has to read and process a large amount
of existing data, and the merged recompute has to iterate over it as well -- and in doing
so has to jump between the go-side merge iterator and the underlying Rocks-backed range.
Instead, we can optimistically assume all the keys/data in the SST being added is in fact
being added -- i.e. not shadowing existing keys -- and simply add its stats to the range
stats without recomputing. This will be incorrect in some cases -- when a key does shadow
or, in particular, when retrying and all keys are shadowed -- but by flipping the flag
in the stats to indicate that the stats contain estimates, we can document our assumption's
imperfection. When the descrepency is found and fixed by a recompute, the flag will prevent
a consistency checker error being raised.
These estimated stats are probably good enough 'as-is', but a followup could send an explicit
CheckConsistency command to all ranges that were sent SSTs during a given bulk operation to
expedite the correction of any inccorrect estimates.
Release note (performance improvement): Speed up bulk data ingestion during index backfills and IMPORT.