batcheval: estimate stats in EvalAddSSTable #35231

dt · 2019-02-27T14:07:25Z

Adding the SSTs stats directly, without removing the underlying stats and computing merged stats,
avoids some potentially expensive repeated reads/reccomputation of the underlying existing data when
repeatedly adding sparse, wide SSTs to existing, full-ish ranges.

When adding one big SST to an empty range, the previous approach was not particularly expensive:
recomputing the existing stats was cheap, as the underying range was empty-ish, and computing
the merged stats was able to mostly iterate sequentially over just the SST.

However when adding an SST with keys distributed over an existing, non-empty range, the
recomputation of the existing span stats actually has to read and process a large amount
of existing data, and the merged recompute has to iterate over it as well -- and in doing
so has to jump between the go-side merge iterator and the underlying Rocks-backed range.

Instead, we can optimistically assume all the keys/data in the SST being added is in fact
being added -- i.e. not shadowing existing keys -- and simply add its stats to the range
stats without recomputing. This will be incorrect in some cases -- when a key does shadow
or, in particular, when retrying and all keys are shadowed -- but by flipping the flag
in the stats to indicate that the stats contain estimates, we can document our assumption's
imperfection. When the descrepency is found and fixed by a recompute, the flag will prevent
a consistency checker error being raised.

These estimated stats are probably good enough 'as-is', but a followup could send an explicit
CheckConsistency command to all ranges that were sent SSTs during a given bulk operation to
expedite the correction of any inccorrect estimates.

Release note (performance improvement): Speed up bulk data ingestion during index backfills and IMPORT.

cockroach-teamcity · 2019-02-27T14:07:33Z

This change is

dt · 2019-02-27T14:13:44Z

Some potential follow-ups: we could add a stats field in the AddSSTable request so the caller could do the computation for the SST it is sending and then the eval could use that instead of computing itself. Moving that computation would probably help significantly when lots of producers are all sending to a single range at the same time.

danhhz

Code change but I'd like someone more familiar with stats to sign off on the idea.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz, @dt, and @nvanbenschoten)

pkg/storage/batcheval/cmd_add_sstable.go, line 58 at r1 (raw file):

	}
	stats.ContainsEstimates = true
	ms.Add(stats)

This deserves a big comment, what you have in the commit message would be perfect.

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @nvanbenschoten)

pkg/storage/batcheval/cmd_add_sstable.go, line 58 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

This deserves a big comment, what you have in the commit message would be perfect.

commented bigly.

dt · 2019-02-27T19:17:38Z

I'm thinking of a followup on this to have IMPORT/RESTORE/IndexBackfiller explicitly send a recompute request to the TableSpan or IndexSpan it loaded when it is done. It looks like Recompute is a point request though, so I think I might actually want a ConsistencyCheck that gets fanned out and then does the recompute on each, though it looks like I'll need to expose that on client.DB.

danhhz · 2019-02-27T19:21:46Z

it looks like I'll need to expose that on client.DB

You could also do the same trick we do with ExportRequest etc, and manually do it via a sender.

nvb

triggering a recomputation to correct the stats after this is all done sounds like a good idea.

It's also worth pointing out how splitting works, both normally and in relation to the ContainsEstimates flag. When we split a range, we already need to recompute the stats of the LHS. In addition, when splitting a range with estimated stats, we also recompute the stats of the RHS. What this means in practice is that we should be periodically correcting the estimated stats that we're creating here as ranges split, which is a good thing. The downside is that splits will become more expensive due to the ContainsEstimates flag, which may cut into the perf win this provides.

Speaking of which, do you have any numbers about the improvement we see due to this?

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @nvanbenschoten)

dt · 2019-02-28T14:04:35Z

@nvanbenschoten I took a stab at sending out explicit recomputes in #35251.

w.r.t. splits having to recompute: that's still better than what's currently happening.

With the code before this change, every addsstable has to recompute the existing stats for the span it covers, so the cost was proportionate to the span it covered, not the amount of data added. Currently in index backfill, we're ingesting many small (in size) but wide (in overlap) SSTs. These won't fill (and thus split) a range for some time, so we'll do many of these ingestions per split, recomputing the same stats over and over.

Before this change, I was seeing some runs where due to luck/data distribution, one node would get hit with a bunch of the AddSSTable requests from all the other nodes at once and bring the overall job to a crawl. Lookin at that one node, profiling showed it spending 30%+ of its CPU time in C.MVCCComputeStats as called by evalAddSSTable. If I can get though those calls more cheaply, and then just need to recompute on splits and once a the end, that's still way better than on every ingestion.

This helper is intended to be called on the span(s) in which a caller has been doing bulk-ingestion, to trigger consistency checks on the range(s) in that span. This might be a good idea on its own but also has the side effect of fixing up any MVCC stats inaccuracies that might have been introduced by estimates during ingestion. It is probably not critical to actually call this: eventually the consistency queue will call anyway and fixup those stats if needed. However, if we know we'll call it right away -- specifically before we mark the operation that was ingesting as completed and turn those ranges over to real traffic that might expect correct-ish stats -- we might be able to make coarser, cheaper estimates, for instance during SST ingestion where the current, accurate stats recomputation is expensive. See cockroachdb#35231. Release note: None

dt · 2019-04-01T23:15:36Z

@nvanbenschoten asked what stats might be off with this change. The only way the stats are different here is if the SST contains a key already in rocks, in which case we should only count it once -- and for the copy in the SSt. I checked what stats we actually compute from the added file to see what we'd overcount in the case of a retry: live_bytes, live_count, key_bytes, key_count, val_bytes, val_count. Everything else is zero, so I don't think repeatedly adding an SST would change them. I think we might think a range is bigger than it is, split it, recompute and the realize it is undersized and merge it. That should be very rare -- we pretty rarely ingest the same key more than once.

dt · 2019-04-01T23:17:21Z

As discussed offline with @nvanbenschoten and @bdarnell, we have renewed interest in this as it might be the best way we have to cut the latency impact of slow evalAddSStable calls blocking other traffic. Does this need anything before we can merge it to master?

nvb · 2019-04-02T01:56:24Z

The change still looks good to me. Thanks for confirming that everything is good with the parts of the stats that can get out of whack.

Do we expect this to have a performance impact on RESTORE and IMPORT as well? Were those two double counting their stats previously too?

tbg · 2019-04-02T13:34:11Z

Copy-pasting my concerns from slack:

I’m generally open to the idea of using approximate stats for backfills, recomputing over and over again does seem bad. But I’m concerned that this relies on the stats recomputation in ways it wasn’t productionized for (it’s generally a little wonky and I’m happy we’re only using it in timeseries). The main shortcomings are a) it works by running a consistency check (which does a full scan) which then triggers an RecomputeStats request which does another recomputation b) it doesn’t reset ContainsEstimates — this is because between recomputing the stats and introducing the corrective delta, there could’ve been another approximation added to the log and we definitely want to be able to assert that !ContainsEstimates => stats agree with recomputation (edited)

the work to do here is basically to make ContainsEstimates an enum{Accurate, Adjusting, Estimated} and let RecomputeStats go from estimated->adjusting->accurate while any write that introduces an estimate goes back to Estimated
and of course the backfill should also proactively do the work of getting the stats back to accurate, we don’t want to do it as fallout of the consistency checker which may not even be turned on.

In the end, I am okay with this change because the benefit outweighs the risk. The recomputation is only triggered by the consistency checker when there's a mismatch, so we wouldn't spend much extra time scanning the range after it has been corrected. The ContainsEstimates flag is poisonous and we'll still need to carry out work like the above if we want to get rid of it, but that'll need some sort of migration for which the ship has sailed this release cycle.

dt · 2019-04-02T14:29:20Z

@nvanbenschoten RESTORE and IMPORT add SSTs with negligible overlap of existing keys (only on retries) and spend basically no time computing stats for the existing keys spanned by the added sst and will see ~no change here -- this really only will show up in index backfills or in some cases for experimental direct import (non-ordered data, either in non-ordered PK or from secondary indexes).

@tbg the benefit is very significant in index backfills: since we need to flush while reading the table and buffering produced index data to keep buffering memory bounded, the amount that, after sorting it by destination range, gets sent in one SST to a given range can sometimes be small, and we may end up sending many such SSTs over the course of the backfill. However the stats recompute when ingesting each of those is for all existing keys spanned by the SST so the insidious part of this is that the cost of ingesting, say, 50kb of index entries isn't proportional to 50kb but to the 25mb of existing data they span, and we pay it over and over on every little file. If we can avoid that, and just pay it once later on split, that can be a huge win.

tbg · 2019-04-02T14:40:21Z

@dt right, I'm aware, and that's why I'm fine with this change. Not doing it that way is basically just terrible. This leaves us with a bit of fallout to cover but we should be able to get to it.

bdarnell · 2019-04-02T17:15:28Z

However the stats recompute when ingesting each of those is for all existing keys spanned by the SST so the insidious part of this is that the cost of ingesting, say, 50kb of index entries isn't proportional to 50kb but to the 25mb of existing data they span, and we pay it over and over on every little file.

If there are only 50KB of updates to a given range, is it still worthwhile to ingest an SST instead of a regular write batch? What's the cutoff at which an SST ingestion makes sense?

dt · 2019-04-02T18:24:40Z

@bdarnell we thought about that and I think it is a good idea at some size -- we opted not to do it right away as we wanted only one path exercised / debugged for now, but I think it could be a good optimization to add to the BulkAdder at some point. I used a small SST as a particularly illustrative example of why recomputing stats is expensive, but even for a decently sized one, it is still slow -- we at least get more ingested for the recompute so maybe we do fewer, but requests that end up behind one still see the recompute latency.

dt · 2019-04-02T20:52:22Z

bors r=nvanbenschoten

craig · 2019-04-02T21:24:58Z

Build failed

GitHub CI (Cockroach)

dt · 2019-04-03T03:11:25Z

The TestAddSSTableMVCCStats test didn't have too much to test after this change -- it used to confirm the stats matched exactly after applying lots of randomly generated SSTs to some randomly generated testdata, and now that we're not computing exact stats, that invariant is no longer true so there wasn't much for it to test.

At @nvanbenschoten suggestion, I changed it to focus on specifically on testing cases where the stats estimates over/undercount. I had to remove the randomness to make it easy to assert exactly what would / wouldn't over and undercount (at least without adding a ton of mvcc logic to the test) but that made it practical to actually document which counts did/didn't reflect keys shadowing/being shadowed.

nvb

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @danhhz and @dt)

pkg/storage/batcheval/cmd_add_sstable_test.go, line 250 at r2 (raw file):

		kvs[i].Value = roachpb.MakeValueFromBytes([]byte(in[i].v)).RawBytes
	}
	sort.Slice(kvs, func(i, j int) bool { return kvs[i].Key.Less(kvs[j].Key) })

Should this sort based on timestamp for equal keys?

pkg/storage/batcheval/cmd_add_sstable_test.go, line 283 at r2 (raw file):

added 4

"added 4 keys"

pkg/storage/batcheval/cmd_add_sstable_test.go, line 283 at r2 (raw file):

thinks

nit: we haven't done this yet, so should these all be future tense?

pkg/storage/batcheval/cmd_add_sstable_test.go, line 291 at r2 (raw file):

colission

collision

pkg/storage/batcheval/cmd_add_sstable_test.go, line 380 at r2 (raw file):

	}()

	if actual, estimated, expDelta := afterStats.KeyCount, evaledStats.KeyCount, delta.KeyCount; actual-estimated != expDelta {

Would MVCCStats.Subtract allow you to compare the delta against expDelta using a single reflect.DeepEqual?

Also, do we even need the reflect.DeepEqual? Aren't MVCCStats objects comparable?

Adding the SSTs stats directly, without removing the underlying stats and computing merged stats, avoids some potentially expensive repeated reads/reccomputation of the underlying existing data when repeatedly adding sparse, wide SSTs to existing, full-ish ranges. When adding one big SST to an empty range, the previous approach was not particularly expensive: recomputing the existing stats was cheap, as the underying range was empty-ish, and computing the merged stats was able to mostly iterate sequentially over just the SST. However when adding an SST with keys distributed over an existing, non-empty range, the recomputation of the existing span stats actually has to read and process a large amount of existing data, and the merged recompute has to iterate over it as well -- and in doing so has to jump between the go-side merge iterator and the underlying Rocks-backed range. Instead, we can optimistically assume all the keys/data in the SST being added is in fact being added -- i.e. not shadowing existing keys -- and simply add its stats to the range stats without recomputing. This will be incorrect in some cases -- when a key does shadow or, in particular, when retrying and all keys are shadowed -- but by flipping the flag in the stats to indicate that the stats contain estimates, we can document our assumption's imperfection. When the descrepency is found and fixed by a recompute, the flag will prevent a consistency checker error being raised. These estimated stats are probably good enough 'as-is', but a followup could send an explicit CheckConsistency command to all ranges that were sent SSTs during a given bulk operation to expedite the correction of any inccorrect estimates. Release note (performance improvement): Speed up bulk data ingestion during index backfills and IMPORT.

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @nvanbenschoten)

pkg/storage/batcheval/cmd_add_sstable_test.go, line 250 at r2 (raw file):