Skip to content

Conversation

@dt
Copy link
Contributor

@dt dt commented Feb 27, 2019

Adding the SSTs stats directly, without removing the underlying stats and computing merged stats,
avoids some potentially expensive repeated reads/reccomputation of the underlying existing data when
repeatedly adding sparse, wide SSTs to existing, full-ish ranges.

When adding one big SST to an empty range, the previous approach was not particularly expensive:
recomputing the existing stats was cheap, as the underying range was empty-ish, and computing
the merged stats was able to mostly iterate sequentially over just the SST.

However when adding an SST with keys distributed over an existing, non-empty range, the
recomputation of the existing span stats actually has to read and process a large amount
of existing data, and the merged recompute has to iterate over it as well -- and in doing
so has to jump between the go-side merge iterator and the underlying Rocks-backed range.

Instead, we can optimistically assume all the keys/data in the SST being added is in fact
being added -- i.e. not shadowing existing keys -- and simply add its stats to the range
stats without recomputing. This will be incorrect in some cases -- when a key does shadow
or, in particular, when retrying and all keys are shadowed -- but by flipping the flag
in the stats to indicate that the stats contain estimates, we can document our assumption's
imperfection. When the descrepency is found and fixed by a recompute, the flag will prevent
a consistency checker error being raised.

These estimated stats are probably good enough 'as-is', but a followup could send an explicit
CheckConsistency command to all ranges that were sent SSTs during a given bulk operation to
expedite the correction of any inccorrect estimates.

Release note (performance improvement): Speed up bulk data ingestion during index backfills and IMPORT.

@dt dt requested review from a team, danhhz and nvb February 27, 2019 14:07
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@dt dt added the do-not-merge bors won't merge a PR with this label. label Feb 27, 2019
@dt
Copy link
Contributor Author

dt commented Feb 27, 2019

Some potential follow-ups: we could add a stats field in the AddSSTable request so the caller could do the computation for the SST it is sending and then the eval could use that instead of computing itself. Moving that computation would probably help significantly when lots of producers are all sending to a single range at the same time.

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change :lgtm: but I'd like someone more familiar with stats to sign off on the idea.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz, @dt, and @nvanbenschoten)


pkg/storage/batcheval/cmd_add_sstable.go, line 58 at r1 (raw file):

	}
	stats.ContainsEstimates = true
	ms.Add(stats)

This deserves a big comment, what you have in the commit message would be perfect.

@dt dt force-pushed the addsst-stats-recompute branch from a216566 to 8def39f Compare February 27, 2019 19:06
Copy link
Contributor Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @nvanbenschoten)


pkg/storage/batcheval/cmd_add_sstable.go, line 58 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

This deserves a big comment, what you have in the commit message would be perfect.

commented bigly.

@dt
Copy link
Contributor Author

dt commented Feb 27, 2019

I'm thinking of a followup on this to have IMPORT/RESTORE/IndexBackfiller explicitly send a recompute request to the TableSpan or IndexSpan it loaded when it is done. It looks like Recompute is a point request though, so I think I might actually want a ConsistencyCheck that gets fanned out and then does the recompute on each, though it looks like I'll need to expose that on client.DB.

@danhhz
Copy link
Contributor

danhhz commented Feb 27, 2019

it looks like I'll need to expose that on client.DB

You could also do the same trick we do with ExportRequest etc, and manually do it via a sender.

Copy link
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: triggering a recomputation to correct the stats after this is all done sounds like a good idea.

It's also worth pointing out how splitting works, both normally and in relation to the ContainsEstimates flag. When we split a range, we already need to recompute the stats of the LHS. In addition, when splitting a range with estimated stats, we also recompute the stats of the RHS. What this means in practice is that we should be periodically correcting the estimated stats that we're creating here as ranges split, which is a good thing. The downside is that splits will become more expensive due to the ContainsEstimates flag, which may cut into the perf win this provides.

Speaking of which, do you have any numbers about the improvement we see due to this?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @nvanbenschoten)

@dt
Copy link
Contributor Author

dt commented Feb 28, 2019

@nvanbenschoten I took a stab at sending out explicit recomputes in #35251.

w.r.t. splits having to recompute: that's still better than what's currently happening.

With the code before this change, every addsstable has to recompute the existing stats for the span it covers, so the cost was proportionate to the span it covered, not the amount of data added. Currently in index backfill, we're ingesting many small (in size) but wide (in overlap) SSTs. These won't fill (and thus split) a range for some time, so we'll do many of these ingestions per split, recomputing the same stats over and over.

Before this change, I was seeing some runs where due to luck/data distribution, one node would get hit with a bunch of the AddSSTable requests from all the other nodes at once and bring the overall job to a crawl. Lookin at that one node, profiling showed it spending 30%+ of its CPU time in C.MVCCComputeStats as called by evalAddSSTable. If I can get though those calls more cheaply, and then just need to recompute on splits and once a the end, that's still way better than on every ingestion.

@dt dt force-pushed the addsst-stats-recompute branch from 8def39f to 0069d8c Compare February 28, 2019 15:20
dt added a commit to dt/cockroach that referenced this pull request Feb 28, 2019
This helper is intended to be called on the span(s) in which a caller
has been doing bulk-ingestion, to trigger consistency checks on the
range(s) in that span. This might be a good idea on its own but also has
the side effect of fixing up any MVCC stats inaccuracies that might
have been introduced by estimates during ingestion.

It is probably not critical to actually call this: eventually the
consistency queue will call anyway and fixup those stats if needed.
However, if we know we'll call it right away -- specifically before we
mark the operation that was ingesting as completed and turn those ranges
over to real traffic that might expect correct-ish stats -- we might be
able to make coarser, cheaper estimates, for instance during SST
ingestion where the current, accurate stats recomputation is expensive.

See cockroachdb#35231.

Release note: None
@dt dt removed the do-not-merge bors won't merge a PR with this label. label Feb 28, 2019
@dt dt force-pushed the addsst-stats-recompute branch from 0069d8c to 29a7f8f Compare April 1, 2019 22:39
@dt
Copy link
Contributor Author

dt commented Apr 1, 2019

@nvanbenschoten asked what stats might be off with this change. The only way the stats are different here is if the SST contains a key already in rocks, in which case we should only count it once -- and for the copy in the SSt. I checked what stats we actually compute from the added file to see what we'd overcount in the case of a retry: live_bytes, live_count, key_bytes, key_count, val_bytes, val_count. Everything else is zero, so I don't think repeatedly adding an SST would change them. I think we might think a range is bigger than it is, split it, recompute and the realize it is undersized and merge it. That should be very rare -- we pretty rarely ingest the same key more than once.

@dt
Copy link
Contributor Author

dt commented Apr 1, 2019

As discussed offline with @nvanbenschoten and @bdarnell, we have renewed interest in this as it might be the best way we have to cut the latency impact of slow evalAddSStable calls blocking other traffic. Does this need anything before we can merge it to master?

@nvb
Copy link
Contributor

nvb commented Apr 2, 2019

The change still looks good to me. Thanks for confirming that everything is good with the parts of the stats that can get out of whack.

Do we expect this to have a performance impact on RESTORE and IMPORT as well? Were those two double counting their stats previously too?

@tbg
Copy link
Member

tbg commented Apr 2, 2019

Copy-pasting my concerns from slack:

I’m generally open to the idea of using approximate stats for backfills, recomputing over and over again does seem bad. But I’m concerned that this relies on the stats recomputation in ways it wasn’t productionized for (it’s generally a little wonky and I’m happy we’re only using it in timeseries). The main shortcomings are a) it works by running a consistency check (which does a full scan) which then triggers an RecomputeStats request which does another recomputation b) it doesn’t reset ContainsEstimates — this is because between recomputing the stats and introducing the corrective delta, there could’ve been another approximation added to the log and we definitely want to be able to assert that !ContainsEstimates => stats agree with recomputation (edited)

the work to do here is basically to make ContainsEstimates an enum{Accurate, Adjusting, Estimated} and let RecomputeStats go from estimated->adjusting->accurate while any write that introduces an estimate goes back to Estimated
and of course the backfill should also proactively do the work of getting the stats back to accurate, we don’t want to do it as fallout of the consistency checker which may not even be turned on.

In the end, I am okay with this change because the benefit outweighs the risk. The recomputation is only triggered by the consistency checker when there's a mismatch, so we wouldn't spend much extra time scanning the range after it has been corrected. The ContainsEstimates flag is poisonous and we'll still need to carry out work like the above if we want to get rid of it, but that'll need some sort of migration for which the ship has sailed this release cycle.

@dt
Copy link
Contributor Author

dt commented Apr 2, 2019

@nvanbenschoten RESTORE and IMPORT add SSTs with negligible overlap of existing keys (only on retries) and spend basically no time computing stats for the existing keys spanned by the added sst and will see ~no change here -- this really only will show up in index backfills or in some cases for experimental direct import (non-ordered data, either in non-ordered PK or from secondary indexes).

@tbg the benefit is very significant in index backfills: since we need to flush while reading the table and buffering produced index data to keep buffering memory bounded, the amount that, after sorting it by destination range, gets sent in one SST to a given range can sometimes be small, and we may end up sending many such SSTs over the course of the backfill. However the stats recompute when ingesting each of those is for all existing keys spanned by the SST so the insidious part of this is that the cost of ingesting, say, 50kb of index entries isn't proportional to 50kb but to the 25mb of existing data they span, and we pay it over and over on every little file. If we can avoid that, and just pay it once later on split, that can be a huge win.

@tbg
Copy link
Member

tbg commented Apr 2, 2019

@dt right, I'm aware, and that's why I'm fine with this change. Not doing it that way is basically just terrible. This leaves us with a bit of fallout to cover but we should be able to get to it.

@bdarnell
Copy link
Contributor

bdarnell commented Apr 2, 2019

However the stats recompute when ingesting each of those is for all existing keys spanned by the SST so the insidious part of this is that the cost of ingesting, say, 50kb of index entries isn't proportional to 50kb but to the 25mb of existing data they span, and we pay it over and over on every little file.

If there are only 50KB of updates to a given range, is it still worthwhile to ingest an SST instead of a regular write batch? What's the cutoff at which an SST ingestion makes sense?

@dt
Copy link
Contributor Author

dt commented Apr 2, 2019

@bdarnell we thought about that and I think it is a good idea at some size -- we opted not to do it right away as we wanted only one path exercised / debugged for now, but I think it could be a good optimization to add to the BulkAdder at some point. I used a small SST as a particularly illustrative example of why recomputing stats is expensive, but even for a decently sized one, it is still slow -- we at least get more ingested for the recompute so maybe we do fewer, but requests that end up behind one still see the recompute latency.

@dt
Copy link
Contributor Author

dt commented Apr 2, 2019

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Apr 2, 2019

Build failed

@dt dt force-pushed the addsst-stats-recompute branch from 29a7f8f to 3546b94 Compare April 3, 2019 03:01
@dt
Copy link
Contributor Author

dt commented Apr 3, 2019

The TestAddSSTableMVCCStats test didn't have too much to test after this change -- it used to confirm the stats matched exactly after applying lots of randomly generated SSTs to some randomly generated testdata, and now that we're not computing exact stats, that invariant is no longer true so there wasn't much for it to test.

At @nvanbenschoten suggestion, I changed it to focus on specifically on testing cases where the stats estimates over/undercount. I had to remove the randomness to make it easy to assert exactly what would / wouldn't over and undercount (at least without adding a ton of mvcc logic to the test) but that made it practical to actually document which counts did/didn't reflect keys shadowing/being shadowed.

@dt dt force-pushed the addsst-stats-recompute branch 3 times, most recently from 9473204 to a1ee5e4 Compare April 3, 2019 03:33
Copy link
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @danhhz and @dt)


pkg/storage/batcheval/cmd_add_sstable_test.go, line 250 at r2 (raw file):

		kvs[i].Value = roachpb.MakeValueFromBytes([]byte(in[i].v)).RawBytes
	}
	sort.Slice(kvs, func(i, j int) bool { return kvs[i].Key.Less(kvs[j].Key) })

Should this sort based on timestamp for equal keys?


pkg/storage/batcheval/cmd_add_sstable_test.go, line 283 at r2 (raw file):

added 4

"added 4 keys"


pkg/storage/batcheval/cmd_add_sstable_test.go, line 283 at r2 (raw file):

thinks

nit: we haven't done this yet, so should these all be future tense?


pkg/storage/batcheval/cmd_add_sstable_test.go, line 291 at r2 (raw file):

colission

collision


pkg/storage/batcheval/cmd_add_sstable_test.go, line 380 at r2 (raw file):

	}()

	if actual, estimated, expDelta := afterStats.KeyCount, evaledStats.KeyCount, delta.KeyCount; actual-estimated != expDelta {

Would MVCCStats.Subtract allow you to compare the delta against expDelta using a single reflect.DeepEqual?

Also, do we even need the reflect.DeepEqual? Aren't MVCCStats objects comparable?

Adding the SSTs stats directly, without removing the underlying stats and computing merged stats,
avoids some potentially expensive repeated reads/reccomputation of the underlying existing data when
repeatedly adding sparse, wide SSTs to existing, full-ish ranges.

When adding one big SST to an empty range, the previous approach was not particularly expensive:
recomputing the existing stats was cheap, as the underying range was empty-ish, and computing
the merged stats was able to mostly iterate sequentially over just the SST.

However when adding an SST with keys distributed over an existing, non-empty range, the
recomputation of the existing span stats actually has to read and process a large amount
of existing data, and the merged recompute has to iterate over it as well -- and in doing
so has to jump between the go-side merge iterator and the underlying Rocks-backed range.

Instead, we can optimistically assume all the keys/data in the SST being added is in fact
being added -- i.e. not shadowing existing keys -- and simply add its stats to the range
stats without recomputing. This will be incorrect in some cases -- when a key does shadow
or, in particular, when retrying and all keys are shadowed -- but by flipping the flag
in the stats to indicate that the stats contain estimates, we can document our assumption's
imperfection. When the descrepency is found and fixed by a recompute, the flag will prevent
a consistency checker error being raised.

These estimated stats are probably good enough 'as-is', but a followup could send an explicit
CheckConsistency command to all ranges that were sent SSTs during a given bulk operation to
expedite the correction of any inccorrect estimates.

Release note (performance improvement): Speed up bulk data ingestion during index backfills and IMPORT.
@dt dt force-pushed the addsst-stats-recompute branch from a1ee5e4 to d943d4d Compare April 3, 2019 03:44
Copy link
Contributor Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @nvanbenschoten)


pkg/storage/batcheval/cmd_add_sstable_test.go, line 250 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this sort based on timestamp for equal keys?

that is MVCCKey.Less, so I think it already does?


pkg/storage/batcheval/cmd_add_sstable_test.go, line 283 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

thinks

nit: we haven't done this yet, so should these all be future tense?

Done.


pkg/storage/batcheval/cmd_add_sstable_test.go, line 283 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
added 4

"added 4 keys"

Done.


pkg/storage/batcheval/cmd_add_sstable_test.go, line 291 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
colission

collision

Done.


pkg/storage/batcheval/cmd_add_sstable_test.go, line 380 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Would MVCCStats.Subtract allow you to compare the delta against expDelta using a single reflect.DeepEqual?

Also, do we even need the reflect.DeepEqual? Aren't MVCCStats objects comparable?

Done.

Copy link
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale)

@dt
Copy link
Contributor Author

dt commented Apr 3, 2019

bors r+

craig bot pushed a commit that referenced this pull request Apr 3, 2019
35231: batcheval: estimate stats in EvalAddSSTable r=dt a=dt

Adding the SSTs stats directly, without removing the underlying stats and computing merged stats,
avoids some potentially expensive repeated reads/reccomputation of the underlying existing data when
repeatedly adding sparse, wide SSTs to existing, full-ish ranges.

When adding one big SST to an empty range, the previous approach was not particularly expensive:
recomputing the existing stats was cheap, as the underying range was empty-ish, and computing
the merged stats was able to mostly iterate sequentially over just the SST.

However when adding an SST with keys distributed over an existing, non-empty range, the
recomputation of the existing span stats actually has to read and process a large amount
of existing data, and the merged recompute has to iterate over it as well -- and in doing
so has to jump between the go-side merge iterator and the underlying Rocks-backed range.

Instead, we can optimistically assume all the keys/data in the SST being added is in fact
being added -- i.e. not shadowing existing keys -- and simply add its stats to the range
stats without recomputing. This will be incorrect in some cases -- when a key does shadow
or, in particular, when retrying and all keys are shadowed -- but by flipping the flag
in the stats to indicate that the stats contain estimates, we can document our assumption's
imperfection. When the descrepency is found and fixed by a recompute, the flag will prevent
a consistency checker error being raised.

These estimated stats are probably good enough 'as-is', but a followup could send an explicit
CheckConsistency command to all ranges that were sent SSTs during a given bulk operation to
expedite the correction of any inccorrect estimates.

Release note (performance improvement): Speed up bulk data ingestion during index backfills and IMPORT.

Co-authored-by: David Taylor <tinystatemachine@gmail.com>
@craig
Copy link
Contributor

craig bot commented Apr 3, 2019

Build succeeded

@craig craig bot merged commit d943d4d into cockroachdb:master Apr 3, 2019
@dt dt deleted the addsst-stats-recompute branch April 4, 2019 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants