Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schema changes: pre-backfill without online traffic then incremental backfill after #36850

Closed
dt opened this issue Apr 15, 2019 · 6 comments
Closed
Labels
A-disaster-recovery A-schema-changes C-performance Perf of queries or internals. Solution not expected to change functional behavior. C-wishlist A wishlist feature. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@dt
Copy link
Member

dt commented Apr 15, 2019

TL;DR

Pre-backfill a new index before it gets any online traffic, then start online traffic and catch it up using a new "incremental backfill" that is cheaper, so online traffic mixes with less backfill traffic.

Background

Currently our schema change process follows the online schema change process documented in the F1 paper pretty closely: we advance though a series of states, but most importantly, we ensure all nodes have advanced to a state in which they update the new schema element during all online traffic before we start backfilling. We do this so we can be sure we correctly process every row: if every node is sure to update the new schema element for any added, changed, or removed row before the backfiller starts, we don't have to worry about the backfiller missing a row added or changed while it runs.

Motivation

However, a bulk operation like a backfill is often tuned differently and behaves differently from online traffic and mixing the two can hurt both: backfill operations often want to be larger and like to get lots of work done during a given request (to get more work done for some fixed request overhead) while online operations are small and tend to be concerned with returning quickly. Thus user-observed tail latencies may suffer when small, quick online requests are serialized with big, slow bulk operations. At the same time, having even just few point writes from online traffic on disk in a range might mean all bulk ingestion to that range is forced to go into L0 instead a lower level. All that data may then need to be completely rewritten for compaction, increasing the cost of the backfill significantly. Thus neither the backfill nor the online traffic is particularly happy with the other happening at the same time on in the same range (TODO: measure this. see point (1) below).

We mix this traffic for the reason mentioned above: the fact that the online updates are happening while we backfill is how we ensure the backfill doesn't miss a new or changed row during the backfill. However, we already have a built-in way to tell what, if anything changed, in a table since a given time and indeed have built incremental BACKUP on precisely this: MVCC iteration with a lower time-bound.

Proposal

We could utilize this in schema changes as well: before altering a table to send online traffic writes to our new index or column, we could instead just run the backfill of it into otherwise disused keyspace. After that "prebackfill", we could use the same mechanism as incremental BACKUP, but instead of just writing what changed to files, feed it back into out backfiller, thus making an "incremental backfill" that can catch us up.

In practice, this could then look something like this:
a) run prebackfill, reading at time t0, writing to its own key space.
b) advance schema states to start online traffic writing to new key space.
c) wait until schema states / leases ensure all online traffic is running updates. pick time t1
d) run incremental backfill from time t0 to t1
e) make descs public

Rough Implementation

Given that we have most of the pieces already in place, I think this could be a pretty straightforward project. I imagine the implementation plan looks something like this:

  1. get a speed-of-light number: Actually measure this cost by just skipping the current idex states and going straight to backfill, then adding the new element as directly as public, with heavy traffic on the table. If it isn't a win, we should focus further efforts elsewhere -- the rest of these steps are just for getting correctness back after making such a change.
  2. Add a new state for pre-backfilling that doesn't trigger online writes.
  3. Add a lower-time bound table reader (or teach the chunkbackfiller to read tables a different way). This will require either adding a lower-time-bound to the Scan it runs or teaching TableReader to send ExportRequests and polishing up ExportRequest some.
  4. Update the schema changer to start in pre-backfill and to run the backfiller, with no time bound, there before stepping though current states, then add the time-bound to the current backfill after delete_and_write_only.

More random details/thoughts:

This looks a lot like CDC if you squint, but I suspect we don't actually want to use actual CDC: what we actually want -- no changes while we're working, then all the changes -- seem like it might actually be the opposite of what CDC wants to give you -- i.e. a nearly realtime feed of changes. We probably could subscribe to the whole table but we'd just need to buffer everything during the potentially long and unbounded backfill. I think incremental backup is closer to what we're looking for.

If for some reason we a) see a big benefit to this and b) discover that under heavy traffic so much changes during the online part, we could potentially add more incremental passes that hopefully shrink in size as we catch up.

The reader+writer w/ incremental-reader might also be how we do "almost online" table rewrites (CREATE AS, change pk, repartition, etc).

This assumes we can run the whole change within the gc window. This is probably fine for a PoC but we shouldn't assume this long-term --if we're actively running a job that depends on mvcc in a given window, we should prevent GC from removing those revisions.

Jira issue: CRDB-4482

@dt dt assigned dt and thoszhang Apr 15, 2019
@dt dt added A-disaster-recovery A-schema-changes C-performance Perf of queries or internals. Solution not expected to change functional behavior. C-wishlist A wishlist feature. labels Apr 15, 2019
@dt dt changed the title schema changes: pre-backfill before starting online traffic to reduce impact on live traffic schema changes: pre-backfill without online traffic then incremental backfill after Apr 15, 2019
@dt
Copy link
Member Author

dt commented Apr 15, 2019

This is currently just an initial brain dump of something that's I've mentioned a couple times recently, and seemed like it was time to put some thoughts in writing (particularly if it is potentially offered as a later stepping stone for table-rewriting jobs building on a fast CREATE AS).
cc @lucy-zhang @rolandcrosby @vivekmenezes (and @danhhz since there's some non-zero conceptual, if not code overlap with CDC).

also, h/t to @andy-kimball for getting me thinking about this, with what I think might have just been an offhand comment asking why we didn't do, in broad strokes, something like this.

@danhhz
Copy link
Contributor

danhhz commented Apr 15, 2019

This all seems pretty tractable to me.

If for some reason we a) see a big benefit to this and b) discover that under heavy traffic so much changes during the online part, we could potentially add more incremental passes that hopefully shrink in size as we catch up.

I've thought about this a little in the context of an extremely large table (terabytes, petabytes) with a high write rate. We'd likely want to do AddSSTable-only backfill passes until we get "close enough" then switch to something like WriteBatch for the backfill once we enter delete_and_write_only. This is likely overkill for v1, but I'd like to have a plausible answer to this hypothetical at some point.

This looks a lot like CDC if you squint, but I suspect we don't actually want to use actual CDC: what we actually want -- no changes while we're working, then all the changes -- seem like it might actually be the opposite of what CDC wants to give you -- i.e. a nearly realtime feed of changes. We probably could subscribe to the whole table but we'd just need to buffer everything during the potentially long and unbounded backfill. I think incremental backup is closer to what we're looking for.

Agreed. CDC doesn't really seem to me like a building block for this, why buffer when we have the source of truth available.

@vivekmenezes
Copy link
Contributor

This makes a lot of sense especially if you can create the initial backfilled index very fast, which I'm pretty sure you can. You're right a major savings is from this index getting into rocksDB L6.

#19749 is the issue related to timebound scans

@ajwerner
Copy link
Contributor

Protected timestamps mitigate the GC problems outlined in the description.

I look forward to understanding not how this approach improves the performance of the backfill but rather how it improves the performance of foreground workloads which either:

A) Are highly contended
B) Rely on the 1PC optimization

@ajwerner
Copy link
Contributor

One issue with concurrent writes to indexes is that they make AddSSTable much worse and tend to lead to backpressure. Relatively uniform writes to a table which has an index being added can bring that index build to a crawl as compaction debt builds up quickly. This is readily observed in 20.1 with both Pebble and RocksDB. Maybe it is better in 20.2 pebble. Another note is that we don't rebalance ranges based on applying SSTs so we tend to not spread this compaction load around in large clusters. @dt do you think that last point is worthy of its own issue?

@postamar
Copy link
Contributor

MVCC-compatible backfills do almost exactly what's described in this issue, so I'm going to close this. Please re-open if I'm wrong.

@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-disaster-recovery T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-schema-changes C-performance Perf of queries or internals. Solution not expected to change functional behavior. C-wishlist A wishlist feature. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
No open projects
Archived in project
Development

No branches or pull requests

9 participants