Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BR can reduce scatter operations via coarse-grained scattering #27234

Closed
YuJuncen opened this issue Aug 16, 2021 · 7 comments
Closed

BR can reduce scatter operations via coarse-grained scattering #27234

YuJuncen opened this issue Aug 16, 2021 · 7 comments
Labels
component/br This issue is related to BR of TiDB. sig/migrate type/enhancement The issue or PR belongs to an enhancement.

Comments

@YuJuncen
Copy link
Contributor

Enhancement

Currently, BR uses the following procedure to scatter regions:

  1. BR collect keys for splitting regions from end keys of files and rewrite rules.
  2. BR split those keys via BatchSplit interface of BR.
  3. After the region split getting present, BR create the scatter operators.
  4. After those operator getting done or timed out, or 3 mins passed, BR report scatter as done.

At the step (3), the number of scatter operators equals to the number of newly split regions. In practice, it would be a huge number (i.e. concurrency * pd-concurrency) of operators, which would be slow and timedout-prone.

In fact, given the number of stores would be fairly less than concurrency * pd-concurrency, and the cost of splitting is quite lower than scattering, we can make a better scatter strategy:

  1. BR collect keys as normal.
  2. BR divide the keys found into equal parts. The number of parts could be equals to the number of the stores or n times of it.
  3. BR split the key space according to these parts, then scatter them.
  4. Execute fine-grained split in the scattered regions.

An example, let there be a set of 129 keys [k1,k2,...,k129] need to split, in a cluster with 3 stores:

  1. We split at [k43, k86], then there would be 3 regions [k1, k43), [k43, k86), [k86, k129).
  2. Given there are only 3 stores, we can scatter those 3 (relatively) bigger regions into 3 stores.
[3 scatter operators generated]

Store 1: [k1, k43)
Store 2: [k43, k86)
Store 3: [k86, k129)
  1. We can then split them in-place at the new store.
Store 1: [k1, k2), ..., [k42, k43)
Store 2: [k43, k44), ..., [k85, k86)
Store 3: [k86, k87), ..., [k128, k129)

Comparing to the current way:

  1. We split it at one store.
Store 1: [k1, k2), ..., [k42, k43), [k43, k44), ..., [k85, k86), [k86, k87), ..., [k128, k129)
Store 2: ∅
Store 3: ∅
  1. We scatter those regions into stores.
[128 scatter operators generated]

Store 1: [k1, k2), [k43, k44) ...
Store 2: [k42, k43), [k85, k86), ...
Store 3: [k86, k87), [k128, k129), ...

The number of scatter operators reduces from O(files + rewrite rules) to O(stores).

@YuJuncen YuJuncen added the type/enhancement The issue or PR belongs to an enhancement. label Aug 16, 2021
@YuJuncen
Copy link
Contributor Author

/component br
/sig migrate

@ti-chi-bot ti-chi-bot added component/br This issue is related to BR of TiDB. sig/migrate labels Aug 16, 2021
@YuJuncen
Copy link
Contributor Author

YuJuncen commented Aug 16, 2021

farther refactor and optimizations

The batcher was adapted to split huge tables into small batches for preventing generating huge split & scatter batch. After this optimize, it would be better to send huge batches.

So, when meeting huge tables, it's no need to split it into 128-sized chunks(or, we still need split it, but for huger chunks, because sort ranges for really huge tables would be hard?), but we can send it directly to the Splitter. (There would be a thread pool to limit the download & ingest part, so it would be OK to send huge batch to Importer.)

And then, the BlankTablesAfterSend field in DrainResult can be removed. Even the complex batcher.drainRanges method can also be removed.

@3pointer
Copy link
Contributor

The number of scatter operators reduces from O(files + rewrite rules) to O(stores).

Change split 128 regions & scatter 128 regions to split 3(store-size) regions & scatter 3 regions & split 128/3 regions sounds a big optimization, @kennytm WDYT🤔?

@kennytm
Copy link
Contributor

kennytm commented Aug 16, 2021

🤔 this surely reduces a lot of scatter calls. One effect, though, is that a longer range of similar keys are all clustered in the same configuration. iiuc, the scatter result may be like this,

potential scatter result, existing (each circle represent a region, each row is a store, 🔴 = leader, 🟢 = follower, ⚪ = not exists):

  1. ⚪🟢🟢⚪🟢🟢🟢⚪🔴🟢⚪⚪🟢⚪⚪⚪🔴⚪⚪🟢🟢🔴🟢🔴⚪🟢🔴🟢⚪🔴🟢⚪⚪⚪⚪
  2. ⚪🟢⚪🟢🔴⚪⚪⚪⚪⚪🔴🟢⚪🔴🟢🔴⚪🟢⚪⚪⚪🟢⚪🟢⚪🔴⚪🔴🟢🟢🟢🟢🔴🔴🔴
  3. 🔴⚪🔴🟢⚪⚪⚪🔴🟢⚪⚪🔴🟢⚪⚪⚪⚪🟢🟢🟢⚪🟢⚪⚪🔴🟢🟢⚪🟢🟢🔴🟢⚪🟢🟢
  4. 🟢🔴🟢🔴🟢🟢🟢🟢⚪🟢🟢⚪🔴🟢🟢🟢🟢🔴🔴⚪🟢⚪🔴⚪🟢⚪🟢⚪🔴⚪⚪⚪🟢🟢🟢
  5. 🟢⚪⚪⚪⚪🔴🔴🟢🟢🔴🟢🟢⚪🟢🔴🟢🟢⚪🟢🔴🔴⚪🟢🟢🟢⚪⚪🟢⚪⚪⚪🔴🟢⚪⚪

potential scatter result, proposed

  1. ⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪🔴🔴🔴🔴🔴🔴🔴🔴🔴🔴🔴🔴🔴🔴
  2. 🔴🔴🔴🔴🔴🔴🔴🟢🟢🟢🟢🟢🟢🟢🔴🔴🔴🔴🔴🔴🔴⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪
  3. 🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢⚪⚪⚪⚪⚪⚪⚪🟢🟢🟢🟢🟢🟢🟢
  4. ⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪⚪🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢
  5. 🟢🟢🟢🟢🟢🟢🟢🔴🔴🔴🔴🔴🔴🔴🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢⚪⚪⚪⚪⚪⚪⚪

(we also see that stores 3 and 4 never have any leaders, due to the random nature of scatter. we need a deterministic version of scattering, perhaps by manual migration, to ensure uniform distribution)

@YuJuncen
Copy link
Contributor Author

YuJuncen commented Aug 16, 2021

@kennytm For the unbalanced leader problem seems serval transform leader operator after splitting can help? (BTW, Near keys shares same configuration(i.e. shares the same peer, but can have different leader.) seems harmless?)

@kennytm
Copy link
Contributor

kennytm commented Aug 16, 2021

is transfer-leader much cheaper than scattering (the empty regions)?

@YuJuncen
Copy link
Contributor Author

Close this due to:
According to the friends from the PD team, it isn't good if adjoining key ranges shares the same configuration, and the cost of scattering empty region should not be so much higher than transforming leader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/br This issue is related to BR of TiDB. sig/migrate type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

4 participants