Merge pull request #6 from tikv/Connor1996-patch-1

Add batch split
tikv · Dec 20, 2018 · f1f14fc · f1f14fc
2 parents d6939c1 + dec91b9
commit f1f14fc
Showing 1 changed file with 213 additions and 0 deletions.
diff --git a/text/2018-10-25-batch-split.md b/text/2018-10-25-batch-split.md
@@ -0,0 +1,213 @@
+# Batch Split
+
+## Summary
+
+Support a `BatchSplit` feature that splits one Region into multiple Regions at
+a time if the size or the number of keys exceeds a threshold. This includes
+modifications of both TiKV and PD. For TiKV, every round of split-check
+produces multiple split keys instead of one and changes inner split related
+interface into batch style. For PD, add RPCs `AskBatchSplit` and
+`ReportBatchSplit` to allow TiKV to ask for `region_id` and `peer_id` in batch.
+
+## Motivation
+
+Current split only splits one Region at a time. It may be very slow when a
+sequential write is too fast, namely, the split speed cannot keep up with
+write speed. A slow split can lead to large Regions. In this case, if a snapshot
+is triggered, it will occupy a lot of I/O and make everything slow. Also, it is
+hard to schedule hotspots for a large Region, so it makes performance even
+worse.
+
+## Detailed design
+
+### RPC interface
+
+#### PD
+
+```protobuf
+service PD {
+    // ...
+
+    rpc AskSplit(AskSplitRequest) returns (AskSplitResponse) {
+        // Use AskBatchSplit instead.
+        option deprecated = true;
+    }
+    rpc ReportSplit(ReportSplitRequest) returns (ReportSplitResponse) {
+        // Use ResportBatchSplit instead.
+        option deprecated = true;
+    }
+    rpc AskBatchSplit(AskBatchSplitRequest) returns (AskBatchSplitResponse) {}
+    rpc ReportBatchSplit(ReportBatchSplitRequest) returns (ReportBatchSplitResponse) {}
+}
+
+message AskBatchSplitRequest {
+    RequestHeader header = 1;
+    metapb.Region region = 2;
+    uint32 split_count = 3;
+}
+
+message SplitID {
+    uint64 new_region_id = 1;
+    repeated uint64 new_peer_ids = 2;
+}
+
+message AskBatchSplitResponse {
+    ResponseHeader header = 1;
+    repeated SplitID ids = 2;
+}
+
+message ReportBatchSplitRequest {
+    RequestHeader header = 1;
+    repeated metapb.Region regions = 2;
+}
+
+message ReportBatchSplitResponse {
+    ResponseHeader header = 1;
+}
+```
+
+Add `AskBatchSplit` to replace `AskSplit`. It is called when TiKV produces some
+split keys for one Region and asks PD to allocate new `region_id` and `peer_id`
+for that Region. `split_count` in `AskBatchSplitRequest` indicates the number
+of Regions to be generated, and `AskBatchSplitResponse` returns all new
+allocated IDs to TiKV.
+
+Add `ReportBatchSplit` to replace `ReportBatchSplit`. It is called when TiKV
+finishes splitting a Region. `ReportBatchSplitRequest` takes all metas of a new
+generated Region for PD to update PD's related information.
+
+For compatibility issue, the old interface is not deleted but set to
+deprecated.
+
+#### TiKV
+
+```protobuf
+message SplitRequest {
+    // ...
+    // Will be ignored in batch split. Use `BatchSplitRequest::right_derive` instead.
+    bool right_derive = 4 [deprecated=true];
+}
+
+message BatchSplitRequest {
+    repeated SplitRequest requests = 1;
+    // If true, the last Region obtains the origin region_id,
+    // and other Regions use new Ids.
+    bool right_derive = 2;
+}
+
+message BatchSplitResponse {
+    repeated metapb.Region regions = 1;
+}
+
+enum AdminCmdType {
+    // ...
+    Split = 2 [deprecated=true];
+    // ...
+    BatchSplit = 10;
+}
+
+message AdminRequest {
+    // ...
+    SplitRequest split = 3 [deprecated=true];
+    // ...
+    BatchSplitRequest splits = 10;
+}
+
+message AdminResponse {
+    // ...
+    SplitResponse split = 3 [deprecated=true];
+    // ...
+    BatchSplitResponse splits = 10;
+}
+```
+
+Add a new admin command type `BatchSplit` with related request and response.
+`BatchSplitRequest` wraps multiple `SplitRequest` along with `right_derive`
+which invalidates the `right_derive` in each `SplitRequest`.
+
+When in the rolling upgrade process, new TiKVs are mixed up with old TiKVs, so
+old command type `Split` still needs to be preserved.
+
+### Implementation in TiKV
+
+#### How to produce multiple split keys
+
+First introducing one config `batch_split_limit` to limit the number of produced
+split keys in a batch. If it is unlimited, for a once split check, it scans all
+over the Region's range, and in some extreme case, this would cause performance
+issue.
+
+SizeChecker and KeysChecker can be rewritten to produce multiple keys, and the
+general logic of SizeChecker and KeysChecker are similar. The only difference
+between them is one splits Region based on size and the other splits Region
+based on the number of keys. So here we mainly describe the logic of
+SizeChecker:
+
+- **Before:** it scans key-value pairs in a Region's range sequentially to
+  accumulate their size as `total_size` and stops once the size reaches
+  `region_max_size` or scans to the end of the range. If `total_size` is
+  smaller than `region_max_size` at the end, checker wouldn't produce any split
+  key; if not, it regards the very key at which `total_size`  reaches
+  `region_split_size` as split key.
+- **After:** it scans key-value pairs in a Region's range sequentially to
+  accumulate their size as `total_size` and stops once the size reaches
+  `region_split_size * (batch_split_limit-1) + region_max_size` or scans to the
+  end of the range. During the scan process, it records the key as a split key
+  every `region_split_size`, but after finishing scanning, it may discard the
+  last split key if the size of the rest is not bigger than `region_max_size -
+  region_split_size`. With this algorithm, if `batch_split_limit` is set to 1,
+  TiKV can perfectly behave the same way as the split without batch.
+
+#### Compatibility concern
+
+The general process in raftstore changes a little. It mainly replaces `Split`
+with `BatchSplit`. But one thing should be noted that during the rolling
+upgrade, PD version control will refuse the `AskBatchSplit` request, thus split
+can't be performed during this process until all TiKVs bump to a new version.
+To let TiKV know whether `AskBatchSplit` fails for compatibility or not, we
+introduce a new error type for `ResponseHeader`:
+
+```protobuf
+enum ErrorType {
+    // ...
+    INCOMPATIBLE_VERSION = 5;
+}
+```
+
+So once TiKV gets `AskBatchSplitResponse` with
+`ErrorType::INCOMPATIBLE_VERSION`, it uses the original `AskSplit` instead of
+`AskBatchSplit`, and all the following processes will degrade to the original
+way. So the original code path is not deleted.
+
+#### Approximate split key
+
+What we said above can ease the problem. However, scanning a large Region can
+also consume a lot of time and CPU. The test shows that large Regions can still
+easily show up even with batch split implemented, although split is speeded up.
+
+When a Region becomes large enough, it's more practical to divide it into
+smaller chunks quickly. This can be achieved via size estimation, which can be
+calculated from SST properties. Although it may not be accurate enough, it's
+okay for a large Region.
+
+So if the size of Region is larger than `region_max_size * batch_split_limit *
+2`, TiKV uses an approximate way to produce split keys. The approximate way is
+quite similar to the algorithm we describe above, but to estimate TiKV uses
+approximate size of the Region and the number of keys in the Region's range to
+calculate the average distance between two SST property keys, and produces a
+split key every `region_split_size / distance` keys.
+
+## Drawbacks
+
+- When the approximate way is used, Region may split into several
+  disproportional Regions due to size estimation.
+
+## Alternatives
+
+None
+
+## Unresolved questions
+
+Generally, it is more urgent to split a large Region, so we can change the
+split check queue from a naive FIFO queue to a priority queue so that a large
+Region can be split early and quickly.