-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make region size dynamic #82
Conversation
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
text/0082-dynamic-size-region.md
Outdated
### Dynamic size | ||
|
||
The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions | ||
and 10GiB for cold regions. So there are two types of split, hotspot split and general size split. Hotspot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it simplified, we choose 512MiB for hot regions
and 10GiB for cold regions.
here is for simplify the description of the idea in this document, or simplify the implements?
512Mib is fixed?
so simplify here means hot region size is 1/2 of normal one? it is fixed or configable?
how much performance increase expected by reduce hot reigion size to 1/2 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplify implementation. I don't think it's 1/2, but rather 5%. It's configurable. Using a small size is to make it easy to schedule around and balance quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry i misread it. 5% and configurable sounds reasonable.
text/0082-dynamic-size-region.md
Outdated
### Bucket | ||
|
||
A region is split into several buckets logically. We will collect query stats by buckets and report the bucket | ||
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does split buckets mean split region? what is your mean by scan
. could you make bucket concept more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why do you think it's related to split region. As described in text and image, buckets are logical segments of a region range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per your description, bucket is a logical concept.
no matter how you split buckets, they still in the same region. unless the owner region split, i am not sure how can a physical region
splits into logic buckets
improve performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no matter how you split buckets, they still in the same region.
Yes, it's exactly what it means.
Buckets are used for mainly two purpose: 1. collect access statistics, which can be used by PD to detect hotpots and decides to whether split a region; 2. TiKV side concurrency, a full scan request will be divided into smaller concurrent scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a region becomes hot, does the bucket need to be re-split since it uses a different split policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, works most like current region scan split.
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
|
||
The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions | ||
and 10GiB for cold regions. So there are two types of split, hotspot split and general size split. Hotspot | ||
split is triggered by PD. General size split is triggered by TiKV itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will it include hotspot and region size??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "it" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it means PD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, PD only needs to do hotspot split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean the current load base split logic will move to PD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exact the key and statistics collection.
|
||
### Flow report | ||
|
||
Buckets statistics are reported in a standalone stream. They will be reported more frequently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In past, TiKV seperates read and write in different channel. should it uniform them in one channel .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way is OK. It depends on the implementation difficulty.
controllable to let PD split hotspots and schedule to get balance. | ||
|
||
Now that a region can be in 10GiB, a full scan can make the response exceed the limit of gRPC, | ||
which is 4GiB. So instead of unary, we need to use server streaming RPC to return the response as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to user server streaming RPC
Is it confirmed or just a candidate solution? An alternative is make clients send ranges instead of Region, maybe we can discuss about which is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a candidate solution. I prefer client sends a unary request with the whole range. It's simple and efficient. Otherwise, TiDB may waste CPU at batch RPCs. A similar problem can be found here: tikv/client-go#342.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But streaming API is deprecated in TiDB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be much easier to retry if TiDB split the request into buckets, each bucket can be sent concurrently.
If there is leader change, a single bucket request can be send to the new leader, the previous succeed bucket requests don't need to retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the request is split in TiKV side, snapshot is retrieved once, so there is no such case that retry some of the buckets due to leader change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If TiKV is slow, and TiDB encounter frequently time out, retry the whole region request would make the situation much worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then TiDB can change the behavior of retrying, it can know what ranges are missing easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left unary vs streaming as an unresolved question and will make decision by evaluation on maintainability and performance.
/cc @solotzg you may also be interested in this RFC. |
...and improvement on replication Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
What's the conclusion about tikv/pd#4326? Will PD split region by placement rules or not? @disksing |
In the long term, I think the answer is Yes, especially when all system without tiflash have less regions and perform better. But in the short term, we can hold the task to avoid too much work. |
|
||
A region is split into several buckets logically. We will collect query stats by buckets and report the bucket | ||
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate | ||
range size. Every bucket should have the size about 128MiB. Their ranges and stats are all reported to PD. PD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all the bucket stats are reported to PD, it would consume a lot of memory.
We only need to report bucket stats when read/write workload exceed a threshold.
So for most of the regions, we only need to report the range of buckets.
The total memory on PD would be greatly reduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bucket size is about 128MiB, which means it can consume as much as the memory current small regions use. As far as I know, it's not an issue yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coocood Good point. Do you know what's the roughly memory consumption per region in PD today?
For 128MB bucket and 1PB total database size, it's about 800 million buckets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BusyJay For bucket stats report, will both leader and follower need to report their buckets? And I guess it's very likely for the same region, the bucket range may be different. If only leader reports the buckets stats, then PD will have to add the follower buckets stats when calculating the machine level stats, or report machine level stats separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the small region will report statistics on both the leader and follower. the TopN(1000) hot region's stats of this node will be reported not all. may buckets can keep the same behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...the bucket range may be different.
It doesn't matter. Only leader's buckets are used and follower's stats will be merged into leaders using similar algorithm described in flow report section.
the TopN(1000) hot region's stats of this node will be reported not all.
I'm worried about accuracy and message layout. I prefer to report all metrics and do filter on the PD side. Let's see how it performs first.
text/0082-dynamic-size-region.md
Outdated
|
||
### Bucket | ||
|
||
A region is split into several buckets logically. We will collect query stats by buckets and report the bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to introduce a bucket version field in metapb::Region
?
So each time we update the buckets, the bucket version increases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, buckets are not shared by regions. It's a peer's own property. So metapb::Region
should not be modified. I'm not sure if version is useful, as final consistency is acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the buckets is not included in metapb::Region
then there should be a new type metapb::RegionBuckets
.
TiDB can query this info from PD, and use them to build requests.
text/0082-dynamic-size-region.md
Outdated
apply to make single region apply logs faster. Since unorder apply is a standalone feature, I’m not | ||
going into details here. | ||
|
||
For read hotspots, split should be triggered by PD, which can utilize the global statics from all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/statics/stats
|
||
### Dynamic size | ||
|
||
The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder 512 is too big if the hotspot is small tables.will the split size set a lower limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
512 is too big if the hotspot is small tables
It's configurable. As long as only a small number of hotspots are maintained, the size of hotspot doesn't matter.
will the split size set a lower limit?
Yes, and it's configured as 512MiB for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use the same heartbeat interval for both cold and hot regions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unspecified. Either hibernate or awake feel good to me.
text/0082-dynamic-size-region.md
Outdated
A region is split into several buckets logically. We will collect query stats by buckets and report the bucket | ||
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate | ||
range size. Every bucket should have the size about 128MiB. Their ranges and stats are all reported to PD. PD | ||
can detect hot buckets from all regions and decide whether need to split new hotspots and schedules them to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part need more supplements. cc @bufferflies.
BTW, Is it recommend only split the region from the boundaries of buckets? the split key is more reasonable determined by query statistics, the key may from query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part need more supplements
It's covered at https://github.com/tikv/rfcs/pull/82/files#diff-ec76198abf750cd69e94e0e14564aea21f6f2de06d0bf85829d6d68969f36d20R107-R110.
...from the boundaries of buckets
Yes, the split key is chosen from the boundaries of buckets. As long as hotspots are not split into two accidentally, I think it's not worse than query statistics. And query statistics is unpredictable and changed frequently, makes buckets hard to maintain.
text/0082-dynamic-size-region.md
Outdated
|
||
### Bucket | ||
|
||
A region is split into several buckets logically. We will collect query stats by buckets and report the bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the query stat is collect by region-level, such as the range scan [start_key, end_key], may cover bucket 1 to10 in this region, how to split it by buckets?
If the scan will divide into sub-scan in buckets, and we only know the stats in buckets. It will make us cannot know the split cost. for example, the scan request in this region may split into two scan requests or still only one scan request(region-level), it's decided by the split key whether in the range scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scan request will be split into buckets and access statistics are collected at bucket level.
I'm not sure what does "split cost" mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split will make RPC increase, in some scenarios we need to make a trade-off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to consider the RPC count for now. But I agree there might be many choices on choosing split keys. I don't think we can get a best solution without a lot of experiments. The RFC avoid describing concrete algorithm on purpose to not make false assumptions.
|
||
### Dynamic size | ||
|
||
The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use the same heartbeat interval for both cold and hot regions?
|
||
The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions | ||
and 10GiB for cold regions. So there are two types of split, hotspot split and general size split. Hotspot | ||
split is triggered by PD. General size split is triggered by TiKV itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean the current load base split logic will move to PD?
text/0082-dynamic-size-region.md
Outdated
### Bucket | ||
|
||
A region is split into several buckets logically. We will collect query stats by buckets and report the bucket | ||
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a region becomes hot, does the bucket need to be re-split since it uses a different split policy?
- GC will still wake up all regions and cause periodical high usage. | ||
|
||
There are two source of region creations: size/keys split and table split. Table split is disabled by | ||
default in TiKV. Even if it splits a lot of regions, small one will be merged later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need more details about merge, e.g. should we try to move a smaller region instead of a larger region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the emergency. If it's hotspot, then it should be split first and then moved. Otherwise, no need to split unless impossible to get size balance.
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
@@ -61,6 +61,20 @@ balance. | |||
|
|||
![dynamic size buckets](../media/dynamic-size-buckets.png) | |||
|
|||
A new bucket metadata will be added to kvproto: | |||
``` | |||
message Buckets { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we allocate an ID for each bucket to reduce the size of requests from clients?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ID could consist of term and version of the region and a number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...reduce the size of requests from clients?
I don't get it. Client should send the range it wants to scan, so it has to be precisely start key and end key. How can ID reduce the size?
Bucket is supposed to be changed independently from region. For example, client can re-use last overlapping buckets even a region split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can optimize for queries that need to scan the whole bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because bucket keys are generated by even split, so the range of a query is probably different from the bucket edges.
### Compatibility | ||
|
||
As buckets don't modify existing metadata, so it's backward compatible. When upgrading from small | ||
regions, PD may trigger a lot of merge to get a large region size. This procedure should be made |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have some regions with heavy requests in different stores, we don't need to merge it until it cools down. But the calculation way of statistics of the bucket may be different which may encounter problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the Region is bigger, the access pattern in one region will more complicated. especially on read-write pattern, it will make hot region have read-write conflicts. I think how to better split and help scheduling to solve the problem of read-write conflict is a more complicated and challenging problem that we will encounter in the follow-up.
text/0082-dynamic-size-region.md
Outdated
|
||
### Bucket | ||
|
||
A region is split into several buckets logically. We will collect query stats by buckets and report the bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split will make RPC increase, in some scenarios we need to make a trade-off.
than original region heartbeats, which is 60s. The new stream is expected to be reported every | ||
10 seconds when have changes. | ||
|
||
PD will maintain top N hot spot buckets and split all of those buckets that exceed P of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
- The Bucket needs to have a unique identifier. and it should like region split, use right derive. otherwise PD cannot know it is a stable hot bucket or it is a traffic peak.
- In most cases, it can be split from the boundary key of the bucket. but still need make load-base split key strategy or other strategy by PD can be accept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... it should like region split, use right derive
This can be very complicated. How about handling load trace on PD side? For example, if the buckets are [a, d), [d, f) in the past, and the new version is [a, c), [c, e), [e, f). Then history H[a, d) should be inherited by both [a, c) and [c, e), history H[d, f) is inherited by both [c, e) and [e, f). The overlapped histories should be sum up. And the inherited history statistic can be optionally multiply by a factor like 0.8 to make process smoothly.
...make load-base split key strategy or other strategy by PD...
It's up to PD to split by any keys. It's recommended to choose bucket boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method also complicated. Do you mean that the bucket boundary may be different every time? [a, d)[d ,f) change to [a, c), [c, e), [e, f); d is gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can change, but it only changes when there are many updates.
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
uint64 region_id = 1; | ||
uint64 version = 2; // A hint indicate if keys have changed. | ||
repeated bytes keys = 3; | ||
repeated uint64 read = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using a separate message inside here, we may add other statics.
repeated BucketStats stats = 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to reduce the flow. If using a message, the overhead will become 1 * field count * bucket count + actual number length + 2 * bucket count
. If using separate arrays, the overhead become 2 * field count + actual number length
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM |
text/0082-dynamic-size-region.md
Outdated
### Dynamic size | ||
|
||
The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions | ||
and 10GiB for cold regions. So there are two types of split, hotspot split and general size split. Hotspot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: "hotspot split" can be renamed with existing term: "load based split". Don't introduce new terms whenever possible.
text/0082-dynamic-size-region.md
Outdated
### Dynamic size | ||
|
||
The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions | ||
and 10GiB for cold regions. So there are two types of split, hotspot split and general size split. Hotspot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, since we haven't done extensive test regarding region size comparison, I think the numbers such as 512MB, 10GB are not finalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's mentioned that they are configurable.
text/0082-dynamic-size-region.md
Outdated
### Bucket | ||
|
||
A region is split into several buckets logically. We will collect query stats by buckets and report the bucket | ||
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"buckets are split" is confusing---we never split a bucket. I guess "buckets are created" are better.
|
||
A region is split into several buckets logically. We will collect query stats by buckets and report the bucket | ||
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate | ||
range size. Every bucket should have the size about 128MiB. Their ranges and stats are all reported to PD. PD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Why it's not 96MB?
- Recap the offline discussion here. There's another option that can achieve similar function but more scale and less complexity to PD.
TiKV handles bucket traffic details and PD are only aware bucket key range but not traffic stats. And TiKV does the split itself with its local data, with some threshold PD generates periodically from the global view of traffic by collection machine level traffic stats.
The concern I have for current solution is that in a 1PB database, that means PD needs to collect and process 8 million entries at sub-seconds level interval. Today PD is essentially single node component (followers do not take requests). Not sure the impact to PD.
text/0082-dynamic-size-region.md
Outdated
For read hotspots, split should be triggered by PD, which can utilize the global statics from all | ||
regions and nodes. For normal read requests, TiKV will need to split its range into smaller buckets | ||
according to statics to increase concurrency. When TiDB wants to do a scan, it sends the RPC once, | ||
and TiKV will split the requests into smaller concurrent jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a large region and stale read is enabled, I think it might be better to distribute the read load among followers. It's up to TiDB to make the choice.
and TiKV will split the requests into smaller concurrent jobs. | ||
|
||
In the past design, follower read has also been discussed to offload works from leader. But TiDB | ||
can’t predict the progress of a follower, so latency can also become unpredictable. It’s more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In follower read, can TiDB skip these peers who are obviously behind the leader more than a threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can, but it doesn't.
|
||
A region is split into several buckets logically. We will collect query stats by buckets and report the bucket | ||
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate | ||
range size. Every bucket should have the size about 128MiB. Their ranges and stats are all reported to PD. PD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coocood Good point. Do you know what's the roughly memory consumption per region in PD today?
For 128MB bucket and 1PB total database size, it's about 800 million buckets.
|
||
A region is split into several buckets logically. We will collect query stats by buckets and report the bucket | ||
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate | ||
range size. Every bucket should have the size about 128MiB. Their ranges and stats are all reported to PD. PD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BusyJay For bucket stats report, will both leader and follower need to report their buckets? And I guess it's very likely for the same region, the bucket range may be different. If only leader reports the buckets stats, then PD will have to add the follower buckets stats when calculating the machine level stats, or report machine level stats separately.
|
||
A new bucket metadata will be added to kvproto: | ||
``` | ||
message Buckets { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be renamed as BucketStats?
Also where's the key range definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's defined as the third field keys
.
may be complicated. And we rely on further designs like separating LSM tree to optimize the cost | ||
for all. | ||
|
||
### Compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we implement the feature in the way that buckets are just optional and optimization for big region size, then by its nature it's backward compatible.
And there's nothing prevent user from using small region, so this should be well supported in the long term as well. For small to medium size cluster, it's likely user will keep the small region size.
|
||
10GiB is just an example, it's allowed to change to a bigger or smaller value. | ||
|
||
### Bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we measure the read/write load at bucket level, which means it's about 128MB size. How can we further split a hot bucket? Today we have load-based split, which may cut the small region by the access pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a clear idea now. I will leave it as an unresolved question. Maybe we can split the bucket further. For example, besides controlling the size of a bucket, also require the minimal count of a bucket. So even a region becomes small, it can still be split further.
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: Jay Lee <BusyJayLee@gmail.com> Signed-off-by: pingyu <yuping@pingcap.com>
* RFC: RawKV Batch Export (#76) Signed-off-by: pingyu <yuping@pingcap.com> * rawkv bulk load: add description for pause merge (#74) * rawkv bulk load: add description for pause merge Signed-off-by: Peng Guanwen <pg999w@outlook.com> * Update text/0072-online-bulk-load-for-rawkv.md Co-authored-by: Liangliang Gu <marsishandsome@gmail.com> Signed-off-by: Peng Guanwen <pg999w@outlook.com> * Add future improvements Signed-off-by: Peng Guanwen <pg999w@outlook.com> Co-authored-by: Liangliang Gu <marsishandsome@gmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * ref pd#4112: implementation detail of PD Signed-off-by: pingyu <yuping@pingcap.com> * ref pd#4112: implementation detail of PD Signed-off-by: pingyu <yuping@pingcap.com> * remove raw cf Signed-off-by: Andy Lok <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * update Signed-off-by: Andy Lok <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * update pd design Signed-off-by: andylokandy <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * revert to keyspace_next_id Signed-off-by: andylokandy <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * RFC: Improve the Scalability of TSO Service (#78) Signed-off-by: pingyu <yuping@pingcap.com> * make region size dynamic (#82) Signed-off-by: Jay Lee <BusyJayLee@gmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * update pd url Signed-off-by: andylokandy <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * address comment Signed-off-by: andylokandy <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * resolve pd flashback problem Signed-off-by: andylokandy <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * update rfcs Signed-off-by: Andy Lok <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * RFC: In-memory Pessimistic Locks (#77) * RFC: In-memory Pessimistic Locks Signed-off-by: Yilin Chen <sticnarf@gmail.com> * clarify where to delete memory locks after writing a lock CF KV Signed-off-by: Yilin Chen <sticnarf@gmail.com> * Elaborate transfer leader handlings and add correctness section Signed-off-by: Yilin Chen <sticnarf@gmail.com> * add an addition step of proposing pessimistic locks before transferring leader Signed-off-by: Yilin Chen <sticnarf@gmail.com> * clarify about new leaders of region split Signed-off-by: Yilin Chen <sticnarf@gmail.com> * Add tracking issue link Signed-off-by: Yilin Chen <sticnarf@gmail.com> * update design and correctness analysis of lock migration Signed-off-by: Yilin Chen <sticnarf@gmail.com> * add configurations Signed-off-by: Yilin Chen <sticnarf@gmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * propose online unsafe recovery (#91) Signed-off-by: Connor1996 <zbk602423539@gmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * physical isolation between region (#93) Signed-off-by: Jay Lee <BusyJayLee@gmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * wip Signed-off-by: pingyu <yuping@pingcap.com> * update Signed-off-by: pingyu <yuping@pingcap.com> * update Signed-off-by: pingyu <yuping@pingcap.com> * Apply suggestions from code review Co-authored-by: Xiaoguang Sun <sunxiaoguang@users.noreply.github.com> Signed-off-by: pingyu <yuping@pingcap.com> * fix case Signed-off-by: pingyu <yuping@pingcap.com> Signed-off-by: pingyu <yuping@pingcap.com> Signed-off-by: Andy Lok <andylokandy@hotmail.com> Signed-off-by: andylokandy <andylokandy@hotmail.com> Signed-off-by: Jay Lee <BusyJayLee@gmail.com> Signed-off-by: Yilin Chen <sticnarf@gmail.com> Signed-off-by: Connor1996 <zbk602423539@gmail.com> Co-authored-by: Liangliang Gu <marsishandsome@gmail.com> Co-authored-by: Peng Guanwen <pg999w@outlook.com> Co-authored-by: Andy Lok <andylokandy@hotmail.com> Co-authored-by: JmPotato <ghzpotato@gmail.com> Co-authored-by: Jay <BusyJay@users.noreply.github.com> Co-authored-by: Yilin Chen <sticnarf@gmail.com> Co-authored-by: Connor <zbk602423539@gmail.com> Co-authored-by: Xiaoguang Sun <sunxiaoguang@users.noreply.github.com>
No description provided.