Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make region size dynamic #82

Merged
merged 12 commits into from
Dec 9, 2021
Merged

make region size dynamic #82

merged 12 commits into from
Dec 9, 2021

Conversation

BusyJay
Copy link
Member

@BusyJay BusyJay commented Nov 29, 2021

No description provided.

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
@BusyJay BusyJay added the Initial Comment Period This RFC is in the initial comment period, and has quite some time to give input on. label Nov 29, 2021
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
### Dynamic size

The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions
and 10GiB for cold regions. So there are two types of split, hotspot split and general size split. Hotspot
Copy link

@feitian124 feitian124 Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it simplified, we choose 512MiB for hot regions
and 10GiB for cold regions.

here is for simplify the description of the idea in this document, or simplify the implements?
512Mib is fixed?

so simplify here means hot region size is 1/2 of normal one? it is fixed or configable?
how much performance increase expected by reduce hot reigion size to 1/2 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify implementation. I don't think it's 1/2, but rather 5%. It's configurable. Using a small size is to make it easy to schedule around and balance quickly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i misread it. 5% and configurable sounds reasonable.

### Bucket

A region is split into several buckets logically. We will collect query stats by buckets and report the bucket
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does split buckets mean split region? what is your mean by scan. could you make bucket concept more clear?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why do you think it's related to split region. As described in text and image, buckets are logical segments of a region range.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per your description, bucket is a logical concept.
no matter how you split buckets, they still in the same region. unless the owner region split, i am not sure how can a physical region splits into logic buckets improve performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no matter how you split buckets, they still in the same region.

Yes, it's exactly what it means.

Buckets are used for mainly two purpose: 1. collect access statistics, which can be used by PD to detect hotpots and decides to whether split a region; 2. TiKV side concurrency, a full scan request will be divided into smaller concurrent scan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a region becomes hot, does the bucket need to be re-split since it uses a different split policy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, works most like current region scan split.

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions
and 10GiB for cold regions. So there are two types of split, hotspot split and general size split. Hotspot
split is triggered by PD. General size split is triggered by TiKV itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it include hotspot and region size??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "it" mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it means PD.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, PD only needs to do hotspot split.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean the current load base split logic will move to PD?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exact the key and statistics collection.


### Flow report

Buckets statistics are reported in a standalone stream. They will be reported more frequently
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In past, TiKV seperates read and write in different channel. should it uniform them in one channel .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is OK. It depends on the implementation difficulty.

controllable to let PD split hotspots and schedule to get balance.

Now that a region can be in 10GiB, a full scan can make the response exceed the limit of gRPC,
which is 4GiB. So instead of unary, we need to use server streaming RPC to return the response as
Copy link
Contributor

@hicqu hicqu Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to user server streaming RPC

Is it confirmed or just a candidate solution? An alternative is make clients send ranges instead of Region, maybe we can discuss about which is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a candidate solution. I prefer client sends a unary request with the whole range. It's simple and efficient. Otherwise, TiDB may waste CPU at batch RPCs. A similar problem can be found here: tikv/client-go#342.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But streaming API is deprecated in TiDB

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be much easier to retry if TiDB split the request into buckets, each bucket can be sent concurrently.

If there is leader change, a single bucket request can be send to the new leader, the previous succeed bucket requests don't need to retry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the request is split in TiKV side, snapshot is retrieved once, so there is no such case that retry some of the buckets due to leader change.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If TiKV is slow, and TiDB encounter frequently time out, retry the whole region request would make the situation much worse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then TiDB can change the behavior of retrying, it can know what ranges are missing easily.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left unary vs streaming as an unresolved question and will make decision by evaluation on maintainability and performance.

@BusyJay
Copy link
Member Author

BusyJay commented Dec 1, 2021

/cc @solotzg you may also be interested in this RFC.

...and improvement on replication

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
@solotzg
Copy link

solotzg commented Dec 1, 2021

What's the conclusion about tikv/pd#4326? Will PD split region by placement rules or not? @disksing

@BusyJay
Copy link
Member Author

BusyJay commented Dec 1, 2021

In the long term, I think the answer is Yes, especially when all system without tiflash have less regions and perform better. But in the short term, we can hold the task to avoid too much work.


A region is split into several buckets logically. We will collect query stats by buckets and report the bucket
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate
range size. Every bucket should have the size about 128MiB. Their ranges and stats are all reported to PD. PD
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all the bucket stats are reported to PD, it would consume a lot of memory.
We only need to report bucket stats when read/write workload exceed a threshold.

So for most of the regions, we only need to report the range of buckets.
The total memory on PD would be greatly reduced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bucket size is about 128MiB, which means it can consume as much as the memory current small regions use. As far as I know, it's not an issue yet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coocood Good point. Do you know what's the roughly memory consumption per region in PD today?
For 128MB bucket and 1PB total database size, it's about 800 million buckets.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BusyJay For bucket stats report, will both leader and follower need to report their buckets? And I guess it's very likely for the same region, the bucket range may be different. If only leader reports the buckets stats, then PD will have to add the follower buckets stats when calculating the machine level stats, or report machine level stats separately.

Copy link
Contributor

@nolouch nolouch Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the small region will report statistics on both the leader and follower. the TopN(1000) hot region's stats of this node will be reported not all. may buckets can keep the same behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...the bucket range may be different.

It doesn't matter. Only leader's buckets are used and follower's stats will be merged into leaders using similar algorithm described in flow report section.

the TopN(1000) hot region's stats of this node will be reported not all.

I'm worried about accuracy and message layout. I prefer to report all metrics and do filter on the PD side. Let's see how it performs first.


### Bucket

A region is split into several buckets logically. We will collect query stats by buckets and report the bucket
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to introduce a bucket version field in metapb::Region?
So each time we update the buckets, the bucket version increases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, buckets are not shared by regions. It's a peer's own property. So metapb::Region should not be modified. I'm not sure if version is useful, as final consistency is acceptable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the buckets is not included in metapb::Region then there should be a new type metapb::RegionBuckets.
TiDB can query this info from PD, and use them to build requests.

apply to make single region apply logs faster. Since unorder apply is a standalone feature, I’m not
going into details here.

For read hotspots, split should be triggered by PD, which can utilize the global statics from all
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/statics/stats


### Dynamic size

The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder 512 is too big if the hotspot is small tables.will the split size set a lower limit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

512 is too big if the hotspot is small tables

It's configurable. As long as only a small number of hotspots are maintained, the size of hotspot doesn't matter.

will the split size set a lower limit?

Yes, and it's configured as 512MiB for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use the same heartbeat interval for both cold and hot regions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unspecified. Either hibernate or awake feel good to me.

A region is split into several buckets logically. We will collect query stats by buckets and report the bucket
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate
range size. Every bucket should have the size about 128MiB. Their ranges and stats are all reported to PD. PD
can detect hot buckets from all regions and decide whether need to split new hotspots and schedules them to
Copy link
Contributor

@nolouch nolouch Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part need more supplements. cc @bufferflies.

BTW, Is it recommend only split the region from the boundaries of buckets? the split key is more reasonable determined by query statistics, the key may from query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part need more supplements

It's covered at https://github.com/tikv/rfcs/pull/82/files#diff-ec76198abf750cd69e94e0e14564aea21f6f2de06d0bf85829d6d68969f36d20R107-R110.

...from the boundaries of buckets

Yes, the split key is chosen from the boundaries of buckets. As long as hotspots are not split into two accidentally, I think it's not worse than query statistics. And query statistics is unpredictable and changed frequently, makes buckets hard to maintain.


### Bucket

A region is split into several buckets logically. We will collect query stats by buckets and report the bucket
Copy link
Contributor

@nolouch nolouch Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the query stat is collect by region-level, such as the range scan [start_key, end_key], may cover bucket 1 to10 in this region, how to split it by buckets?
If the scan will divide into sub-scan in buckets, and we only know the stats in buckets. It will make us cannot know the split cost. for example, the scan request in this region may split into two scan requests or still only one scan request(region-level), it's decided by the split key whether in the range scan.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scan request will be split into buckets and access statistics are collected at bucket level.

I'm not sure what does "split cost" mean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split will make RPC increase, in some scenarios we need to make a trade-off.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to consider the RPC count for now. But I agree there might be many choices on choosing split keys. I don't think we can get a best solution without a lot of experiments. The RFC avoid describing concrete algorithm on purpose to not make false assumptions.


### Dynamic size

The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use the same heartbeat interval for both cold and hot regions?


The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions
and 10GiB for cold regions. So there are two types of split, hotspot split and general size split. Hotspot
split is triggered by PD. General size split is triggered by TiKV itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean the current load base split logic will move to PD?

text/0082-dynamic-size-region.md Show resolved Hide resolved
### Bucket

A region is split into several buckets logically. We will collect query stats by buckets and report the bucket
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a region becomes hot, does the bucket need to be re-split since it uses a different split policy?

- GC will still wake up all regions and cause periodical high usage.

There are two source of region creations: size/keys split and table split. Table split is disabled by
default in TiKV. Even if it splits a lot of regions, small one will be merged later.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need more details about merge, e.g. should we try to move a smaller region instead of a larger region.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the emergency. If it's hotspot, then it should be split first and then moved. Otherwise, no need to split unless impossible to get size balance.

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
@@ -61,6 +61,20 @@ balance.

![dynamic size buckets](../media/dynamic-size-buckets.png)

A new bucket metadata will be added to kvproto:
```
message Buckets {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allocate an ID for each bucket to reduce the size of requests from clients?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ID could consist of term and version of the region and a number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...reduce the size of requests from clients?

I don't get it. Client should send the range it wants to scan, so it has to be precisely start key and end key. How can ID reduce the size?

Bucket is supposed to be changed independently from region. For example, client can re-use last overlapping buckets even a region split.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can optimize for queries that need to scan the whole bucket.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because bucket keys are generated by even split, so the range of a query is probably different from the bucket edges.

### Compatibility

As buckets don't modify existing metadata, so it's backward compatible. When upgrading from small
regions, PD may trigger a lot of merge to get a large region size. This procedure should be made
Copy link
Member

@rleungx rleungx Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have some regions with heavy requests in different stores, we don't need to merge it until it cools down. But the calculation way of statistics of the bucket may be different which may encounter problems.

Copy link
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the Region is bigger, the access pattern in one region will more complicated. especially on read-write pattern, it will make hot region have read-write conflicts. I think how to better split and help scheduling to solve the problem of read-write conflict is a more complicated and challenging problem that we will encounter in the follow-up.


### Bucket

A region is split into several buckets logically. We will collect query stats by buckets and report the bucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split will make RPC increase, in some scenarios we need to make a trade-off.

than original region heartbeats, which is 60s. The new stream is expected to be reported every
10 seconds when have changes.

PD will maintain top N hot spot buckets and split all of those buckets that exceed P of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think

  • The Bucket needs to have a unique identifier. and it should like region split, use right derive. otherwise PD cannot know it is a stable hot bucket or it is a traffic peak.
  • In most cases, it can be split from the boundary key of the bucket. but still need make load-base split key strategy or other strategy by PD can be accept.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... it should like region split, use right derive

This can be very complicated. How about handling load trace on PD side? For example, if the buckets are [a, d), [d, f) in the past, and the new version is [a, c), [c, e), [e, f). Then history H[a, d) should be inherited by both [a, c) and [c, e), history H[d, f) is inherited by both [c, e) and [e, f). The overlapped histories should be sum up. And the inherited history statistic can be optionally multiply by a factor like 0.8 to make process smoothly.

...make load-base split key strategy or other strategy by PD...

It's up to PD to split by any keys. It's recommended to choose bucket boundary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method also complicated. Do you mean that the bucket boundary may be different every time? [a, d)[d ,f) change to [a, c), [c, e), [e, f); d is gone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can change, but it only changes when there are many updates.

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Copy link
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

uint64 region_id = 1;
uint64 version = 2; // A hint indicate if keys have changed.
repeated bytes keys = 3;
repeated uint64 read = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a separate message inside here, we may add other statics.

repeated BucketStats stats = 3 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to reduce the flow. If using a message, the overhead will become 1 * field count * bucket count + actual number length + 2 * bucket count. If using separate arrays, the overhead become 2 * field count + actual number length.

Copy link
Member

@5kbpers 5kbpers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tonyxuqqi
Copy link

LGTM

### Dynamic size

The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions
and 10GiB for cold regions. So there are two types of split, hotspot split and general size split. Hotspot

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: "hotspot split" can be renamed with existing term: "load based split". Don't introduce new terms whenever possible.

### Dynamic size

The hotter a region is, the smaller its size becomes. To make it simplified, we choose 512MiB for hot regions
and 10GiB for cold regions. So there are two types of split, hotspot split and general size split. Hotspot

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, since we haven't done extensive test regarding region size comparison, I think the numbers such as 512MB, 10GB are not finalized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's mentioned that they are configurable.

### Bucket

A region is split into several buckets logically. We will collect query stats by buckets and report the bucket
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"buckets are split" is confusing---we never split a bucket. I guess "buckets are created" are better.


A region is split into several buckets logically. We will collect query stats by buckets and report the bucket
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate
range size. Every bucket should have the size about 128MiB. Their ranges and stats are all reported to PD. PD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why it's not 96MB?
  2. Recap the offline discussion here. There's another option that can achieve similar function but more scale and less complexity to PD.
    TiKV handles bucket traffic details and PD are only aware bucket key range but not traffic stats. And TiKV does the split itself with its local data, with some threshold PD generates periodically from the global view of traffic by collection machine level traffic stats.
    The concern I have for current solution is that in a 1PB database, that means PD needs to collect and process 8 million entries at sub-seconds level interval. Today PD is essentially single node component (followers do not take requests). Not sure the impact to PD.

For read hotspots, split should be triggered by PD, which can utilize the global statics from all
regions and nodes. For normal read requests, TiKV will need to split its range into smaller buckets
according to statics to increase concurrency. When TiDB wants to do a scan, it sends the RPC once,
and TiKV will split the requests into smaller concurrent jobs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a large region and stale read is enabled, I think it might be better to distribute the read load among followers. It's up to TiDB to make the choice.

and TiKV will split the requests into smaller concurrent jobs.

In the past design, follower read has also been discussed to offload works from leader. But TiDB
can’t predict the progress of a follower, so latency can also become unpredictable. It’s more

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In follower read, can TiDB skip these peers who are obviously behind the leader more than a threshold?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can, but it doesn't.


A region is split into several buckets logically. We will collect query stats by buckets and report the bucket
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate
range size. Every bucket should have the size about 128MiB. Their ranges and stats are all reported to PD. PD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coocood Good point. Do you know what's the roughly memory consumption per region in PD today?
For 128MB bucket and 1PB total database size, it's about 800 million buckets.


A region is split into several buckets logically. We will collect query stats by buckets and report the bucket
to PD. For hotspot regions, buckets are split by scan. For cold regions, buckets are split by approximate
range size. Every bucket should have the size about 128MiB. Their ranges and stats are all reported to PD. PD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BusyJay For bucket stats report, will both leader and follower need to report their buckets? And I guess it's very likely for the same region, the bucket range may be different. If only leader reports the buckets stats, then PD will have to add the follower buckets stats when calculating the machine level stats, or report machine level stats separately.


A new bucket metadata will be added to kvproto:
```
message Buckets {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be renamed as BucketStats?
Also where's the key range definition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's defined as the third field keys.

may be complicated. And we rely on further designs like separating LSM tree to optimize the cost
for all.

### Compatibility

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we implement the feature in the way that buckets are just optional and optimization for big region size, then by its nature it's backward compatible.
And there's nothing prevent user from using small region, so this should be well supported in the long term as well. For small to medium size cluster, it's likely user will keep the small region size.


10GiB is just an example, it's allowed to change to a bigger or smaller value.

### Bucket

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we measure the read/write load at bucket level, which means it's about 128MB size. How can we further split a hot bucket? Today we have load-based split, which may cut the small region by the access pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a clear idea now. I will leave it as an unresolved question. Maybe we can split the bucket further. For example, besides controlling the size of a bucket, also require the minimal count of a bucket. So even a region becomes small, it can still be split further.

BusyJay and others added 2 commits December 9, 2021 18:56
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
@BusyJay BusyJay added Final Comment Period This RFC is in the final comment period, and has a limited amount of time to give input on. and removed Initial Comment Period This RFC is in the initial comment period, and has quite some time to give input on. labels Dec 9, 2021
@BusyJay BusyJay merged commit 8bd15f2 into tikv:master Dec 9, 2021
@BusyJay BusyJay deleted the dynamic-size branch December 9, 2021 16:57
pingyu pushed a commit to pingyu/tikv-rfcs that referenced this pull request Nov 4, 2022
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>
pingyu added a commit that referenced this pull request Nov 8, 2022
* RFC: RawKV Batch Export (#76)

Signed-off-by: pingyu <yuping@pingcap.com>

* rawkv bulk load: add description for pause merge (#74)

* rawkv bulk load: add description for pause merge

Signed-off-by: Peng Guanwen <pg999w@outlook.com>

* Update text/0072-online-bulk-load-for-rawkv.md

Co-authored-by: Liangliang Gu <marsishandsome@gmail.com>
Signed-off-by: Peng Guanwen <pg999w@outlook.com>

* Add future improvements

Signed-off-by: Peng Guanwen <pg999w@outlook.com>

Co-authored-by: Liangliang Gu <marsishandsome@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* ref pd#4112: implementation detail of PD

Signed-off-by: pingyu <yuping@pingcap.com>

* ref pd#4112: implementation detail of PD

Signed-off-by: pingyu <yuping@pingcap.com>

* remove raw cf

Signed-off-by: Andy Lok <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* update

Signed-off-by: Andy Lok <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* update pd design

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* revert to keyspace_next_id

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* RFC: Improve the Scalability of TSO Service (#78)

Signed-off-by: pingyu <yuping@pingcap.com>

* make region size dynamic (#82)

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* update pd url

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* address comment

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* resolve pd flashback problem

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* update rfcs

Signed-off-by: Andy Lok <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* RFC: In-memory Pessimistic Locks (#77)

* RFC: In-memory Pessimistic Locks

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* clarify where to delete memory locks after writing a lock CF KV

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* Elaborate transfer leader handlings and add correctness section

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* add an addition step of proposing pessimistic locks before transferring leader

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* clarify about new leaders of region split

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* Add tracking issue link

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* update design and correctness analysis of lock migration

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* add configurations

Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* propose online unsafe recovery (#91)

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* physical isolation between region (#93)

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* wip

Signed-off-by: pingyu <yuping@pingcap.com>

* update

Signed-off-by: pingyu <yuping@pingcap.com>

* update

Signed-off-by: pingyu <yuping@pingcap.com>

* Apply suggestions from code review

Co-authored-by: Xiaoguang Sun <sunxiaoguang@users.noreply.github.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* fix case

Signed-off-by: pingyu <yuping@pingcap.com>

Signed-off-by: pingyu <yuping@pingcap.com>
Signed-off-by: Andy Lok <andylokandy@hotmail.com>
Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Co-authored-by: Liangliang Gu <marsishandsome@gmail.com>
Co-authored-by: Peng Guanwen <pg999w@outlook.com>
Co-authored-by: Andy Lok <andylokandy@hotmail.com>
Co-authored-by: JmPotato <ghzpotato@gmail.com>
Co-authored-by: Jay <BusyJay@users.noreply.github.com>
Co-authored-by: Yilin Chen <sticnarf@gmail.com>
Co-authored-by: Connor <zbk602423539@gmail.com>
Co-authored-by: Xiaoguang Sun <sunxiaoguang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Final Comment Period This RFC is in the final comment period, and has a limited amount of time to give input on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants