docs: Initial version of TTL Table RFC #22763

sunxiaoguang · 2021-02-09T04:42:37Z

Signed-off-by: Xiaoguang Sun sunxiaoguang@zhihu.com

What problem does this PR solve?

Issue Number: close #22762

Problem Summary:
Add TTL Table support to automatically reclaim data with given retention policy and garbage collect granularity.

What is changed and how it works?

RFC document

What's Changed:
Design document

How it Works:
Documentation

Related changes

RFC document

Check List

Tests

No code

Side effects

Release note

No release note

Signed-off-by: Xiaoguang Sun <sunxiaoguang@zhihu.com>

andylokandy · 2021-02-09T07:29:19Z

/cc @Connor1996

zz-jason

@morgo would you like to take a look?

morgo

A few questions from me. Thank you for working on this!

morgo · 2021-02-09T16:10:54Z

docs/design/2021-02-09-ttl-table.md

+## Proposal
+Introduce new `TTL` and `TTL_GRANULARITY` table options when creating or altering tables. By specifying `TTL`, users can expect expired data to be removed automatically. Additionally, users can also choose from `ROW` or `PARTITION` for `TTL_GRANULARITY` option to trade off collect granularity and cost to run such garbage collection. `ROW` mode evaluates expiry time against each row of data and reclaim space at row basis. This gives us the finest granularity and allows the most accurate expiration timing. `PARTITION` mode on the other hand partitions data according to its last update time. A background timer is going to rollover partitions and truncate all expired data within the oldest partition all at once.


Can you show a full SHOW CREATE TABLE example so it becomes more clear? I assume this requires a timestamp to be collected, will this appear in the SHOW CREATE TABLE? It could be useful to see which rows are about to be purged.

It could be something like:

CREATE TABLE ttl_table { id varchar(255), author varchar(255), content varchar(65535), PRIMARY_KEY(id) } TTL=‘10m’ TTL_GRANULARITY=‘ROW’;

CREATE TABLE ttl_table { id varchar(255), author varchar(255), content varchar(65535), PRIMARY_KEY(id) } TTL=‘10m’ TTL_GRANULARITY=‘PARTITION’;

Oh I see! So it uses partitioning internally, but not partition syntax. Is there any restriction such as TTL tables can not be partitioned?

Hm, as far as I know there is no restriction when granularity is set to partition as long as itself is not a partitioned table. If people want to use row granularity, it works for partition table as well.

Sorry, that is what I meant with my question. Please document that Partitioned tables do not support TTL_GRANULARITY=‘PARTITION’.

Oh I see what you mean know. It is possible to archive this by manually managing and truncating partitioned table, but it's error prone and tedious. I guess nobody would like to do that if it can be done automatically otherwise.

for non-partition tables, do we support the PARTITION TTL_GRANULARITY? BTW, could you add these examples to the proposal and explain their meanings?

It could be something like:

CREATE TABLE ttl_table { id varchar(255), author varchar(255), content varchar(65535), PRIMARY_KEY(id) } TTL=‘10m’ TTL_GRANULARITY=‘ROW’; CREATE TABLE ttl_table { id varchar(255), author varchar(255), content varchar(65535), PRIMARY_KEY(id) } TTL=‘10m’ TTL_GRANULARITY=‘PARTITION’;

Could you write the syntax to this doc?

If we consider the compatibility with MySQL(say for example, need to replicate the table to MySQL, maybe it's necessary to support the 'TiDB-specific commet style' so it won't be broken in MySQL).

(Just a reminder)I noticed that MyRocks specified the TTL in a different way like:

CREATE TABLE t1 ( a bigint(20) NOT NULL, b int NOT NULL, ts bigint(20) UNSIGNED NOT NULL, PRIMARY KEY (a), KEY kb (b) ) ENGINE=rocksdb COMMENT='ttl_duration=1;ttl_col=ts;'

However, I believe that this is caused by the fact that MyRocks is only one of the storage engines for MySQL, so 'TTL' cannot be specified as a 'table option', so I would say it is better to keep it as a table option in TiDB, as what you are suggesting now.

morgo · 2021-02-09T16:15:30Z

docs/design/2021-02-09-ttl-table.md

+-->
+
+## Open issues (if applicable)
+TBD


What is the expected behavior when reading data that has expired, but garbage collection has not run yet? I assume there are no guarantees.

I assume that touching the row (UPDATE / INSERT ON DUPLICATE KEY UDPATE) will refresh the TTL?

Yea, there is no strict guarantee in this proposal, but it can be done by filtering data during reading at some cost if it is important. And yes, any updates will refresh expire time.

I think this behavior is fine provided the semantic is stated.

morgo · 2021-02-09T16:18:58Z

docs/design/2021-02-09-ttl-table.md

+### `PARTITION`
+TiDB DDL master maintains a new periodic task to rollover partitions of a TTL table by allocating a new partition as current writing partition and truncate the oldest partition that has already passed its lifetime. Truncating partition is a `O1` operation that has nothing to do with the actual number of records within such partition. Therefore can be archived without blocking DDL for a noticeable amount of time with negligible amount of work at the background.


How is validation enforced that a partition will not contain rows that are not yet eligible for purge? If it is by requiring the ttl column (assuming a column is used) to be included in the partition expression, then I think TTL_GRANULARITY is redundant info? It should be able to determine this itself and save the users from specifying.

Actually user doesn't get chance to choose which partition to use, by keeping some meta data in schema we can be sure that the oldest partition only contains expired data.

It was implemented as a hackathon project called T4. Unfortunately there is no English version of slides at this time, but you can get some idea from code itself.

Here is the slides download link
There are lots of diagrams, maybe you can still get some ideas without understanding those Chinese characters.

Signed-off-by: Xiaoguang Sun <sunxiaoguang@zhihu.com>

zz-jason · 2021-02-23T06:45:03Z

docs/design/2021-02-09-ttl-table.md

+## Proposal
+Introduce new `TTL` and `TTL_GRANULARITY` table options when creating or altering tables. By specifying `TTL`, users can expect expired data to be removed automatically. Additionally, users can also choose from `ROW` or `PARTITION` for `TTL_GRANULARITY` option to trade off collect granularity and cost to run such garbage collection. `ROW` mode evaluates expiry time against each row of data and reclaim space at row basis. This gives us the finest granularity and allows the most accurate expiration timing. `PARTITION` mode on the other hand partitions data according to its last update time. A background timer is going to rollover partitions and truncate all expired data within the oldest partition all at once.


for non-partition tables, do we support the PARTITION TTL_GRANULARITY? BTW, could you add these examples to the proposal and explain their meanings?

zz-jason · 2021-02-23T07:33:24Z

docs/design/2021-02-09-ttl-table.md

+## Implementation
+
+### `ROW`
+By associating TTL configuration with the key range of a TTL table in PD and distributing such configuration to all TiKV instances within the cluster, TiKV can utilize TTL settings during GC process to collect data that are expired. To avoid inconsistency caused by reclaiming progress difference between index and actual record data. The lifetime of record range is one GC interval longer than it is for index range, therefore TiDB will not see missing record data during table lookup.


could you elaborate more about:

How to store TTL definitions in the table schema, what needs to be changed.

How does TiDB communicate with PD, the API/protobuf design

How does PD communicate with TiKV, the API/protobuf design

How to ensure stale rows are purged in TiKV, the internal design of TiKV, or the API to use in TiKV.

Sure, Iet me add more about these.

Changes to model.TableInfo

@@ -298,6 +298,11 @@ type TableInfo struct { // TiFlashReplica means the TiFlash replica info. TiFlashReplica *TiFlashReplicaInfo `json:"tiflash_replica"` + + // TTL + TTL time.Duration + TTLByRow bool + NextTTLTruncateTime time.Time }

This is what we have in hackathon to quickly make it work. We should definitely make it better.

Changes to pdpb.proto to specify TTL configuration for ranges

message RangeTTL { bytes start_key = 1; bytes end_key = 2; uint64 TTL = 3; bytes user_data = 4; bool add_gc_interval = 5; // delay TTL by adding GC interval } service PD { rpc AddRangeTTL(AddRangeTTLRequest) returns (AddRangeTTLResponse) {} rpc DeleteRangeTTL(DeleteRangeTTLRequest) returns (DeleteRangeTTLResponse) {} rpc GetRangeTTL(GetRangeTTLRequest) returns (GetRangeTTLResponse) {} rpc GetAllRangeTTL(GetAllRangeTTLRequest) returns (GetAllRangeTTLResponse) {} } message AddRangeTTLRequest { RequestHeader header = 1; repeated RangeTTL TTL = 2; } message AddRangeTTLResponse { ResponseHeader header = 1; } message DeleteRangeTTLRequest { RequestHeader header = 1; bytes start_key = 2; bytes end_key = 3; } message DeleteRangeTTLResponse { ResponseHeader header = 1; } message GetRangeTTLRequest { RequestHeader header = 1; bytes start_key = 2; bytes end_key = 3; } message GetRangeTTLResponse { ResponseHeader header = 1; RangeTTL TTL = 2; } message GetAllRangeTTLRequest { RequestHeader header = 1; } message GetAllRangeTTLResponse { ResponseHeader header = 1; repeated RangeTTL TTL = 2; } message GetGCSafePointResponse { ResponseHeader header = 1; uint64 safe_point = 2; repeated RangeTTL range_TTL = 3; uint64 now = 4; }

This is what we have in hackathon to quickly make it work. We should definitely make it better.

There are two new fields in GetGCSafePointResponse which TiKV uses to get safe_point for every run of GC. These two new fields can tell TiKV what is the current time on PD and the ranges that have TTL enabled. During GC process, if a KV pair is older than safe_point and is expired, it is collected no matter what to reclaim the space.

There are two new fields in GetGCSafePointResponse which TiKV uses to get safe_point for every run of GC. These two new fields can tell TiKV what is the current time on PD and the ranges that have TTL enabled. During GC process, if a KV pair is older than safe_point and is expired, it is collected no matter not matter what to reclaim the space.

zz-jason · 2021-02-23T07:45:39Z

docs/design/2021-02-09-ttl-table.md

+TiDB DDL master maintains a new periodic task to rollover partitions of a TTL table by allocating a new partition as current writing partition and truncate the oldest partition that has already passed its lifetime. Truncating partition is a `O1` operation that has nothing to do with the actual number of records within such partition. Therefore can be archived without blocking DDL for a noticeable amount of time with negligible amount of work at the background.
+
+## Testing Plan
+TBD


could you elaborate more about:

compatibility tests

compatibility with other features, like MVCC GC, partition table

compatibility with other internal components, like Parser, DDL, Privilege, Statistics

compatibility with other external components, like PD, TiKV, TiFlash, BR, TiCDC, Dumpling

upgrade compatibility

downgrade compatibility

functional tests

to ensure the basic feature function works as expected?

scenario tests

to ensure this feature works as expected in some common scenarios

benchmark tests

to measure the timeliness of TTL mechanism

to measure the influence on the online workload when TTL triggered

Sure, Iet me add more about these.

bb7133 · 2021-03-02T14:51:58Z

docs/design/2021-02-09-ttl-table.md

+## Implementation
+
+### `ROW`
+By associating TTL configuration with the key range of a TTL table in PD and distributing such configuration to all TiKV instances within the cluster, TiKV can utilize TTL settings during GC process to collect data that are expired. To avoid inconsistency caused by reclaiming progress difference between index and actual record data. The lifetime of record range is one GC interval longer than it is for index range, therefore TiDB will not see missing record data during table lookup.


The lifetime of record range is one GC interval longer than it is for index range, therefore TiDB will not see missing record data during table lookup.

My concern is that the definition of the index is still defined in schema, so we may see inconsistent queries like:

SELECT * FROM some_ttl_table;

and

SELECT * FROM some_ttl_table USE INDEX(some_index)

Is that right?

I'm also concerned about this. This breaks the snapshot isolation of transactions.

Yes, we considered other two options during Hackathon but didn't have time to finish due to complexity. Let's see if these two options could be used instead.

Run filtering on TiKV side for KV pairs within ranges that have expire time. No matter the KV pairs belong to record or index, they are all filtered out based on the same expiry time against TSO.

Change TiDB to ignore data not found error when reading data for handles of TTL tables and remove data range before record range on TiKV.

These two options are not ideal as well, maybe we can find out a better way to solve the consistency issue.

bb7133 · 2021-03-02T14:59:48Z

docs/design/2021-02-09-ttl-table.md

+## Implementation
+
+### `ROW`
+By associating TTL configuration with the key range of a TTL table in PD and distributing such configuration to all TiKV instances within the cluster, TiKV can utilize TTL settings during GC process to collect data that are expired. To avoid inconsistency caused by reclaiming progress difference between index and actual record data. The lifetime of record range is one GC interval longer than it is for index range, therefore TiDB will not see missing record data during table lookup.


By associating TTL configuration with the key range of a TTL table in PD

Could you please give more details of the data structure in PD, as well as the way it is distributed in TiKV?

And is it possible to store all configurations in TiKV without PD? I talk about this because the 'TTL configuration' is actually part of the table schema, maintaining it in both PD and TiKV is not easy, IMHO we should try to avoid it.

What we used in Hackathon project was pretty simple, we can start from this if we decide to use PD to distribute the configuration.

message RangeTTL { bytes start_key = 1; bytes end_key = 2; uint64 TTL = 3; bytes user_data = 4; bool add_gc_interval = 5; // delay TTL by adding GC interval } service PD { rpc AddRangeTTL(AddRangeTTLRequest) returns (AddRangeTTLResponse) {} rpc DeleteRangeTTL(DeleteRangeTTLRequest) returns (DeleteRangeTTLResponse) {} rpc GetRangeTTL(GetRangeTTLRequest) returns (GetRangeTTLResponse) {} rpc GetAllRangeTTL(GetAllRangeTTLRequest) returns (GetAllRangeTTLResponse) {} } message AddRangeTTLRequest { RequestHeader header = 1; repeated RangeTTL TTL = 2; } message AddRangeTTLResponse { ResponseHeader header = 1; } message DeleteRangeTTLRequest { RequestHeader header = 1; bytes start_key = 2; bytes end_key = 3; } message DeleteRangeTTLResponse { ResponseHeader header = 1; } message GetRangeTTLRequest { RequestHeader header = 1; bytes start_key = 2; bytes end_key = 3; } message GetRangeTTLResponse { ResponseHeader header = 1; RangeTTL TTL = 2; } message GetAllRangeTTLRequest { RequestHeader header = 1; } message GetAllRangeTTLResponse { ResponseHeader header = 1; repeated RangeTTL TTL = 2; } message GetGCSafePointResponse { ResponseHeader header = 1; uint64 safe_point = 2; repeated RangeTTL range_TTL = 3; uint64 now = 4; }

The initial design was trying to make TTL concept neutral from TiDB and it is simply a range of data that defines it's lifecycle by expiring time. If we store the TTL information into table schema, it becomes tightly coupled with TiDB and can not be shared with other users using TiKV only. Even if we choose storing this information on TiKV, reading it out and get notified when it's changing from other TiKV instances is troublesome.

bb7133 · 2021-03-02T15:09:34Z

docs/design/2021-02-09-ttl-table.md

+## Implementation
+
+### `ROW`
+By associating TTL configuration with the key range of a TTL table in PD and distributing such configuration to all TiKV instances within the cluster, TiKV can utilize TTL settings during GC process to collect data that are expired. To avoid inconsistency caused by reclaiming progress difference between index and actual record data. The lifetime of record range is one GC interval longer than it is for index range, therefore TiDB will not see missing record data during table lookup.


I'm not quite clear about the 'GC' here, for the expired data, is it possible to read them using the tidb_snapshot(https://docs.pingcap.com/tidb/stable/read-historical-data)?

The GC process mentioned here is MVCC GC process. For this case, once the data are considered expiring and being reclaimed, users can not read them even with tidb_snapshot because data are physically deleted by GC.

djshow832 · 2021-03-03T07:00:50Z

docs/design/2021-02-09-ttl-table.md

+Another way to quickly remove expired data is managing all data in a carefully designed partition table. Bucketing data according to their update time makes it possible to quickly delete all expired records within a partition with a simple `TRUNCATE`. Compared to TTL Table's approach, these two options are error prone and suboptimal.
+
+## Compatibility and Migration Plan
+Due to the fact that TiDB by itself prohibits either converting ordinary table to partition table or between different partition table types, it is only possible to convert existing table to TTL table with `ROW` granularity. A TTL table with `PARTITION` granularity is implemented as a special partition table therefore conflicts with any other partition table type. Newly created tables can choose either `ROW` or `PARTITION` as a trade off between accuracy of reclaiming time and efficiency of garbage collecting expired data.


Could you show the DDL syntax to:

alter a normal table to a TTL table

alter a TTL table to a normal table

alter the TTL of a TTL table

Besides, what will happen when the garbage collector is collecting the table during the DDL?

Hm, we didn't consider the case that people want to change a TTL back to normal table. What I have right now looks strange, let's try finding out better way to do so.

Alter a normal table to TTL tablem
ALTER TABLE tbl TTL='10m' TTL_GRANULARITY='ROW';

Alter a TTL table to normal table
ALTER TABLE ttl_tbl TTL='NONE';

Alter the TTL of a TTL table
ALTER TABLE ttl_tbl TTL='10h';

djshow832 · 2021-03-03T07:03:49Z

docs/design/2021-02-09-ttl-table.md

+## Implementation
+
+### `ROW`
+By associating TTL configuration with the key range of a TTL table in PD and distributing such configuration to all TiKV instances within the cluster, TiKV can utilize TTL settings during GC process to collect data that are expired. To avoid inconsistency caused by reclaiming progress difference between index and actual record data. The lifetime of record range is one GC interval longer than it is for index range, therefore TiDB will not see missing record data during table lookup.


I'm also concerned about this. This breaks the snapshot isolation of transactions.

djshow832 · 2021-03-03T07:12:08Z

docs/design/2021-02-09-ttl-table.md

+There are application scenarios that data is only valuable for a certain period of time after ingestion and can be deleted permanently after expiration. Tracing, audit logs or push notification with expiration are examples of such applications. With the help of TTL tables, users can be relieved from taking care of the data life cycle in such tables. Therefore people are more willing to use TiDB as a general storage for such scenarios.
+
+## Proposal
+Introduce new `TTL` and `TTL_GRANULARITY` table options when creating or altering tables. By specifying `TTL`, users can expect expired data to be removed automatically. Additionally, users can also choose from `ROW` or `PARTITION` for `TTL_GRANULARITY` option to trade off collect granularity and cost to run such garbage collection. `ROW` mode evaluates expiry time against each row of data and reclaim space at row basis. This gives us the finest granularity and allows the most accurate expiration timing. `PARTITION` mode on the other hand partitions data according to its last update time. A background timer is going to rollover partitions and truncate all expired data within the oldest partition all at once.


How is the expiry time calculated? IMO, it's current timestamp - update timestamp rather than current timestamp - insert timestamp, right?
What if an update statement only updates the row record but not the index, or it only updates some indexes but not other indexes? Some of the indexes will be collected earlier, which breaks the snapshot isolation.

Yes you are right, we didn't consider the case where only data get updated and it breaks many things. So in order to fix this, maybe we should update both data and index even for the case when it is not necessary to do so for none TTL table.

sre-bot · 2022-06-11T07:36:09Z

Code Coverage Details: https://codecov.io/github/pingcap/tidb/commit/7327272941f621309c3ae1ad46dd4bfaa33d95fb

SunRunAway · 2023-01-11T08:59:51Z

TTL feature is released with TiDB v6.5: https://docs.pingcap.com/tidb/stable/time-to-live#periodically-delete-expired-data-using-ttl-time-to-live

Initial version of TTL Table RFC

a2f05f1

Signed-off-by: Xiaoguang Sun <sunxiaoguang@zhihu.com>

ti-srebot requested a review from Connor1996 February 9, 2021 07:29

zz-jason reviewed Feb 9, 2021

View reviewed changes

morgo reviewed Feb 9, 2021

View reviewed changes

sunxiaoguang and others added 4 commits February 11, 2021 08:00

Add more compatibility document

9afa819

Signed-off-by: Xiaoguang Sun <sunxiaoguang@zhihu.com>

Add an issue

f410c80

Signed-off-by: Xiaoguang Sun <sunxiaoguang@zhihu.com>

Merge branch 'master' into ttl_rfc

93f0611

Merge branch 'master' into ttl_rfc

c0646f0

ti-chi-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 22, 2021

zz-jason reviewed Feb 23, 2021

View reviewed changes

bb7133 reviewed Mar 2, 2021

View reviewed changes

djshow832 reviewed Mar 3, 2021

View reviewed changes

noobgramming mentioned this pull request Jan 21, 2022

How to set table TTL and TTL_GRANULARITY using ALTER TABLE? #31869

Closed

Merge branch 'master' into ttl_rfc

7327272

lcwangchao mentioned this pull request Dec 1, 2022

docs: RFC for TTL tables #39264

Merged

1 task

SunRunAway closed this Jan 11, 2023

		## Proposal
		Introduce new `TTL` and `TTL_GRANULARITY` table options when creating or altering tables. By specifying `TTL`, users can expect expired data to be removed automatically. Additionally, users can also choose from `ROW` or `PARTITION` for `TTL_GRANULARITY` option to trade off collect granularity and cost to run such garbage collection. `ROW` mode evaluates expiry time against each row of data and reclaim space at row basis. This gives us the finest granularity and allows the most accurate expiration timing. `PARTITION` mode on the other hand partitions data according to its last update time. A background timer is going to rollover partitions and truncate all expired data within the oldest partition all at once.

		### `PARTITION`
		TiDB DDL master maintains a new periodic task to rollover partitions of a TTL table by allocating a new partition as current writing partition and truncate the oldest partition that has already passed its lifetime. Truncating partition is a `O1` operation that has nothing to do with the actual number of records within such partition. Therefore can be archived without blocking DDL for a noticeable amount of time with negligible amount of work at the background.

docs: Initial version of TTL Table RFC #22763

docs: Initial version of TTL Table RFC #22763

Conversation

sunxiaoguang commented Feb 9, 2021

What problem does this PR solve?

What is changed and how it works?

Related changes

Check List

Release note

andylokandy commented Feb 9, 2021

zz-jason left a comment

Choose a reason for hiding this comment

morgo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bb7133 Mar 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunxiaoguang Feb 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunxiaoguang Mar 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunxiaoguang Mar 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunxiaoguang Mar 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sre-bot commented Jun 11, 2022

SunRunAway commented Jan 11, 2023

bb7133 Mar 2, 2021 •

edited

Loading

sunxiaoguang Feb 9, 2021 •

edited

Loading

sunxiaoguang Mar 7, 2021 •

edited

Loading

sunxiaoguang Mar 7, 2021 •

edited

Loading

sunxiaoguang Mar 7, 2021 •

edited

Loading