Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Initial version of TTL Table RFC #22763

Closed
wants to merge 6 commits into from
Closed

docs: Initial version of TTL Table RFC #22763

wants to merge 6 commits into from

Conversation

sunxiaoguang
Copy link
Contributor

Signed-off-by: Xiaoguang Sun sunxiaoguang@zhihu.com

What problem does this PR solve?

Issue Number: close #22762

Problem Summary:
Add TTL Table support to automatically reclaim data with given retention policy and garbage collect granularity.

What is changed and how it works?

RFC document

What's Changed:
Design document

How it Works:
Documentation

Related changes

RFC document

Check List

Tests

  • No code

Side effects

Release note

  • No release note

Signed-off-by: Xiaoguang Sun <sunxiaoguang@zhihu.com>
@andylokandy
Copy link
Contributor

/cc @Connor1996

Copy link
Member

@zz-jason zz-jason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@morgo would you like to take a look?

Copy link
Contributor

@morgo morgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions from me. Thank you for working on this!

Comment on lines +13 to +14
## Proposal
Introduce new `TTL` and `TTL_GRANULARITY` table options when creating or altering tables. By specifying `TTL`, users can expect expired data to be removed automatically. Additionally, users can also choose from `ROW` or `PARTITION` for `TTL_GRANULARITY` option to trade off collect granularity and cost to run such garbage collection. `ROW` mode evaluates expiry time against each row of data and reclaim space at row basis. This gives us the finest granularity and allows the most accurate expiration timing. `PARTITION` mode on the other hand partitions data according to its last update time. A background timer is going to rollover partitions and truncate all expired data within the oldest partition all at once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show a full SHOW CREATE TABLE example so it becomes more clear? I assume this requires a timestamp to be collected, will this appear in the SHOW CREATE TABLE? It could be useful to see which rows are about to be purged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be something like:

CREATE TABLE ttl_table {
    id varchar(255),
    author varchar(255), 
    content varchar(65535),
    PRIMARY_KEY(id)  
} TTL=‘10m’ TTL_GRANULARITY=‘ROW’;
CREATE TABLE ttl_table {
    id varchar(255),
    author varchar(255), 
    content varchar(65535),
    PRIMARY_KEY(id)  
} TTL=‘10m’ TTL_GRANULARITY=‘PARTITION’;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see! So it uses partitioning internally, but not partition syntax. Is there any restriction such as TTL tables can not be partitioned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, as far as I know there is no restriction when granularity is set to partition as long as itself is not a partitioned table. If people want to use row granularity, it works for partition table as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that is what I meant with my question. Please document that Partitioned tables do not support TTL_GRANULARITY=‘PARTITION’.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what you mean know. It is possible to archive this by manually managing and truncating partitioned table, but it's error prone and tedious. I guess nobody would like to do that if it can be done automatically otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for non-partition tables, do we support the PARTITION TTL_GRANULARITY? BTW, could you add these examples to the proposal and explain their meanings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be something like:

CREATE TABLE ttl_table {
    id varchar(255),
    author varchar(255), 
    content varchar(65535),
    PRIMARY_KEY(id)  
} TTL=‘10m’ TTL_GRANULARITY=‘ROW’;
CREATE TABLE ttl_table {
    id varchar(255),
    author varchar(255), 
    content varchar(65535),
    PRIMARY_KEY(id)  
} TTL=‘10m’ TTL_GRANULARITY=‘PARTITION’;
  1. Could you write the syntax to this doc?
  2. If we consider the compatibility with MySQL(say for example, need to replicate the table to MySQL, maybe it's necessary to support the 'TiDB-specific commet style' so it won't be broken in MySQL).

Copy link
Member

@bb7133 bb7133 Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just a reminder)I noticed that MyRocks specified the TTL in a different way like:

CREATE TABLE t1 (
  a bigint(20) NOT NULL,
  b int NOT NULL,
  ts bigint(20) UNSIGNED NOT NULL,
  PRIMARY KEY (a),
  KEY kb (b)
) ENGINE=rocksdb
COMMENT='ttl_duration=1;ttl_col=ts;'

However, I believe that this is caused by the fact that MyRocks is only one of the storage engines for MySQL, so 'TTL' cannot be specified as a 'table option', so I would say it is better to keep it as a table option in TiDB, as what you are suggesting now.

-->

## Open issues (if applicable)
TBD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected behavior when reading data that has expired, but garbage collection has not run yet? I assume there are no guarantees.

I assume that touching the row (UPDATE / INSERT ON DUPLICATE KEY UDPATE) will refresh the TTL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, there is no strict guarantee in this proposal, but it can be done by filtering data during reading at some cost if it is important. And yes, any updates will refresh expire time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this behavior is fine provided the semantic is stated.

Comment on lines +28 to +29
### `PARTITION`
TiDB DDL master maintains a new periodic task to rollover partitions of a TTL table by allocating a new partition as current writing partition and truncate the oldest partition that has already passed its lifetime. Truncating partition is a `O1` operation that has nothing to do with the actual number of records within such partition. Therefore can be archived without blocking DDL for a noticeable amount of time with negligible amount of work at the background.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is validation enforced that a partition will not contain rows that are not yet eligible for purge? If it is by requiring the ttl column (assuming a column is used) to be included in the partition expression, then I think TTL_GRANULARITY is redundant info? It should be able to determine this itself and save the users from specifying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually user doesn't get chance to choose which partition to use, by keeping some meta data in schema we can be sure that the oldest partition only contains expired data.

Copy link
Contributor Author

@sunxiaoguang sunxiaoguang Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was implemented as a hackathon project called T4. Unfortunately there is no English version of slides at this time, but you can get some idea from code itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the slides download link
There are lots of diagrams, maybe you can still get some ideas without understanding those Chinese characters.

sunxiaoguang and others added 4 commits February 11, 2021 08:00
Signed-off-by: Xiaoguang Sun <sunxiaoguang@zhihu.com>
Signed-off-by: Xiaoguang Sun <sunxiaoguang@zhihu.com>
@ti-chi-bot ti-chi-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 22, 2021
Comment on lines +13 to +14
## Proposal
Introduce new `TTL` and `TTL_GRANULARITY` table options when creating or altering tables. By specifying `TTL`, users can expect expired data to be removed automatically. Additionally, users can also choose from `ROW` or `PARTITION` for `TTL_GRANULARITY` option to trade off collect granularity and cost to run such garbage collection. `ROW` mode evaluates expiry time against each row of data and reclaim space at row basis. This gives us the finest granularity and allows the most accurate expiration timing. `PARTITION` mode on the other hand partitions data according to its last update time. A background timer is going to rollover partitions and truncate all expired data within the oldest partition all at once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for non-partition tables, do we support the PARTITION TTL_GRANULARITY? BTW, could you add these examples to the proposal and explain their meanings?

## Implementation

### `ROW`
By associating TTL configuration with the key range of a TTL table in PD and distributing such configuration to all TiKV instances within the cluster, TiKV can utilize TTL settings during GC process to collect data that are expired. To avoid inconsistency caused by reclaiming progress difference between index and actual record data. The lifetime of record range is one GC interval longer than it is for index range, therefore TiDB will not see missing record data during table lookup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elaborate more about:

  1. How to store TTL definitions in the table schema, what needs to be changed.
  2. How does TiDB communicate with PD, the API/protobuf design
  3. How does PD communicate with TiKV, the API/protobuf design
  4. How to ensure stale rows are purged in TiKV, the internal design of TiKV, or the API to use in TiKV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, Iet me add more about these.

Copy link
Contributor Author

@sunxiaoguang sunxiaoguang Mar 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to model.TableInfo

@@ -298,6 +298,11 @@ type TableInfo struct {

        // TiFlashReplica means the TiFlash replica info.
        TiFlashReplica *TiFlashReplicaInfo `json:"tiflash_replica"`
+
+       // TTL
+       TTL                 time.Duration
+       TTLByRow            bool
+       NextTTLTruncateTime time.Time
 }

This is what we have in hackathon to quickly make it work. We should definitely make it better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to pdpb.proto to specify TTL configuration for ranges

message RangeTTL {
    bytes start_key = 1;
    bytes end_key = 2;
    uint64 TTL = 3;
    bytes user_data = 4;
    bool add_gc_interval = 5; // delay TTL by adding GC interval
}
service PD {
    rpc AddRangeTTL(AddRangeTTLRequest) returns (AddRangeTTLResponse) {}
    rpc DeleteRangeTTL(DeleteRangeTTLRequest) returns (DeleteRangeTTLResponse) {}
    rpc GetRangeTTL(GetRangeTTLRequest) returns (GetRangeTTLResponse) {}
    rpc GetAllRangeTTL(GetAllRangeTTLRequest) returns (GetAllRangeTTLResponse) {}
}
message AddRangeTTLRequest {
    RequestHeader header = 1;
    repeated RangeTTL TTL = 2;
}
message AddRangeTTLResponse {
    ResponseHeader header = 1;
}
message DeleteRangeTTLRequest {
    RequestHeader header = 1;
    bytes start_key = 2;
    bytes end_key = 3;
}
message DeleteRangeTTLResponse {
    ResponseHeader header = 1;
}
message GetRangeTTLRequest {
    RequestHeader header = 1;
    bytes start_key = 2;
    bytes end_key = 3;
}
message GetRangeTTLResponse {
    ResponseHeader header = 1;
    RangeTTL TTL = 2;
}
message GetAllRangeTTLRequest {
    RequestHeader header = 1;
}
message GetAllRangeTTLResponse {
    ResponseHeader header = 1;
    repeated RangeTTL TTL = 2;
}
message GetGCSafePointResponse {
    ResponseHeader header = 1;
    uint64 safe_point = 2;
    repeated RangeTTL range_TTL = 3;
    uint64 now = 4;
}

This is what we have in hackathon to quickly make it work. We should definitely make it better.

Copy link
Contributor Author

@sunxiaoguang sunxiaoguang Mar 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two new fields in GetGCSafePointResponse which TiKV uses to get safe_point for every run of GC. These two new fields can tell TiKV what is the current time on PD and the ranges that have TTL enabled. During GC process, if a KV pair is older than safe_point and is expired, it is collected no matter what to reclaim the space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two new fields in GetGCSafePointResponse which TiKV uses to get safe_point for every run of GC. These two new fields can tell TiKV what is the current time on PD and the ranges that have TTL enabled. During GC process, if a KV pair is older than safe_point and is expired, it is collected no matter not matter what to reclaim the space.

TiDB DDL master maintains a new periodic task to rollover partitions of a TTL table by allocating a new partition as current writing partition and truncate the oldest partition that has already passed its lifetime. Truncating partition is a `O1` operation that has nothing to do with the actual number of records within such partition. Therefore can be archived without blocking DDL for a noticeable amount of time with negligible amount of work at the background.

## Testing Plan
TBD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elaborate more about:

  1. compatibility tests
    • compatibility with other features, like MVCC GC, partition table
    • compatibility with other internal components, like Parser, DDL, Privilege, Statistics
    • compatibility with other external components, like PD, TiKV, TiFlash, BR, TiCDC, Dumpling
    • upgrade compatibility
    • downgrade compatibility
  2. functional tests
    • to ensure the basic feature function works as expected?
  3. scenario tests
    • to ensure this feature works as expected in some common scenarios
  4. benchmark tests
    • to measure the timeliness of TTL mechanism
    • to measure the influence on the online workload when TTL triggered

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, Iet me add more about these.

## Implementation

### `ROW`
By associating TTL configuration with the key range of a TTL table in PD and distributing such configuration to all TiKV instances within the cluster, TiKV can utilize TTL settings during GC process to collect data that are expired. To avoid inconsistency caused by reclaiming progress difference between index and actual record data. The lifetime of record range is one GC interval longer than it is for index range, therefore TiDB will not see missing record data during table lookup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lifetime of record range is one GC interval longer than it is for index range, therefore TiDB will not see missing record data during table lookup.

My concern is that the definition of the index is still defined in schema, so we may see inconsistent queries like:

SELECT * FROM some_ttl_table;

and

SELECT * FROM some_ttl_table USE INDEX(some_index)

Is that right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also concerned about this. This breaks the snapshot isolation of transactions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we considered other two options during Hackathon but didn't have time to finish due to complexity. Let's see if these two options could be used instead.

  1. Run filtering on TiKV side for KV pairs within ranges that have expire time. No matter the KV pairs belong to record or index, they are all filtered out based on the same expiry time against TSO.
  2. Change TiDB to ignore data not found error when reading data for handles of TTL tables and remove data range before record range on TiKV.

These two options are not ideal as well, maybe we can find out a better way to solve the consistency issue.

## Implementation

### `ROW`
By associating TTL configuration with the key range of a TTL table in PD and distributing such configuration to all TiKV instances within the cluster, TiKV can utilize TTL settings during GC process to collect data that are expired. To avoid inconsistency caused by reclaiming progress difference between index and actual record data. The lifetime of record range is one GC interval longer than it is for index range, therefore TiDB will not see missing record data during table lookup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By associating TTL configuration with the key range of a TTL table in PD

Could you please give more details of the data structure in PD, as well as the way it is distributed in TiKV?

And is it possible to store all configurations in TiKV without PD? I talk about this because the 'TTL configuration' is actually part of the table schema, maintaining it in both PD and TiKV is not easy, IMHO we should try to avoid it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we used in Hackathon project was pretty simple, we can start from this if we decide to use PD to distribute the configuration.

message RangeTTL {
    bytes start_key = 1;
    bytes end_key = 2;
    uint64 TTL = 3;
    bytes user_data = 4;
    bool add_gc_interval = 5; // delay TTL by adding GC interval
}
service PD {
    rpc AddRangeTTL(AddRangeTTLRequest) returns (AddRangeTTLResponse) {}
    rpc DeleteRangeTTL(DeleteRangeTTLRequest) returns (DeleteRangeTTLResponse) {}
    rpc GetRangeTTL(GetRangeTTLRequest) returns (GetRangeTTLResponse) {}
    rpc GetAllRangeTTL(GetAllRangeTTLRequest) returns (GetAllRangeTTLResponse) {}
}
message AddRangeTTLRequest {
    RequestHeader header = 1;
    repeated RangeTTL TTL = 2;
}
message AddRangeTTLResponse {
    ResponseHeader header = 1;
}
message DeleteRangeTTLRequest {
    RequestHeader header = 1;
    bytes start_key = 2;
    bytes end_key = 3;
}
message DeleteRangeTTLResponse {
    ResponseHeader header = 1;
}
message GetRangeTTLRequest {
    RequestHeader header = 1;
    bytes start_key = 2;
    bytes end_key = 3;
}
message GetRangeTTLResponse {
    ResponseHeader header = 1;
    RangeTTL TTL = 2;
}
message GetAllRangeTTLRequest {
    RequestHeader header = 1;
}
message GetAllRangeTTLResponse {
    ResponseHeader header = 1;
    repeated RangeTTL TTL = 2;
}
message GetGCSafePointResponse {
    ResponseHeader header = 1;
    uint64 safe_point = 2;
    repeated RangeTTL range_TTL = 3;
    uint64 now = 4;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial design was trying to make TTL concept neutral from TiDB and it is simply a range of data that defines it's lifecycle by expiring time. If we store the TTL information into table schema, it becomes tightly coupled with TiDB and can not be shared with other users using TiKV only. Even if we choose storing this information on TiKV, reading it out and get notified when it's changing from other TiKV instances is troublesome.

## Implementation

### `ROW`
By associating TTL configuration with the key range of a TTL table in PD and distributing such configuration to all TiKV instances within the cluster, TiKV can utilize TTL settings during GC process to collect data that are expired. To avoid inconsistency caused by reclaiming progress difference between index and actual record data. The lifetime of record range is one GC interval longer than it is for index range, therefore TiDB will not see missing record data during table lookup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite clear about the 'GC' here, for the expired data, is it possible to read them using the tidb_snapshot(https://docs.pingcap.com/tidb/stable/read-historical-data)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GC process mentioned here is MVCC GC process. For this case, once the data are considered expiring and being reclaimed, users can not read them even with tidb_snapshot because data are physically deleted by GC.

Another way to quickly remove expired data is managing all data in a carefully designed partition table. Bucketing data according to their update time makes it possible to quickly delete all expired records within a partition with a simple `TRUNCATE`. Compared to TTL Table's approach, these two options are error prone and suboptimal.

## Compatibility and Migration Plan
Due to the fact that TiDB by itself prohibits either converting ordinary table to partition table or between different partition table types, it is only possible to convert existing table to TTL table with `ROW` granularity. A TTL table with `PARTITION` granularity is implemented as a special partition table therefore conflicts with any other partition table type. Newly created tables can choose either `ROW` or `PARTITION` as a trade off between accuracy of reclaiming time and efficiency of garbage collecting expired data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you show the DDL syntax to:

  • alter a normal table to a TTL table
  • alter a TTL table to a normal table
  • alter the TTL of a TTL table

Besides, what will happen when the garbage collector is collecting the table during the DDL?

Copy link
Contributor Author

@sunxiaoguang sunxiaoguang Mar 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, we didn't consider the case that people want to change a TTL back to normal table. What I have right now looks strange, let's try finding out better way to do so.

  • Alter a normal table to TTL tablem
    ALTER TABLE tbl TTL='10m' TTL_GRANULARITY='ROW';

  • Alter a TTL table to normal table
    ALTER TABLE ttl_tbl TTL='NONE';

  • Alter the TTL of a TTL table
    ALTER TABLE ttl_tbl TTL='10h';

## Implementation

### `ROW`
By associating TTL configuration with the key range of a TTL table in PD and distributing such configuration to all TiKV instances within the cluster, TiKV can utilize TTL settings during GC process to collect data that are expired. To avoid inconsistency caused by reclaiming progress difference between index and actual record data. The lifetime of record range is one GC interval longer than it is for index range, therefore TiDB will not see missing record data during table lookup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also concerned about this. This breaks the snapshot isolation of transactions.

There are application scenarios that data is only valuable for a certain period of time after ingestion and can be deleted permanently after expiration. Tracing, audit logs or push notification with expiration are examples of such applications. With the help of TTL tables, users can be relieved from taking care of the data life cycle in such tables. Therefore people are more willing to use TiDB as a general storage for such scenarios.

## Proposal
Introduce new `TTL` and `TTL_GRANULARITY` table options when creating or altering tables. By specifying `TTL`, users can expect expired data to be removed automatically. Additionally, users can also choose from `ROW` or `PARTITION` for `TTL_GRANULARITY` option to trade off collect granularity and cost to run such garbage collection. `ROW` mode evaluates expiry time against each row of data and reclaim space at row basis. This gives us the finest granularity and allows the most accurate expiration timing. `PARTITION` mode on the other hand partitions data according to its last update time. A background timer is going to rollover partitions and truncate all expired data within the oldest partition all at once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the expiry time calculated? IMO, it's current timestamp - update timestamp rather than current timestamp - insert timestamp, right?
What if an update statement only updates the row record but not the index, or it only updates some indexes but not other indexes? Some of the indexes will be collected earlier, which breaks the snapshot isolation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right, we didn't consider the case where only data get updated and it breaks many things. So in order to fix this, maybe we should update both data and index even for the case when it is not necessary to do so for none TTL table.

@sre-bot
Copy link
Contributor

sre-bot commented Jun 11, 2022

@lcwangchao lcwangchao mentioned this pull request Dec 1, 2022
1 task
@SunRunAway
Copy link
Contributor

@SunRunAway SunRunAway closed this Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet