Skip to content

Commit

Permalink
Update RFC of API v2 according to latest codes (#79)
Browse files Browse the repository at this point in the history
* RFC: RawKV Batch Export (#76)

Signed-off-by: pingyu <yuping@pingcap.com>

* rawkv bulk load: add description for pause merge (#74)

* rawkv bulk load: add description for pause merge

Signed-off-by: Peng Guanwen <pg999w@outlook.com>

* Update text/0072-online-bulk-load-for-rawkv.md

Co-authored-by: Liangliang Gu <marsishandsome@gmail.com>
Signed-off-by: Peng Guanwen <pg999w@outlook.com>

* Add future improvements

Signed-off-by: Peng Guanwen <pg999w@outlook.com>

Co-authored-by: Liangliang Gu <marsishandsome@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* ref pd#4112: implementation detail of PD

Signed-off-by: pingyu <yuping@pingcap.com>

* ref pd#4112: implementation detail of PD

Signed-off-by: pingyu <yuping@pingcap.com>

* remove raw cf

Signed-off-by: Andy Lok <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* update

Signed-off-by: Andy Lok <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* update pd design

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* revert to keyspace_next_id

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* RFC: Improve the Scalability of TSO Service (#78)

Signed-off-by: pingyu <yuping@pingcap.com>

* make region size dynamic (#82)

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* update pd url

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* address comment

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* resolve pd flashback problem

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* update rfcs

Signed-off-by: Andy Lok <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* RFC: In-memory Pessimistic Locks (#77)

* RFC: In-memory Pessimistic Locks

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* clarify where to delete memory locks after writing a lock CF KV

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* Elaborate transfer leader handlings and add correctness section

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* add an addition step of proposing pessimistic locks before transferring leader

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* clarify about new leaders of region split

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* Add tracking issue link

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* update design and correctness analysis of lock migration

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* add configurations

Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* propose online unsafe recovery (#91)

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* physical isolation between region (#93)

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* wip

Signed-off-by: pingyu <yuping@pingcap.com>

* update

Signed-off-by: pingyu <yuping@pingcap.com>

* update

Signed-off-by: pingyu <yuping@pingcap.com>

* Apply suggestions from code review

Co-authored-by: Xiaoguang Sun <sunxiaoguang@users.noreply.github.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* fix case

Signed-off-by: pingyu <yuping@pingcap.com>

Signed-off-by: pingyu <yuping@pingcap.com>
Signed-off-by: Andy Lok <andylokandy@hotmail.com>
Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Co-authored-by: Liangliang Gu <marsishandsome@gmail.com>
Co-authored-by: Peng Guanwen <pg999w@outlook.com>
Co-authored-by: Andy Lok <andylokandy@hotmail.com>
Co-authored-by: JmPotato <ghzpotato@gmail.com>
Co-authored-by: Jay <BusyJay@users.noreply.github.com>
Co-authored-by: Yilin Chen <sticnarf@gmail.com>
Co-authored-by: Connor <zbk602423539@gmail.com>
Co-authored-by: Xiaoguang Sun <sunxiaoguang@users.noreply.github.com>
  • Loading branch information
9 people authored Nov 8, 2022
1 parent fe50384 commit b1a22d7
Showing 1 changed file with 131 additions and 138 deletions.
269 changes: 131 additions & 138 deletions text/0069-api-v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,216 +2,200 @@

## Motivation

`API V2` is a set of breaking changes that aim to solve serval issues with current RawKV (hereafter referred to as `API V1`):
`API V2` is a set of breaking changes that aims to solve serval issues with current RawKV (hereafter referred to as `API V1`):

1. RawKV is not safe to use along with TxnKV. By solving this, TiDB will be able to support RawKV as the table's storage engine, which will enrich TiDB's use cases.
2. RawKV TTL is controlled by Store configuration. Switching the configuration will cause data corruption in silence.
3. RawKV TTL is encoded into the value by appending 8-bytes UNIX timestamp to the end of the value, therefore it's hard to introduce other encode afterward.
4. It could be nice if we can deploy multiple applications on one TiKV cluster.
1. RawKV is not safe to use along with TxnKV & TiDB. If it get solved, the three different modes can be used in the same cluster, and reduce cost of resource and maintenance. It even makes it possible that TiDB can support RawKV as the table's storage engine, and enrich TiDB's use cases.
2. RawKV TTL is controlled by TiKV configuration. Switching the configuration will cause data corruption in silence.
3. Key and value of RawKV are just raw bytes, therefore it's hard to add more metadata afterward to support more features, such as keyspace to support multi-tenant, or timestamp to support [Change Data Capture].

## Detailed Design

### New key-value codec

This RFC introduces a new key encode to RawKV and TxnKV, and a new value encode to RawKV, which will allow the RawKV to be used along with TxnKV and also allow TiKV to flexibly add meta, e.g. the TTL, to a RawKV value.
This RFC introduces a new key encoding to RawKV and TxnKV, and a new value encoding to RawKV, which allow RawKV to be used along with TxnKV and being flexible to add more fields to value.

In addition, keys will be contained in keyspaces, where the keys in different keyspace are totally independant. If keyspace is not specified, the keyspace 'default' will be used.
Since API V2 changed the storage encoding, it will be not compatible to switch between `API V1` and `API V2` while there are non-TiDB data in TiKV. TiDB data is specially treated in order to not be affected by this change.

The `API V2` is enabled by a switch on PD. Since it changes the storage encode, it will be not compatible to switch between `API V1` and `API V2` while there are non-TiDB data in TiKV. TiDB data is specially treated in order to not be affected by the change.
#### Key Encoding

#### Key Encode
Once `API V2` is enabled, keys will start with either:

Once the `API V2` is enabled, the key will be starting with either:
1. `m` and `t`: TiDB keys.
2. `x`: TxnKV keys.
3. `r`: RawKV keys.

1. `m` or `t`: TxnKV key. Used by TiDB.
2. `k{keyspace prefix id}x`: TxnKV key.
3. `k{keyspace prefix id}r`: RawKV key.
`x`, `r` are mode prefixes that indicates which mode the key is belonging to. After mode prefix is 3 bytes for keyspace. So in API V2, RawKV & TxnKV keys will be encoded as `MCE(mode-prefix + keyspace + user-key) + timestamp`. `MCE` is abbreviation of [Memory Comparable Encoding], which is necessary to keep the encoded keys having the same order with user keys.

The `{keyspace prefix id}` is the [keyspace](https://github.com/tikv/rfcs/pull/39) prefix for seperating keys of different keyspace. It should be an vary-length integer whose highest bit of every byte denotes whether the next byte is still part of the integer. The client will fetch the prefix from PD by the keyspace name specified by the user when initializing the client, so it means that the keyspace prefix is valid during the session, in other words, the change on keyspace on PD will not take effect on running seesions.
`Keyspace` is fixed-length of 3 bytes in network byte order, and will be introduced in another RFC.

Note that in TxnKV, the key will be encoded by `Memory Comparable Encoding`. But since the `Memory Comparable Encoding` will not change the starting bytes but only add paddings, there won't be any overlap between RawKV and TxnKV.
`Timestamp` in RawKV entries are necessary to implement [Change Data Capture] feature, which will indicate what and when data is changed.

#### RawKV Value Encode
##### Timestamp requirement

If the key has RawKV prefix, which is `k{keyspace id prefix}r`, then the value can be either:
Among requests of a single key, the timestamp must be monotonic with the sequence of data flushed to disk in TiKV.

1. `{0x0}{data}`
2. `{0x1}{TTL expire timestamp}{data}`
In general, if request `A` [Happened Before] `B`, then `Timestamp(A)` < `Timestamp(B)`. As to RawKV, we provide [Causal Consistency] by keeping order of the timestamp the same as sequence of data flush to disk.

### Keyspace Management
At the same time, as RawKV doesn't provide cross-rows transaction and snapshot isolation, we allow concurrent updates to different keys, which means that the timestamp order of two different keys would not be consistent with data flush, to improve efficiency.

Add a new http interface to PD for adding, renaming, deleteing and querying the mapping from keyspace name to prefix id:
##### Timestamp Generation

```javascript
// list all keyspaces
GET /keyspaces
[
{
name: "default",
id: "0",
properties: {
"description": "this is default keyspace",
"default-config": {
"raw-client": {
"ttl-secs": 30000000
},
"txn-client: {
"enable-async-commit": true
}
}
}
},
{
name: "redis",
id: "1"
}
]
Timestamp is generated by PD (i.e. TSO), the same as TiDB & TxnKV. But differently, TSO is acquired by TiKV internally, to get a better overall performance and client compatibility.

// add new keyspace
POST /keyspaces
{
name: "foo",
}
To reduce latency and improve availability, TiKV will prefetch and cache a number of TSO locally. User can specify how long the TSO cache is required to tolerate fault of PD, then TiKV will calculate the size of batch according to recent QPS.

// delete a keyspace
DELETE /keyspaces/{keyspace_name}
{}
Note that TSO cache brings another issue. If subsequent writes of a single key happen in another store of TiKV cluster (caused by leader transfer), the TSO cache of another store must be renewed to ensure that it's larger than the original store. TiKV observes events of `leader transfer` and then flushes the cache. Another event should be observed is `region merge`, as the leader of merged region would be on another store as to the region being merged from.

// recover the latest deleted keyspace
POST /keyspaces?action=flashback
{
new_name: "bar",
}
```
In current implementation, the flush of TSO cache is asynchronous to avoid blocking leader transfer and region merge. Clients will get an `MaxTimestampNotSynced` error until the flush is done.

The keyspaces are stored in etcd and has the no limitation on the id number.
*The alternative of timestamp generation is `HLC` ([Hybrid Logical Clock]). The pros of `HLC` is being independent to availability of PD, but the cons is that it depends on local clock and [NTP], and it's also not easy to make it right (refer to [strict monotonicity](https://github.com/cockroachdb/cockroach/blob/13c5a25238ce75cfb7ff151d620e82aa44c72e27/pkg/util/hlc/doc.go#L150) in CockroachDB). All things considered, as PD is designed to be highly available, and fault of PD will affect not only TSO but also other critical components (e.g, region metadata), we prefer to utilize TSO as timestamp.*

1. Adding keyspace: newly added keyspaces can only be viable to new clients.
2. Deleting keyspace:
2.1. Deleting keyspace only marks the metadata to inviable in PD, the data in the keyspace and metadata in PD will not be deleted automatically. Garbage collecting the data in deleted keyspaces may be introduced in the future since it increases the complexity of this RFC. To ensure no data is left, the user should clean up the keyspace before deleting the keyspace at present.
2.2. Every client syncs the keyspace information with the PD leader every 5 minutes. When a keyspace is deleted, the client should be aware of the deletion in 5 minutes. The keyspace can not be accessed by any living client after 5 minutes.
3. Flashbacking keyspace:
3.1. Flashbacking keyspace only affects the metadata, turning the keyspace name and id mapping to viable by clients.
3.2. If there are multiple deleted keyspaces with the same name, only the last deleted keyspace is flashbacked. Users can flashback all these deleted keyspaces by calling the flashback API multiple times with different new keyspace names.
#### RawKV Value Encoding

#### pd-ctl
The value of RawKV `API V2` by now can be either:

To enable API V2, which also enables the keyspace API:
1. `{data}{0x0}`, for values without TTL
2. `{data}{TTL expire timestamp}{0x1}`, for values with TTL
3. `{0x1}`, for values deleted

```bash
>> config set api-version 2
```
The last byte of value is used as meta flags. Bit `0` of flag is for TTL, if it's set, the `8` bytes just before meta flags is the TTL expire timestamp. Bit `1` is for deleted mark, if it's set, the entries is logical deleted (used for [Change Data Capture]).

To manage keyspace:
Extra fields in future can utilize other bits of meta flags, and will be inserted between user value & meta flags in reverse order. The most significant bit of meta flags is supposed to be used for extended meta flags if there are more than 7 fields.

```bash
>> config keyspaces show
>> config keyspaces create <keyspace name>
>> config keyspaces delete <keyspace name>
`{user value}{field of bit n}...{extended meta flags}...{field of bit 2}{field of bit 0 (TTL)}{meta flags}`

# example: config keyspaces set-property foo default-config.raw-client.ttl-secs 100000
>> config keyspaces set-property <keyspace name> <property-path> <property-value>
### How to safely enable API V2

>> config keyspaces delete-property <keyspace name> <property-path>
>> config keyspaces flashback <new keyspace name>
```
#### Upgrade from `API V1` to `API V2`

#### Keyspace metadata
1. Upgrade TiKV, TiDB, and PD to the version that supports `API V2`.
2. Ensure that all the keys in TiKV are written by TiDB, which are prefixed with `m` or `t`. Any other data should be migrated out, or else the step 3 will fail.
3. Enable `API V2` in TiKV config file and restart TiKV (user should also take the responsibility to enable `API V2` for all TiKV clients excluding TiDB).

```json
{
name: string,
id: int64,
created_at: timestamp,
deleted_at: timestamp, // if set, the keyspace is not visiable to users
flashbacked_at: timstamp
properties: object
}
```
#### Downgrade from `API V2` to `API V1`

### How to safely enable API V2
1. Ensure that all the keys in TiKV are written by TiDB, which are prefixed with `m` or `t`. Any other data should be migrated out.
2. Disable `API V2` in TiKV config file and restart TiKV (user should also take the responsibility to enable `API V1` for all TiKV clients excluding TiDB).

#### Upgrade
#### Data migration

Upgrade from `API V1` to `API V2` is a simple process:
A backup and restore tool would be provided to export data from TiKV cluster of `API V1`, and convert to `API V2` encoding. Then import the backup data into another TiKV cluster of `API V2`.

1. Update TiKV, TiDB, and PD to the version that supports `API V2`.
2. Ensure that all the keys in TiKV are written by TiDB, which are prefixed with `m` or `t`. Delete if any. Or else the step 4 will fail.
3. Use `pd-ctl` to enable `API V2`.
4. Enable `API V2` in TiKV config file and restart TiKV (User should take the responsibility to offline all tikv clients excluding TiDB. Or set by online config change API (Not proposed in this RFC, but is good to have).
## Implementation Details

#### Downgrade
### kvproto

Downgrade from `API V2` to `API V1` is also simple:
```proto
// kvrpcpb.proto
1. Ensure that all the keys in TiKV are written by TiDB, which are prefixed with `m` or `t`. Delete if any.
2. Use `pd-ctl` to disable `API V2`.
3. Disable `API V2` in TiKV config file and restart TiKV (User should take the responsibility to offline all tikv clients excluding TiDB). Or set by online config change API (Not proposed in this RFC, but is good to have).
message Context {
// ... omited other fields
#### Data migration
// API version implies the encode of the key and value.
APIVersion api_version = 21;
}
It's reasonable to provide a way to import and export non-TiDB data in TiKV during the upgrade or downgrade. On TiKV before 4.0, the only way to do that is `scan` and `batch_put` on the client. After 4.0, TiKV start to support importing SST file into TxnKV, and after 5.1, importing on RawKV is also supported. You can find more information in [`RFC: Online Bulk Load for RawKV`](https://github.com/tikv/rfcs/pull/72).
// The API version the server and the client is using.
// See more details in https://github.com/tikv/rfcs/blob/master/text/0069-api-v2.md.
enum APIVersion {
V1 = 0;
V1TTL = 1;
V2 = 2;
}
```

### Implementation Details
```proto
// raft_serverpb.proto
#### PD
message StoreIdent {
// ... omited other fields
kvrpcpb.APIVersion api_version = 3;
}
```

Add the new APIs described [above](#Keyspace-Management).
```proto
// brpb.proto
#### TiKV Server
message BackupMeta {
// ... omited other fields
In TiKV config file, add a new configuration `storage.api_version`. When enabled, `storage.enable_ttl` must also be enabled.
kvrpcpb.APIVersion api_version = 18;
}
In kvdb, add a store meta `api_version`. When the store meta mismatches the config `storage.enable_ttl`, it means that the user is switching the API version, then check no non-TiDB exist, and then save the new api version in store meta.
message BackupResponse {
// ... omited other fields
kvrpcpb.APIVersion api_version = 5;
}
```

In kvproto message `SSTMeta`, add `api_version`. Reject the SST file if the version is mismatched.
### TiKV Server

In TiKV gRPC's context, add a field `api_version`.
In TiKV config file, add a new configuration `storage.api-version`.

If `storage.api_version=2`:
If the API version in `StoreIdent` mismatches with `storage.api-version` in the config, it means that the user is switching the API version, therefore TiKV will check if there is any non-TiDB data in storage, and eventually save the new API version in `StoreIdent`.

- Run TTL compaction filter only on the keys that start with RawKV prefix.
If `storage.api-version=2`:

- Use the `API V2` Value encode in `RawStore`, `TTLStore` and `sst_importer`.
- Use the new value encoding in `RawStore` and `sst_importer`.

- If the request's context has `api_version=1`:
- Only allow RawKV to access `default` CF.

- If the request's context has `api-version=1`:
- Reject the request unless it's a TxnKV request and the keys starting with `m` or `t`.

- If the request's context has `api_version=2`:
- If the request's context has `api-version=2`:
- Only allow the key that has RawKV prefix for RawKV requests.
- Only allow the key that has TxnKV prefix for TxnKV requests.

If `storage.api_version=1`:
If `storage.api-version=1` & `storage.enable-ttl=true`:

- Reject all requests with `api-version=2` in the context.
- Reject all transactional requests otherwise the raw TTL encoding in V1TTL will corrupt transaction data.

If `storage.api-version=1` & `storage.enable-ttl=false`:

- Reject all requests with `api_version=2` in the context.
- Reject all requests with `api-version=2` in the context.

#### TiKV Client
### TiKV Client

Provide two modes for users:

- V2:
- Fetch keyspace prefix by keyspace name from PD and then prepend `k{keyspace prefix}x` on TxnKV keys or prepend `k{keyspace prefix}r` on RawKV keys.
- Set `api_version=2` in TiKV gRPC's `Context`.
- Disallow specify CF in `RawClient`.
- Allow user to specify a keyspace for a session of `RawClient` or `TxnCient`. Default keyspace is named `default`.
- Fetch keyspace information from PD every 5 mins. Destory the client session if the keyspace it's using is deleted.
- Prepend `x{keyspace}` before TxnKV keys or prepend `r{keyspace}` before RawKV keys.
- `Keyspace` is optional and defaults to `0` for backward compatible
- Set `api_version=2` in `kvrpcpb.Context`.
- Disallow specifying `cf` in `RawClient`.

- V1:
- Behaves jusk like current client.
- Set `api_version=1` in TiKV gRPC's `Context`.
- Set `api_version=1` in `kvrpcpb.Context`.
- Besides above, behaves just like current client.

Listed below is the compatibility matrix:

| | V1 Server | V2 Server |
| ------------ | --------- | --------- |
| V1 RawClient | Raw Data | Forbidden |
| V1 TxnClient | Txn Data | TiDB Data |
| V2 RawClient | Forbidden | Raw Data |
| V2 TxnClient | Forbidden | Txn Data |
| | V1 Server | V1TTL Server | V2 Server |
| --------------------- | --------- | ------------ | --------- |
| V1 RawClient | Raw | Raw | Error |
| V1 RawClient with TTL | Error | Raw | Error |
| V1 TxnClient | Txn | Error | Error |
| V1 TiDB | TiDB Data | Error | TiDB Data |
| V2 RawClient | Error | Error | Raw |
| V2 TxnClient | Error | Error | Txn |

### CDC / BR
### Garbage Collection

Since all access to TiDB is unchanged during the upgrade, CDC and BR should work the same after upgrade/downgrade.
*To be supplemented in another PR*

### Backup and Restore

*To be supplemented in another PR*

### Change Data Capture

The details of CDC will be introduced in another RFC.

*TODO: add link here.*

### tikv-ctl

Expand All @@ -222,3 +206,12 @@ Read `api_version` in kvdb and decode data using the corresponding version.
Upgrade to the latest TiKV Go Client and use `V1` mode.

## Unresolved questions

*TBD*

[Change Data Capture]: https://en.wikipedia.org/wiki/Change_data_capture
[Memory Comparable Encoding]: https://github.com/facebook/mysql-5.6/wiki/MyRocks-record-format#memcomparable-format
[Causal Consistency]: https://en.wikipedia.org/wiki/Causal_consistency
[Happened Before]: https://en.wikipedia.org/wiki/Happened-before
[Hybrid Logical Clock]: https://cse.buffalo.edu/tech-reports/2014-04.pdf
[NTP]: https://en.wikipedia.org/wiki/Network_Time_Protocol

0 comments on commit b1a22d7

Please sign in to comment.