-
Notifications
You must be signed in to change notification settings - Fork 708
cdc: the e2e checksum integrity check functionality #13258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
af6aa2b
cdc: the e2e checksum integrity check functionality
Oreoxmt f038066
Apply suggestions from code review
Oreoxmt 15b1eca
make ci happy
Oreoxmt b1c3e17
Apply suggestions from code review
Oreoxmt 606a0cb
add TIDB_ROW_CHECKSUM and implementation details
Oreoxmt 3ddcb04
Apply suggestions from code review
Oreoxmt 2673b5c
Update tidb-functions.md
Oreoxmt 4fab76a
Apply suggestions from code review
Oreoxmt f89bd1b
Apply suggestions from code review
Oreoxmt 702413d
Merge branch 'master' into translate/docs-cn-13664
Oreoxmt b2e8d56
Apply suggestions from code review
Oreoxmt ce16d91
fix ci
Oreoxmt 099685b
refine wording
Oreoxmt 1c5f07e
update code example
Oreoxmt efe84bd
Merge branch 'master' into translate/docs-cn-13664
Oreoxmt File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,104 @@ | ||
| --- | ||
| title: TiCDC Data Integrity Validation for Single-Row Data | ||
| summary: Introduce the implementation principle and usage of the TiCDC data integrity validation feature. | ||
| --- | ||
|
|
||
| # TiCDC Data Integrity Validation for Single-Row Data | ||
|
|
||
| Starting from v7.1.0, TiCDC introduces the data integrity validation feature, which uses a checksum algorithm to validate the integrity of single-row data. This feature helps verify whether any error occurs in the process of writing data from TiDB, replicating it through TiCDC, and then writing it to a Kafka cluster. The data integrity validation feature only supports changefeeds that use Kafka as the downstream and currently supports the Avro protocol. | ||
|
|
||
| ## Implementation principles | ||
|
|
||
| After you enable the checksum integrity validation feature for single-row data, TiDB uses the CRC32 algorithm to calculate the checksum of a row and writes it to TiKV along with the data. TiCDC reads the data from TiKV and recalculates the checksum using the same algorithm. If the two checksums are equal, it indicates that the data is consistent during the transmission from TiDB to TiCDC. | ||
|
|
||
| TiCDC then encodes the data into a specific format and sends it to Kafka. After the Kafka Consumer reads data, it calculates a new checksum using the same algorithm as TiDB. If the new checksum is equal to the checksum in the data, it indicates that the data is consistent during the transmission from TiCDC to the Kafka Consumer. | ||
|
|
||
| For more information about the algorithm of the checksum, see [Algorithm for checksum calculation](#algorithm-for-checksum-calculation). | ||
|
|
||
| ## Enable the feature | ||
|
|
||
| TiCDC disables data integrity validation by default. To enable it, perform the following steps: | ||
|
|
||
| 1. Enable the checksum integrity validation feature for single-row data in the upstream TiDB cluster by setting the [`tidb_enable_row_level_checksum`](/system-variables.md#tidb_enable_row_level_checksum-new-in-v710) system variable: | ||
|
|
||
| ```sql | ||
| SET GLOBAL tidb_enable_row_level_checksum = ON; | ||
| ``` | ||
|
|
||
| This configuration only takes effect for newly created sessions, so you need to reconnect to TiDB. | ||
|
|
||
| 2. In the [configuration file](/ticdc/ticdc-changefeed-config.md#changefeed-configuration-parameters) specified by the `--config` parameter when you create a changefeed, add the following configurations: | ||
|
|
||
| ```toml | ||
| [integrity] | ||
| integrity-check-level = "correctness" | ||
| corruption-handle-level = "warn" | ||
| ``` | ||
|
|
||
| 3. When using Avro as the data encoding format, you need to set [`enable-tidb-extension=true`](/ticdc/ticdc-sink-to-kafka.md#configure-sink-uri-for-kafka) in the [`sink-uri`](/ticdc/ticdc-sink-to-kafka.md#configure-sink-uri-for-kafka). To prevent numerical precision loss during network transmission, which can cause checksum validation failures, you also need to set [`avro-decimal-handling-mode=string`](/ticdc/ticdc-sink-to-kafka.md#configure-sink-uri-for-kafka) and [`avro-bigint-unsigned-handling-mode=string`](/ticdc/ticdc-sink-to-kafka.md#configure-sink-uri-for-kafka). The following is an example: | ||
|
|
||
| ```shell | ||
| cdc cli changefeed create --server=http://127.0.0.1:8300 --changefeed-id="kafka-avro-checksum" --sink-uri="kafka://127.0.0.1:9092/topic-name?protocol=avro&enable-tidb-extension=true&avro-decimal-handling-mode=string&avro-bigint-unsigned-handling-mode=string" --schema-registry=http://127.0.0.1:8081 --config changefeed_config.toml | ||
| ``` | ||
|
|
||
| With the preceding configuration, each message written to Kafka by the changefeed will include the corresponding data's checksum. You can verify data consistency based on these checksum values. | ||
|
|
||
| > **Note:** | ||
| > | ||
| > For existing changefeeds, if `avro-decimal-handling-mode` and `avro-bigint-unsigned-handling-mode` are not set, enabling the checksum validation feature might cause schema compatibility issues. To resolve this issue, you can modify the compatibility type of the Schema Registry to `NONE`. For more details, see [Schema Registry](https://docs.confluent.io/platform/current/schema-registry/fundamentals/avro.html#no-compatibility-checking). | ||
|
|
||
Oreoxmt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ## Disable the feature | ||
|
|
||
| TiCDC disables data integrity validation by default. To disable this feature after enabling it, perform the following steps: | ||
|
|
||
| 1. Follow the `Pause Task -> Modify Configuration -> Resume Task` process described in [Update task configuration](/ticdc/ticdc-manage-changefeed.md#update-task-configuration) and remove all `[integrity]` configurations in the configuration file specified by the `--config` parameter of the changefeed. | ||
|
|
||
| ```toml | ||
| [integrity] | ||
| integrity-check-level = "none" | ||
| corruption-handle-level = "warn" | ||
| ``` | ||
|
|
||
| 2. Execute the following SQL statement in the upstream TiDB to disable the checksum integrity validation feature ([`tidb_enable_row_level_checksum`](/system-variables.md#tidb_enable_row_level_checksum-new-in-v710)): | ||
|
|
||
| ```sql | ||
| SET GLOBAL tidb_enable_row_level_checksum = OFF; | ||
| ``` | ||
|
|
||
| The preceding configuration only takes effect for newly created sessions. After all clients writing to TiDB have reconnected, the messages written by changefeed to Kafka will no longer include the checksum for the corresponding data. | ||
|
|
||
| ## Algorithm for checksum calculation | ||
|
|
||
| The pseudocode for the checksum calculation algorithm is as follows: | ||
|
|
||
| ``` | ||
| fn checksum(columns) { | ||
| let result = 0 | ||
| for column in sort_by_schema_order(columns) { | ||
| result = crc32.update(result, encode(column)) | ||
| } | ||
| return result | ||
| } | ||
| ``` | ||
|
|
||
| * `columns` should be sorted by column ID. In the Avro schema, fields are already sorted by column ID, so you can directly use the order in `columns`. | ||
|
|
||
| * The `encode(column)` function encodes the column value into bytes. Encoding rules vary based on the data type of the column. The specific rules are as follows: | ||
|
|
||
| * TINYINT, SMALLINT, INT, BIGINT, MEDIUMINT, and YEAR types are converted to UINT64 and encoded in little-endian. For example, the number `0x0123456789abcdef` is encoded as `hex'0x0123456789abcdef'`. | ||
| * FLOAT and DOUBLE types are converted to DOUBLE and then encoded as UINT64 in IEEE754 format. | ||
| * BIT, ENUM, and SET types are converted to UINT64. | ||
|
|
||
| * BIT type is converted to UINT64 in binary format. | ||
| * ENUM and SET types are converted to their corresponding INT values in UINT64. For example, if the data value of a `SET('a','b','c')` type column is `'a,c'`, the value is encoded as `0b101`. | ||
|
|
||
| * TIMESTAMP, DATE, DURATION, DATETIME, JSON, and DECIMAL types are converted to STRING and then encoded as UTF8 bytes. | ||
| * VARBIANRY, BINARY, and BLOB types (including TINY, MEDIUM, and LONG) are directly encoded as bytes. | ||
| * VARCHAR, CHAR, and TEXT types (including TINY, MEDIUM, and LONG) are encoded as UTF8 bytes. | ||
| * NULL and GEOMETRY types are excluded from the checksum calculation and this function returns empty bytes. | ||
|
|
||
Oreoxmt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| > **Note:** | ||
| > | ||
| > After enabling the checksum validation feature, DECIMAL and UNSIGNED BIGINT types data will be converted to string types. Therefore, in the downstream consumer code, you need to convert them back to their corresponding numerical types before calculating checksum values. | ||
|
|
||
| The consumer code written in Golang implements steps such as decoding data read from Kafka, sorting by schema fields, and calculating the checksum value. For more information, see [`avro/decoder.go`](https://github.com/pingcap/tiflow/blob/master/pkg/sink/codec/avro/decoder.go). | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.