From 4447bc9480cf8cae65781bcfb749631ad81344d9 Mon Sep 17 00:00:00 2001 From: zhangyangyu Date: Thu, 12 May 2022 23:46:26 +0800 Subject: [PATCH 1/5] add avro refactor design --- ...2022-05-12-ticdc-avro-protocol-refactor.md | 205 ++++++++++++++++++ 1 file changed, 205 insertions(+) create mode 100644 docs/design/2022-05-12-ticdc-avro-protocol-refactor.md diff --git a/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md b/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md new file mode 100644 index 00000000000..2885fc962d3 --- /dev/null +++ b/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md @@ -0,0 +1,205 @@ +# TiCDC Design Documents + +- Author(s): [Zhao Yilin](http://github.com/leoppro), [Zhang Xiang](http://github.com/zhangyangyu) +- Tracking Issue: https://github.com/pingcap/tiflow/issues/5338 + +## Table of Contents + +- [TiCDC Design Documents](#ticdc-design-documents) + - [Table of Contents](#table-of-contents) + - [Introduction](#introduction) + - [Motivation or Background](#motivation-or-background) + - [Detailed Design](#detailed-design) + - [New Config Items](#new-config-items) + - [flat-avro Schema Definition](#flat-avro-schema-definition) + - [Key Schema](#key-schema) + - [Value Schema](#value-schema) + - [DML Events](#dml-events) + - [Schema Change](#schema-change) + - [Subject Name Strategy](#subject-name-strategy) + - [ColumnValueBlock and Data Mapping](#columnvalueblock-and-data-mapping) + - [Test Design](#test-design) + - [Functional Tests](#functional-tests) + - [CLI Tests](#cli-tests) + - [Data Mapping Tests](#data-mapping-tests) + - [DML Tests](#dml-tests) + - [Schema Tests](#schema-tests) + - [SubjectNameStrategy Tests](#subjectnamestrategy-tests) + - [Compatibility Tests](#compatibility-tests) + - [Impacts & Risks](#impacts--risks) + - [Investigation & Alternatives](#investigation--alternatives) + - [Unresolved Questions](#unresolved-questions) + +## Introduction + +This document provides a complete design on refactoring the existing Avro protocol implementation. A common Avro data format is defined in order to building data pathways to various streaming systems. + +## Motivation or Background + +Apache Avro™ is a data serialization system with rich data structures and a compact binary data format. Avro relies on schemas, schemas is managed by schema-registry. Avro is a common data format in streaming systems, supported by Confluent, Flink, Debezium, etc. + +## Detailed Design + +### New Config Items + +| Config item | Option values | Default | Explain +|------------------------------------|------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +| protocol | canal-json / flat-avro | - | Specify the message format which output to the kafka.
The `flat-avro` option means that using the avro format design by this document. +| enable-tidb-extension | true / false | false | Append tidb extension fields into avro message or not. +| schema-registry | - | - | Specifies the schema registry endpoint. +| avro-decimal-handling-mode | precise / string | precise | Specifies how the TiCDC should handle values for DECIMAL columns:
`precise` option represents encoding decimals as precise bytes.
`string` option encodes values as formatted strings, which is easy to consume but semantic information about the real type is lost. +| avro-bigint-unsigned-handling-mode | long / string | long | Specifies how the TiCDC should handle values for UNSIGNED BIGINT columns:
`long` represents values by using Avro long(64-bit signed integer) which might overflow but which is easy to use in consumers.
`string` represents values by string which is precision but which is need to parse by consumers. + +### flat-avro Schema Definition + +`flat-avro` is an alias of `avro` protocol. It means all column values are placed directly inside the message with no nesting. This structure is compatible with most confluent sink connectors, but it cannot handle `old-value`. `rich-avro` is opposite and reserved for future needs. + +#### Key Schema + +``` +{ + "name":"{{RecordName}}", + "namespace":"{{Namespace}}", + "type":"record", + "fields":[ + {{ColumnValueBlock}}, + {{ColumnValueBlock}}, + ] +} +``` + +- `{{RecordName}}` represents full qualified table name. +- `{{ColumnValueBlock}}` represents a JSON block, which defines a column value of a key. +- Key only includes the valid index fields. + +#### Value Schema + +``` +{ + "name":"{{RecordName}}", + "namespace":"{{Namespace}}", + "type":"record", + "fields":[ + {{ColumnValueBlock}}, + {{ColumnValueBlock}}, + { + "name":"_tidb_op", + "type":"string" + }, + { + "name":"_tidb_commit_ts", + "type":"long" + }, + { + "name":"_tidb_commit_physical_time", + "type":"long" + } + ] +} +``` +- `{{RecordName}}` represents full qualified table name. +- `{{ColumnValueBlock}}` represents a JSON block, which defines a column value of a key. +- `_tidb_op` used to distinguish between INSERT or UPDATE events, optional values are "c" / "u". +- `_tidb_commit_ts` represents a CommitTS of a transaction. +- `_tidb_commit_physical_time` represents a physical timestamp of a transaction. + +When `enable-tidb-extension` is `true`, `_tidb_op`, `_tidb_commit_ts`, `_tidb_commit_physical_time` will be appended to every Kafka value. When `enable-tidb-extension` is `false`, no extension fields will be appended to Kafka values. + +### DML Events + +If `enable-tidb-extension` is `true`, the `_tidb_op` field for the INSERT event is "c" and the field for UPDATE event is "u". + +If `enable-tidb-extension` is `false`, the `_tidb_op` field will not be appended in Kafka value, so there is no difference between INSERT and UPDATE event. + +For the DELETE event, TiCDC will send the primary key value as the Kafka key, and the Kafka value will be `null`. + +### Schema Change + +Avro detects schema change at every DML events instead of DDL events. Whenever there is a schema change, avro codec tries to register a new version schema under corresponding subject in the schema registry. Whether it succeeds or not depends on the schema evolution compatibility. Avro codec will not address any compatibility issues and simply propagates errors. + +### Subject Name Strategy + +Avro codec only supports the default `TopicNameStrategy`. This means a kafka topic could only accepts a unique schema. With the multi-topic ability in TiCDC, events from multiple tables could be all dispatched to one topic, which is not allowed under `TopicNameStrategy`. So we require in the dispatcher rules, for avro protocol, the topic rule must contain both `{schema}` and `{table}` placeholders, which means one table would occupy one kafka topic. + +### ColumnValueBlock and Data Mapping + +A `ColumnValueBlock` has the following schema: + +``` +{ + "name":"{{ColumnName}}", + "type":{ + "connect.parameters":{ + "tidb_type":"{{TIDB_TYPE}}" + }, + "type":"{{AVRO_TYPE}}" + } +} +``` + +| SQL TYPE | TIDB_TYPE | AVRO_TYPE | Description | +|----------------------------------------------------|------------------------------|-----------|--------------------------------------------------------------------------------------------------------------------------------| +| TINYINT/BOOL/SMALLINT/MEDIUMINT/INT | INT | int | When it's unsigned, TIDB_TYPE is INT UNSIGNED. For SQL TYPE INT UNSIGNED, its AVRO_TYPE is long. | +| BIGINT | BIGINT | long | When it's unsigned, TIDB_TYPE is BIGINT UNSIGNED. If `avro-bigint-unsigned-handling-mode` is string, AVRO_TYPE is string. | +| TINYBLOB/BLOB/MEDIUMBLOB/LONGBLOB/BINARY/VARBINARY | BLOB | bytes | | +| TINYTEXT/TEXT/MEDIUMTEXT/LONGTEXT/CHAR/VARCHAR | TEXT | string | | +| FLOAT/DOUBLE | FLOAT/DOUBLE | double | | +| DATE/DATETIME/TIMESTAMP/TIME | DATE/DATETIME/TIMESTAMP/TIME | string | | +| YEAR | YEAR | int | | +| BIT | BIT | bytes | BIT has another `connector.parameters` entry `"length":"64"`. | +| JSON | JSON | string | | +| ENUM/SET | ENUM/SET | string | BIT has another `connector.parameters` entry `"allowed":"a,b,c"`. | +| DECIMAL | DECIMAL | bytes | This is an avro logical type having `scale` and `precision`. When `avro-decimal-handling-mode` is string, AVRO_TYPE is string. | + + +## Test Design + +### Functional Tests + +#### CLI Tests + +- avro/flat-avro protocol +- avro/flat-avro protocol & true/false/invalid enable-tidb-extension +- avro/flat-avro protocol & precise/string/invalid avro-decimal-handling-mode +- avro/flat-avro protocol & long/string/invalid avro-bigint-unsigned-handling-mode +- avro/flat-avro protocol & valid/invalid schema-registry + +#### Data Mapping Tests + +- With protocol=avro&enable-tidb-extension=false&avro-decimal-handling-mode=precise&avro-bigint-unsigned-handling-mode=long, all generated schema and data are correct. +- With enable-tidb-extension=true, schema and value will have _tidb_op, _tidb_commit_ts, _tidb_commit_physical_time fields. +- With avro-decimal-handling-mode=string,decimal field generates string schema and data. +- With avro-bigint-unsigned-handling-mode=string, bigint unsigned generates string schema and data. + +#### DML Tests + +- Insert row and check the row in downstream database. +- Update row and check the row in downstream database. +- Delete row and check the row in downstream database. + +#### Schema Tests + +- When the schema is not in schema registry, a fresh new schema is created. +- When the schema is in schema registry and pass compatibility check, a new version is created. +- When the schema is in schema registry and not pass compatibility check, reports error. + +#### SubjectNameStrategy Tests + +- When there is only default topic, a changefeed could only replicate one table. +- When there is invalid topic rule, report error. + +### Compatibility Tests + +N/A + +## Impacts & Risks + +N/A + +## Investigation & Alternatives + +N/A + +## Unresolved Questions + +N/A From 15c8e960ff0694267a37d5690f0d1f8e44d3f45c Mon Sep 17 00:00:00 2001 From: zhangyangyu Date: Fri, 13 May 2022 00:10:59 +0800 Subject: [PATCH 2/5] lint --- ...2022-05-12-ticdc-avro-protocol-refactor.md | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md b/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md index 2885fc962d3..2382e897144 100644 --- a/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md +++ b/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md @@ -42,13 +42,13 @@ Apache Avro™ is a data serialization system with rich data structures and a co ### New Config Items -| Config item | Option values | Default | Explain -|------------------------------------|------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -| protocol | canal-json / flat-avro | - | Specify the message format which output to the kafka.
The `flat-avro` option means that using the avro format design by this document. -| enable-tidb-extension | true / false | false | Append tidb extension fields into avro message or not. -| schema-registry | - | - | Specifies the schema registry endpoint. -| avro-decimal-handling-mode | precise / string | precise | Specifies how the TiCDC should handle values for DECIMAL columns:
`precise` option represents encoding decimals as precise bytes.
`string` option encodes values as formatted strings, which is easy to consume but semantic information about the real type is lost. -| avro-bigint-unsigned-handling-mode | long / string | long | Specifies how the TiCDC should handle values for UNSIGNED BIGINT columns:
`long` represents values by using Avro long(64-bit signed integer) which might overflow but which is easy to use in consumers.
`string` represents values by string which is precision but which is need to parse by consumers. +| Config item | Option values | Default | Explain | +| ---------------------------------- | ---------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| protocol | canal-json / flat-avro | - | Specify the message format which output to the kafka.
The `flat-avro` option means that using the avro format design by this document. | +| enable-tidb-extension | true / false | false | Append tidb extension fields into avro message or not. | +| schema-registry | - | - | Specifies the schema registry endpoint. | +| avro-decimal-handling-mode | precise / string | precise | Specifies how the TiCDC should handle values for DECIMAL columns:
`precise` option represents encoding decimals as precise bytes.
`string` option encodes values as formatted strings, which is easy to consume but semantic information about the real type is lost. | +| avro-bigint-unsigned-handling-mode | long / string | long | Specifies how the TiCDC should handle values for UNSIGNED BIGINT columns:
`long` represents values by using Avro long(64-bit signed integer) which might overflow but which is easy to use in consumers.
`string` represents values by string which is precision but which is need to parse by consumers. | ### flat-avro Schema Definition @@ -97,6 +97,7 @@ Apache Avro™ is a data serialization system with rich data structures and a co ] } ``` + - `{{RecordName}}` represents full qualified table name. - `{{ColumnValueBlock}}` represents a JSON block, which defines a column value of a key. - `_tidb_op` used to distinguish between INSERT or UPDATE events, optional values are "c" / "u". @@ -138,7 +139,7 @@ A `ColumnValueBlock` has the following schema: ``` | SQL TYPE | TIDB_TYPE | AVRO_TYPE | Description | -|----------------------------------------------------|------------------------------|-----------|--------------------------------------------------------------------------------------------------------------------------------| +| -------------------------------------------------- | ---------------------------- | --------- | ------------------------------------------------------------------------------------------------------------------------------ | | TINYINT/BOOL/SMALLINT/MEDIUMINT/INT | INT | int | When it's unsigned, TIDB_TYPE is INT UNSIGNED. For SQL TYPE INT UNSIGNED, its AVRO_TYPE is long. | | BIGINT | BIGINT | long | When it's unsigned, TIDB_TYPE is BIGINT UNSIGNED. If `avro-bigint-unsigned-handling-mode` is string, AVRO_TYPE is string. | | TINYBLOB/BLOB/MEDIUMBLOB/LONGBLOB/BINARY/VARBINARY | BLOB | bytes | | @@ -151,7 +152,6 @@ A `ColumnValueBlock` has the following schema: | ENUM/SET | ENUM/SET | string | BIT has another `connector.parameters` entry `"allowed":"a,b,c"`. | | DECIMAL | DECIMAL | bytes | This is an avro logical type having `scale` and `precision`. When `avro-decimal-handling-mode` is string, AVRO_TYPE is string. | - ## Test Design ### Functional Tests @@ -167,7 +167,7 @@ A `ColumnValueBlock` has the following schema: #### Data Mapping Tests - With protocol=avro&enable-tidb-extension=false&avro-decimal-handling-mode=precise&avro-bigint-unsigned-handling-mode=long, all generated schema and data are correct. -- With enable-tidb-extension=true, schema and value will have _tidb_op, _tidb_commit_ts, _tidb_commit_physical_time fields. +- With enable-tidb-extension=true, schema and value will have \_tidb_op, \_tidb_commit_ts, \_tidb_commit_physical_time fields. - With avro-decimal-handling-mode=string,decimal field generates string schema and data. - With avro-bigint-unsigned-handling-mode=string, bigint unsigned generates string schema and data. From b00f5e2379790947911b6ffa153432e2f588e10a Mon Sep 17 00:00:00 2001 From: Xiang Zhang Date: Tue, 17 May 2022 17:37:58 +0800 Subject: [PATCH 3/5] Apply suggestions from code review Co-authored-by: zhaoxinyu --- docs/design/2022-05-12-ticdc-avro-protocol-refactor.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md b/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md index 2382e897144..787cd1b9f94 100644 --- a/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md +++ b/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md @@ -44,11 +44,11 @@ Apache Avro™ is a data serialization system with rich data structures and a co | Config item | Option values | Default | Explain | | ---------------------------------- | ---------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| protocol | canal-json / flat-avro | - | Specify the message format which output to the kafka.
The `flat-avro` option means that using the avro format design by this document. | +| protocol | canal-json / flat-avro | - | Specify the format of messages written to Kafka.
The `flat-avro` option means that using the avro format design by this document. | | enable-tidb-extension | true / false | false | Append tidb extension fields into avro message or not. | | schema-registry | - | - | Specifies the schema registry endpoint. | | avro-decimal-handling-mode | precise / string | precise | Specifies how the TiCDC should handle values for DECIMAL columns:
`precise` option represents encoding decimals as precise bytes.
`string` option encodes values as formatted strings, which is easy to consume but semantic information about the real type is lost. | -| avro-bigint-unsigned-handling-mode | long / string | long | Specifies how the TiCDC should handle values for UNSIGNED BIGINT columns:
`long` represents values by using Avro long(64-bit signed integer) which might overflow but which is easy to use in consumers.
`string` represents values by string which is precision but which is need to parse by consumers. | +| avro-bigint-unsigned-handling-mode | long / string | long | Specifies how the TiCDC should handle values for UNSIGNED BIGINT columns:
`long` represents values by using Avro long(64-bit signed integer) which might overflow but which is easy to use in consumers.
`string` represents values by string which is precise but which needs to be parsed by consumers. | ### flat-avro Schema Definition From a14d26da303b707a71f15a3c62a2ee784ea7890c Mon Sep 17 00:00:00 2001 From: Xiang Zhang Date: Tue, 17 May 2022 17:50:21 +0800 Subject: [PATCH 4/5] Apply suggestions from code review Co-authored-by: zhaoxinyu --- docs/design/2022-05-12-ticdc-avro-protocol-refactor.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md b/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md index 787cd1b9f94..2556227d27b 100644 --- a/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md +++ b/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md @@ -100,7 +100,7 @@ Apache Avro™ is a data serialization system with rich data structures and a co - `{{RecordName}}` represents full qualified table name. - `{{ColumnValueBlock}}` represents a JSON block, which defines a column value of a key. -- `_tidb_op` used to distinguish between INSERT or UPDATE events, optional values are "c" / "u". +- `_tidb_op` is used to distinguish between INSERT or UPDATE events, optional values are "c" / "u". - `_tidb_commit_ts` represents a CommitTS of a transaction. - `_tidb_commit_physical_time` represents a physical timestamp of a transaction. @@ -181,7 +181,7 @@ A `ColumnValueBlock` has the following schema: - When the schema is not in schema registry, a fresh new schema is created. - When the schema is in schema registry and pass compatibility check, a new version is created. -- When the schema is in schema registry and not pass compatibility check, reports error. +- When the schema is in schema registry and cannot pass compatibility check, reports error. #### SubjectNameStrategy Tests From ae30b8ce5f2d23dc4b735c83d880bb146c0387e4 Mon Sep 17 00:00:00 2001 From: zhangyangyu Date: Tue, 17 May 2022 20:21:20 +0800 Subject: [PATCH 5/5] fix lint --- docs/design/2022-05-12-ticdc-avro-protocol-refactor.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md b/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md index 2556227d27b..68c2298874e 100644 --- a/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md +++ b/docs/design/2022-05-12-ticdc-avro-protocol-refactor.md @@ -44,7 +44,7 @@ Apache Avro™ is a data serialization system with rich data structures and a co | Config item | Option values | Default | Explain | | ---------------------------------- | ---------------------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| protocol | canal-json / flat-avro | - | Specify the format of messages written to Kafka.
The `flat-avro` option means that using the avro format design by this document. | +| protocol | canal-json / flat-avro | - | Specify the format of messages written to Kafka.
The `flat-avro` option means that using the avro format design by this document. | | enable-tidb-extension | true / false | false | Append tidb extension fields into avro message or not. | | schema-registry | - | - | Specifies the schema registry endpoint. | | avro-decimal-handling-mode | precise / string | precise | Specifies how the TiCDC should handle values for DECIMAL columns:
`precise` option represents encoding decimals as precise bytes.
`string` option encodes values as formatted strings, which is easy to consume but semantic information about the real type is lost. |