Skip to content

Latest commit

 

History

History
545 lines (459 loc) · 20.8 KB

ticdc-canal-json.md

File metadata and controls

545 lines (459 loc) · 20.8 KB
title summary
TiCDC Canal-JSON Protocol
Learn the concept of TiCDC Canal-JSON Protocol and how to use it.

TiCDC Canal-JSON Protocol

Canal-JSON is a data exchange format protocol defined by Alibaba Canal. In this document, you can learn how Canal-JSON data formats are implemented in TiCDC, including the TiDB extension field, the definitions of the Canal-JSON data formats, and comparison with the official Canal.

Use Canal-JSON

When using Message Queue (MQ) as the downstream Sink, you can specify Canal-JSON in sink-uri. TiCDC wraps and constructs Canal-JSON messages with Event as the basic unit, and sends TiDB data change Events to the downstream.

There are three types of Events:

  • DDL Event: Represents a DDL change record. It is sent after an upstream DDL statement is successfully executed. The DDL Event is sent to the MQ Partition with the index being 0.
  • DML Event: Represents a row data change record. This type of Event is sent when a row change occurs. It contains the information about the row after the change occurs.
  • WATERMARK Event: Represents a special time point. It indicates that the Events received before this point is complete. It applies only to the TiDB extension field and takes effect when you set enable-tidb-extension to true in sink-uri.

The following is an example of using Canal-JSON:

{{< copyable "shell-regular" >}}

cdc cli changefeed create --server=http://127.0.0.1:8300 --changefeed-id="kafka-canal-json" --sink-uri="kafka://127.0.0.1:9092/topic-name?kafka-version=2.4.0&protocol=canal-json"

TiDB extension field

The Canal-JSON protocol is originally designed for MySQL. It does not contain important fields such as the TiDB-specific unique identifier for the CommitTS transaction. To solve this problem, TiCDC appends a TiDB extension field to the Canal-JSON protocol format. After you set enable-tidb-extension to true (false by default) in sink-uri, TiCDC behaves as follows when generating Canal-JSON messages:

  • TiCDC sends DML Event and DDL Event messages that contain a field named _tidb.
  • TiCDC sends WATERMARK Event messages.

The following is an example:

{{< copyable "shell-regular" >}}

cdc cli changefeed create --server=http://127.0.0.1:8300 --changefeed-id="kafka-canal-json-enable-tidb-extension" --sink-uri="kafka://127.0.0.1:9092/topic-name?kafka-version=2.4.0&protocol=canal-json&enable-tidb-extension=true"

Definitions of message formats

This section describes the formats of DDL Event, DML Event and WATERMARK Event, and how the data is resolved on the consumer side.

DDL Event

TiCDC encodes a DDL Event into the following Canal-JSON format.

{
    "id": 0,
    "database": "test",
    "table": "",
    "pkNames": null,
    "isDdl": true,
    "type": "QUERY",
    "es": 1639633094670,
    "ts": 1639633095489,
    "sql": "drop database if exists test",
    "sqlType": null,
    "mysqlType": null,
    "data": null,
    "old": null,
    "_tidb": {     // TiDB extension field
        "commitTs": 429918007904436226  // A TiDB TSO timestamp
    }
}

The fields are explained as follows.

Field Type Description
id Number The default value is 0 in TiCDC.
database String The name of the database where the row is located
table String The name of the table where the row is located
pkNames Array The names of all the columns that make up the primary key
isDdl Bool Whether the message is a DDL event
type String Event types defined by Canal-JSON
es Number 13-bit (millisecond) timestamp when the event that generated the message happened
ts Number 13-bit (millisecond) timestamp when TiCDC generated the message
sql String When isDdl is true, records the corresponding DDL statement
sqlType Object When isDdl is false, records how the data type of each column is represented in Java
mysqlType object When isDdl is false, records how the data type of each column is represented in MySQL
data Object When isDdl is false, records the name of each column and its data value
old Object Only if the message is generated by an update Event, records the name of each column and the data value before the update
_tidb Object TiDB extension field. It exists only if you set enable-tidb-extension to true. The value of commitTs is the TSO of the transaction that caused the row to change.

DML Event

TiCDC encodes a row of DML data change event as follows:

{
    "id": 0,
    "database": "test",
    "table": "tp_int",
    "pkNames": [
        "id"
    ],
    "isDdl": false,
    "type": "INSERT",
    "es": 1639633141221,
    "ts": 1639633142960,
    "sql": "",
    "sqlType": {
        "c_bigint": -5,
        "c_int": 4,
        "c_mediumint": 4,
        "c_smallint": 5,
        "c_tinyint": -6,
        "id": 4
    },
    "mysqlType": {
        "c_bigint": "bigint",
        "c_int": "int",
        "c_mediumint": "mediumint",
        "c_smallint": "smallint",
        "c_tinyint": "tinyint",
        "id": "int"
    },
    "data": [
        {
            "c_bigint": "9223372036854775807",
            "c_int": "2147483647",
            "c_mediumint": "8388607",
            "c_smallint": "32767",
            "c_tinyint": "127",
            "id": "2"
        }
    ],
    "old": null,
    "_tidb": {     // TiDB extension field
        "commitTs": 429918007904436226  // A TiDB TSO timestamp
    }
}

WATERMARK Event

TiCDC sends a WATERMARK Event only when you set enable-tidb-extension to true. The value of the type field is TIDB_WATERMARK. The Event contains the _tidb field, and the field contains only one parameter watermarkTs. The value of watermarkTs is the TSO recorded when the Event is sent.

When you receive an Event of this type, all Events with commitTs less than watermarkTs have been sent. Because TiCDC provides the "At Least Once" semantics, data might be sent repeatedly. If a subsequent Event with commitTs less than watermarkTs is received, you can safely ignore this Event.

The following is an example of the WATERMARK Event.

{
    "id": 0,
    "database": "",
    "table": "",
    "pkNames": null,
    "isDdl": false,
    "type": "TIDB_WATERMARK",
    "es": 1640007049196,
    "ts": 1640007050284,
    "sql": "",
    "sqlType": null,
    "mysqlType": null,
    "data": null,
    "old": null,
    "_tidb": {     // TiDB extension field
        "watermarkTs": 429918007904436226  // A TiDB TSO timestamp
    }
}

Data resolution on the consumer side

As you can see from the example above, Canal-JSON has a uniform data format, with different field filling rules for different Event types. You can use a uniform method to resolve this JSON format data, and then determine the Event type by checking the field values.

  • When isDdl is true, the message contains a DDL Event.
  • When isDdl is false, you need to further check the type field. If type is TIDB_WATERMARK, it is a WATERMARK Event; otherwise, it is a DML Event.

Field descriptions

The Canal-JSON format records the corresponding data type in the mysqlType field and the sqlType field.

MySQL Type field

In the mysqlType field, the Canal-JSON format records the string of MySQL Type in each column. For more information, see TiDB Data Types.

SQL Type field

In the sqlType field, the Canal-JSON format records Java SQL Type of each column, which is the data type corresponding to the data in JDBC. Its value can be calculated by MySQL Type and the specific data value. The mapping is as follows:

MySQL Type Java SQL Type Code
Boolean -6
Float 7
Double 8
Decimal 3
Char 1
Varchar 12
Binary 2004
Varbinary 2004
Tinytext 2005
Text 2005
Mediumtext 2005
Longtext 2005
Tinyblob 2004
Blob 2004
Mediumblob 2004
Longblob 2004
Date 91
Datetime 93
Timestamp 93
Time 92
Year 12
Enum 4
Set -7
Bit -7
JSON 12

Integer types

You need to consider whether integer types have the Unsigned constraint and the value size, which corresponds to different Java SQL Type Codes respectively, as shown in the following table.

MySQL Type String Value Range Java SQL Type Code
tinyint [-128, 127] -6
tinyint unsigned [0, 127] -6
tinyint unsigned [128, 255] 5
smallint [-32768, 32767] 5
smallint unsigned [0, 32767] 5
smallint unsigned [32768, 65535] 4
mediumint [-8388608, 8388607] 4
mediumint unsigned [0, 8388607] 4
mediumint unsigned [8388608, 16777215] 4
int [-2147483648, 2147483647] 4
int unsigned [0, 2147483647] 4
int unsigned [2147483648, 4294967295] -5
bigint [-9223372036854775808, 9223372036854775807] -5
bigint unsigned [0, 9223372036854775807] -5
bigint unsigned [9223372036854775808, 18446744073709551615] 3

The following table shows the mapping relationships between Java SQL Types in TiCDC and their codes.

Java SQL Type Java SQL Type Code
CHAR 1
DECIMAL 3
INTEGER 4
SMALLINT 5
REAL 7
DOUBLE 8
VARCHAR 12
DATE 91
TIME 92
TIMESTAMP 93
BLOB 2004
CLOB 2005
BIGINT -5
TINYINT -6
Bit -7

For more information about Java SQL Types, see Java SQL Class Types.

Binary and Blob types

TiCDC encodes binary types in the Canal-JSON format by converting each byte to its character representation as follows:

  • Printable characters are represented using the ISO/IEC 8859-1 character encodings.
  • Non-printable characters and certain characters with special meaning in HTML are represented using their UTF-8 escape sequence.

The following table shows the detailed representation information.

Character type Value range Character representation
Control characters [0, 31] UTF-8 escape (such as \u0000 through \u001F)
Horizontal tab [9] \t
Line feed [10] \n
Carriage return [13] \r
Printable characters [32, 127] Literal character (such as A)
Ampersand [38] \u0026
Less-than sign [60] \u0038
Greater-than sign [62] \u003E
Extended control characters [128, 159] Literal character
ISO 8859-1 (Latin-1) [160, 255] Literal character

Example of the encoding

For example, the following 16 bytes [5 7 10 15 36 50 43 99 120 60 38 255 254 45 55 70] stored in a VARBINARY column called c_varbinary are encoded in a Canal-JSON Update event as follows:

{
    ...
    "data": [
        {
            ...
            "c_varbinary": "\u0005\u0007\n\u000f$2+cx\u003c\u0026ÿþ-7F"
        }
    ]
    ...
}

Comparison of TiCDC Canal-JSON and the official Canal

The way that TiCDC implements the Canal-JSON data format, including the Update Event and the mysqlType field, differs from the official Canal. The following table shows the main differences.

Item TiCDC Canal-JSON Canal
Event of Update Type By default, the old field contains all the column data. When only_output_updated_columns is true, the old field contains only the modified column data. The old field contains only the modified column data
mysqlType field For types with parameters, it does not contain the information of the type parameter For types with parameters, it contains the full information of the type parameter

Compatibility with the official Canal

Starting from v6.5.6, v7.1.3, and v7.6.0, TiCDC Canal-JSON supports compatibility with the data format of the official Canal. When creating a changefeed, you can set content-compatible=true in sink-uri to enable this feature. In this mode, TiCDC outputs Canal-JSON format data that is compatible with the official Canal. The specific changes are as follows:

  • The mysqlType field contains the full information of the type parameter for each type.
  • An Event of Update Type only outputs the modified column data.

Event of Update Type

For an Event of Update Type:

  • In TiCDC, the old field contains all the column data
  • In the official Canal, the old field contains only the modified column data

Assume that the following SQL statements are executed sequentially in the upstream TiDB:

create table tp_int
(
    id          int auto_increment,
    c_tinyint   tinyint   null,
    c_smallint  smallint  null,
    c_mediumint mediumint null,
    c_int       int       null,
    c_bigint    bigint    null,
    constraint pk
        primary key (id)
);

insert into tp_int(c_tinyint, c_smallint, c_mediumint, c_int, c_bigint)
values (127, 32767, 8388607, 2147483647, 9223372036854775807);

update tp_int set c_int = 0, c_tinyint = 0 where c_smallint = 32767;

For the update statement, TiCDC outputs an Event message with type as UPDATE, as shown below. The update statement only modifies the c_int and c_tinyint columns. The old field in the output event message contains all the column data.

{
    "id": 0,
    ...
    "type": "UPDATE",
    ...
    "sqlType": {
        ...
    },
    "mysqlType": {
        ...
    },
    "data": [
        {
            "c_bigint": "9223372036854775807",
            "c_int": "0",
            "c_mediumint": "8388607",
            "c_smallint": "32767",
            "c_tinyint": "0",
            "id": "2"
        }
    ],
    "old": [                              // In TiCDC, this field contains all the column data.
        {
            "c_bigint": "9223372036854775807",
            "c_int": "2147483647",        // Modified column
            "c_mediumint": "8388607",
            "c_smallint": "32767",
            "c_tinyint": "127",           // Modified column
            "id": "2"
        }
    ]
}

For the official Canal, the old field in the output event message contains only the modified column data, as shown below.

{
    "id": 0,
    ...
    "type": "UPDATE",
    ...
    "sqlType": {
        ...
    },
    "mysqlType": {
        ...
    },
    "data": [
        {
            "c_bigint": "9223372036854775807",
            "c_int": "0",
            "c_mediumint": "8388607",
            "c_smallint": "32767",
            "c_tinyint": "0",
            "id": "2"
        }
    ],
    "old": [                              // In Canal, this field contains only the modified column data.
        {
            "c_int": "2147483647",        // Modified column
            "c_tinyint": "127",           // Modified column
        }
    ]
}

mysqlType field

For the mysqlType field, if a type contains parameters, the official Canal contains the full information of the type parameter. TiCDC does not contain such information.

In the following example, the table-defining SQL statement contains a parameter for each column, such as the ones for decimal, char, varchar and enum. By comparing the Canal-JSON formats generated by TiCDC and the official Canal, you can see that TiCDC only contains the basic MySQL information in the mysqlType field. If you need the full information of the type parameter, you need to implement it by other means.

Assume that the following SQL statements are executed sequentially in the upstream TiDB:

create table t (
    id     int auto_increment,
    c_decimal    decimal(10, 4) null,
    c_char       char(16)      null,
    c_varchar    varchar(16)   null,
    c_binary     binary(16)    null,
    c_varbinary  varbinary(16) null,
    c_enum enum('a','b','c') null,
    c_set  set('a','b','c')  null,
    c_bit  bit(64)            null,
    constraint pk
        primary key (id)
);

insert into t (c_decimal, c_char, c_varchar, c_binary, c_varbinary, c_enum, c_set, c_bit)
values (123.456, "abc", "abc", "abc", "abc", 'a', 'a,b', b'1000001');

The output of TiCDC is as follows:

{
    "id": 0,
    ...
    "isDdl": false,
    "sqlType": {
        ...
    },
    "mysqlType": {
        "c_binary": "binary",
        "c_bit": "bit",
        "c_char": "char",
        "c_decimal": "decimal",
        "c_enum": "enum",
        "c_set": "set",
        "c_varbinary": "varbinary",
        "c_varchar": "varchar",
        "id": "int"
    },
    "data": [
        {
            ...
        }
    ],
    "old": null,
}

The output of the official Canal is as follows:

{
    "id": 0,
    ...
    "isDdl": false,
    "sqlType": {
        ...
    },
    "mysqlType": {
        "c_binary": "binary(16)",
        "c_bit": "bit(64)",
        "c_char": "char(16)",
        "c_decimal": "decimal(10, 4)",
        "c_enum": "enum('a','b','c')",
        "c_set": "set('a','b','c')",
        "c_varbinary": "varbinary(16)",
        "c_varchar": "varchar(16)",
        "id": "int"
    },
    "data": [
        {
            ...
        }
    ],
    "old": null,
}

Changes in TiCDC Canal-JSON

Changes in the Old field of the Delete events

From v5.4.0, the old field of the Delete events has changed.

The following is a Delete event message. Before v5.4.0, the old field contains the same content as the "data" field. In v5.4.0 and later versions, the old field is set to null. You can get the deleted data by using the "data" field.

{
    "id": 0,
    "database": "test",
    ...
    "type": "DELETE",
    ...
    "sqlType": {
        ...
    },
    "mysqlType": {
        ...
    },
    "data": [
        {
            "c_bigint": "9223372036854775807",
            "c_int": "0",
            "c_mediumint": "8388607",
            "c_smallint": "32767",
            "c_tinyint": "0",
            "id": "2"
        }
    ],
    "old": null,
    // The following is an example before v5.4.0. The `old` field contains the same content as the "data" field.
    "old": [
        {
            "c_bigint": "9223372036854775807",
            "c_int": "0",
            "c_mediumint": "8388607",
            "c_smallint": "32767",
            "c_tinyint": "0",
            "id": "2"
        }
    ]
}