[FLINK-36611][pipeline-connector][kafka] Add schema info to output of Kafka sink #3791

MOBIN-F · 2024-12-11T02:57:05Z

Currently, the output of Kafka sink in debezium format looks like this:

{
  "before": {
    "id": 4,
    "name": "John",
    "address": "New York",
    "phone_number": "2222",
    "age": 12
  },
  "after": {
    "id": 4,
    "name": "John",
    "address": "New York",
    "phone_number": "1234",
    "age": 12
  },
  "op": "u",
  "source": {
    "db": null,
    "table": "customers"
  }
}

It contains record data with full before/after and db info, but schema info wasn't included.

However, In some scenarios, we need this information to determine the type of data. For example, Paimon's Kafka CDC source requires this type information, otherwise all types are considered String, refer to https://paimon.apache.org/docs/0.9/flink/cdc-ingestion/kafka-cdc/#supported-formats.

Considering that this will increase the data load, I suggest adding a parameter to configure whether to enable it.

…schema info

MOBIN-F · 2024-12-11T03:00:01Z

...um/src/main/java/org/apache/flink/cdc/debezium/event/DebeziumEventDeserializationSchema.java

+                                            after,
+                                            meta,
+                                            extractBeforeAndAfterSchema(
+                                                    jsonConverter.asJsonSchema(valueSchema))))


Through the jsonConverter.asJsonSchema(valueSchema) method, we can easily obtain the complete schema json information of debezium-json.
At first, I wanted to convert the schema josn to GenericRowData, but the schema structure of debezium is too complex and difficult to implement.
Finally, I chose to pass the schema json information as a string to the downstream, which can reduce some serialization and deserialization overhead

MOBIN-F · 2024-12-11T03:05:14Z

.../apache/flink/cdc/connectors/kafka/json/debezium/DebeziumJsonRowDataSerializationSchema.java

+                // escape characters such as "\"
+                String schemaValue = node.get("schema").asText();
+                JsonNode schemaNode = mapper.readTree(schemaValue);
+                node.set("schema", schemaNode);


Because the schema is passed to the downstream as a string, and there is a nested json in the schema, if the json string is put into jsonNode, there will be ["]. The JsonNode.asText() method can solve this problem well.

…ezium-json-include-schema # Conflicts: # flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/test/java/org/apache/flink/cdc/connectors/base/MySqlSourceMetricsTest.java # flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/test/java/org/apache/flink/cdc/connectors/base/source/MySqlEventDeserializer.java

MOBIN-F added 5 commits December 11, 2024 10:35

Add schema fields to DataChangeEvent.

b712e2f

get schema json through jsonConverter.asJsonSchema

3bd3e47

support pipeline-connector-mysql and pipeline-connector-kafka output …

b874ab4

…schema info

support pipeline-connector-values output schema info

c18bf0d

add e2eTest

36724f3

github-actions bot added values-pipeline-connector common runtime mysql-cdc-connector base e2e-tests mysql-pipeline-connector kafka-pipeline-connector labels Dec 11, 2024

MOBIN-F commented Dec 11, 2024

View reviewed changes

MOBIN-F added 2 commits December 11, 2024 12:07

fix code style

c72d25c

github-actions bot removed the base label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-36611][pipeline-connector][kafka] Add schema info to output of Kafka sink #3791

[FLINK-36611][pipeline-connector][kafka] Add schema info to output of Kafka sink #3791

MOBIN-F commented Dec 11, 2024

MOBIN-F Dec 11, 2024

MOBIN-F Dec 11, 2024

[FLINK-36611][pipeline-connector][kafka] Add schema info to output of Kafka sink #3791

Are you sure you want to change the base?

[FLINK-36611][pipeline-connector][kafka] Add schema info to output of Kafka sink #3791

Conversation

MOBIN-F commented Dec 11, 2024

MOBIN-F Dec 11, 2024

Choose a reason for hiding this comment

MOBIN-F Dec 11, 2024

Choose a reason for hiding this comment