Support sinking to relational databases via Debezium #235

mwylde · 2023-08-08T03:25:08Z

While investigating #233, I noticed that while we have support for outputting Debezium-formatted JSON, there were a few issues that prevented us from actually being able to write to a postgres (or other RDBMS) sink via Debezium:

Debezium requires a Kafka-connect schema, but we don't support any of the ways of setting that (via avro, confluent schema registry, or as an "embedded" schema in the message)
We don't have a way to emit timestamps that can be ingested by Debezium.

This PR addresses both issues, by allowing users to specify that output JSON should have embedded schema (currently only possible in sql, by setting 'json.include_schema' = true on the connection table), and properly emits millisecond-unix timestamp datetimes when the format is "debezium_json." This behavior can be overridden by setting the json.timestamp_format option as well.

As part of addressing this, I also began a redesign of the format system as the existing "serialization_mode" spec was too limiting to express things like "embed_schema." This will support us emitting more formats (like avro and protobuf) in a structured way in the future.

arroyo-api/src/connection_tables.rs

jacksonrnewhouse · 2023-08-08T17:49:41Z

arroyo-rpc/src/lib.rs

+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
+pub struct OperatorConfig {


This seems to be here so that it can be shared between arroyo-worker and arroyo-connectors. Is that right? I've put things in arroyo-types for similar reasons, should we standardize on one crate for that?

It's in arroyo-rpc instead of arroyo-types because it needs serde_json::Value which arroyo-types doesn't currently have, and ideally we don't add additional crate dependencies to arroyo-types.

Generally I think arroyo-rpc makes sense for values that are used in communication between the various parts of our system.

jacksonrnewhouse · 2023-08-08T18:30:10Z

arroyo-sql/src/external.rs

@@ -30,7 +30,6 @@ pub struct SqlSink {
 #[derive(Clone, Debug)]
 pub enum SinkUpdateType {
    Allow,
-    Disallow,


The point of this was to reject updating SQL queries if the sink did not support updates. With this change, the following query writes debezium formatted records to kafka:

CREATE TABLE kafka_raw_sink ( sum bigint, ) WITH ( connector = 'kafka', bootstrap_servers = 'localhost:9092', type = 'sink', topic = 'raw_sink', format = 'json' ); INSERT INTO kafka_raw_sink SELECT Count(*) FROM nexmark;

I've re-added this logic so that we reject queries that insert updates into non-updating sinks, and added a test.

jacksonrnewhouse · 2023-08-08T18:32:36Z

arroyo-sql/src/lib.rs

@@ -341,11 +341,14 @@ pub fn parse_and_get_program_sync(
            operator: "GrpcSink::<#in_k, #in_t>".to_string(),
            config: "{}".to_string(),
            description: "WebSink".to_string(),
-            serialization_mode: if insert.is_updating() {
-                arroyo_datastream::SerializationMode::DebeziumJson
+            format: Some(if insert.is_updating() {


This would be cleaner as just

Some(Format::Json(JsonFormat { debezium: insert.is_updating(), ..Default::default() })

I think

jacksonrnewhouse · 2023-08-08T18:35:49Z

arroyo-sql/src/types.rs

@@ -150,7 +151,7 @@ impl StructDef {
        StructDef { name: None, fields }
    }

-    pub fn generate_record_batch_builder(&self) -> TokenStream {
+    pub fn generate_serializer_items(&self) -> TokenStream {


I think in the near future we should standardize how we decide which code needs to be generated in each place. Right now we get there through a number of different pieces of logic.

jacksonrnewhouse · 2023-08-08T18:39:44Z

arroyo-sql/src/types.rs

@@ -256,6 +278,7 @@ pub struct StructField {
    pub renamed_from: Option<String>,
    pub original_type: Option<String>,
    pub expression: Option<Box<Expression>>,
+    pub format: Option<Arc<Format>>,


I don't like adding this just for the one serialization case, but it does mean that we have different struct hashes, so differently named structs.

Could we put it on the StructDef instead?

moved to StructDef

jacksonrnewhouse · 2023-08-08T18:42:26Z

arroyo-types/src/formats.rs

+            "protobuf" => return Err("protobuf is not yet supported".to_string()),
+            "avro" => return Err("avro is not yet supported".to_string()),
+            "raw_string" => return Err("raw_string is not yet supported".to_string()),
+            "parquet" => Format::Parquet(ParquetFormat {}),


It isn't clear to me where the line is between "format" settings and other settings. Should the compression used by parquet be a format setting?

Yes, I think parquet stuff should be moved in here, but I punted on it for now

mwylde · 2023-08-08T22:39:28Z

This is now ready for review

mwylde requested a review from jacksonrnewhouse August 8, 2023 03:25

jacksonrnewhouse reviewed Aug 8, 2023

View reviewed changes

mwylde added 8 commits August 8, 2023 15:37

Fix debezium outputs for SQL sinks

d5ebf47

blah

8e96576

work

8f1de82

setup

c02a43b

cutting off work on this branch

17d1902

Almost there, except for timestamps

747334d

Working integration with debezium sink

d171a71

Fix grpc endpoints for new formats

a9a72f4

mwylde force-pushed the debezium_fixes branch 2 times, most recently from eada29a to d58303a Compare August 8, 2023 22:39

mwylde marked this pull request as ready for review August 8, 2023 22:39

Working timestamp serialization config

fb611c1

mwylde force-pushed the debezium_fixes branch from d58303a to fb611c1 Compare August 9, 2023 00:01

jacksonrnewhouse approved these changes Aug 9, 2023

View reviewed changes

mwylde enabled auto-merge (squash) August 9, 2023 00:12

mwylde merged commit b114097 into master Aug 9, 2023

mwylde mentioned this pull request Aug 11, 2023

connectors: Add Kinesis Source and Sink #234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sinking to relational databases via Debezium #235

Support sinking to relational databases via Debezium #235

mwylde commented Aug 8, 2023

jacksonrnewhouse Aug 8, 2023

mwylde Aug 8, 2023

jacksonrnewhouse Aug 8, 2023

mwylde Aug 8, 2023

jacksonrnewhouse Aug 8, 2023

jacksonrnewhouse Aug 8, 2023

mwylde Aug 8, 2023

jacksonrnewhouse Aug 8, 2023

mwylde Aug 8, 2023

jacksonrnewhouse Aug 8, 2023

mwylde Aug 8, 2023

mwylde commented Aug 8, 2023

Support sinking to relational databases via Debezium #235

Support sinking to relational databases via Debezium #235

Conversation

mwylde commented Aug 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwylde commented Aug 8, 2023