-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support sinking to relational databases via Debezium #235
Conversation
} | ||
|
||
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)] | ||
pub struct OperatorConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be here so that it can be shared between arroyo-worker and arroyo-connectors. Is that right? I've put things in arroyo-types for similar reasons, should we standardize on one crate for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in arroyo-rpc instead of arroyo-types because it needs serde_json::Value
which arroyo-types doesn't currently have, and ideally we don't add additional crate dependencies to arroyo-types.
Generally I think arroyo-rpc makes sense for values that are used in communication between the various parts of our system.
@@ -30,7 +30,6 @@ pub struct SqlSink { | |||
#[derive(Clone, Debug)] | |||
pub enum SinkUpdateType { | |||
Allow, | |||
Disallow, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this was to reject updating SQL queries if the sink did not support updates. With this change, the following query writes debezium formatted records to kafka:
CREATE TABLE kafka_raw_sink (
sum bigint,
) WITH (
connector = 'kafka',
bootstrap_servers = 'localhost:9092',
type = 'sink',
topic = 'raw_sink',
format = 'json'
);
INSERT INTO kafka_raw_sink
SELECT Count(*) FROM nexmark;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-added this logic so that we reject queries that insert updates into non-updating sinks, and added a test.
arroyo-sql/src/lib.rs
Outdated
@@ -341,11 +341,14 @@ pub fn parse_and_get_program_sync( | |||
operator: "GrpcSink::<#in_k, #in_t>".to_string(), | |||
config: "{}".to_string(), | |||
description: "WebSink".to_string(), | |||
serialization_mode: if insert.is_updating() { | |||
arroyo_datastream::SerializationMode::DebeziumJson | |||
format: Some(if insert.is_updating() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be cleaner as just
Some(Format::Json(JsonFormat {
debezium: insert.is_updating(),
..Default::default()
})
I think
@@ -150,7 +151,7 @@ impl StructDef { | |||
StructDef { name: None, fields } | |||
} | |||
|
|||
pub fn generate_record_batch_builder(&self) -> TokenStream { | |||
pub fn generate_serializer_items(&self) -> TokenStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the near future we should standardize how we decide which code needs to be generated in each place. Right now we get there through a number of different pieces of logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah
arroyo-sql/src/types.rs
Outdated
@@ -256,6 +278,7 @@ pub struct StructField { | |||
pub renamed_from: Option<String>, | |||
pub original_type: Option<String>, | |||
pub expression: Option<Box<Expression>>, | |||
pub format: Option<Arc<Format>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like adding this just for the one serialization case, but it does mean that we have different struct hashes, so differently named structs.
Could we put it on the StructDef instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to StructDef
"protobuf" => return Err("protobuf is not yet supported".to_string()), | ||
"avro" => return Err("avro is not yet supported".to_string()), | ||
"raw_string" => return Err("raw_string is not yet supported".to_string()), | ||
"parquet" => Format::Parquet(ParquetFormat {}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't clear to me where the line is between "format" settings and other settings. Should the compression used by parquet be a format setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think parquet stuff should be moved in here, but I punted on it for now
eada29a
to
d58303a
Compare
This is now ready for review |
While investigating #233, I noticed that while we have support for outputting Debezium-formatted JSON, there were a few issues that prevented us from actually being able to write to a postgres (or other RDBMS) sink via Debezium:
This PR addresses both issues, by allowing users to specify that output JSON should have embedded schema (currently only possible in sql, by setting
'json.include_schema' = true
on the connection table), and properly emits millisecond-unix timestamp datetimes when the format is "debezium_json." This behavior can be overridden by setting thejson.timestamp_format
option as well.As part of addressing this, I also began a redesign of the format system as the existing "serialization_mode" spec was too limiting to express things like "embed_schema." This will support us emitting more formats (like avro and protobuf) in a structured way in the future.