Kafka Connect: Initial project setup and event data structures #8701

bryanck · 2023-10-02T17:05:58Z

We (Tabular) would like to submit our Iceberg Kafka Connect sink connector to the Iceberg project. Kafka Connect is a popular, efficient, and easy to use framework for reading from and writing to Kafka. This sink gives Iceberg users another option for landing data from Kafka into Iceberg tables. Having the backing of the Iceberg community will help it evolve and improve over time.

The sink codebase is on the larger side, so the thought was to break the submission into different PRs to make it easier to review. This initial PR includes the starting build setup and the project for the Avro event data structures.

The original repo can be found at https://github.com/tabular-io/iceberg-kafka-connect. Some design docs can be found in the docs directory, and that includes an explanation of what the events are used for, and why Avro was chosen for serialization.

The events were put in a separate project so the library can be used independently to read messages from the control topic outside of the connector, for debugging or notification purposes.

...ka-connect-events/src/main/java/io/tabular/iceberg/connect/events/CommitCompletePayload.java

...nect/kafka-connect-events/src/test/java/org/apache/iceberg/connect/events/EventTestUtil.java

ajantha-bhat · 2023-10-02T17:19:11Z

Very happy to see this contribution 👍
I have recently tested this project with Nessie catalog and liked it.

I just skimmed the files. Will definitely go through design docs and review again.

core/src/main/java/org/apache/iceberg/data/avro/DecoderResolver.java

kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/Element.java

danielcweeks · 2023-10-03T19:14:35Z

settings.gradle

@@ -200,3 +202,6 @@ if (JavaVersion.current() == JavaVersion.VERSION_1_8) {
  }
 }

+include ":iceberg-kafka-connect:kafka-connect-events"


Spark/Flink/Hive require that we support multiple versions. Is the KC api stable enough that we don't have to worry about supporting different major/minor versions? I see the 3.x line goes back to 2021, but there are six minor releases. Just wondering if we should structure the project with versioning in mind from the start.

For the most part, the API has been very stable so I was thinking of not doing this to start, it might be overkill.

Do we need so many modules? If we're just creating a single Jar in the end I wonder if it is helpful to break it up

The main reason I have events as separate module is so that it can be used independently from the sink to read and deserialize events from the control topic. This can be used, for example, to trigger workflows from the control topic rather than having to poll the table metadata. With that, the plan was to have 2 modules (events, core) plus a runtime module.

rdblue · 2023-10-06T00:00:09Z

...ka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java

+  private Long vtts;
+  private final Schema avroSchema;
+
+  private static final Schema AVRO_SCHEMA =


We typically prefer to define Iceberg schemas with field IDs and convert them to Avro schemas. I think those schemas are easier to read and it ensures that we assign field IDs.

+1 for using the Iceberg schema definition with field ids and converting them to Avro.

I do agree it is more readable. However, this will require changes to the Avro encoder. For example, both GenericDataFile and GenericDelete file use the same Iceberg schema. The current schema converter only allows a one to one mapping of struct type to class name, so you can't include both data files and delete files in the same event.

Also, it becomes cumbersome to redefine the same struct to class mapping for every schema, e.g. the TopicPartitionOffset mapping must be defined for both payloads that use it as well as for the event container, when converting the schema to Avro.

Another downside is the extra overhead of doing the conversion. Schemas are sometimes constructed for each event, as the schema changes depending on the table and partitioning. Though that is more minor and could be solved w/ caching.

I did updated the field IDs so they aren't all -1.

rdblue · 2023-10-06T00:02:22Z

kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/Element.java

+import org.apache.avro.specific.SpecificData.SchemaConstructable;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+
+public interface Element extends IndexedRecord, SchemaConstructable {


Rather than using SchemaConstructable, Iceberg just checks for a Schema constructor first.

Thanks, I removed SchemaConstructable.

dungdm93 · 2023-10-06T15:31:32Z

I'm developing alluvial project which used to stream change logs from Kafka in debezium format to Iceberg table.
Can't wait until this PR get merged and I'd like to contribute to it too.

rdblue · 2023-10-06T21:22:31Z

...nect/kafka-connect-events/src/test/java/org/apache/iceberg/connect/events/EventTestUtil.java

+public class EventTestUtil {
+  public static DataFile createDataFile() {
+    Ctor<DataFile> ctor =
+        DynConstructors.builder(DataFile.class)


Can't this use DataFiles to build a data file instead of using reflection?

this is done

ajantha-bhat · 2023-10-11T15:43:26Z

...ka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java

+    return commitId;
+  }
+
+  public Long vtts() {


Can you please add a comment explaining this field? I can understand it is some time-stamp. But not clearly with the abbreviation.

I think we can also add events.md doc with this PR now.
https://github.com/tabular-io/iceberg-kafka-connect/blob/main/docs/events.md

We can add this info as javadoc

VTTS (valid-through timestamp) property indicating through what timestamp records have been fully processed, i.e. all records processed from then on will have a timestamp greater than the VTTS. This is calculated by taking the maximum timestamp of records processed from each topic partition, and taking the minimum of these. If any partitions were not processed as part of the commit then the VTTS is not set

https://github.com/tabular-io/iceberg-kafka-connect/blob/main/docs/design.md#snapshot-properties

I added a javadoc for this

...-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java

ajantha-bhat · 2023-10-11T16:53:04Z

...a-connect-events/src/test/java/org/apache/iceberg/connect/events/EventSerializationTest.java

+import org.apache.iceberg.types.Types.StructType;
+import org.junit.jupiter.api.Test;
+
+public class EventSerializationTest {


nit: Majority of the testcase in Iceberg starts with Test prefix. So, maybe we can rename it.

It is pretty mixed actually and I much prefer the specificity of putting the class name first.

ajantha-bhat · 2023-10-11T16:53:49Z

...nect/kafka-connect-events/src/test/java/org/apache/iceberg/connect/events/EventTestUtil.java

+        ByteBuffer.wrap(new byte[] {0}));
+  }
+
+  private EventTestUtil() {}


nit: Can we move the constructor up?

nastra · 2023-10-27T21:13:31Z

...ka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java

+
+/**
+ * A control event payload for events sent by a coordinator that indicates it has completed a commit
+ * cycle. Events with this payload are not consumed by the sink, they * are informational and can be


Suggested change

* cycle. Events with this payload are not consumed by the sink, they * are informational and can be

* cycle. Events with this payload are not consumed by the sink, they are informational and can be

Thanks for catching this, I fixed this

nastra · 2023-10-27T21:15:09Z

...ka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java

+public class CommitCompletePayload implements Payload {
+
+  private UUID commitId;
+  private Long vtts;


the name is somewhat cryptic, would it make sense to rename this to validThroughTimestamp?

Sure, I updated this to validThroughTs

kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/Event.java

...kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitReadyPayload.java

...ka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java

danielcweeks · 2023-10-27T23:01:19Z

I'm a +1 on moving forward with this. I think there might still be an open question about Iceberg/Avro Schema definitions, but I'm fine with either resolution.

ajantha-bhat

LGTM.

Can we take this forward?

ajantha-bhat · 2023-12-04T13:39:06Z

core/src/main/java/org/apache/iceberg/avro/TypeToSchema.java

+      super((id, struct) -> names.get(struct));
+    }
+
+    Map<Type, Schema> getConversionMap() {


Suggested change

Map<Type, Schema> getConversionMap() {

Map<Type, Schema> conversionMap() {

Fokko · 2024-01-10T14:07:20Z

Since there are no further comments, I'll go ahead and merge this. I would like to express my gratitude to @bryanck for working on this since this will help so many people in the Kafka community to get their data in Iceberg in a fast and reliable way! 🙏 Thanks @ajantha-bhat, @danielcweeks, @rdblue, @jbonofre, @ajantha-bhat and @nastra for the review 🚀

jbonofre · 2024-01-10T14:08:55Z

@Fokko awesome, thanks !

bryanck · 2024-01-10T17:00:00Z

Awesome! Thanks all for the feedback and guidance. I'll follow up with PRs for the actual sink portion.

…e#8701)

github-actions bot added core INFRA build labels Oct 2, 2023

ajantha-bhat reviewed Oct 2, 2023

View reviewed changes

...ka-connect-events/src/main/java/io/tabular/iceberg/connect/events/CommitCompletePayload.java Outdated Show resolved Hide resolved

ajantha-bhat reviewed Oct 2, 2023

View reviewed changes

...nect/kafka-connect-events/src/test/java/org/apache/iceberg/connect/events/EventTestUtil.java Outdated Show resolved Hide resolved

bryanck force-pushed the kafka-connect-pt1 branch from f6ffd07 to a1354cc Compare October 3, 2023 16:21

bryanck commented Oct 3, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/data/avro/DecoderResolver.java Show resolved Hide resolved

bryanck force-pushed the kafka-connect-pt1 branch from a1354cc to ef0aa7f Compare October 3, 2023 16:36

danielcweeks reviewed Oct 3, 2023

View reviewed changes

kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/Element.java Outdated Show resolved Hide resolved

danielcweeks reviewed Oct 3, 2023

View reviewed changes

rdblue reviewed Oct 6, 2023

View reviewed changes

amogh-jahagirdar mentioned this pull request Oct 6, 2023

Support Hudi DeltaStreamer compatible feature #8724

Closed

rdblue reviewed Oct 6, 2023

View reviewed changes

ajantha-bhat reviewed Oct 11, 2023

View reviewed changes

bryanck force-pushed the kafka-connect-pt1 branch from ef0aa7f to b4d413d Compare October 21, 2023 17:02

danielcweeks self-requested a review October 21, 2023 22:26

nastra reviewed Oct 27, 2023

View reviewed changes

kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/Event.java Show resolved Hide resolved

bryanck force-pushed the kafka-connect-pt1 branch from 5e711b1 to 8b5d469 Compare October 27, 2023 21:47

danielcweeks reviewed Oct 27, 2023

View reviewed changes

...kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitReadyPayload.java Outdated Show resolved Hide resolved

bryanck force-pushed the kafka-connect-pt1 branch from a1daf94 to ff5d269 Compare October 27, 2023 22:51

danielcweeks reviewed Oct 27, 2023

View reviewed changes

...ka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java Outdated Show resolved Hide resolved

danielcweeks self-requested a review October 27, 2023 23:00

danielcweeks approved these changes Oct 27, 2023

View reviewed changes

danielcweeks requested a review from rdblue October 27, 2023 23:01

bryanck force-pushed the kafka-connect-pt1 branch 2 times, most recently from 1a3fde1 to 9fed273 Compare November 12, 2023 17:06

ajantha-bhat approved these changes Dec 4, 2023

View reviewed changes

ajantha-bhat reviewed Dec 4, 2023

View reviewed changes

Fokko approved these changes Dec 4, 2023

View reviewed changes

jbonofre approved these changes Jan 4, 2024

View reviewed changes

bryanck added 9 commits January 4, 2024 13:15

Kafka Connect: initial project setup and event data structures

419e626

improved tests

0c26edf

avoid unneeded conversions

1b57e1a

add caching by record name

79c75a5

use timestamp type

9c3aa21

test fix

8f93de1

field ID mapping

43acff8

remove mapping array

224fc3c

position check and test update

f33aa77

bryanck force-pushed the kafka-connect-pt1 branch from f6e7bbd to f33aa77 Compare January 4, 2024 21:15

nastra added this to the Iceberg 1.5.0 milestone Jan 5, 2024

Fokko merged commit d1a3c10 into apache:main Jan 10, 2024
41 checks passed

Fokko mentioned this pull request Jan 10, 2024

Support Kafka Connect within Iceberg #4977

Closed

bryanck mentioned this pull request Jan 13, 2024

Kafka Connect: Sink connector with data writers and converters #9466

Merged

geruh pushed a commit to geruh/iceberg that referenced this pull request Jan 26, 2024

Kafka Connect: Initial project setup and event data structures (apach…

a5819a5

…e#8701)

adnanhemani pushed a commit to adnanhemani/iceberg that referenced this pull request Jan 30, 2024

Kafka Connect: Initial project setup and event data structures (apach…

b3a53bd

…e#8701)

bryanck mentioned this pull request Feb 4, 2024

Kafka Connect: Record converters #9641

Merged

tabmatfournier mentioned this pull request Apr 11, 2024

Upgrade to iceberg 1.5.2 databricks/iceberg-kafka-connect#235

Merged

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Kafka Connect: Initial project setup and event data structures (apach…

e0d595d

…e#8701)

bryanck mentioned this pull request May 18, 2024

Kafka Connect: Commit coordination #10351

Merged

bryanck mentioned this pull request Jul 21, 2024

Kafka Connect: Runtime distribution with integration tests #10739

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka Connect: Initial project setup and event data structures #8701

Kafka Connect: Initial project setup and event data structures #8701

bryanck commented Oct 2, 2023

ajantha-bhat commented Oct 2, 2023

danielcweeks Oct 3, 2023

bryanck Oct 3, 2023

rdblue Oct 6, 2023

bryanck Oct 6, 2023

rdblue Oct 6, 2023

ajantha-bhat Oct 11, 2023

bryanck Oct 21, 2023

rdblue Oct 6, 2023

bryanck Oct 21, 2023

dungdm93 commented Oct 6, 2023

rdblue Oct 6, 2023

bryanck Oct 21, 2023

ajantha-bhat Oct 11, 2023

ajantha-bhat Oct 12, 2023

ajantha-bhat Oct 12, 2023

bryanck Oct 21, 2023

ajantha-bhat Oct 11, 2023

bryanck Oct 21, 2023

ajantha-bhat Oct 11, 2023

bryanck Oct 21, 2023

nastra Oct 27, 2023

bryanck Oct 27, 2023 •

edited

Loading

nastra Oct 27, 2023

bryanck Oct 27, 2023

danielcweeks commented Oct 27, 2023

ajantha-bhat left a comment

ajantha-bhat Dec 4, 2023

Fokko commented Jan 10, 2024

jbonofre commented Jan 10, 2024

bryanck commented Jan 10, 2024

	* cycle. Events with this payload are not consumed by the sink, they * are informational and can be
	* cycle. Events with this payload are not consumed by the sink, they are informational and can be

	Map<Type, Schema> getConversionMap() {
	Map<Type, Schema> conversionMap() {

Kafka Connect: Initial project setup and event data structures #8701

Kafka Connect: Initial project setup and event data structures #8701

Conversation

bryanck commented Oct 2, 2023

ajantha-bhat commented Oct 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dungdm93 commented Oct 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bryanck Oct 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielcweeks commented Oct 27, 2023

ajantha-bhat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko commented Jan 10, 2024

jbonofre commented Jan 10, 2024

bryanck commented Jan 10, 2024

bryanck Oct 27, 2023 •

edited

Loading