Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: validate airbyte metadata added to iceberg schema #48604

Merged
merged 6 commits into from
Nov 22, 2024

Conversation

jdpgrailsdev
Copy link
Contributor

@jdpgrailsdev jdpgrailsdev commented Nov 21, 2024

What

  • Validate that Airbyte metadata columns are added to the Iceberg schema
  • Actually detect insert vs update operation

How

  • Add tests to verify that Airbyte metadata columns are added to the Iceberg schema
  • Enable CDK tests to validate metadata is added to a schema when withMetadata() is called
  • Convert IcebergUtil to a Micronaut singleton to better facilitate testing
  • Detect insert/update when converting to Iceberg record

Review guide

  1. IcebergUtil.kt
  2. IcebergUtilTest.kt
  3. IcebergV2WriterTest.kt
  4. AirbyteTypeToAirbyteTypeWithMetaTest.kt

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

@jdpgrailsdev jdpgrailsdev requested a review from a team as a code owner November 21, 2024 19:27
Copy link

vercel bot commented Nov 21, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 22, 2024 6:03pm

@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit connectors/destination/iceberg-v2 labels Nov 21, 2024
@edgao
Copy link
Contributor

edgao commented Nov 21, 2024

}
}

private fun createAirbyteMetadataFields(): List<NestedField> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these columns are being added to the schema using the mapper architecture. if you take a look at

val schema = pipeline.finalSchema.withAirbyteMeta(true).toIcebergSchema(primaryKeys)

In IcebergV2Writer. The withAirbyteMeta method is adding a bunch of em. If we want to add more like the generation id , etc. we should use the same design.

@jdpgrailsdev
Copy link
Contributor Author

this looks a lot like https://github.com/airbytehq/airbyte/blob/master/airbyte-cdk/bulk/core/load/src/main/kotlin/io/airbyte/cdk/load/data/AirbyteTypeToAirbyteTypeWithMeta.kt ? which we can maybe reuse in some way

@edgao It does and I wasn't sure how to reuse that given that it returns Airbyte types and not Iceberg types. I suspect what is really being pointed out here is some refactor to this in the CDK to make the common parts more common.

@jdpgrailsdev jdpgrailsdev changed the title feat: iceberg schema airbyte metadata feat: validate airbyte metadata added to iceberg schema Nov 22, 2024
@jdpgrailsdev
Copy link
Contributor Author

PR has been updated to validate that the metadata is added to the schema. I have removed the code that I originally added to do this, as the existing logic already adds the fields. This is confirmed via new tests added to this PR.

Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a small question about the deleted_at detection

// the deletion status. This should be revisited when there is a cheaper way to do
// this, such
// as after a protocol change that explicitly states the operation.
// if (record.data.toJson().get(AIRBYTE_CDC_DELETE_COLUMN) != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do something like this to dodge the serialize?

(record.data as ObjectValue).properties[DELETE_COLUMN] != null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgao Sure? I'll test it out and update the PR if it works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgao That appeared to work. I will update the PR to include the DELETE operation detection based on your suggestion.

@jdpgrailsdev jdpgrailsdev merged commit 5db1bb1 into master Nov 22, 2024
35 checks passed
@jdpgrailsdev jdpgrailsdev deleted the jonathan/iceberg-schema-airbyte-metadata branch November 22, 2024 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues CDK Connector Development Kit connectors/destination/iceberg-v2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants