Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(samples): Dataflow to Apache Iceberg with dynamic destinations #9645

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

VeronicaWasson
Copy link
Contributor

Description

Add snippet for Iceberg dynamic destinations

Relevant doc bug: b/371047621

Checklist

  • I have followed Sample Format Guide
  • pom.xml parent set to latest shared-configuration
  • Appropriate changes to README are included in PR
  • These samples need a new API enabled in testing projects to pass (let us know which ones)
  • These samples need a new/updated env vars in testing projects set to pass (let us know which ones)
  • Tests pass: mvn clean verify required
  • Lint passes: mvn -P lint checkstyle:check required
  • Static Analysis: mvn -P lint clean compile pmd:cpd-check spotbugs:check advisory only
  • This sample adds a new sample directory, and I updated the CODEOWNERS file with the codeowners for this sample
  • This sample adds a new Product API, and I updated the Blunderbuss issue/PR auto-assigner with the codeowners for this sample
  • Please merge this PR for me once it is approved

@VeronicaWasson VeronicaWasson requested review from yoshi-approver and a team as code owners November 4, 2024 23:06
@product-auto-label product-auto-label bot added samples Issues that are directly related to samples. api: dataflow Issues related to the Dataflow API. labels Nov 4, 2024
Copy link

snippet-bot bot commented Nov 4, 2024

Here is the summary of changes.

You are about to add 2 region tags.

This comment is generated by snippet-bot.
If you find problems with this result, please file an issue at:
https://github.com/googleapis/repo-automation-bots/issues.
To update this comment, add snippet-bot:force-run label or use the checkbox below:

  • Refresh this comment

@iennae
Copy link
Contributor

iennae commented Jan 16, 2025

@code-review-assist review

Copy link
Contributor

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request adds a snippet demonstrating dynamic destinations with Apache Iceberg in Dataflow and refactors integration tests. The changes are well-structured and address a relevant documentation bug. Here are a few suggestions based on the Google Java Style Guide:

  • Use diamond operators for type inference.
  • Declare Map and ImmutableMap using their interfaces.
  • Consider using List.of() for creating immutable lists when possible.

@iennae
Copy link
Contributor

iennae commented Jan 16, 2025

@code-review-assist walkthrough

Copy link
Contributor

Dataflow to Apache Iceberg Samples Walkthrough

This repository contains Java code samples demonstrating how to interact with Apache Iceberg tables using Apache Beam and Google Cloud Dataflow. The primary goal is to provide clear, concise examples for developers to integrate Dataflow pipelines with Iceberg for data ingestion and processing.

Repository Structure

The repository is structured as follows:

  • dataflow/snippets/: This directory contains the core Dataflow pipeline code samples. This is the main focus of this walkthrough and the pull request.
    • pom.xml: This file manages project dependencies. The pull request updates the Apache Beam version to 2.60.0, enabling the use of dynamic destinations which is a key feature of this PR. This is a crucial change as it unlocks new functionality.
    • src/main/java/: Contains the Java source code for the Dataflow pipelines.
      • com.example.dataflow.*: This package contains the pipeline implementations. The ApacheIcebergDynamicDestinations class is the primary addition in this PR, demonstrating the use of dynamic destinations in Iceberg.
    • src/test/java/: Contains the integration tests for the pipeline code. The pull request refactors these tests to handle multiple destination tables, making them more robust and adaptable to the dynamic destination feature.

Code Walkthrough

Let's trace the execution flow of the ApacheIcebergDynamicDestinations pipeline (the main addition in this PR):

  1. Pipeline Creation: The createPipeline method in ApacheIcebergDynamicDestinations.java creates a new Apache Beam pipeline. The pipeline options (warehouse location and catalog name) are passed in from the command line.
  2. Iceberg Configuration: The pipeline configures the Iceberg I/O connector using Managed.write(Managed.ICEBERG). Crucially, it sets the table property to "flights-{airport}". This utilizes the dynamic destination feature (introduced in Beam 2.60) allowing the pipeline to write to different Iceberg tables based on the airport field in the input data. This is the core functionality added by this PR.
  3. Data Ingestion: The pipeline ingests sample data using Create.of(TABLE_ROWS). This data is in JSON format and represents flight information (id, name, airport).
  4. Schema Conversion: JsonToRow.withSchema(SCHEMA) converts the JSON data into Beam Row objects, aligning with the Iceberg schema.
  5. Data Writing: Finally, Managed.write(Managed.ICEBERG).withConfig(config) writes the data to the specified Iceberg tables. The dynamic destination configuration ensures that records are routed to the appropriate table based on the airport value.

Integration Tests

The integration tests in ApacheIcebergIT.java verify the functionality of the pipelines. The pull request enhances these tests by:

  • Handling Multiple Tables: The tests now create and verify multiple Iceberg tables, reflecting the dynamic destination capability. The previous version assumed only one table, which is no longer the case with the new dynamic destination feature.
  • Improved Assertions: The tests use more precise assertions to check for the presence of specific records in the correct tables, ensuring comprehensive validation.

Pull Request Impact

This pull request introduces a significant enhancement by adding support for dynamic destinations in Apache Iceberg. The updated pom.xml ensures compatibility with the required Beam version. The new ApacheIcebergDynamicDestinations sample demonstrates this feature, and the improved integration tests validate its correctness and robustness. The changes are well-integrated and enhance the overall functionality and usability of the samples.

Potential Improvements

While the PR is well-structured, adding more comprehensive error handling and logging could further improve its robustness. Consider adding checks for null or empty values in the input data and handling potential exceptions during table creation or data writing. More detailed logging would aid in debugging and troubleshooting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: dataflow Issues related to the Dataflow API. samples Issues that are directly related to samples.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants