Support Iceberg ingestion from REST based catalogs #17124

a2l007 · 2024-09-20T18:28:28Z

Description

Adds support to the iceberg input source to read from Iceberg REST catalogs.
Upgrades iceberg core dependency to 1.6.1

Release note

Adds support to the iceberg input source to read from Iceberg REST Catalogs.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

abhishekrb19

Looks good overall, thanks! A few suggestions.

abhishekrb19 · 2024-09-20T23:40:52Z

docs/development/extensions-contrib/iceberg.md

@@ -31,7 +31,7 @@ Iceberg refers to these metastores as catalogs. The Iceberg extension lets you c
 * Hive metastore catalog
 * Local catalog


Suggested change

* Local catalog

* Local catalog

* REST-based catalog

abhishekrb19 · 2024-09-20T23:43:04Z

...ruid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/RestIcebergCatalog.java

+      @JacksonInject @HiveConf Configuration configuration
+  )
+  {
+    this.catalogUri = Preconditions.checkNotNull(catalogUri, "catalogUri cannot be null");


consider using InvalidInput.exception()

abhishekrb19 · 2024-09-20T23:58:24Z

docs/ingestion/input-sources.md

 ### Iceberg filter object

 This input source provides the following filters: `and`, `equals`, `interval`, and `or`. You can use these filters to filter out data files from a snapshot, reducing the number of files Druid has to ingest.
+If the filter column is not an Iceberg partition column, it is highly recommended to define an additional filter defined in the [`transformSpec`](./ingestion-spec.md#transformspec). This is because for non-partition columns, Iceberg filters may return rows that do not match the expression. 


This filtering behavior applies to Delta Lake as well. I think typically, in the Lakehouse world, filtering is performed on table partition columns. Filtering on non-partitioned columns are best-effort.

I'm not sure if transformSpec would fully guarantee additional filtering in all scenarios. Perhaps for these docs, we can:

Highly recommend filtering on Iceberg partitioned columns

If filtering on non-partitioned columns, call out that it's best-effort and recommend using additional filters by defining it in a transformSpec if applicable

Thanks for the comments.
Curious, have you encountered any cases where the transformSpec doesn't fully guarantee additional filtering on top of the delta lake filters?
Transform spec filters have worked out well for us in this case and so I'm keen to understand if there are any gotchas.

Ah, good to know. I was just imagining a scenario where the transformSpec filters in Druid don't map 1:1 to a native filter that Delta Lake supports. I did a quick search, and it seems they can be mapped to the ones we already support at least, so I think we should be good.

abhishekrb19 · 2024-09-21T00:00:46Z

docs/ingestion/input-sources.md

@@ -1063,7 +1063,7 @@ The following is a sample spec for a S3 warehouse source:

 ### Catalog Object

-The catalog object supports `local` and `hive` catalog types.
+The catalog object supports `local`,`hive` and `rest` catalog types.


Nit: maybe reorder this and the catalog sections by relevance/usage — rest, hive, and local?

abhishekrb19 · 2024-09-21T00:04:27Z

...b/druid-iceberg-extensions/src/test/java/org/apache/druid/iceberg/input/RestCatalogTest.java

+    HttpServer server = null;
+    ServerSocket serverSocket = null;
+    try {
+      serverSocket = new ServerSocket(0);
+      int port = serverSocket.getLocalPort();
+      serverSocket.close();
+      server = HttpServer.create(new InetSocketAddress("localhost", port), 0);
+      server.createContext(
+          "/v1/config", // API for catalog fetchConfig which is invoked on catalog initialization
+          (httpExchange) -> {
+            String payload = "{}";
+            byte[] outputBytes = payload.getBytes(StandardCharsets.UTF_8);
+            httpExchange.sendResponseHeaders(200, outputBytes.length);
+            OutputStream os = httpExchange.getResponseBody();
+            httpExchange.getResponseHeaders().set(HttpHeaders.CONTENT_TYPE, "application/octet-stream");
+            httpExchange.getResponseHeaders().set(HttpHeaders.CONTENT_LENGTH, String.valueOf(outputBytes.length));
+            httpExchange.getResponseHeaders().set(HttpHeaders.CONTENT_RANGE, "bytes 0");
+            os.write(outputBytes);
+            os.close();
+          }
+      );
+      server.start();


This should probably be in test setup()?

Adds support to the iceberg input source to read from Iceberg REST Catalogs.

Adds support to the iceberg input source to read from Iceberg REST Catalogs. Co-authored-by: Atul Mohan <atulmohan.mec@gmail.com>

Iceberg extension changes to support reads from REST catalogs

a0fe453

github-actions bot added Area - Documentation Area - Dependencies labels Sep 20, 2024

a2l007 added Area - Batch Ingestion Area - Documentation Area - Extension Area - Dependencies and removed Area - Documentation Area - Dependencies labels Sep 20, 2024

Fix test

fff8858

abhishekrb19 reviewed Sep 21, 2024

View reviewed changes

Fix up tests and docs

6f8b9d7

abhishekrb19 approved these changes Sep 24, 2024

View reviewed changes

abhishekrb19 merged commit c1f8ae2 into apache:master Sep 24, 2024
91 checks passed

abhishekrb19 added this to the 31.0.0 milestone Sep 24, 2024

abhishekrb19 pushed a commit to abhishekrb19/incubator-druid that referenced this pull request Sep 24, 2024

Support Iceberg ingestion from REST based catalogs (apache#17124)

dfa4acb

Adds support to the iceberg input source to read from Iceberg REST Catalogs.

abhishekrb19 mentioned this pull request Sep 24, 2024

[Backport] Support Iceberg ingestion from REST based catalogs (#17124) #17145

Merged

abhishekrb19 added a commit that referenced this pull request Sep 24, 2024

Support Iceberg ingestion from REST based catalogs (#17124) (#17145)

0ae9988

Adds support to the iceberg input source to read from Iceberg REST Catalogs. Co-authored-by: Atul Mohan <atulmohan.mec@gmail.com>

kfaraz mentioned this pull request Oct 11, 2024

[DRAFT] 31.0.0 Release Notes #17332

Closed

abhishekrb19 mentioned this pull request Oct 18, 2024

AWS Glue Catalog for Iceberg ingest extension #17352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Iceberg ingestion from REST based catalogs #17124

Support Iceberg ingestion from REST based catalogs #17124

a2l007 commented Sep 20, 2024

abhishekrb19 left a comment

abhishekrb19 Sep 20, 2024

abhishekrb19 Sep 20, 2024

abhishekrb19 Sep 20, 2024

a2l007 Sep 23, 2024

abhishekrb19 Sep 24, 2024

abhishekrb19 Sep 21, 2024 •

edited

Loading

abhishekrb19 Sep 21, 2024

		@@ -31,7 +31,7 @@ Iceberg refers to these metastores as catalogs. The Iceberg extension lets you c
		* Hive metastore catalog
		* Local catalog

Support Iceberg ingestion from REST based catalogs #17124

Support Iceberg ingestion from REST based catalogs #17124

Conversation

a2l007 commented Sep 20, 2024

Description

Release note

abhishekrb19 left a comment

Choose a reason for hiding this comment

abhishekrb19 Sep 20, 2024

Choose a reason for hiding this comment

abhishekrb19 Sep 20, 2024

Choose a reason for hiding this comment

abhishekrb19 Sep 20, 2024

Choose a reason for hiding this comment

a2l007 Sep 23, 2024

Choose a reason for hiding this comment

abhishekrb19 Sep 24, 2024

Choose a reason for hiding this comment

abhishekrb19 Sep 21, 2024 • edited Loading

Choose a reason for hiding this comment

abhishekrb19 Sep 21, 2024

Choose a reason for hiding this comment

abhishekrb19 Sep 21, 2024 •

edited

Loading