From 56fd9e6931ef2feb97e4b4bc897794c4dab6f9ea Mon Sep 17 00:00:00 2001 From: Steve Zhang Date: Thu, 27 Apr 2023 16:01:43 -0700 Subject: [PATCH 1/3] Update documentation to reflect new catalog features --- docs/spark-configuration.md | 18 +++++++++++++++--- .../org/apache/iceberg/spark/SparkCatalog.java | 13 ++++++++++--- .../org/apache/iceberg/spark/SparkCatalog.java | 13 ++++++++++--- 3 files changed, 35 insertions(+), 9 deletions(-) diff --git a/docs/spark-configuration.md b/docs/spark-configuration.md index 926ec0207dad..8a14794d352e 100644 --- a/docs/spark-configuration.md +++ b/docs/spark-configuration.md @@ -40,6 +40,14 @@ spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port # omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml ``` +Below is an example for a REST catalog named `rest_prod` that loads tables from REST URL `http://localhost:8080`: + +```plain +spark.sql.catalog.rest_prod = org.apache.iceberg.spark.SparkCatalog +spark.sql.catalog.rest_prod.type = rest +spark.sql.catalog.rest_prod.uri = http://localhost:8080 +``` + Iceberg also supports a directory-based catalog in HDFS that can be configured using `type=hadoop`: ```plain @@ -66,12 +74,16 @@ Both catalogs are configured using properties nested under the catalog name. Com | Property | Values | Description | | -------------------------------------------------- | ----------------------------- | -------------------------------------------------------------------- | | spark.sql.catalog._catalog-name_.type | `hive`, `hadoop` or `rest` | The underlying Iceberg catalog implementation, `HiveCatalog`, `HadoopCatalog`, `RESTCatalog` or left unset if using a custom catalog | -| spark.sql.catalog._catalog-name_.catalog-impl | | The underlying Iceberg catalog implementation.| +| spark.sql.catalog._catalog-name_.catalog-impl | | The custom Iceberg catalog implementation. if `type` is null, `catalog-impl` must not be null. | +| spark.sql.catalog._catalog-name_.io-impl | | The custom FileIO implementation. | +| spark.sql.catalog._catalog-name_.metrics-reporter-impl | | The custom MetricsReporter implementation. | | spark.sql.catalog._catalog-name_.default-namespace | default | The default current namespace for the catalog | -| spark.sql.catalog._catalog-name_.uri | thrift://host:port | Metastore connect URI; default from `hive-site.xml` | +| spark.sql.catalog._catalog-name_.uri | thrift://host:port | Hive metastore URL for hive typed catalog, REST URL for REST typed catalog | | spark.sql.catalog._catalog-name_.warehouse | hdfs://nn:8020/warehouse/path | Base path for the warehouse directory | | spark.sql.catalog._catalog-name_.cache-enabled | `true` or `false` | Whether to enable catalog cache, default value is `true` | -| spark.sql.catalog._catalog-name_.cache.expiration-interval-ms | `30000` (30 seconds) | Duration after which cached catalog entries are expired; Only effective if `cache-enabled` is `true`. `-1` disables cache expiration and `0` disables caching entirely, irrespective of `cache-enabled`. Default is `30000` (30 seconds) | | +| spark.sql.catalog._catalog-name_.cache.expiration-interval-ms | `30000` (30 seconds) | Duration after which cached catalog entries are expired; Only effective if `cache-enabled` is `true`. `-1` disables cache expiration and `0` disables caching entirely, irrespective of `cache-enabled`. Default is `30000` (30 seconds) | +| spark.sql.catalog._catalog-name_.table-default._propertyKey_ | | Iceberg table property _propertyKey_ default at catalog level. A different table property value can be overridden by user | +| spark.sql.catalog._catalog-name_.table-override._propertyKey_ | | Iceberg table property _propertyKey_ enforced at catalog level. Cannot be overridden by user | Additional properties can be found in common [catalog configuration](../configuration#catalog-properties). diff --git a/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java b/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java index 3ad3f5d0ee2a..cae62486ca36 100644 --- a/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java +++ b/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java @@ -89,16 +89,23 @@ *

This supports the following catalog configuration options: * *

* *

diff --git a/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java b/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java index 3ad3f5d0ee2a..cae62486ca36 100644 --- a/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java +++ b/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java @@ -89,16 +89,23 @@ *

This supports the following catalog configuration options: * *

* *

From eba2115386b737714a406f2cb140397480bc8258 Mon Sep 17 00:00:00 2001 From: Hongyue/Steve Zhang Date: Fri, 28 Apr 2023 12:02:30 -0700 Subject: [PATCH 2/3] Apply suggestions from code review Co-authored-by: Eduard Tudenhoefner --- docs/spark-configuration.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/spark-configuration.md b/docs/spark-configuration.md index 8a14794d352e..ba2fe10cff82 100644 --- a/docs/spark-configuration.md +++ b/docs/spark-configuration.md @@ -74,16 +74,16 @@ Both catalogs are configured using properties nested under the catalog name. Com | Property | Values | Description | | -------------------------------------------------- | ----------------------------- | -------------------------------------------------------------------- | | spark.sql.catalog._catalog-name_.type | `hive`, `hadoop` or `rest` | The underlying Iceberg catalog implementation, `HiveCatalog`, `HadoopCatalog`, `RESTCatalog` or left unset if using a custom catalog | -| spark.sql.catalog._catalog-name_.catalog-impl | | The custom Iceberg catalog implementation. if `type` is null, `catalog-impl` must not be null. | -| spark.sql.catalog._catalog-name_.io-impl | | The custom FileIO implementation. | -| spark.sql.catalog._catalog-name_.metrics-reporter-impl | | The custom MetricsReporter implementation. | +| spark.sql.catalog._catalog-name_.catalog-impl | | The custom Iceberg catalog implementation. If `type` is null, `catalog-impl` must not be null. | +| spark.sql.catalog._catalog-name_.io-impl | | The custom FileIO implementation. | +| spark.sql.catalog._catalog-name_.metrics-reporter-impl | | The custom MetricsReporter implementation. | | spark.sql.catalog._catalog-name_.default-namespace | default | The default current namespace for the catalog | | spark.sql.catalog._catalog-name_.uri | thrift://host:port | Hive metastore URL for hive typed catalog, REST URL for REST typed catalog | | spark.sql.catalog._catalog-name_.warehouse | hdfs://nn:8020/warehouse/path | Base path for the warehouse directory | | spark.sql.catalog._catalog-name_.cache-enabled | `true` or `false` | Whether to enable catalog cache, default value is `true` | | spark.sql.catalog._catalog-name_.cache.expiration-interval-ms | `30000` (30 seconds) | Duration after which cached catalog entries are expired; Only effective if `cache-enabled` is `true`. `-1` disables cache expiration and `0` disables caching entirely, irrespective of `cache-enabled`. Default is `30000` (30 seconds) | -| spark.sql.catalog._catalog-name_.table-default._propertyKey_ | | Iceberg table property _propertyKey_ default at catalog level. A different table property value can be overridden by user | -| spark.sql.catalog._catalog-name_.table-override._propertyKey_ | | Iceberg table property _propertyKey_ enforced at catalog level. Cannot be overridden by user | +| spark.sql.catalog._catalog-name_.table-default._propertyKey_ | | Iceberg table property _propertyKey_ default at catalog level. A different table property value can be overridden by the user | +| spark.sql.catalog._catalog-name_.table-override._propertyKey_ | | Iceberg table property _propertyKey_ enforced at catalog level. Cannot be overridden by the user | Additional properties can be found in common [catalog configuration](../configuration#catalog-properties). From 7051920194377e5658adf9bcace21ca9d8f6bb38 Mon Sep 17 00:00:00 2001 From: Steve Zhang Date: Tue, 2 May 2023 15:24:32 -0700 Subject: [PATCH 3/3] Address Szehon Feedback --- docs/spark-configuration.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/spark-configuration.md b/docs/spark-configuration.md index ba2fe10cff82..4dbe527aee5a 100644 --- a/docs/spark-configuration.md +++ b/docs/spark-configuration.md @@ -82,8 +82,8 @@ Both catalogs are configured using properties nested under the catalog name. Com | spark.sql.catalog._catalog-name_.warehouse | hdfs://nn:8020/warehouse/path | Base path for the warehouse directory | | spark.sql.catalog._catalog-name_.cache-enabled | `true` or `false` | Whether to enable catalog cache, default value is `true` | | spark.sql.catalog._catalog-name_.cache.expiration-interval-ms | `30000` (30 seconds) | Duration after which cached catalog entries are expired; Only effective if `cache-enabled` is `true`. `-1` disables cache expiration and `0` disables caching entirely, irrespective of `cache-enabled`. Default is `30000` (30 seconds) | -| spark.sql.catalog._catalog-name_.table-default._propertyKey_ | | Iceberg table property _propertyKey_ default at catalog level. A different table property value can be overridden by the user | -| spark.sql.catalog._catalog-name_.table-override._propertyKey_ | | Iceberg table property _propertyKey_ enforced at catalog level. Cannot be overridden by the user | +| spark.sql.catalog._catalog-name_.table-default._propertyKey_ | | Default Iceberg table property value for property key _propertyKey_, which will be set on tables created by this catalog if not overridden | +| spark.sql.catalog._catalog-name_.table-override._propertyKey_ | | Enforced Iceberg table property value for property key _propertyKey_, which cannot be overridden by user | Additional properties can be found in common [catalog configuration](../configuration#catalog-properties).