You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can also preload the catalog by setting the configurations above in `hive-site.xml`.
157
157
158
-
## Glue Catalog
158
+
## Catalogs
159
+
160
+
There are multiple different options that users can choose to build an Iceberg catalog with AWS.
161
+
162
+
### Glue Catalog
159
163
160
164
Iceberg enables the use of [AWS Glue](https://aws.amazon.com/glue) as the `Catalog` implementation.
161
165
When used, an Iceberg namespace is stored as a [Glue Database](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-databases.html),
@@ -165,22 +169,22 @@ You can start using Glue catalog by specifying the `catalog-impl` as `org.apache
165
169
just like what is shown in the [enabling AWS integration](#enabling-aws-integration) section above.
166
170
More details about loading the catalog can be found in individual engine pages, such as [Spark](../spark-configuration/#loading-a-custom-catalog) and [Flink](../flink/#creating-catalogs-and-using-catalogs).
167
171
168
-
### Glue Catalog ID
172
+
####Glue Catalog ID
169
173
There is a unique Glue metastore in each AWS account and each AWS region.
170
174
By default, `GlueCatalog` chooses the Glue metastore to use based on the user's default AWS client credential and region setup.
171
175
You can specify the Glue catalog ID through `glue.id` catalog property to point to a Glue catalog in a different AWS account.
172
176
The Glue catalog ID is your numeric AWS account ID.
173
177
If the Glue catalog is in a different region, you should configure you AWS client to point to the correct region,
174
178
see more details in [AWS client customization](#aws-client-customization).
175
179
176
-
### Skip Archive
180
+
####Skip Archive
177
181
178
182
By default, Glue stores all the table versions created and user can rollback a table to any historical version if needed.
179
183
However, if you are streaming data to Iceberg, this will easily create a lot of Glue table versions.
180
184
Therefore, it is recommended to turn off the archive feature in Glue by setting `glue.skip-archive` to `true`.
181
185
For more details, please read [Glue Quotas](https://docs.aws.amazon.com/general/latest/gr/glue.html) and the [UpdateTable API](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateTable.html).
182
186
183
-
### DynamoDB for Commit Locking
187
+
####DynamoDB for Commit Locking
184
188
185
189
Glue does not have a strong guarantee over concurrent updates to a table.
186
190
Although it throws `ConcurrentModificationException` when detecting two processes updating a table at the same time,
@@ -196,14 +200,14 @@ This feature requires the following lock related catalog properties:
196
200
Other lock related catalog properties can also be used to adjust locking behaviors such as heartbeat interval.
197
201
For more details, please refer to [Lock catalog properties](../configuration/#lock-catalog-properties).
198
202
199
-
### Warehouse Location
203
+
####Warehouse Location
200
204
201
205
Similar to all other catalog implementations, `warehouse` is a required catalog property to determine the root path of the data warehouse in storage.
202
206
By default, Glue only allows a warehouse location in S3 because of the use of `S3FileIO`.
203
207
To store data in a different local or cloud store, Glue catalog can switch to use `HadoopFileIO` or any custom FileIO by setting the `io-impl` catalog property.
204
208
Details about this feature can be found in the [custom FileIO](../custom-catalog/#custom-file-io-implementation) section.
205
209
206
-
### Table Location
210
+
####Table Location
207
211
208
212
By default, the root location for a table `my_table` of namespace `my_ns` is at `my-warehouse-location/my-ns.db/my-table`.
209
213
This default root location can be changed at both namespace and table level.
| identifier | partition key | string | table identifier such as `db1.table1`, or string `NAMESPACE` for namespaces |
265
+
| namespace | sort key | string | namespace name. A [global secondary index (GSI)](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html) is created with namespace as partition key, identifier as sort key, no other projected columns |
266
+
| v || string | row version, used for optimistic locking |
267
+
| updated_at || number | timestamp (millis) of the last update |
268
+
| created_at || number | timestamp (millis) of the table creation |
269
+
| p.<property_key\> | | string | Iceberg-defined table properties including `table_type`, `metadata_location` and `previous_metadata_location` or namespace properties
270
+
271
+
This design has the following benefits:
272
+
273
+
1. it avoids potential [hot partition issue](https://aws.amazon.com/premiumsupport/knowledge-center/dynamodb-table-throttled/) if there are heavy write traffic to the tables within the same namespace, because the partition key is at the table level
274
+
2. namespace operations are clustered in a single partition to avoid affecting table commit operations
275
+
3. a sort key to partition key reverse GSI is used for list table operation, and all other operations are single row ops or single partition query. No full table scan is needed for any operation in the catalog.
276
+
4. a string UUID version field `v` is used instead of `updated_at` to avoid 2 processes committing at the same millisecond
277
+
5. multi-row transaction is used for `catalog.renameTable` to ensure idempotency
278
+
6. properties are flattened as top level columns so that user can add custom GSI on any property field to customize the catalog. For example, users can store owner information as table property `owner`, and search tables by owner by adding a GSI on the `p.owner` column.
279
+
280
+
### RDS JDBC Catalog
281
+
282
+
Iceberg also supports JDBC catalog which uses a table in a relational database to manage Iceberg tables.
283
+
You can configure to use JDBC catalog with relational database services like [AWS RDS](https://aws.amazon.com/rds).
284
+
Read [the JDBC integration page](../jdbc/#jdbc-catalog) for guides and examples about using the JDBC catalog.
285
+
Read [this AWS documentation](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.IAMDBAuth.Connecting.Java.html) for more details about configuring JDBC catalog with IAM authentication.
286
+
287
+
### Which catalog to choose?
288
+
289
+
With all the available options, we offer the following guidance when choosing the right catalog to use for your application:
290
+
291
+
1. if your organization has an existing Glue metastore or plans to use the AWS analytics ecosystem including Glue, [Athena](https://aws.amazon.com/athena), [EMR](https://aws.amazon.com/emr), [Redshift](https://aws.amazon.com/redshift) and [LakeFormation](https://aws.amazon.com/lake-formation), Glue catalog provides the easiest integration.
292
+
2. if your application requires frequent updates to table or high read and write throughput (e.g. streaming write), DynamoDB catalog provides the best performance through optimistic locking.
293
+
3. if you would like to enforce access control for tables in a catalog, Glue tables can be managed as an [IAM resource](https://docs.aws.amazon.com/service-authorization/latest/reference/list_awsglue.html), whereas DynamoDB catalog tables can only be managed through [item-level permission](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/specifying-conditions.html) which is much more complicated.
294
+
4. if you would like to query tables based on table property information without the need to scan the entire catalog, DynamoDB catalog allows you to build secondary indexes for any arbitrary property field and provide efficient query performance.
295
+
5. if you would like to have the benefit of DynamoDB catalog while also connect to Glue, you can enable [DynamoDB stream with Lambda trigger](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.Lambda.Tutorial.html) to asynchronously update your Glue metastore with table information in the DynamoDB catalog.
296
+
6. if your organization already maintains an existing relational database in RDS or uses [serverless Aurora](https://aws.amazon.com/rds/aurora/serverless/) to manage tables, JDBC catalog provides the easiest integration.
297
+
241
298
## S3 FileIO
242
299
243
300
Iceberg allows users to write data to S3 through `S3FileIO`.
0 commit comments