Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,15 @@ Iceberg tables support table properties to configure table behavior, like the de
| write.merge.isolation-level | serializable | Isolation level for merge commands: serializable or snapshot |
| write.delete.granularity | partition | Controls the granularity of generated delete files: partition or file |

### Encryption properties

| Property | Default | Description |
| --------------------------------- | ------------------ | ------------------------------------------------------------------------------------- |
| encryption.key-id | (not set) | ID of the master key of the table |
| encryption.data-key-length | 16 (bytes) | Length of keys used for encryption of table files. Valid values are 16, 24, 32 bytes |

See the [Encryption](encryption.md) document for additional details.

### Table behavior properties

| Property | Default | Description |
Expand Down Expand Up @@ -137,6 +146,7 @@ Iceberg catalogs support using catalog properties to configure catalog behaviors
| cache-enabled | true | Whether to cache catalog entries |
| cache.expiration-interval-ms | 30000 | How long catalog entries are locally cached, in milliseconds; 0 disables caching, negative values disable expiration |
| metrics-reporter-impl | org.apache.iceberg.metrics.LoggingMetricsReporter | Custom `MetricsReporter` implementation to use in a catalog. See the [Metrics reporting](metrics-reporting.md) section for additional details |
| encryption.kms-impl | null | a custom `KeyManagementClient` implementation to use in a catalog for interactions with KMS (key management service). See the [Encryption](encryption.md) document for additional details |

`HadoopCatalog` and `HiveCatalog` can access the properties in their constructors.
Any other custom catalog can access the properties by implementing `Catalog.initialize(catalogName, catalogProperties)`.
Expand Down
2 changes: 2 additions & 0 deletions docs/docs/custom-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ It's possible to read an iceberg table either from an hdfs path or from a hive t
- [Custom LocationProvider](#custom-location-provider-implementation)
- [Custom IcebergSource](#custom-icebergsource)

Note: To work with encrypted tables, custom catalogs must address a number of security [requirements](encryption.md#catalog-security-requirements).

### Custom table operations implementation
Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata

Expand Down
153 changes: 153 additions & 0 deletions docs/docs/encryption.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
title: "Encryption"
---
<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
-->

# Encryption

Iceberg table encryption protects confidentiality and integrity of table data in an untrusted storage. The `data`, `delete`, `manifest` and `manifest list` files are encrypted and tamper-proofed before being sent to the storage backend.

The `metadata.json` file does not contain data or stats, and is therefore not encrypted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not true, please check this PR on stats contained in the metadata.json #14502 (comment), thoughts on exploit

cc @ggershinsky

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @singhpk234, I'll have a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @singhpk234, can you please clarify a few points

  1. Is the following true? - partition summary is not a part of the snapshot summary in the Iceberg spec https://iceberg.apache.org/spec/#optional-snapshot-summary-fields ; but in the implementation, it is added sometimes to the snapshots and can contain data stats.
  2. If yes, when the partition summaries are enabled, and when do they have stats? Is any of this under the writing user control?
  3. Can you give an example of such a usecase (that writes stats to the metadata.json)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking look @ggershinsky !

  1. Partition summary or even any summary props are supposed to be optional and not defined in spec, when you use iceberg java impl these are collected when write.summary.partition-limit is enable (default off) please check this for details : https://iceberg.apache.org/docs/nightly/configuration/#write-properties
  2. Yes they are enabled by the writer (user setting this table prop write.summary.partition-limit engines such as spark etc collect these when enabled.
  3. please check this https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/SnapshotSummary.java#L202
    on a high level it can reveal which column values and stats on file counts etc
    https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/SnapshotSummary.java#L163

please let me know what do you think about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are partition summary fields encapsulated in the SnapshotSummary.UpdateMetrics class?
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/SnapshotSummary.java#L223

Looks like a set of counters. Are there stats (col min/max) too?


Currently, encryption is supported in the Hive and REST catalogs for tables with Parquet and Avro data formats.

Two parameters are required to activate encryption of a table
1. Catalog property `encryption.kms-impl`, that specifies the class path for a client of a KMS ("key management service").
2. Table property `encryption.key-id`, that specifies the ID of a master key used to encrypt and decrypt the table. Master keys are stored and managed in the KMS.

For more details on table encryption, see the "Appendix: Internals Overview" [subsection](#appendix-internals-overview).

## Example

```sh
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }}\
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hive \
--conf spark.sql.catalog.local.encryption.kms-impl=org.apache.iceberg.aws.AwsKeyManagementClient
```

```sql
CREATE TABLE local.db.table (id bigint, data string) USING iceberg
TBLPROPERTIES ('encryption.key-id'='{{ master key id }}');
```

Inserted data will be automatically encrypted,

```sql
INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c');
```

To verify encryption, the contents of data, manifest and manifest list files can be dumped in the command line with

```sh
hexdump -C {{ /path/to/file }} | more
```

The Parquet files must start with the "PARE" magic string (PARquet Encrypted footer mode), and manifest/list files must start with "AGS1" magic string (Aes Gcm Stream version 1).

Queried data will be automatically decrypted,

```sql
SELECT * FROM local.db.table;
```

## Catalog security requirements

1. Catalogs must ensure the `encryption.key-id` property is not modified or removed during table lifetime.

2. To function properly, Iceberg table encryption requires the catalog implementations not to retrieve the metadata
directly from metadata.json files, if these files are kept unprotected in a storage vulnerable to tampering.

* Catalogs may keep the metadata in a trusted independent object store.
* Catalogs may work with metadata.json files in a tamper-proof storage.
* Catalogs may use checksum techniques to verify integrity of metadata.json files in a storage vulnerable to tampering
(the checksums must be kept in a separate trusted storage).

## Key Management Clients

Currently, Iceberg has clients for the AWS, GCP and Azure KMS systems. A custom client can be built for other key management systems by implementing the `org.apache.iceberg.encryption.KeyManagementClient` interface.

This interface has the following main methods,

```java
/**
* Initialize the KMS client with given properties.
*
* @param properties kms client properties (taken from catalog properties)
*/
void initialize(Map<String, String> properties);

/**
* Wrap a secret key, using a wrapping/master key which is stored in KMS and referenced by an ID.
* Wrapping means encryption of the secret key with the master key, and adding optional
* KMS-specific metadata that allows the KMS to decrypt the secret key in an unwrapping call.
*
* @param key a secret key being wrapped
* @param wrappingKeyId a key ID that represents a wrapping key stored in KMS
* @return wrapped key material
*/
ByteBuffer wrapKey(ByteBuffer key, String wrappingKeyId);

/**
* Unwrap a secret key, using a wrapping/master key which is stored in KMS and referenced by an
* ID.
*
* @param wrappedKey wrapped key material (encrypted key and optional KMS metadata, returned by
* the wrapKey method)
* @param wrappingKeyId a key ID that represents a wrapping key stored in KMS
* @return raw key bytes
*/
ByteBuffer unwrapKey(ByteBuffer wrappedKey, String wrappingKeyId);
```

## Appendix: Internals Overview

The standard Iceberg encryption manager generates an encryption key and a unique file ID ("AAD prefix")
for each data and delete file. The generation is performed in the worker nodes, by using a secure random
number generator. For Parquet data files, these parameters are passed to the native Parquet Modular
Encryption [mechanism](https://parquet.apache.org/docs/file-format/data-pages/encryption). For Avro data files,
these parameters are passed to the AES GCM Stream encryption [mechanism](../../format/gcm-stream-spec.md).

The parent manifest file stores the encryption key and AAD prefix for each data and delete file in the
`key_metadata` [field](../../format/spec.md#data-file-fields). For Avro data tables, the data file length
is also added to the `key_metadata`.
The manifest file is encrypted by the AES GCM Stream encryption mechanism, using an encryption key and an
AAD prefix generated by the standard encryption manager. The generation is performed in the driver nodes,
by using a secure random number generator.

The parent manifest list file stores the encryption key, AAD prefix and file length for each manifest file
in the `key_metadata` [field](../../format/spec.md#manifest-lists). The manifest list file is encrypted by
the AES GCM Stream encryption mechanism,
using an encryption key and an AAD prefix generated by the standard encryption manager.

The manifest list encryption key, AAD prefix and file length are packed in a key metadata object. This object
is serialized and encrypted with a "key encryption key" (KEK), using the KEK creation timestamp as the AES
GCM AAD. A KEK and its unique KEK_ID are generated by using a secure random number generator. For each
snapshot, the KEK_ID of the encryption key that encrypts the manifest list key metadata is kept in the
`key-id` field in the table metadata snapshot [structure](../../format/spec.md#snapshots). The encrypted
manifest list key metadata is kept in the `encryption-keys` list in the table metadata
[structure](../../format/spec.md#table-metadata-fields).

The KEK is encrypted by the table master key via the KMS client. The result is kept in the `encryption-keys`
list in the table metadata structure. The KEK is re-used for a period allowed by the NIST SP 800-57
specification. Then, it is rotated - a new KEK and KEK_ID are generated for encryption of new manifest list
key metadata objects. The new KEK is encrypted by the table master key and stored in the `encryption-keys`
list in the table metadata structure. The previous KEKs are retained for the existing table snapshots.
1 change: 1 addition & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ nav:
- Tables:
- branching.md
- configuration.md
- encryption.md
- evolution.md
- maintenance.md
- metrics-reporting.md
Expand Down