You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/aws.md
+46-9Lines changed: 46 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -410,27 +410,64 @@ workloads with exceptionally high throughput against tables that S3 has not yet
410
410
| s3.retry.min-wait-ms | 2s | Minimum wait time to retry a S3 operation. |
411
411
| s3.retry.max-wait-ms | 20s | Maximum wait time to retry a S3 read operation. |
412
412
413
-
### S3 Strong Consistency
414
-
415
-
In November 2020, S3 announced [strong consistency](https://aws.amazon.com/s3/consistency/) for all read operations, and Iceberg is updated to fully leverage this feature.
416
-
There is no redundant consistency wait and check which might negatively impact performance during IO operations.
417
413
418
414
### Hadoop S3A FileSystem
419
415
420
416
Before `S3FileIO` was introduced, many Iceberg users choose to use `HadoopFileIO` to write data to S3 through the [S3A FileSystem](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java).
421
-
As introduced in the previous sections, `S3FileIO` adopts the latest AWS clients and S3 features for optimized security and performance
422
-
and is thus recommended for S3 use cases rather than the S3A FileSystem.
417
+
As introduced in the previous sections, `S3FileIO` adopts the latest AWS clients and S3 features for optimized security and performance with Apache Iceberg
418
+
and is thus recommended for S3 use cases rather than the S3A FileSystem.
419
+
420
+
In contrast, the Apache Hadoop S3A connector
421
+
* Uses the most recent AWS v2 client they have qualified across a broad set of applications, including Apache Spark, Apache Hive, Apache HBase and more. This may lag the Iceberg artifacts.
422
+
* Supports Amazon S3 Express One Zone storage.
423
+
* Contains detection and recovery for S3 failures beyond that in the AWS SDK, recovery added "one support call at a time".
424
+
* Supports OpenSSL as an optional TLS transport layer, for tangible performance improvements over the JDK implementation.
425
+
* Supports scatter/gather IO "vector IO" and other features for high-performance parquet reads.
426
+
* Has an explict "FIPS mode" which uses the FIPS endpoints available in some AWS regions.
427
+
* Includes [auditing via the S3 Server Logs](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/auditing.html), which can be used to answer important questions such as "who deleted all the files?" and "which job is triggering throttling?".
428
+
* Collects [client-side statistics](https://apachecon.com/acasia2022/sessions/bigdata-1191.html) for identification of performance and connectivity issues.
423
429
424
430
`S3FileIO` writes data with `s3://` URI scheme, but it is also compatible with schemes written by the S3A FileSystem.
425
431
This means for any table manifests containing `s3a://` or `s3n://` file paths, `S3FileIO` is still able to read them.
426
432
This feature allows people to easily switch from S3A to `S3FileIO`.
427
433
428
-
If for any reason you have to use S3A, here are the instructions:
434
+
To use S3A, here are the instructions:
429
435
430
436
1. To store data using S3A, specify the `warehouse` catalog property to be an S3A path, e.g. `s3a://my-bucket/my-warehouse`
431
437
2. For `HiveCatalog`, to also store metadata using S3A, specify the Hadoop config property `hive.metastore.warehouse.dir` to be an S3A path.
432
-
3. Add [hadoop-aws](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) as a runtime dependency of your compute engine.
433
-
4. Configure AWS settings based on [hadoop-aws documentation](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) (make sure you check the version, S3A configuration varies a lot based on the version you use).
438
+
3. Add [hadoop-aws](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) as a runtime dependency of your compute engine. The version of this module must be the exact same version as the rest of any hadoop binaries on the classpath.
439
+
4. For the latest features, bug fixes and best performance, use the latest versions of hadoop and hadoop-aws libraries.
440
+
5. Use the exact same shaded version of the AWS SDK as the `hadoop-aws` module was built and tested with. Older versions are unlikely to work, newer versions will be unqualified and lack any fixes for problems identified during the qualification process.
441
+
6. Configure AWS settings based on [hadoop-aws documentation](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) (make sure you check the version, S3A configuration varies a lot based on the version you use).
442
+
443
+
#### S3A: Maximizing Parquet and Iceberg Performance.
444
+
445
+
For best performance, the S3A connector should be configured for the Parquet and iceberg workloads
446
+
based on the recommedations of [Maximizing Performance when working with the S3A Connector](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/performance.html).
447
+
448
+
For applications reading Parquet data, including Iceberg manifests, specific settings to use are listed below.
0 commit comments