Skip to content

Commit b352b04

Browse files
committed
HADOOP-18679. Updating aws.md documentation
1 parent fb05fa1 commit b352b04

File tree

1 file changed

+46
-9
lines changed

1 file changed

+46
-9
lines changed

docs/docs/aws.md

Lines changed: 46 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -410,27 +410,64 @@ workloads with exceptionally high throughput against tables that S3 has not yet
410410
| s3.retry.min-wait-ms | 2s | Minimum wait time to retry a S3 operation. |
411411
| s3.retry.max-wait-ms | 20s | Maximum wait time to retry a S3 read operation. |
412412

413-
### S3 Strong Consistency
414-
415-
In November 2020, S3 announced [strong consistency](https://aws.amazon.com/s3/consistency/) for all read operations, and Iceberg is updated to fully leverage this feature.
416-
There is no redundant consistency wait and check which might negatively impact performance during IO operations.
417413

418414
### Hadoop S3A FileSystem
419415

420416
Before `S3FileIO` was introduced, many Iceberg users choose to use `HadoopFileIO` to write data to S3 through the [S3A FileSystem](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java).
421-
As introduced in the previous sections, `S3FileIO` adopts the latest AWS clients and S3 features for optimized security and performance
422-
and is thus recommended for S3 use cases rather than the S3A FileSystem.
417+
As introduced in the previous sections, `S3FileIO` adopts the latest AWS clients and S3 features for optimized security and performance with Apache Iceberg
418+
and is thus recommended for S3 use cases rather than the S3A FileSystem.
419+
420+
In contrast, the Apache Hadoop S3A connector
421+
* Uses the most recent AWS v2 client they have qualified across a broad set of applications, including Apache Spark, Apache Hive, Apache HBase and more. This may lag the Iceberg artifacts.
422+
* Supports Amazon S3 Express One Zone storage.
423+
* Contains detection and recovery for S3 failures beyond that in the AWS SDK, recovery added "one support call at a time".
424+
* Supports OpenSSL as an optional TLS transport layer, for tangible performance improvements over the JDK implementation.
425+
* Supports scatter/gather IO "vector IO" and other features for high-performance parquet reads.
426+
* Has an explict "FIPS mode" which uses the FIPS endpoints available in some AWS regions.
427+
* Includes [auditing via the S3 Server Logs](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/auditing.html), which can be used to answer important questions such as "who deleted all the files?" and "which job is triggering throttling?".
428+
* Collects [client-side statistics](https://apachecon.com/acasia2022/sessions/bigdata-1191.html) for identification of performance and connectivity issues.
423429

424430
`S3FileIO` writes data with `s3://` URI scheme, but it is also compatible with schemes written by the S3A FileSystem.
425431
This means for any table manifests containing `s3a://` or `s3n://` file paths, `S3FileIO` is still able to read them.
426432
This feature allows people to easily switch from S3A to `S3FileIO`.
427433

428-
If for any reason you have to use S3A, here are the instructions:
434+
To use S3A, here are the instructions:
429435

430436
1. To store data using S3A, specify the `warehouse` catalog property to be an S3A path, e.g. `s3a://my-bucket/my-warehouse`
431437
2. For `HiveCatalog`, to also store metadata using S3A, specify the Hadoop config property `hive.metastore.warehouse.dir` to be an S3A path.
432-
3. Add [hadoop-aws](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) as a runtime dependency of your compute engine.
433-
4. Configure AWS settings based on [hadoop-aws documentation](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) (make sure you check the version, S3A configuration varies a lot based on the version you use).
438+
3. Add [hadoop-aws](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) as a runtime dependency of your compute engine. The version of this module must be the exact same version as the rest of any hadoop binaries on the classpath.
439+
4. For the latest features, bug fixes and best performance, use the latest versions of hadoop and hadoop-aws libraries.
440+
5. Use the exact same shaded version of the AWS SDK as the `hadoop-aws` module was built and tested with. Older versions are unlikely to work, newer versions will be unqualified and lack any fixes for problems identified during the qualification process.
441+
6. Configure AWS settings based on [hadoop-aws documentation](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) (make sure you check the version, S3A configuration varies a lot based on the version you use).
442+
443+
#### S3A: Maximizing Parquet and Iceberg Performance.
444+
445+
For best performance, the S3A connector should be configured for the Parquet and iceberg workloads
446+
based on the recommedations of [Maximizing Performance when working with the S3A Connector](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/performance.html).
447+
448+
For applications reading Parquet data, including Iceberg manifests, specific settings to use are listed below.
449+
450+
451+
| Property | Recommended value | Description |
452+
|------------------------------------|-------------------|----------------------------------------------------------------------------------------|
453+
| fs.s3a.experimental.input.fadvise | parquet, random | Optimizes file reading for random IO, or, if explicitly supported, parquet files |
454+
| fs.s3a.vectored.read.min.seek.size | 128M | Threshold below which "nearby" Vector IO ranges are coalesced into single GET requests |
455+
| parquet.hadoop.vectored.io.enabled | true | Flag to enable Vector IO in parquet |
456+
| iceberg.hadoop.bulk.delete.enabled | true | Iceberg to use Hadoop 3.4.1+ bulk delete API, where available |
457+
458+
Optional properties
459+
460+
461+
```shell
462+
spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
463+
--conf spark.sql.catalog.my_catalog.warehouse=s3a://my-bucket/my/key/prefix \
464+
--conf spark.sql.catalog.my_catalog.type=glue \
465+
--conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \
466+
--conf "spark.sql.catalog.my_catalog.fs.s3a.experimental.input.fadvise=parquet, random" \
467+
--conf spark.sql.catalog.my_catalog.fs.s3a.vectored.read.min.seek.size=128M \
468+
--conf spark.sql.catalog.my_catalog.parquet.hadoop.vectored.io.enabled=true
469+
--conf spark.sql.catalog.my_catalog.iceberg.hadoop.bulk.delete.enabled=true
470+
```
434471

435472
### S3 Write Checksum Verification
436473

0 commit comments

Comments
 (0)