Skip to content

Commit 2ecfac3

Browse files
committed
HADOOP-19388. paste in content cut from iceberg patch
Forms initial iceberg page. To add * analytics stream * using s3 as the URLs
1 parent 3a3e6f3 commit 2ecfac3

File tree

3 files changed

+86
-7
lines changed

3 files changed

+86
-7
lines changed

hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/contract/ContractTestUtils.java

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@
4646
import java.io.IOException;
4747
import java.io.InputStream;
4848
import java.io.OutputStream;
49+
import java.io.UncheckedIOException;
4950
import java.nio.ByteBuffer;
5051
import java.nio.charset.StandardCharsets;
5152
import java.util.ArrayList;
@@ -56,13 +57,16 @@
5657
import java.util.Locale;
5758
import java.util.Map;
5859
import java.util.NoSuchElementException;
60+
import java.util.Optional;
5961
import java.util.Properties;
6062
import java.util.Set;
6163
import java.util.UUID;
6264
import java.util.concurrent.CompletableFuture;
6365
import java.util.concurrent.TimeUnit;
6466
import java.util.concurrent.TimeoutException;
6567

68+
import static java.util.Optional.empty;
69+
import static java.util.Optional.of;
6670
import static org.apache.hadoop.fs.CommonConfigurationKeysPublic.IO_FILE_BUFFER_SIZE_DEFAULT;
6771
import static org.apache.hadoop.fs.CommonConfigurationKeysPublic.IO_FILE_BUFFER_SIZE_KEY;
6872
import static org.apache.hadoop.util.functional.RemoteIterators.foreach;
@@ -1888,17 +1892,18 @@ public static void assertSuccessfulBulkDelete(List<Map.Entry<Path, String>> entr
18881892
* Get a file status value or, if the path doesn't exist, return null.
18891893
* @param fs filesystem
18901894
* @param path path
1891-
* @return status or null
1892-
* @throws IOException Any IO Failure other than file not found.
1895+
* @return status or empty
1896+
* @throws UncheckedIOException Any IO Failure other than file not found.
18931897
*/
1894-
public static final FileStatus getFileStatusOrNull(
1898+
public static final Optional<FileStatus> getFileStatusIfPresent(
18951899
final FileSystem fs,
1896-
final Path path)
1897-
throws IOException {
1900+
final Path path) {
18981901
try {
1899-
return fs.getFileStatus(path);
1902+
return of(fs.getFileStatus(path));
19001903
} catch (FileNotFoundException e) {
1901-
return null;
1904+
return empty();
1905+
} catch (IOException ioe) {
1906+
throw new UncheckedIOException(ioe);
19021907
}
19031908
}
19041909

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
<!---
2+
Licensed under the Apache License, Version 2.0 (the "License");
3+
you may not use this file except in compliance with the License.
4+
You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software
9+
distributed under the License is distributed on an "AS IS" BASIS,
10+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
See the License for the specific language governing permissions and
12+
limitations under the License. See accompanying LICENSE file.
13+
-->
14+
15+
## Using the Hadoop S3A Connector with Apache Iceberg
16+
17+
The Apache Hadoop S3A Connector can be used to access data on S3 stores through Apache Iceberg
18+
19+
It:
20+
* Uses an AWS v2 client qualified across a broad set of applications, including Apache Spark, Apache Hive, Apache HBase and more. This may lag the Iceberg artifacts.
21+
* Is qualified with third party stores on every release.
22+
* Contains detection and recovery for S3 failures beyond that in the AWS SDK -recovery added "one support call at a time".
23+
* Supports scatter/gather IO "Vector IO" for high-performance Parquet reads.
24+
* Supports Amazon S3 Express One Zone storage, FIPS endpoints, Client-Side encryption S3 Access Points and S3 Access Grants.
25+
* Supports OpenSSL as an optional TLS transport layer -for tangible performance improvements over the JDK implementation.
26+
* Includes [auditing via the S3 Server Logs](./auditing.html), which can be used to answer important questions such as "who deleted all the files?" and "which job is triggering throttling?".
27+
* Collects [client-side statistics](https://apachecon.com/acasia2022/sessions/bigdata-1191.html) for identification of performance and connectivity issues.
28+
* Note: it does not support S3 Dual Stack, S3 Acceleration or S3 Tags.
29+
30+
To use the S3A Connector, here are the instructions:
31+
32+
1. To store data using S3A, specify the `warehouse` catalog property to be an S3A path, e.g. `s3a://my-bucket/my-warehouse`
33+
2. For `HiveCatalog` to also store metadata using S3A, specify the Hadoop config property `hive.metastore.warehouse.dir` to be an S3A path.
34+
3. Add [hadoop-aws](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) as a runtime dependency of your compute engine. The version of this module must be the exact same version as all other hadoop binaries on the classpath.
35+
4. For the latest features, bug fixes and best performance, use the latest versions of all hadoop and hadoop-aws libraries.
36+
5. Use the same shaded version of the AWS SDK `bundle.jar` as the `hadoop-aws` module was built and tested with. Older versions are unlikely to work, newer versions will be unqualified and may cause regressions
37+
6. Configure AWS settings based on [hadoop-aws documentation](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html).
38+
39+
### Maximizing Parquet and Iceberg Performance through the S3A Connector
40+
41+
For best performance
42+
* The S3A connector should be configured for the Parquet and iceberg workloads
43+
based on the recommedations of [Maximizing Performance when working with the S3A Connector](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/performance.html).
44+
* Iceberg should be configured to use the Hadoop Bulk Delete API when deleting files.
45+
* Parquet should be configured to use the Vector IO API for parallelized data retrieval.
46+
47+
The recommended settings are listed below:
48+
49+
| Property | Recommended value | Description |
50+
|------------------------------------|-------------------|----------------------------------------------------------------------------------------|
51+
| `iceberg.hadoop.bulk.delete.enabled` | `true` | Iceberg to use Hadoop bulk delete API, where available |
52+
| `parquet.hadoop.vectored.io.enabled` | `true` | Use Vector IO in Parquet Reads, where available |
53+
|` fs.s3a.vectored.read.min.seek.size` | `128K` | Threshold below which adjacent Vector IO ranges are coalesced into single GET requests |
54+
| `fs.s3a.experimental.input.fadvise` | `parquet,vector,random,adaptive` | Preferred read policy when opening files. |
55+
56+
```shell
57+
spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
58+
--conf spark.sql.catalog.my_catalog.warehouse=s3a://my-bucket/my/key/prefix \
59+
--conf spark.sql.catalog.my_catalog.type=glue \
60+
--conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \
61+
--conf spark.sql.catalog.my_catalog.iceberg.hadoop.bulk.delete.enabled=true \
62+
--conf spark.sql.catalog.my_catalog.parquet.hadoop.vectored.io.enabled=true \
63+
--conf spark.hadoop.fs.s3a.vectored.read.min.seek.size=128K \
64+
--conf spark.hadoop.fs.s3a.experimental.input.fadvise=parquet,vector,random,adaptive
65+
```
66+
67+
The property `fs.s3a.vectored.read.min.seek.size` sets the threshold below which adjacent requests are coalesced into single GET requests.
68+
This can compensate for S3 request latency by combining requests and discarding the data between them.
69+
The recommended value is based on the [Facebook Velox paper](https://research.facebook.com/publications/velox-metas-unified-execution-engine/)
70+
71+
> IO reads for nearby columns are typically coalesced (merged) if the gap between them is small enough (currently about 20K for SSD and 500K for disaggregated storage), aiming to serve neighboring reads in as few IO reads as possible.
72+
73+

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ full details.
4343
* [S3A and Directory Markers](directory_markers.html).
4444
* [Auditing](./auditing.html).
4545
* [Committing work to S3 with the "S3A Committers"](./committers.html)
46+
* [Apache Iceberg Integration](iceberg.html)
4647
* [S3A Committers Architecture](./committer_architecture.html)
4748
* [Working with IAM Assumed Roles](./assumed_roles.html)
4849
* [S3A Delegation Token Support](./delegation_tokens.html)

0 commit comments

Comments
 (0)