diff --git a/docs/lakehouse/catalogs/hive-catalog.mdx b/docs/lakehouse/catalogs/hive-catalog.mdx index 46205b43285db..bd2d396f6730d 100644 --- a/docs/lakehouse/catalogs/hive-catalog.mdx +++ b/docs/lakehouse/catalogs/hive-catalog.mdx @@ -474,6 +474,7 @@ Hive transactional tables are supported from version 3.x onwards. For details, r 'glue.secret_key' = '' ); ``` + When Glue service authentication information differs from S3 authentication information, you can specify S3 authentication information separately in the following way. ```sql CREATE CATALOG hive_glue_on_s3_catalog PROPERTIES ( @@ -489,6 +490,16 @@ Hive transactional tables are supported from version 3.x onwards. For details, r 's3.secret_key' = '' ); ``` + + Using IAM Assumed Role to obtain S3 access credentials (Since 3.1.2+) + ```sql + CREATE CATALOG `glue_hive_iamrole` PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.type' = 'glue', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + 'glue.role_arn' = '' + ); diff --git a/docs/lakehouse/catalogs/iceberg-catalog.mdx b/docs/lakehouse/catalogs/iceberg-catalog.mdx index e80dc1f510137..0f1b5bd0b6aee 100644 --- a/docs/lakehouse/catalogs/iceberg-catalog.mdx +++ b/docs/lakehouse/catalogs/iceberg-catalog.mdx @@ -137,6 +137,43 @@ The current Iceberg dependency is version 1.6.1, which is compatible with higher > > You can check whether the source type has timezone information in the Extra column of the `DESCRIBE table_name` statement. If it shows `WITH_TIMEZONE`, it indicates that the source type is a timezone-aware type. (Supported since 3.1.0). +## Namespace Mapping + +Iceberg's metadata hierarchy is Catalog -> Namespace -> Table. Namespace can have multiple levels (Nested Namespace). + +``` + ┌─────────┐ + │ Catalog │ + └────┬────┘ + │ + ┌─────┴─────┐ + ┌──▼──┐ ┌──▼──┐ + │ NS1 │ │ NS2 │ + └──┬──┘ └──┬──┘ + │ │ +┌────▼───┐ ┌──▼──┐ +│ Table1 │ │ NS3 │ +└────────┘ └──┬──┘ + │ + ┌──────┴───────┐ + ┌────▼───┐ ┌────▼───┐ + │ Table2 │ │ Table3 │ + └────────┘ └────────┘ +``` + + +Starting from version 3.1.2, for Iceberg Rest Catalog, Doris supports mapping of Nested Namespace. + +In the above example, tables will be mapped to Doris metadata according to the following logic: + +| Catalog | Database | Table | +| --- | --- | --- | +| Catalog | NS1 | Table1 | +| Catalog | NS2.NS3 | Table2 | +| Catalog | NS2.NS3 | Table3 | + +Support for Nested Namespace needs to be explicitly enabled. For details, please refer to [Iceberg Rest Catalog](../metastores/iceberg-rest.md) + ## Examples ### Hive Metastore @@ -469,6 +506,7 @@ The current Iceberg dependency is version 1.6.1, which is compatible with higher 'glue.secret_key' = '' ); ``` + When Glue service authentication credentials differ from S3 authentication credentials, you can specify S3 authentication credentials separately using the following method. ```sql CREATE CATALOG `iceberg_glue_on_s3_catalog_` PROPERTIES ( @@ -485,6 +523,18 @@ The current Iceberg dependency is version 1.6.1, which is compatible with higher 's3.secret_key' = '' ); ``` + + Using IAM Assumed Role to obtain S3 access credentials (Since 3.1.2+) + ```sql + CREATE CATALOG `glue_iceberg_iamrole` PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'glue', + 'warehouse' = 's3://bucket/warehouse', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + 'glue.role_arn' = '' + ); + ``` diff --git a/docs/lakehouse/metastores/aws-glue.md b/docs/lakehouse/metastores/aws-glue.md index be5f6fa868e6a..fba9332b51f64 100644 --- a/docs/lakehouse/metastores/aws-glue.md +++ b/docs/lakehouse/metastores/aws-glue.md @@ -11,31 +11,79 @@ This document describes the parameter configuration when using **AWS Glue Catalo AWS Glue Catalog currently supports three types of Catalogs: -| Catalog Type | Type Identifier (`type`) | Description | -|--------------|-------------------------|----------------------------------------------------| -| Hive | glue | Catalog for connecting to Hive Metastore | -| Iceberg | glue | Catalog for connecting to Iceberg table format | -| Iceberg | rest | Catalog for connecting to Iceberg via Glue Rest | +| Catalog Type | Type Identifier (`type`) | Description | +|-------------|-------------------------|------------------------------------------------| +| Hive | glue | Catalog for connecting to Hive Metastore | +| Iceberg | glue | Catalog for connecting to Iceberg table format | +| Iceberg | rest | Catalog for connecting to Iceberg table format via Glue Rest Catalog | -This document provides detailed descriptions of the parameters for each type to help users with configuration. +This documentation provides detailed parameter descriptions for each type to facilitate user configuration. -## Hive Glue Catalog +## Common Parameters Overview +| Parameter Name | Description | Required | Default Value | +|--------------------------|---------------------------------------------------------------|----------|---------------| +| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | +| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | +| `glue.access_key` | AWS Access Key ID | Yes | Empty | +| `glue.secret_key` | AWS Secret Access Key | Yes | Empty | +| `glue.catalog_id` | Glue Catalog ID (not supported yet) | No | Empty | +| `glue.role_arn` | IAM Role ARN for accessing Glue (supported since 3.1.2+) | No | Empty | +| `glue.external_id` | IAM External ID for accessing Glue (supported since 3.1.2+) | No | Empty | -Hive Glue Catalog is used to access Hive tables through AWS Glue's Hive Metastore compatible interface. Configuration parameters are as follows: +### Authentication Parameters -| Parameter Name | Description | Required | Default Value | -|---------------------------|----------------------------------------------------------------------|----------|---------------| -| `type` | Fixed value `hms` | Yes | None | -| `hive.metastore.type` | Fixed value `glue` | Yes | None | -| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | -| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | -| `glue.access_key` | AWS Access Key ID | Yes | Empty | -| `glue.secret_key` | AWS Secret Access Key | Yes | Empty | -| `glue.catalog_id` | Glue Catalog ID (not yet supported) | No | Empty | -| `glue.role_arn` | IAM Role ARN for accessing Glue (not yet supported) | No | Empty | -| `glue.external_id` | IAM External ID for accessing Glue (not yet supported) | No | Empty | +Accessing Glue requires authentication information, supporting the following two methods: -### Example +1. Access Key Authentication + + Authenticate access to Glue through Access Key provided by `glue.access_key` and `glue.secret_key`. + +2. IAM Role Authentication (supported since 3.1.2+) + + Authenticate access to Glue through IAM Role provided by `glue.role_arn`. + + This method requires Doris to be deployed on AWS EC2, and the EC2 instance needs to be bound to an IAM Role that has permission to access Glue. + + If access through External ID is required, you need to configure `glue.external_id` as well. + +Notes: + +- At least one of the two methods must be configured. If both methods are configured, Access Key authentication takes priority. + +Example: + + ```sql + CREATE CATALOG hive_glue_catalog PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.type' = 'glue', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + -- Using Access Key authentication + 'glue.access_key' = '', + 'glue.secret_key' = '' + -- Or using IAM Role authentication + -- 'glue.role_arn' = '', + -- 'glue.external_id' = '' + ); + ``` + +### Hive Glue Catalog + +Hive Glue Catalog is used to access Hive tables through AWS Glue's Hive Metastore compatible interface. Configuration as follows: + +| Parameter Name | Description | Required | Default Value | +|--------------------------|---------------------------------------------------------------|----------|---------------| +| `type` | Fixed as `hms` | Yes | None | +| `hive.metastore.type` | Fixed as `glue` | Yes | None | +| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | +| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | +| `glue.access_key` | AWS Access Key ID | No | Empty | +| `glue.secret_key` | AWS Secret Access Key | No | Empty | +| `glue.catalog_id` | Glue Catalog ID (not supported yet) | No | Empty | +| `glue.role_arn` | IAM Role ARN for accessing Glue | No | Empty | +| `glue.external_id` | IAM External ID for accessing Glue | No | Empty | + +#### Example ```sql CREATE CATALOG hive_glue_catalog PROPERTIES ( @@ -48,24 +96,24 @@ CREATE CATALOG hive_glue_catalog PROPERTIES ( ); ``` -## Iceberg Glue Catalog +### Iceberg Glue Catalog -Iceberg Glue Catalog accesses Glue through the Glue Client. Configuration parameters are as follows: +Iceberg Glue Catalog accesses Glue through Glue Client. Configuration as follows: -| Parameter Name | Description | Required | Default Value | -|-------------------------|-----------------------------------------------------------------------------|----------|---------------| -| `type` | Fixed value `iceberg` | Yes | None | -| `iceberg.catalog.type` | Fixed value `glue` | Yes | None | -| `warehouse` | Iceberg warehouse path, e.g., `s3://my-bucket/iceberg-warehouse/` | Yes | s3://doris | -| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | -| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | -| `glue.access_key` | AWS Access Key ID | Yes | Empty | -| `glue.secret_key` | AWS Secret Access Key | Yes | Empty | -| `glue.catalog_id` | Glue Catalog ID (not yet supported) | No | Empty | -| `glue.role_arn` | IAM Role ARN for accessing Glue (not yet supported) | No | Empty | -| `glue.external_id` | IAM External ID for accessing Glue (not yet supported) | No | Empty | +| Parameter Name | Description | Required | Default Value | +|------------------------|------------------------------------------------------------------|----------|---------------| +| `type` | Fixed as `iceberg` | Yes | None | +| `iceberg.catalog.type` | Fixed as `glue` | Yes | None | +| `warehouse` | Iceberg data warehouse path, e.g., `s3://my-bucket/iceberg-warehouse/` | Yes | s3://doris | +| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | +| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | +| `glue.access_key` | AWS Access Key ID | No | Empty | +| `glue.secret_key` | AWS Secret Access Key | No | Empty | +| `glue.catalog_id` | Glue Catalog ID (not supported yet) | No | Empty | +| `glue.role_arn` | IAM Role ARN for accessing Glue (not supported yet) | No | Empty | +| `glue.external_id` | IAM External ID for accessing Glue (not supported yet) | No | Empty | -### Example +#### Example ```sql CREATE CATALOG iceberg_glue_catalog PROPERTIES ( @@ -78,23 +126,23 @@ CREATE CATALOG iceberg_glue_catalog PROPERTIES ( ); ``` -## Iceberg Glue Rest Catalog +### Iceberg Glue Rest Catalog -Iceberg Glue Rest Catalog accesses Glue through the Glue Rest Catalog interface. Currently only supports Iceberg tables stored in AWS S3 Table Bucket. Configuration parameters are as follows: +Iceberg Glue Rest Catalog accesses Glue through Glue Rest Catalog interface. Currently only supports Iceberg tables stored in AWS S3 Table Bucket. Configuration as follows: -| Parameter Name | Description | Required | Default Value | -|----------------------------------|---------------------------------------------------------------------------------|----------|---------------| -| `type` | Fixed value `iceberg` | Yes | None | -| `iceberg.catalog.type` | Fixed value `rest` | Yes | None | +| Parameter Name | Description | Required | Default Value | +|----------------------------------|-------------------------------------------------------------------|----------|---------------| +| `type` | Fixed as `iceberg` | Yes | None | +| `iceberg.catalog.type` | Fixed as `rest` | Yes | None | | `iceberg.rest.uri` | Glue Rest service endpoint, e.g., `https://glue.ap-east-1.amazonaws.com/iceberg` | Yes | None | -| `warehouse` | Iceberg warehouse path, e.g., `:s3tablescatalog/` | Yes | None | -| `iceberg.rest.sigv4-enabled` | Enable V4 signature format, fixed value `true` | Yes | None | -| `iceberg.rest.signing-name` | Signature type, fixed value `glue` | Yes | Empty | -| `iceberg.rest.access-key-id` | Access Key for accessing Glue (also used for S3 Bucket access) | Yes | Empty | -| `iceberg.rest.secret-access-key` | Secret Key for accessing Glue (also used for S3 Bucket access) | Yes | Empty | -| `iceberg.rest.signing-region` | AWS Glue region, e.g., `us-east-1` | Yes | Empty | +| `warehouse` | Iceberg data warehouse path, e.g., `:s3tablescatalog/` | Yes | None | +| `iceberg.rest.sigv4-enabled` | Enable V4 signature format, fixed as `true` | Yes | None | +| `iceberg.rest.signing-name` | Signature type, fixed as `glue` | Yes | Empty | +| `iceberg.rest.access-key-id` | Access Key for accessing Glue (also used for accessing S3 Bucket) | Yes | Empty | +| `iceberg.rest.secret-access-key` | Secret Key for accessing Glue (also used for accessing S3 Bucket) | Yes | Empty | +| `iceberg.rest.signing-region` | AWS Glue region, e.g., `us-east-1` | Yes | Empty | -### Example +#### Example ```sql CREATE CATALOG glue_s3 PROPERTIES ( @@ -109,3 +157,89 @@ CREATE CATALOG glue_s3 PROPERTIES ( 'iceberg.rest.signing-region' = '' ); ``` + + +## Permission Policies + +Depending on usage scenarios, they can be divided into **read-only** and **read-write** policies. + +### 1. Read-Only Permissions + +Only allows reading database and table information from Glue Catalog. + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "GlueCatalogReadOnly", + "Effect": "Allow", + "Action": [ + "glue:GetCatalog", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:GetTable", + "glue:GetTables", + "glue:GetPartitions" + ], + "Resource": [ + "arn:aws:glue:::catalog", + "arn:aws:glue:::database/*", + "arn:aws:glue:::table/*/*" + ] + } + ] +} +``` + +### 2. Read-Write Permissions + +Based on read-only permissions, allows creating/modifying/deleting databases and tables. + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "GlueCatalogReadWrite", + "Effect": "Allow", + "Action": [ + "glue:GetCatalog", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:GetTable", + "glue:GetTables", + "glue:GetPartitions", + "glue:CreateDatabase", + "glue:UpdateDatabase", + "glue:DeleteDatabase", + "glue:CreateTable", + "glue:UpdateTable", + "glue:DeleteTable" + ], + "Resource": [ + "arn:aws:glue:::catalog", + "arn:aws:glue:::database/*", + "arn:aws:glue:::table/*/*" + ] + } + ] +} +``` + +### Notes + +1. Placeholder Replacement + + - `` → Your AWS region (e.g., `us-east-1`). + - `` → Your AWS account ID (12-digit number). + +2. Principle of Least Privilege + + - If only querying, do not grant write permissions. + - Can replace `*` with specific database/table ARNs to further restrict permissions. + +3. S3 Permissions + + - The above policies only involve Glue Catalog. + - If you need to read data files, additional S3 permissions are required (such as `s3:GetObject`, `s3:ListBucket`, etc.). \ No newline at end of file diff --git a/docs/lakehouse/metastores/iceberg-rest.md b/docs/lakehouse/metastores/iceberg-rest.md index be03e029efc57..1d89566ad31f9 100644 --- a/docs/lakehouse/metastores/iceberg-rest.md +++ b/docs/lakehouse/metastores/iceberg-rest.md @@ -19,6 +19,7 @@ This document describes the supported parameters when connecting to and accessin | iceberg.rest.oauth2.credential | | `oauth2` credentials used to access `server-uri` to obtain token | - | No | | iceberg.rest.oauth2.server-uri | | URI address for obtaining `oauth2` token, used in conjunction with `iceberg.rest.oauth2.credential` | - | No | | iceberg.rest.vended-credentials-enabled | | Whether to enable `vended-credentials` functionality. When enabled, it will obtain storage system access credentials such as `access-key` and `secret-key` from the rest server, eliminating the need for manual specification. Requires rest server support for this capability. | `false` | No | +| iceberg.rest.nested-namespace-enabled | | (Supported since version 3.1.2+) Whether to enable support for Nested Namespace. Default is `false`. If `true`, Nested Namespace will be flattened and displayed as Database names, such as `parent_ns.child_ns`. Some Rest Catalog services do not support Nested Namespace, such as AWS Glue, so this parameter should be set to `false` | No | > Note: > @@ -28,6 +29,27 @@ This document describes the supported parameters when connecting to and accessin > > 3. For AWS Glue Rest Catalog, please refer to the [AWS Glue documentation](./aws-glue.md) +## Nested Namespace + +Since 3.1.2, to fully access Nested Namespace, in addition to setting `iceberg.rest.nested-namespace-enabled` to `true` in the Catalog properties, you also need to enable the following global parameter: + +``` +SET GLOBAL enable_nested_namespace=true; +``` + +Assuming the Catalog is "ice", Namespace is "ns1.ns2", and Table is "tbl1", you can access Nested Namespace in the following ways: + +```sql +mysql> USE ice.ns1.ns2; +mysql> SELECT k1 FROM ice.`ns1.ns2`.tbl1; +mysql> SELECT tbl1.k1 FROM `ns1.ns2`.tbl1; +mysql> SELECT `ns1.ns2`.tbl1.k1 FROM ice.`ns1.ns2`.tbl1; +mysql> SELECT ice.`ns1.ns2`.tbl1.k1 FROM tbl1; +mysql> REFRESH CATALOG ice; +mysql> REFRESH DATABASE ice.`ns1.ns2`; +mysql> REFRESH TABLE ice.`ns1.ns2`.tbl1; +``` + ## Example Configurations - Rest Catalog service without authentication @@ -111,6 +133,43 @@ This document describes the supported parameters when connecting to and accessin ); ``` +- Connecting to Snowflake Open Catalog (Since 3.1.2) + + ```sql + -- Enable vended-credentials + CREATE CATALOG snowflake_open_catalog PROPERTIES ( + 'type' = 'iceberg', + 'warehouse' = '', + 'iceberg.catalog.type' = 'rest', + 'iceberg.rest.uri' = 'https://.snowflakecomputing.com/polaris/api/catalog', + 'iceberg.rest.security.type' = 'oauth2', + 'iceberg.rest.oauth2.credential' = ':', + 'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:', + 'iceberg.rest.vended-credentials-enabled' = 'true', + 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', + 's3.region' = 'us-west-2', + 'iceberg.rest.nested-namespace-enabled' = 'true' + ); + ``` + + ```sql + -- Disable vended-credentials + CREATE CATALOG snowflake_open_catalog PROPERTIES ( + 'type' = 'iceberg', + 'warehouse' = '', + 'iceberg.catalog.type' = 'rest', + 'iceberg.rest.uri' = 'https://.snowflakecomputing.com/polaris/api/catalog', + 'iceberg.rest.security.type' = 'oauth2', + 'iceberg.rest.oauth2.credential' = ':', + 'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:', + 's3.access_key' = '', + 's3.secret_key' = '', + 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', + 's3.region' = 'us-west-2', + 'iceberg.rest.nested-namespace-enabled' = 'true' + ); + ``` + - Connecting to Apache Gravitino Rest Catalog ```sql diff --git a/docs/lakehouse/storages/s3.md b/docs/lakehouse/storages/s3.md index 90e671f866646..8c5d8473940f4 100644 --- a/docs/lakehouse/storages/s3.md +++ b/docs/lakehouse/storages/s3.md @@ -15,18 +15,18 @@ This document describes the parameters required for accessing AWS S3. These para ## Parameter Overview -| Property Name | Legacy Name | Description | Default Value | Required | -|------------------------------|-------------|-------------------------------------------------|---------------|----------| -| s3.endpoint | | S3 service access endpoint, e.g., s3.us-east-1.amazonaws.com | None | No | -| s3.access_key | | AWS Access Key for authentication | None | No | -| s3.secret_key | | AWS Secret Key for authentication | None | No | -| s3.region | | S3 region, e.g., us-east-1. Highly recommended to configure | None | Yes | -| s3.use_path_style | | Whether to use path-style access | FALSE | No | -| s3.connection.maximum | | Maximum number of connections for high concurrency scenarios | 50 | No | -| s3.connection.request.timeout| | Request timeout in milliseconds for connection acquisition | 3000 | No | -| s3.connection.timeout | | Connection establishment timeout in milliseconds | 1000 | No | -| s3.role_arn | | Role ARN when using Assume Role mode | None | No | -| s3.external_id | | External ID used with s3.role_arn | None | No | +| Property Name | Legacy Name | Description | Default | Required | +|------------------------------|-------------|--------------------------------------------------|---------|----------| +| s3.endpoint | | S3 service access endpoint, e.g., s3.us-east-1.amazonaws.com | None | No | +| s3.access_key | | AWS Access Key for authentication | None | No | +| s3.secret_key | | AWS Secret Key for authentication | None | No | +| s3.region | | S3 region, e.g., us-east-1. Strongly recommended | None | Yes | +| s3.use_path_style | | Whether to use path-style access | FALSE | No | +| s3.connection.maximum | | Maximum number of connections for high concurrency scenarios | 50 | No | +| s3.connection.request.timeout| | Request timeout (milliseconds), controls connection acquisition timeout | 3000 | No | +| s3.connection.timeout | | Connection establishment timeout (milliseconds) | 1000 | No | +| s3.role_arn | | Role ARN specified when using Assume Role mode | None | No | +| s3.external_id | | External ID used with s3.role_arn | None | No | ## Authentication Configuration @@ -41,7 +41,7 @@ Doris supports the following two methods to access S3: "s3.region"="us-east-1" ``` -2. Assume Role +2. Assume Role Mode Suitable for cross-account and temporary authorization access. Automatically obtains temporary credentials through role authorization. @@ -52,13 +52,13 @@ Doris supports the following two methods to access S3: "s3.region"="us-east-1" ``` -> If both Access Key and Role ARN are configured, Access Key mode takes priority. +> If both Access Key and Role ARN are configured, Access Key mode takes precedence. ## Accessing S3 Directory Bucket > This feature is supported since version 3.1.0. -Amazon S3 Express One Zone (also known as Directory Bucket) provides higher performance but has a different endpoint format. +Amazon S3 Express One Zone (also known as Directory Bucket) provides higher performance, but has a different endpoint format. * Regular bucket: s3.us-east-1.amazonaws.com * Directory Bucket: s3express-usw2-az1.us-west-2.amazonaws.com @@ -71,5 +71,84 @@ Example: "s3.access_key"="ak", "s3.secret_key"="sk", "s3.endpoint"="s3express-usw2-az1.us-west-2.amazonaws.com", -"s3.region"="us-west +"s3.region"="us-west-2" ``` + +## Permission Policies + +Depending on the use case, permissions can be categorized into **read-only** and **read-write** policies. + +### 1. Read-only Permissions + +Only allows reading objects from S3. Suitable for LOAD, TVF, querying EXTERNAL CATALOG, and other scenarios. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:GetObjectVersion", + ], + "Resource": "arn:aws:s3:::/your-prefix/*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListBucket", + "s3:GetBucketLocation" + ], + "Resource": "arn:aws:s3:::" + } + ] +} +``` + +### 2. Read-write Permissions + +Based on read-only permissions, additionally allows deleting, creating, and modifying objects. Suitable for EXPORT, OUTFILE, and EXTERNAL CATALOG write-back scenarios. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:PutObject", + "s3:GetObject", + "s3:GetObjectVersion", + "s3:DeleteObject", + "s3:DeleteObjectVersion", + "s3:AbortMultipartUpload", + "s3:ListMultipartUploadParts" + ], + "Resource": "arn:aws:s3::://*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListBucket", + "s3:GetBucketLocation", + "s3:GetBucketVersioning", + "s3:GetLifecycleConfiguration" + ], + "Resource": "arn:aws:s3:::" + } + ] +} +``` + +### Notes + +1. Placeholder Replacement + + - `` → Your S3 Bucket name. + - `` → Your AWS account ID (12-digit number). + +2. Principle of Least Privilege + + - If only querying, do not grant write permissions. + \ No newline at end of file diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md index 786205b08454a..c811e75b1d436 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md @@ -74,7 +74,7 @@ The `EXPORT` command is used to export data from a specified table to files at a - `timeout`: Timeout for export job, default is 2 hours, unit is seconds. - - `compress_type`: (Supported since 2.1.5) When specifying the export file format as Parquet / ORC files, you can specify the compression method used by Parquet / ORC files. Parquet file format can specify compression methods as SNAPPY, GZIP, BROTLI, ZSTD, LZ4, and PLAIN, with default value SNAPPY. ORC file format can specify compression methods as PLAIN, SNAPPY, ZLIB, and ZSTD, with default value ZLIB. This parameter is supported starting from version 2.1.5. (PLAIN means no compression) + - `compress_type`: (Supported since 2.1.5) When specifying the export file format as Parquet / ORC files, you can specify the compression method used by Parquet / ORC files. Parquet file format can specify compression methods as SNAPPY, GZIP, BROTLI, ZSTD, LZ4, and PLAIN, with default value SNAPPY. ORC file format can specify compression methods as PLAIN, SNAPPY, ZLIB, and ZSTD, with default value ZLIB. This parameter is supported starting from version 2.1.5. (PLAIN means no compression). Starting from version 3.1.1, supports specifying compression algorithms for CSV format, currently supports "plain", "gz", "bz2", "snappyblock", "lz4block", "zstd". :::caution Note To use the delete_existing_files parameter, you also need to add the configuration `enable_delete_existing_files = true` in fe.conf and restart fe, then delete_existing_files will take effect. delete_existing_files = true is a dangerous operation, it's recommended to use only in test environments. diff --git a/docs/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md b/docs/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md index f0d59be7cee43..7aafd98add28d 100644 --- a/docs/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md +++ b/docs/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md @@ -1,16 +1,15 @@ --- { - "title": "OUTFILE", - "language": "en" + "title": "OUTFILE", + "language": "en" } - --- ## Description -This statement is used to export query results to a file using the `SELECT INTO OUTFILE` command. Currently, it supports exporting to remote storage, such as HDFS, S3, BOS, COS (Tencent Cloud), through the Broker process, S3 protocol, or HDFS protocol. +The `SELECT INTO OUTFILE` command is used to export query results to files. Currently supports exporting to remote storage such as HDFS, S3, BOS, COS (Tencent Cloud) through Broker process, S3 protocol or HDFS protocol. -## Syntax +## Syntax: ```sql @@ -21,33 +20,32 @@ INTO OUTFILE "" ## Required Parameters -**1. ``** +**1. ``** - The query statement must be a valid SQL statement. Please refer to the [query statement documentation](../../data-query/SELECT.md). +Query statement, must be a valid SQL, refer to [query statement documentation](../../data-query/SELECT.md). **2. ``** - file_path points to the path where the file is stored and the file prefix. Such as `hdfs://path/to/my_file_`. - - The final filename will consist of `my_file_`, the file number and the file format suffix. The file serial number starts from 0, and the number is the number of files to be divided. Such as: - - my_file_abcdefg_0.csv - - my_file_abcdefg_1.csv - - my_file_abcdegf_2.csv +File storage path and file prefix. Points to the file storage path and file prefix. For example `hdfs://path/to/my_file_`. +The final filename will consist of `my_file_`, file sequence number, and file format suffix. The file sequence number starts from 0, and the quantity is the number of files split. For example: +- my_file_abcdefg_0.csv +- my_file_abcdefg_1.csv +- my_file_abcdegf_2.csv - You can also omit the file prefix and specify only the file directory, such as: `hdfs://path/to/` +You can also omit the file prefix and only specify the file directory, such as `hdfs://path/to/` ## Optional Parameters **1. ``** - Specifies the export format. Supported formats include : - - `CSV` (Default) + Specify export format. Currently supports the following formats: + - `CSV` (default) - `PARQUET` - `CSV_WITH_NAMES` - `CSV_WITH_NAMES_AND_TYPES` - `ORC` - > Note: PARQUET, CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, and ORC are supported starting in version 1.2 . + > Note: PARQUET, CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, ORC are supported starting from version 1.2. **2. ``** @@ -55,73 +53,73 @@ INTO OUTFILE "" [ PROPERTIES (""="" [, ... ]) ] ``` -Specify related properties. Currently exporting via the Broker process, S3 protocol, or HDFS protocol is supported. +Currently supports export through Broker process, or through S3/HDFS protocol. -**File properties** -- `column_separator`: column separator,is only for CSV format. mulit-bytes is supported starting in version 1.2, such as: "\\x01", "abc". -- `line_delimiter`: line delimiter,is only for CSV format. mulit-bytes supported starting in version 1.2, such as: "\\x01", "abc". -- `max_file_size`: the size limit of a single file, if the result exceeds this value, it will be cut into multiple files, the value range of max_file_size is [5MB, 2GB] and the default is 1GB. (When specified that the file format is ORC, the size of the actual division file will be a multiples of 64MB, such as: specify max_file_size = 5MB, and actually use 64MB as the division; specify max_file_size = 65MB, and will actually use 128MB as cut division points.) -- `delete_existing_files`: default `false`. If it is specified as true, you will first delete all files specified in the directory specified by the file_path, and then export the data to the directory.For example: "file_path" = "/user/tmp", then delete all files and directory under "/user/"; "file_path" = "/user/tmp/", then delete all files and directory under "/user/tmp/" -- `file_suffix`: Specify the suffix of the export file. If this parameter is not specified, the default suffix for the file format will be used. +**Properties related to export file itself** +- `column_separator`: Column separator, only used for CSV related formats. Starting from version 1.2, supports multi-byte separators, such as: "\\x01", "abc". +- `line_delimiter`: Line delimiter, only used for CSV related formats. Starting from version 1.2, supports multi-byte separators, such as: "\\x01", "abc". +- `max_file_size`: Single file size limit, if the result exceeds this value, it will be split into multiple files, `max_file_size` value range is [5MB, 2GB], default is `1GB`. (When specifying export as ORC file format, the actual split file size will be a multiple of 64MB, for example: if `max_file_size = 5MB` is specified, it will actually be split by 64 MB; if `max_file_size = 65MB` is specified, it will actually be split by 128 MB) +- `delete_existing_files`: Default is `false`, if specified as `true`, it will first delete all files under the directory specified by `file_path`, then export data to that directory. For example: "file_path" = "/user/tmp", will delete all files and directories under "/user/"; "file_path" = "/user/tmp/", will delete all files and directories under "/user/tmp/". +- `file_suffix`: Specify the suffix of the exported file, if this parameter is not specified, the default suffix of the file format will be used. +- `compress_type`: When specifying the exported file format as Parquet / ORC file, you can specify the compression method used by Parquet / ORC file. Parquet file format can specify compression methods as SNAPPY, GZIP, BROTLI, ZSTD, LZ4 and PLAIN, default value is SNAPPY. ORC file format can specify compression methods as PLAIN, SNAPPY, ZLIB and ZSTD, default value is ZLIB. This parameter is supported starting from version 2.1.5. (PLAIN means no compression). Starting from version 3.1.1, supports specifying compression algorithms for CSV format, currently supports "plain", "gz", "bz2", "snappyblock", "lz4block", "zstd". -**Broker properties** _(need to be prefixed with `broker`)_ -- `broker.name: broker`: broker name -- `broker.hadoop.security.authentication`: specify the authentication method as kerberos -- `broker.kerberos_principal`: specifies the principal of kerberos -- `broker.kerberos_keytab`: specifies the path to the keytab file of kerberos. The file must be the absolute path to the file on the server where the broker process is located. and can be accessed by the Broker process +**Broker related properties** _(need to add prefix `broker.`)_ +- `broker.name: broker`: name +- `broker.hadoop.security.authentication`: Specify authentication method as kerberos +- `broker.kerberos_principal`: Specify kerberos principal +- `broker.kerberos_keytab`: Specify kerberos keytab file path. This file must be an absolute path of a file on the server where the Broker process is located. And it can be accessed by the Broker process -**HDFS properties** +**HDFS related properties** - `fs.defaultFS`: namenode address and port - `hadoop.username`: hdfs username -- `dfs.nameservices`: if hadoop enable HA, please set fs nameservice. See hdfs-site.xml -- `dfs.ha.namenodes.[nameservice ID]`: unique identifiers for each NameNode in the nameservice. See hdfs-site.xml -- `dfs.namenode.rpc-address.[nameservice ID].[name node ID]`: the fully-qualified RPC address for each NameNode to listen on. See hdfs-site.xml -- `dfs.client.failover.proxy.provider.[nameservice ID]`: the Java class that HDFS clients use to contact the Active NameNode, usually it is org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider - -**For a kerberos-authentication enabled Hadoop cluster, additional properties need to be set:** -- `dfs.namenode.kerberos.principal`: HDFS namenode service principal -- `hadoop.security.authentication`: kerberos -- `hadoop.kerberos.principal`: the Kerberos pincipal that Doris will use when connectiong to HDFS. -- `hadoop.kerberos.keytab`: HDFS client keytab location. - -For the S3 protocol, you can directly execute the S3 protocol configuration: -- `s3.endpoint` -- `s3.access_key` -- `s3.secret_key` -- `s3.region` -- `use_path_style`: (optional) default false . The S3 SDK uses the virtual-hosted style by default. However, some object storage systems may not be enabled or support virtual-hosted style access. At this time, we can add the use_path_style parameter to force the use of path style access method. - -> Note that to use the `delete_existing_files` parameter, you also need to add the configuration `enable_delete_existing_files = true` to the fe.conf file and restart the FE. Only then will the `delete_existing_files` parameter take effect. Setting `delete_existing_files = true` is a dangerous operation and it is recommended to only use it in a testing environment. +- `dfs.nameservices`: name service name, consistent with hdfs-site.xml +- `dfs.ha.namenodes.[nameservice ID]`: namenode id list, consistent with hdfs-site.xml +- `dfs.namenode.rpc-address.[nameservice ID].[name node ID]`: Name node rpc address, same number as namenode count, consistent with hdfs-site.xml +- `dfs.client.failover.proxy.provider.[nameservice ID]`: Java class for HDFS client to connect to active namenode, usually "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" + +**For Hadoop clusters with kerberos authentication enabled, additional PROPERTIES attributes need to be set:** +- `dfs.namenode.kerberos.principal`: Principal name of HDFS namenode service +- `hadoop.security.authentication`: Set authentication method to kerberos +- `hadoop.kerberos.principal`: Set the Kerberos principal used when Doris connects to HDFS +- `hadoop.kerberos.keytab`: Set keytab local file path + +For S3 protocol, directly configure S3 protocol settings: + - `s3.endpoint` + - `s3.access_key` + - `s3.secret_key` + - `s3.region` + - `use_path_style`: (Optional) Default is `false`. S3 SDK uses Virtual-hosted Style by default. But some object storage systems may not have enabled or support Virtual-hosted Style access, in this case you can add the `use_path_style` parameter to force the use of Path Style access. + +> Note: To use the `delete_existing_files` parameter, you also need to add the configuration `enable_delete_existing_files = true` in `fe.conf` and restart fe, then delete_existing_files will take effect. delete_existing_files = true is a dangerous operation, it is recommended to use only in test environments. ## Return Value -The results returned by the `Outfile` statement are explained as follows: - -| Column | DataType | Note | -|------------------|--------------|----------------------------------------------------------------------------------------------------------------| -| FileNumber | int | The total number of files generated. | -| TotalRows | int | The number of rows in the result set. | -| FileSize | int | The total size of the exported files, in bytes. | -| URL | string | The prefix of the exported file paths. Multiple files are numbered sequentially with suffixes like `_0`, `_1`. | +The result returned by the Outfile statement, the meaning of each column is as follows: -## Access Control Requirements +| Column Name | Type | Description | +|-------------|----------|-------------------------------------------------| +| FileNumber | int | Number of files finally generated | +| TotalRows | int | Number of rows in result set | +| FileSize | int | Total size of exported files. Unit: bytes. | +| URL | string | Prefix of exported file path, multiple files will be numbered with suffixes `_0`,`_1` sequentially. | -The user executing this SQL command must have at least the following privileges: +## Permission Control -| Privilege | Object | Notes | -|:-----------------|:-----------|:------------------------------------------------| -| SELECT_PRIV | Database | Requires read access to the database and table. | +Users executing this SQL command must have at least the following permissions: +| Permission | Object | Description | +|:------------|:-------------|:-------------------------------| +| SELECT_PRIV | Database | Requires read permissions on database and table. | -## Usage Notes +## Notes -### DataType Mapping +### Data Type Mapping -- All file formats support the export of basic data types, while only csv/orc/csv_with_names/csv_with_names_and_types currently support the export of complex data types (ARRAY/MAP/STRUCT). Nested complex data types are not supported. +- All file types support exporting basic data types, while for complex data types (ARRAY/MAP/STRUCT), currently only `csv`, `orc`, `csv_with_names` and `csv_with_names_and_types` support exporting complex types, and nested complex types are not supported. -- Parquet and ORC file formats have their own data types. The export function of Doris can automatically export the Doris data types to the corresponding data types of the Parquet/ORC file format. The following are the data type mapping relationship of the Doris data types and the Parquet/ORC file format data types: +- Parquet and ORC file formats have their own data types, Doris's export function can automatically export Doris data types to corresponding data types in Parquet/ORC file formats. The following are the data type mapping tables between Apache Doris data types and Parquet/ORC file formats: -1. The mapping relationship between the Doris data types to the ORC data types is: +1. **Doris to ORC file format data type mapping table:** | Doris Type | Orc Type | |-------------------------|-----------| | boolean | boolean | @@ -142,7 +140,9 @@ The user executing this SQL command must have at least the following privileges: | map | map | | array | array | -2. When Doris exports data to the Parquet file format, the Doris memory data will be converted to Arrow memory data format first, and then the paraquet file format is written by Arrow. The mapping relationship between the Doris data types to the ARROW data types is: +2. **Doris to Parquet file format data type mapping table:** + + When Doris exports to Parquet file format, it first converts Doris memory data to Arrow memory data format, then Arrow writes to Parquet file format. The mapping relationship between Doris data types and Arrow data types is: | Doris Type | Arrow Type | |-------------------------|------------| | boolean | boolean | @@ -163,40 +163,36 @@ The user executing this SQL command must have at least the following privileges: | map | map | | array | list | +### Export Data Volume and Export Efficiency -### Export data volume and export efficiency - - This function essentially executes an SQL query command. The final result is a single-threaded output. Therefore, the time-consuming of the entire export includes the time-consuming of the query itself and the time-consuming of writing the final result set. If the query is large, you need to set the session variable `query_timeout` to appropriately extend the query timeout. - -### Management of export files + This function essentially executes a SQL query command. The final result is output in a single thread. So the total export time includes the query execution time and the final result set write time. If the query is large, you need to set the session variable `query_timeout` to appropriately extend the query timeout. - Doris does not manage exported files. Including the successful export, or the remaining files after the export fails, all need to be handled by the user. +### Exported File Management -### Export to local file - To export to a local file, you need configure `enable_outfile_to_local=true` in fe.conf. + Doris does not manage exported files. Including successfully exported files or residual files after export failure, all need to be handled by users themselves. +### Export to Local Files + To export to local files, you need to first configure `enable_outfile_to_local=true` in `fe.conf` ```sql - select * from tbl1 limit 10 + select * from tbl1 limit 10 INTO OUTFILE "file:///home/work/path/result_"; ``` -The ability to export to a local file is not available for public cloud users, only for private deployments. And the default user has full control over the cluster nodes. Doris will not check the validity of the export path filled in by the user. If the process user of Doris does not have write permission to the path, or the path does not exist, an error will be reported. At the same time, for security reasons, if a file with the same name already exists in this path, the export will also fail. + The function of exporting to local files is not suitable for public cloud users, only for users with private deployment. And it defaults that users have complete control over cluster nodes. Doris does not perform validity checks on the export path filled by users. If the Doris process user does not have write permission to the path, or the path does not exist, an error will be reported. Also for security considerations, if a file with the same name already exists at the path, the export will also fail. -Doris does not manage files exported locally, nor does it check disk space, etc. These files need to be managed by the user, such as cleaning and so on. + Doris does not manage files exported locally, nor does it check disk space, etc. These files need to be managed by users themselves, such as cleanup. -### Results Integrity Guarantee - -This command is a synchronous command, so it is possible that the task connection is disconnected during the execution process, so that it is impossible to live the exported data whether it ends normally, or whether it is complete. At this point, you can use the `success_file_name` parameter to request that a successful file identifier be generated in the directory after the task is successful. Users can use this file to determine whether the export ends normally. +### Result Integrity Guarantee + This command is a synchronous command, so it's possible that the task connection is disconnected during execution, making it impossible to know whether the exported data ended normally or is complete. In this case, you can use the `success_file_name` parameter to require the task to generate a success file identifier in the directory after successful completion. Users can use this file to determine whether the export ended normally. ### Concurrent Export -Setting the session variable `set enable_parallel_outfile = true;` enables concurrent export using outfile. For detailed usage, see [Export Query Result](../../../../data-operate/export/outfile). - + Set Session variable `set enable_parallel_outfile = true;` to enable Outfile concurrent export. ## Examples -- Use the broker method to export, and export the simple query results to the file `hdfs://path/to/result.txt`. Specifies that the export format is CSV. Use `my_broker` and set kerberos authentication information. Specify the column separator as `,` and the row separator as `\n`. +- Export using Broker method, export simple query results to file `hdfs://path/to/result.txt`. Specify export format as CSV. Use `my_broker` and set kerberos authentication information. Specify column separator as `,`, line delimiter as `\n`. ```sql SELECT * FROM tbl @@ -214,10 +210,10 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ); ``` - If the final generated file is not larger than 100MB, it will be: `result_0.csv`. + The final generated file will be: `result_0.csv` if not larger than 100MB. If larger than 100MB, it may be `result_0.csv, result_1.csv, ...`. -- Export the simple query results to the file `hdfs://path/to/result.parquet`. Specify the export format as PARQUET. Use `my_broker` and set kerberos authentication information. +- Export simple query results to file `hdfs://path/to/result.parquet`. Specify export format as PARQUET. Use `my_broker` and set kerberos authentication information. ```sql SELECT c1, c2, c3 FROM tbl @@ -232,7 +228,7 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ); ``` -- Export the query result of the CTE statement to the file `hdfs://path/to/result.txt`. The default export format is CSV. Use `my_broker` and set hdfs high availability information. Use the default row and column separators. +- Export CTE statement query results to file `hdfs://path/to/result.txt`. Default export format is CSV. Use `my_broker` and set HDFS high availability information. Use default row and column separators. ```sql WITH @@ -255,11 +251,11 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ); ``` - If the final generated file is not larger than 1GB, it will be: `result_0.csv`. + The final generated file will be: `result_0.csv` if not larger than 1GB. If larger than 1GB, it may be `result_0.csv, result_1.csv, ...`. -- Export the query result of the UNION statement to the file `bos://bucket/result.txt`. Specify the export format as PARQUET. Use `my_broker` and set hdfs high availability information. The PARQUET format does not require a column delimiter to be specified. - After the export is complete, an identity file is generated. +- Export UNION statement query results to file `bos://bucket/result.txt`. Specify export format as PARQUET. Use `my_broker` and set HDFS high availability information. PARQUET format does not need to specify column separator. + After export completion, generate an identifier file. ```sql SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1 @@ -274,8 +270,8 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ); ``` -- Export the query result of the select statement to the file `s3a://${bucket_name}/path/result.txt`. Specify the export format as csv. - After the export is complete, an identity file is generated. +- Export Select statement query results to file `s3a://${bucket_name}/path/result.txt`. Specify export format as CSV. + After export completion, generate an identifier file. ```sql select k1,k2,v1 from tbl1 limit 100000 @@ -294,14 +290,14 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ) ``` - If the final generated file is not larger than 1GB, it will be: `my_file_0.csv`. + The final generated file will be: `my_file_0.csv` if not larger than 1GB. If larger than 1GB, it may be `my_file_0.csv, result_1.csv, ...`. - Verify on cos + Verification on cos: - 1. A path that does not exist will be automatically created - 2. Access.key/secret.key/endpoint needs to be confirmed with students of cos. Especially the value of endpoint does not need to fill in bucket_name. + 1. Non-existing paths will be automatically created + 2. access.key/secret.key/endpoint need to be confirmed with cos colleagues. Especially the endpoint value, no need to fill in bucket_name. -- Use the s3 protocol to export to bos, and enable concurrent export. +- Export to bos using S3 protocol, with concurrent export enabled. ```sql set enable_parallel_outfile = true; @@ -317,10 +313,10 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ) ``` - The resulting file is prefixed with `my_file_{fragment_instance_id}_`. + The final generated file prefix will be `my_file_{fragment_instance_id}_`. -- Use the s3 protocol to export to bos, and enable concurrent export of session variables. - Note: However, since the query statement has a top-level sorting node, even if the concurrently exported session variable is enabled for this query, it cannot be exported concurrently. +- Export to bos using S3 protocol, with concurrent export Session variable enabled. + Note: But because the query statement has a top-level sort node, this query cannot use concurrent export even if the concurrent export Session variable is enabled. ```sql set enable_parallel_outfile = true; @@ -336,10 +332,10 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ) ``` -- Use hdfs export to export simple query results to the file `hdfs://${host}:${fileSystem_port}/path/to/result.txt`. Specify the export format as CSV and the user name as work. Specify the column separator as `,` and the row separator as `\n`. +- Export using HDFS method, export simple query results to file `hdfs://${host}:${fileSystem_port}/path/to/result.txt`. Specify export format as CSV, username as work. Specify column separator as `,`, line delimiter as `\n`. ```sql - -- fileSystem_port 默认值为 9000 + -- fileSystem_port default value is 9000 SELECT * FROM tbl INTO OUTFILE "hdfs://${host}:${fileSystem_port}/path/to/result_" FORMAT AS CSV @@ -350,7 +346,7 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ); ``` - If the Hadoop cluster is highly available and Kerberos authentication is enabled, you can refer to the following SQL statement: + If Hadoop cluster has high availability enabled and uses Kerberos authentication, you can refer to the following SQL statement: ```sql SELECT * FROM tbl @@ -371,11 +367,11 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ); ``` - If the final generated file is not larger than 100MB, it will be: `result_0.csv`. - If larger than 100MB, it may be `result_0.csv, result_1.csv, ...`. + The final generated file will be: `result_0.csv` if not larger than 100 MB. + If larger than 100 MB, it may be `result_0.csv, result_1.csv, ...`. -- Export the query result of the select statement to the file `cosn://${bucket_name}/path/result.txt` on Tencent Cloud Object Storage (COS). Specify the export format as csv. - After the export is complete, an identity file is generated. +- Export Select statement query results to Tencent Cloud cos file `cosn://${bucket_name}/path/result.txt`. Specify export format as CSV. + After export completion, generate an identifier file. ```sql select k1,k2,v1 from tbl1 limit 100000 @@ -392,6 +388,4 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu "max_file_size" = "1024MB", "success_file_name" = "SUCCESS" ) - ``` - - + ``` \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.mdx b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.mdx index b62672f038cbb..1417dac05a32f 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.mdx +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.mdx @@ -220,6 +220,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 's3.secret_key' = '' ); ``` + 使用 IAM Assumed Role 的方式获取 S3 访问凭证 (3.1.2+) ```sql CREATE CATALOG hive_hms_on_s3_iamrole PROPERTIES ( @@ -483,6 +484,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'glue.secret_key' = '' ); ``` + Glue 服务的认证信息和 S3 的认证信息不一致时,可以通过以下方式单独指定 S3 的认证信息。 ```sql CREATE CATALOG hive_glue_on_s3_catalog PROPERTIES ( @@ -498,6 +500,17 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 's3.secret_key' = '' ); ``` + + 使用 IAM Assumed Role 的方式获取 S3 访问凭证 (3.1.2+ 支持) + ```sql + CREATE CATALOG `glue_hive_iamrole` PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.type' = 'glue', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + 'glue.role_arn' = '' + ); + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/iceberg-catalog.mdx b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/iceberg-catalog.mdx index 8cdfae5db59d4..eeb1e3c4e1658 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/iceberg-catalog.mdx +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/iceberg-catalog.mdx @@ -149,6 +149,42 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( > > 可以在 `DESCRIBE table_name` 语句中的 Extra 列查看源类型是否带时区信息。如显示 `WITH_TIMEZONE`,则表示源类型是带时区的类型。(该功能自 3.1.0 版本支持)。 +## Namespace 映射 + +Iceberg 的元数层级关系是 Catalog -> Namespace -> Table。其中 Namespace 可以有多级(Nested Namespace)。 + +``` + ┌─────────┐ + │ Catalog │ + └────┬────┘ + │ + ┌─────┴─────┐ + ┌──▼──┐ ┌──▼──┐ + │ NS1 │ │ NS2 │ + └──┬──┘ └──┬──┘ + │ │ +┌────▼───┐ ┌──▼──┐ +│ Table1 │ │ NS3 │ +└────────┘ └──┬──┘ + │ + ┌──────┴───────┐ + ┌────▼───┐ ┌────▼───┐ + │ Table2 │ │ Table3 │ + └────────┘ └────────┘ +``` + +自 3.1.2 版本开始,对于 Iceberg Rest Catalog,Doris 支持对 Nested Namespace 的映射。 + +在上述示例中表,会按照如下逻辑映射为 Doris 的元数据: + +| Catalog | Database | Table | +| --- | --- | --- | +| Catalog | NS1 | Table1 | +| Catalog | NS2.NS3 | Table2 | +| Catalog | NS1.NS3 | Table3 | + +对 Nested Namespace 的支持需要显式开启,具体请参阅 [Iceberg Rest Catalog](../metastores/iceberg-rest.md) + ## 基础示例 ### Hive Metastore @@ -481,6 +517,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'glue.secret_key' = '' ); ``` + Glue 服务的认证信息和 S3 的认证信息不一致时,可以通过以下方式单独指定 S3 的认证信息。 ```sql CREATE CATALOG `iceberg_glue_on_s3_catalog_` PROPERTIES ( @@ -497,6 +534,18 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 's3.secret_key' = '' ); ``` + + 使用 IAM Assumed Role 的方式获取 S3 访问凭证 (3.1.2+ 支持) + ```sql + CREATE CATALOG `glue_iceberg_iamrole` PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'glue', + 'warehouse' = 's3://bucket/warehouse', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + 'glue.role_arn' = '' + ); + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/metastores/aws-glue.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/metastores/aws-glue.md index 5ecbb0fe7e586..33f5df619c9d6 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/metastores/aws-glue.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/metastores/aws-glue.md @@ -19,23 +19,71 @@ AWS Glue Catalog 当前支持三种类型的 Catalog: 本说明文档分别对这写类型的参数进行详细介绍,便于用户配置。 -## Hive Glue Catalog +## 通用参数总览 +| 参数名称 | 描述 | 是否必须 | 默认值 | +|---------------------------|-------------------------------------------------------------|------|--------| +| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | +| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | +| `glue.access_key` | AWS Access Key ID | 是 | 空 | +| `glue.secret_key` | AWS Secret Access Key | 是 | 空 | +| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | +| `glue.role_arn` | IAM Role ARN,用于访问 Glue(自 3.1.2+ 支持) | 否 | 空 | +| `glue.external_id` | IAM External ID,用于访问 Glue(自 3.1.2+ 支持) | 否 | 空 | + +### 认证参数 + +访问 Glue 需要认证信息,支持以下两种方式: + +1. Access Key 认证 + + 通过 `glue.access_key` 和 `glue.secret_key` 提供的 Access Key 认证访问 Glue。 + +2. IAM Role 认证(自 3.1.2+ 起支持) + + 通过 `glue.role_arn` 提供的 IAM Role 认证访问 Glue。 + + 该方式需要 Doris 部署在 AWS EC2 上,并且 EC2 实例需要绑定一个 IAM Role,且该 Role 需要有访问 Glue 的权限。 + + 如果需要通过 External ID 进行访问,需要同时配置 `glue.external_id`。 + +注意事项: + +- 两种方式必须至少配置一种,如果同时配置了两种方式,则优先使用 AccessKey 认证。 + +示例: + + ```sql + CREATE CATALOG hive_glue_catalog PRPPERTIES ( + 'type' = 'hms', + 'hive.metastore.type' = 'glue', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + -- 使用 Access Key 认证 + 'glue.access_key' = '', + 'glue.secret_key' = '' + -- 或者使用 IAM Role 认证 + -- 'glue.role_arn' = '', + -- 'glue.external_id' = '' + ); + ``` + +### Hive Glue Catalog Hive Glue Catalog 用于访问 Hive 表,通过 AWS Glue 的 Hive Metastore 兼容接口访问 Glue。配置如下: | 参数名称 | 描述 | 是否必须 | 默认值 | -|---------------------------|-----------------------------------------------------------|----------|--------| -| `type` | 固定为 `hms` | 是 | 无 | -| `hive.metastore.type` | 固定为 `glue` | 是 | 无 | -| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | -| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | -| `glue.access_key` | AWS Access Key ID | 是 | 空 | -| `glue.secret_key` | AWS Secret Access Key | 是 | 空 | -| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | -| `glue.role_arn` | IAM Role ARN,用于访问 Glue(暂未支持) | 否 | 空 | -| `glue.external_id` | IAM External ID,用于访问 Glue(暂未支持) | 否 | 空 | - -### 示例 +|---------------------------|-----------------------------------------------------------|------|--------| +| `type` | 固定为 `hms` | 是 | 无 | +| `hive.metastore.type` | 固定为 `glue` | 是 | 无 | +| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | +| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | +| `glue.access_key` | AWS Access Key ID | 否 | 空 | +| `glue.secret_key` | AWS Secret Access Key | 否 | 空 | +| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | +| `glue.role_arn` | IAM Role ARN,用于访问 Glue | 否 | 空 | +| `glue.external_id` | IAM External ID,用于访问 Glue | 否 | 空 | + +#### 示例 ```sql CREATE CATALOG hive_glue_catalog PROPERTIES ( @@ -48,24 +96,24 @@ CREATE CATALOG hive_glue_catalog PROPERTIES ( ); ``` -## Iceberg Glue Catalog +### Iceberg Glue Catalog Iceberg Glue Catalog 通过 Glue Client 访问 Glue。配置如下: | 参数名称 | 描述 | 是否必须 | 默认值 | -|-------------------------|--------------------------------------------------------------|----------|------------| -| `type` | 固定为 `iceberg` | 是 | 无 | -| `iceberg.catalog.type` | 固定为 `glue` | 是 | 无 | -| `warehouse` | Iceberg 数据仓库路径,例如:`s3://my-bucket/iceberg-warehouse/` | 是 | s3://doris | -| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | -| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | -| `glue.access_key` | AWS Access Key ID | 是 | 空 | -| `glue.secret_key` | AWS Secret Access Key | 是 | 空 | -| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | -| `glue.role_arn` | IAM Role ARN,用于访问 Glue(暂未支持) | 否 | 空 | -| `glue.external_id` | IAM External ID,用于访问 Glue(暂未支持) | 否 | 空 | - -### 示例 +|-------------------------|--------------------------------------------------------------|------|------------| +| `type` | 固定为 `iceberg` | 是 | 无 | +| `iceberg.catalog.type` | 固定为 `glue` | 是 | 无 | +| `warehouse` | Iceberg 数据仓库路径,例如:`s3://my-bucket/iceberg-warehouse/` | 是 | s3://doris | +| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | +| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | +| `glue.access_key` | AWS Access Key ID | 否 | 空 | +| `glue.secret_key` | AWS Secret Access Key | 否 | 空 | +| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | +| `glue.role_arn` | IAM Role ARN,用于访问 Glue(暂未支持) | 否 | 空 | +| `glue.external_id` | IAM External ID,用于访问 Glue(暂未支持) | 否 | 空 | + +#### 示例 ```sql CREATE CATALOG iceberg_glue_catalog PROPERTIES ( @@ -78,7 +126,7 @@ CREATE CATALOG iceberg_glue_catalog PROPERTIES ( ); ``` -## Iceberg Glue Rest Catalog +### Iceberg Glue Rest Catalog Iceberg Glue Rest Catalog 通过 Glue Rest Catalog 接口访问 Glue。目前仅支持存储在 AWS S3 Table Bucket 中的 Iceberg 表。配置如下: @@ -94,7 +142,7 @@ Iceberg Glue Rest Catalog 通过 Glue Rest Catalog 接口访问 Glue。目前仅 | `iceberg.rest.secret-access-key` | 访问 Glue 的 Secret Key(同时也用于访问 S3 Bucket) | 是 | 空 | | `iceberg.rest.signing-region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 空 | -### 示例 +#### 示例 ```sql CREATE CATALOG glue_s3 PROPERTIES ( @@ -109,3 +157,89 @@ CREATE CATALOG glue_s3 PROPERTIES ( 'iceberg.rest.signing-region' = '' ); ``` + + +## 权限策略 + +根据使用场景不同,可以分为 **只读** 和 **读写** 两类策略。 + +### 1. 只读权限 + +只允许读取 Glue Catalog 的数据库和表信息。 + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "GlueCatalogReadOnly", + "Effect": "Allow", + "Action": [ + "glue:GetCatalog", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:GetTable", + "glue:GetTables", + "glue:GetPartitions" + ], + "Resource": [ + "arn:aws:glue:::catalog", + "arn:aws:glue:::database/*", + "arn:aws:glue:::table/*/*" + ] + } + ] +} +``` + +### 2. 读写权限 + +在只读的基础上,允许创建 / 修改 / 删除数据库和表。 + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "GlueCatalogReadWrite", + "Effect": "Allow", + "Action": [ + "glue:GetCatalog", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:GetTable", + "glue:GetTables", + "glue:GetPartitions", + "glue:CreateDatabase", + "glue:UpdateDatabase", + "glue:DeleteDatabase", + "glue:CreateTable", + "glue:UpdateTable", + "glue:DeleteTable" + ], + "Resource": [ + "arn:aws:glue:::catalog", + "arn:aws:glue:::database/*", + "arn:aws:glue:::table/*/*" + ] + } + ] +} +``` + +### 注意事项 + +1. 占位符替换 + + - `` → 你的 AWS 区域(如 `us-east-1`)。 + - `` → 你的 AWS 账号 ID(12 位数字)。 + +2. 最小权限原则 + + - 如果只做查询,不要授予写权限。 + - 可以替换 `*` 为具体数据库、表 ARN,进一步收紧权限。 + +3. S3 权限 + + - 上述策略只涉及 Glue Catalog + - 如果需要读取数据文件,还需额外授予 S3 权限(如 `s3:GetObject`, `s3:ListBucket` 等)。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/metastores/iceberg-rest.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/metastores/iceberg-rest.md index 446a0720558b1..ec66765829af0 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/metastores/iceberg-rest.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/metastores/iceberg-rest.md @@ -19,6 +19,7 @@ | iceberg.rest.oauth2.credential | | `oauth2` 凭证,用于访问 `server-uri` 获取 token | - | 否 | | iceberg.rest.oauth2.server-uri | | 用于获取 `oauth2` token 的 uri 地址,配合 `iceberg.rest.oauth2.credential` 使用 | - | 否 | | iceberg.rest.vended-credentials-enabled | | 是否启用 `vended-credentials` 功能。启用后,会同 rest 服务端获取访问存储系统的凭证信息,如 `access-key` 和 `secret-key`,不再需要手动指定。需要 rest 服务端本身支持该能力。| `false` | 否 | +| iceberg.rest.nested-namespace-enabled | | (自 3.1.2+ 版本支持)是否启用对 Nested Namespace 的支持。默认为 `false`。如果为 `true`,则 Nested Namespace 会被打平作为 Database 名称显示,如 `parent_ns.child_ns`。某些 Rest Catalog 服务不支持 Nested Namespace,如 AWS Glue,责改参数需设置为 `false` | 否 | > 注: > @@ -28,6 +29,27 @@ > > 3. AWS Glue Rest Catalog 请参阅 [AWS Glue 文档](./aws-glue.md) +## Nested Namespace + +在 3.1.2 及后续版本中,如需完整访问 Nested Namespace,除了在 Catalog 属性中将 `iceberg.rest.nested-namespace-enabled` 设置为 `true` 外,还需开启如下全局参数: + +``` +SET GLOBAL enable_nested_namespace=true; +``` + +假设 Catalog 为 "ice",Namespace 为 "ns1.ns2",Table 为 "tbl1",可参考如下方式访问 Nested Namespace: + +```sql +mysql> USE ice.ns1.ns2; +mysql> SELECT k1 FROM ice.`ns1.ns2`.tbl1; +mysql> SELECT tbl1.k1 FROM `ns1.ns2`.tbl1; +mysql> SELECT `ns1.ns2`.tbl1.k1 FROM ice.`ns1.ns2`.tbl1; +mysql> SELECT ice.`ns1.ns2`.tbl1.k1 FROM tbl1; +mysql> REFRESH CATALOG ice; +mysql> REFRESH DATABASE ice.`ns1.ns2`; +mysql> REFRESH TABLE ice.`ns1.ns2`.tbl1; +``` + ## 示例配置 - 无认证的 Rest Catalog 服务 @@ -111,6 +133,43 @@ ); ``` +- 连接 Snowflake Open Catalog (自 3.1.2 版本支持) + + ```sql + -- Enable vended-credentials + CREATE CATALOG snowflake_open_catalog PROPERTIES ( + 'type' = 'iceberg', + 'warehouse' = '', + 'iceberg.catalog.type' = 'rest', + 'iceberg.rest.uri' = 'https://.snowflakecomputing.com/polaris/api/catalog', + 'iceberg.rest.security.type' = 'oauth2', + 'iceberg.rest.oauth2.credential' = ':', + 'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:', + 'iceberg.rest.vended-credentials-enabled' = 'true', + 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', + 's3.region' = 'us-west-2', + 'iceberg.rest.nested-namespace-enabled' = 'true' + ); + ``` + + ```sql + -- Disable vended-credentials + CREATE CATALOG snowflake_open_catalog PROPERTIES ( + 'type' = 'iceberg', + 'warehouse' = '', + 'iceberg.catalog.type' = 'rest', + 'iceberg.rest.uri' = 'https://.snowflakecomputing.com/polaris/api/catalog', + 'iceberg.rest.security.type' = 'oauth2', + 'iceberg.rest.oauth2.credential' = ':', + 'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:', + 's3.access_key' = '', + 's3.secret_key' = '', + 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', + 's3.region' = 'us-west-2', + 'iceberg.rest.nested-namespace-enabled' = 'true' + ); + ``` + - 连接 Apache Gravitino Rest Catalog ```sql diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/storages/s3.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/storages/s3.md index b812f0d5f0e72..ceb081d5129eb 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/storages/s3.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/storages/s3.md @@ -73,3 +73,82 @@ Amazon S3 Express One Zone(又名 Directory Bucket)提供更高性能,但 "s3.endpoint"="s3express-usw2-az1.us-west-2.amazonaws.com", "s3.region"="us-west-2" ``` + +## 权限策略 + +根据使用场景不同,可以分为 **只读** 和 **读写** 两类策略。 + +### 1. 只读权限 + +只允许读取 S3 中的对象。适用于 LOAD、TVF、查询 EXTERNAL CATALOG 等场景。 + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:GetObjectVersion", + ], + "Resource": "arn:aws:s3:::/your-prefix/*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListBucket", + "s3:GetBucketLocation" + ], + "Resource": "arn:aws:s3:::" + } + ] +} +``` + +### 2. 读写权限 + +在只读的基础上,允许删除、创建、修改对象。适用于 EXPORT、OUTFILE 以及 EXTERNAL CATALOG 回写等场景。 + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:PutObject", + "s3:GetObject", + "s3:GetObjectVersion", + "s3:DeleteObject", + "s3:DeleteObjectVersion", + "s3:AbortMultipartUpload", + "s3:ListMultipartUploadParts" + ], + "Resource": "arn:aws:s3::://*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListBucket", + "s3:GetBucketLocation", + "s3:GetBucketVersioning", + "s3:GetLifecycleConfiguration" + ], + "Resource": "arn:aws:s3:::" + } + ] +} +``` + +### 注意事项 + +1. 占位符替换 + + - `` → 你的 S3 Bucket 名称。 + - `` → 你的 AWS 账号 ID(12 位数字)。 + +2. 最小权限原则 + + - 如果只做查询,不要授予写权限。 + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md index 406eebc7d6706..a700c369846ca 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md @@ -74,7 +74,7 @@ - `timeout`:导出作业的超时时间,默认为 2 小时,单位是秒。 - - `compress_type`:(自 2.1.5 支持) 当指定导出的文件格式为 Parquet / ORC 文件时,可以指定 Parquet / ORC 文件使用的压缩方式。Parquet 文件格式可指定压缩方式为 SNAPPY,GZIP,BROTLI,ZSTD,LZ4 及 PLAIN,默认值为 SNAPPY。ORC 文件格式可指定压缩方式为 PLAIN,SNAPPY,ZLIB 以及 ZSTD,默认值为 ZLIB。该参数自 2.1.5 版本开始支持。(PLAIN 就是不采用压缩) + - `compress_type`:(自 2.1.5 支持) 当指定导出的文件格式为 Parquet / ORC 文件时,可以指定 Parquet / ORC 文件使用的压缩方式。Parquet 文件格式可指定压缩方式为 SNAPPY,GZIP,BROTLI,ZSTD,LZ4 及 PLAIN,默认值为 SNAPPY。ORC 文件格式可指定压缩方式为 PLAIN,SNAPPY,ZLIB 以及 ZSTD,默认值为 ZLIB。该参数自 2.1.5 版本开始支持。(PLAIN 就是不采用压缩)。自 3.1.1 版本开始,支持对 CSV 格式指定压缩算法,目前支持 "plain", "gz", "bz2", "snappyblock", "lz4block", "zstd"。 :::caution 注意 要使用 delete_existing_files 参数,还需要在 fe.conf 中添加配置`enable_delete_existing_files = true`并重启 fe,此时 delete_existing_files 才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md index 093a6b5bafc6c..e134c05f88a5c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md @@ -61,7 +61,7 @@ INTO OUTFILE "" - `max_file_size`: 单个文件大小限制,如果结果超过这个值,将切割成多个文件,`max_file_size` 取值范围是[5MB, 2GB], 默认为 `1GB`。(当指定导出为 OCR 文件格式时,实际切分文件的大小将是 64MB 的倍数,如:指定 `max_file_size = 5MB`, 实际将以 64 MB 为切分;指定 `max_file_size = 65MB`, 实际将以 128 MB 为切分) - `delete_existing_files`: 默认为 `false`,若指定为 `true`,则会先删除 `file_path` 指定的目录下的所有文件,然后导出数据到该目录下。例如:"file_path" = "/user/tmp", 则会删除"/user/"下所有文件及目录;"file_path" = "/user/tmp/", 则会删除"/user/tmp/"下所有文件及目录。 - `file_suffix`: 指定导出文件的后缀,若不指定该参数,将使用文件格式的默认后缀。 -- `compress_type`:当指定导出的文件格式为 Parquet / ORC 文件时,可以指定 Parquet / ORC 文件使用的压缩方式。Parquet 文件格式可指定压缩方式为 SNAPPY,GZIP,BROTLI,ZSTD,LZ4 及 PLAIN,默认值为 SNAPPY。ORC 文件格式可指定压缩方式为 PLAIN,SNAPPY,ZLIB 以及 ZSTD,默认值为 ZLIB。该参数自 2.1.5 版本开始支持。(PLAIN 就是不采用压缩) +- `compress_type`:当指定导出的文件格式为 Parquet / ORC 文件时,可以指定 Parquet / ORC 文件使用的压缩方式。Parquet 文件格式可指定压缩方式为 SNAPPY,GZIP,BROTLI,ZSTD,LZ4 及 PLAIN,默认值为 SNAPPY。ORC 文件格式可指定压缩方式为 PLAIN,SNAPPY,ZLIB 以及 ZSTD,默认值为 ZLIB。该参数自 2.1.5 版本开始支持。(PLAIN 就是不采用压缩)。自 3.1.1 版本开始,支持对 CSV 格式指定压缩算法,目前支持 "plain", "gz", "bz2", "snappyblock", "lz4block", "zstd"。 **Broker 相关属性** _(需加前缀 `broker.`)_ - `broker.name: broker`: 名称 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/catalogs/hive-catalog.mdx b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/catalogs/hive-catalog.mdx index b62672f038cbb..1417dac05a32f 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/catalogs/hive-catalog.mdx +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/catalogs/hive-catalog.mdx @@ -220,6 +220,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 's3.secret_key' = '' ); ``` + 使用 IAM Assumed Role 的方式获取 S3 访问凭证 (3.1.2+) ```sql CREATE CATALOG hive_hms_on_s3_iamrole PROPERTIES ( @@ -483,6 +484,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'glue.secret_key' = '' ); ``` + Glue 服务的认证信息和 S3 的认证信息不一致时,可以通过以下方式单独指定 S3 的认证信息。 ```sql CREATE CATALOG hive_glue_on_s3_catalog PROPERTIES ( @@ -498,6 +500,17 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 's3.secret_key' = '' ); ``` + + 使用 IAM Assumed Role 的方式获取 S3 访问凭证 (3.1.2+ 支持) + ```sql + CREATE CATALOG `glue_hive_iamrole` PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.type' = 'glue', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + 'glue.role_arn' = '' + ); + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/catalogs/iceberg-catalog.mdx b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/catalogs/iceberg-catalog.mdx index 8cdfae5db59d4..eeb1e3c4e1658 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/catalogs/iceberg-catalog.mdx +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/catalogs/iceberg-catalog.mdx @@ -149,6 +149,42 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( > > 可以在 `DESCRIBE table_name` 语句中的 Extra 列查看源类型是否带时区信息。如显示 `WITH_TIMEZONE`,则表示源类型是带时区的类型。(该功能自 3.1.0 版本支持)。 +## Namespace 映射 + +Iceberg 的元数层级关系是 Catalog -> Namespace -> Table。其中 Namespace 可以有多级(Nested Namespace)。 + +``` + ┌─────────┐ + │ Catalog │ + └────┬────┘ + │ + ┌─────┴─────┐ + ┌──▼──┐ ┌──▼──┐ + │ NS1 │ │ NS2 │ + └──┬──┘ └──┬──┘ + │ │ +┌────▼───┐ ┌──▼──┐ +│ Table1 │ │ NS3 │ +└────────┘ └──┬──┘ + │ + ┌──────┴───────┐ + ┌────▼───┐ ┌────▼───┐ + │ Table2 │ │ Table3 │ + └────────┘ └────────┘ +``` + +自 3.1.2 版本开始,对于 Iceberg Rest Catalog,Doris 支持对 Nested Namespace 的映射。 + +在上述示例中表,会按照如下逻辑映射为 Doris 的元数据: + +| Catalog | Database | Table | +| --- | --- | --- | +| Catalog | NS1 | Table1 | +| Catalog | NS2.NS3 | Table2 | +| Catalog | NS1.NS3 | Table3 | + +对 Nested Namespace 的支持需要显式开启,具体请参阅 [Iceberg Rest Catalog](../metastores/iceberg-rest.md) + ## 基础示例 ### Hive Metastore @@ -481,6 +517,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'glue.secret_key' = '' ); ``` + Glue 服务的认证信息和 S3 的认证信息不一致时,可以通过以下方式单独指定 S3 的认证信息。 ```sql CREATE CATALOG `iceberg_glue_on_s3_catalog_` PROPERTIES ( @@ -497,6 +534,18 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 's3.secret_key' = '' ); ``` + + 使用 IAM Assumed Role 的方式获取 S3 访问凭证 (3.1.2+ 支持) + ```sql + CREATE CATALOG `glue_iceberg_iamrole` PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'glue', + 'warehouse' = 's3://bucket/warehouse', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + 'glue.role_arn' = '' + ); + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/metastores/aws-glue.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/metastores/aws-glue.md index 5ecbb0fe7e586..33f5df619c9d6 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/metastores/aws-glue.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/metastores/aws-glue.md @@ -19,23 +19,71 @@ AWS Glue Catalog 当前支持三种类型的 Catalog: 本说明文档分别对这写类型的参数进行详细介绍,便于用户配置。 -## Hive Glue Catalog +## 通用参数总览 +| 参数名称 | 描述 | 是否必须 | 默认值 | +|---------------------------|-------------------------------------------------------------|------|--------| +| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | +| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | +| `glue.access_key` | AWS Access Key ID | 是 | 空 | +| `glue.secret_key` | AWS Secret Access Key | 是 | 空 | +| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | +| `glue.role_arn` | IAM Role ARN,用于访问 Glue(自 3.1.2+ 支持) | 否 | 空 | +| `glue.external_id` | IAM External ID,用于访问 Glue(自 3.1.2+ 支持) | 否 | 空 | + +### 认证参数 + +访问 Glue 需要认证信息,支持以下两种方式: + +1. Access Key 认证 + + 通过 `glue.access_key` 和 `glue.secret_key` 提供的 Access Key 认证访问 Glue。 + +2. IAM Role 认证(自 3.1.2+ 起支持) + + 通过 `glue.role_arn` 提供的 IAM Role 认证访问 Glue。 + + 该方式需要 Doris 部署在 AWS EC2 上,并且 EC2 实例需要绑定一个 IAM Role,且该 Role 需要有访问 Glue 的权限。 + + 如果需要通过 External ID 进行访问,需要同时配置 `glue.external_id`。 + +注意事项: + +- 两种方式必须至少配置一种,如果同时配置了两种方式,则优先使用 AccessKey 认证。 + +示例: + + ```sql + CREATE CATALOG hive_glue_catalog PRPPERTIES ( + 'type' = 'hms', + 'hive.metastore.type' = 'glue', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + -- 使用 Access Key 认证 + 'glue.access_key' = '', + 'glue.secret_key' = '' + -- 或者使用 IAM Role 认证 + -- 'glue.role_arn' = '', + -- 'glue.external_id' = '' + ); + ``` + +### Hive Glue Catalog Hive Glue Catalog 用于访问 Hive 表,通过 AWS Glue 的 Hive Metastore 兼容接口访问 Glue。配置如下: | 参数名称 | 描述 | 是否必须 | 默认值 | -|---------------------------|-----------------------------------------------------------|----------|--------| -| `type` | 固定为 `hms` | 是 | 无 | -| `hive.metastore.type` | 固定为 `glue` | 是 | 无 | -| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | -| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | -| `glue.access_key` | AWS Access Key ID | 是 | 空 | -| `glue.secret_key` | AWS Secret Access Key | 是 | 空 | -| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | -| `glue.role_arn` | IAM Role ARN,用于访问 Glue(暂未支持) | 否 | 空 | -| `glue.external_id` | IAM External ID,用于访问 Glue(暂未支持) | 否 | 空 | - -### 示例 +|---------------------------|-----------------------------------------------------------|------|--------| +| `type` | 固定为 `hms` | 是 | 无 | +| `hive.metastore.type` | 固定为 `glue` | 是 | 无 | +| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | +| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | +| `glue.access_key` | AWS Access Key ID | 否 | 空 | +| `glue.secret_key` | AWS Secret Access Key | 否 | 空 | +| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | +| `glue.role_arn` | IAM Role ARN,用于访问 Glue | 否 | 空 | +| `glue.external_id` | IAM External ID,用于访问 Glue | 否 | 空 | + +#### 示例 ```sql CREATE CATALOG hive_glue_catalog PROPERTIES ( @@ -48,24 +96,24 @@ CREATE CATALOG hive_glue_catalog PROPERTIES ( ); ``` -## Iceberg Glue Catalog +### Iceberg Glue Catalog Iceberg Glue Catalog 通过 Glue Client 访问 Glue。配置如下: | 参数名称 | 描述 | 是否必须 | 默认值 | -|-------------------------|--------------------------------------------------------------|----------|------------| -| `type` | 固定为 `iceberg` | 是 | 无 | -| `iceberg.catalog.type` | 固定为 `glue` | 是 | 无 | -| `warehouse` | Iceberg 数据仓库路径,例如:`s3://my-bucket/iceberg-warehouse/` | 是 | s3://doris | -| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | -| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | -| `glue.access_key` | AWS Access Key ID | 是 | 空 | -| `glue.secret_key` | AWS Secret Access Key | 是 | 空 | -| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | -| `glue.role_arn` | IAM Role ARN,用于访问 Glue(暂未支持) | 否 | 空 | -| `glue.external_id` | IAM External ID,用于访问 Glue(暂未支持) | 否 | 空 | - -### 示例 +|-------------------------|--------------------------------------------------------------|------|------------| +| `type` | 固定为 `iceberg` | 是 | 无 | +| `iceberg.catalog.type` | 固定为 `glue` | 是 | 无 | +| `warehouse` | Iceberg 数据仓库路径,例如:`s3://my-bucket/iceberg-warehouse/` | 是 | s3://doris | +| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | +| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | +| `glue.access_key` | AWS Access Key ID | 否 | 空 | +| `glue.secret_key` | AWS Secret Access Key | 否 | 空 | +| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | +| `glue.role_arn` | IAM Role ARN,用于访问 Glue(暂未支持) | 否 | 空 | +| `glue.external_id` | IAM External ID,用于访问 Glue(暂未支持) | 否 | 空 | + +#### 示例 ```sql CREATE CATALOG iceberg_glue_catalog PROPERTIES ( @@ -78,7 +126,7 @@ CREATE CATALOG iceberg_glue_catalog PROPERTIES ( ); ``` -## Iceberg Glue Rest Catalog +### Iceberg Glue Rest Catalog Iceberg Glue Rest Catalog 通过 Glue Rest Catalog 接口访问 Glue。目前仅支持存储在 AWS S3 Table Bucket 中的 Iceberg 表。配置如下: @@ -94,7 +142,7 @@ Iceberg Glue Rest Catalog 通过 Glue Rest Catalog 接口访问 Glue。目前仅 | `iceberg.rest.secret-access-key` | 访问 Glue 的 Secret Key(同时也用于访问 S3 Bucket) | 是 | 空 | | `iceberg.rest.signing-region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 空 | -### 示例 +#### 示例 ```sql CREATE CATALOG glue_s3 PROPERTIES ( @@ -109,3 +157,89 @@ CREATE CATALOG glue_s3 PROPERTIES ( 'iceberg.rest.signing-region' = '' ); ``` + + +## 权限策略 + +根据使用场景不同,可以分为 **只读** 和 **读写** 两类策略。 + +### 1. 只读权限 + +只允许读取 Glue Catalog 的数据库和表信息。 + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "GlueCatalogReadOnly", + "Effect": "Allow", + "Action": [ + "glue:GetCatalog", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:GetTable", + "glue:GetTables", + "glue:GetPartitions" + ], + "Resource": [ + "arn:aws:glue:::catalog", + "arn:aws:glue:::database/*", + "arn:aws:glue:::table/*/*" + ] + } + ] +} +``` + +### 2. 读写权限 + +在只读的基础上,允许创建 / 修改 / 删除数据库和表。 + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "GlueCatalogReadWrite", + "Effect": "Allow", + "Action": [ + "glue:GetCatalog", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:GetTable", + "glue:GetTables", + "glue:GetPartitions", + "glue:CreateDatabase", + "glue:UpdateDatabase", + "glue:DeleteDatabase", + "glue:CreateTable", + "glue:UpdateTable", + "glue:DeleteTable" + ], + "Resource": [ + "arn:aws:glue:::catalog", + "arn:aws:glue:::database/*", + "arn:aws:glue:::table/*/*" + ] + } + ] +} +``` + +### 注意事项 + +1. 占位符替换 + + - `` → 你的 AWS 区域(如 `us-east-1`)。 + - `` → 你的 AWS 账号 ID(12 位数字)。 + +2. 最小权限原则 + + - 如果只做查询,不要授予写权限。 + - 可以替换 `*` 为具体数据库、表 ARN,进一步收紧权限。 + +3. S3 权限 + + - 上述策略只涉及 Glue Catalog + - 如果需要读取数据文件,还需额外授予 S3 权限(如 `s3:GetObject`, `s3:ListBucket` 等)。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/metastores/iceberg-rest.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/metastores/iceberg-rest.md index 446a0720558b1..ec66765829af0 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/metastores/iceberg-rest.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/metastores/iceberg-rest.md @@ -19,6 +19,7 @@ | iceberg.rest.oauth2.credential | | `oauth2` 凭证,用于访问 `server-uri` 获取 token | - | 否 | | iceberg.rest.oauth2.server-uri | | 用于获取 `oauth2` token 的 uri 地址,配合 `iceberg.rest.oauth2.credential` 使用 | - | 否 | | iceberg.rest.vended-credentials-enabled | | 是否启用 `vended-credentials` 功能。启用后,会同 rest 服务端获取访问存储系统的凭证信息,如 `access-key` 和 `secret-key`,不再需要手动指定。需要 rest 服务端本身支持该能力。| `false` | 否 | +| iceberg.rest.nested-namespace-enabled | | (自 3.1.2+ 版本支持)是否启用对 Nested Namespace 的支持。默认为 `false`。如果为 `true`,则 Nested Namespace 会被打平作为 Database 名称显示,如 `parent_ns.child_ns`。某些 Rest Catalog 服务不支持 Nested Namespace,如 AWS Glue,责改参数需设置为 `false` | 否 | > 注: > @@ -28,6 +29,27 @@ > > 3. AWS Glue Rest Catalog 请参阅 [AWS Glue 文档](./aws-glue.md) +## Nested Namespace + +在 3.1.2 及后续版本中,如需完整访问 Nested Namespace,除了在 Catalog 属性中将 `iceberg.rest.nested-namespace-enabled` 设置为 `true` 外,还需开启如下全局参数: + +``` +SET GLOBAL enable_nested_namespace=true; +``` + +假设 Catalog 为 "ice",Namespace 为 "ns1.ns2",Table 为 "tbl1",可参考如下方式访问 Nested Namespace: + +```sql +mysql> USE ice.ns1.ns2; +mysql> SELECT k1 FROM ice.`ns1.ns2`.tbl1; +mysql> SELECT tbl1.k1 FROM `ns1.ns2`.tbl1; +mysql> SELECT `ns1.ns2`.tbl1.k1 FROM ice.`ns1.ns2`.tbl1; +mysql> SELECT ice.`ns1.ns2`.tbl1.k1 FROM tbl1; +mysql> REFRESH CATALOG ice; +mysql> REFRESH DATABASE ice.`ns1.ns2`; +mysql> REFRESH TABLE ice.`ns1.ns2`.tbl1; +``` + ## 示例配置 - 无认证的 Rest Catalog 服务 @@ -111,6 +133,43 @@ ); ``` +- 连接 Snowflake Open Catalog (自 3.1.2 版本支持) + + ```sql + -- Enable vended-credentials + CREATE CATALOG snowflake_open_catalog PROPERTIES ( + 'type' = 'iceberg', + 'warehouse' = '', + 'iceberg.catalog.type' = 'rest', + 'iceberg.rest.uri' = 'https://.snowflakecomputing.com/polaris/api/catalog', + 'iceberg.rest.security.type' = 'oauth2', + 'iceberg.rest.oauth2.credential' = ':', + 'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:', + 'iceberg.rest.vended-credentials-enabled' = 'true', + 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', + 's3.region' = 'us-west-2', + 'iceberg.rest.nested-namespace-enabled' = 'true' + ); + ``` + + ```sql + -- Disable vended-credentials + CREATE CATALOG snowflake_open_catalog PROPERTIES ( + 'type' = 'iceberg', + 'warehouse' = '', + 'iceberg.catalog.type' = 'rest', + 'iceberg.rest.uri' = 'https://.snowflakecomputing.com/polaris/api/catalog', + 'iceberg.rest.security.type' = 'oauth2', + 'iceberg.rest.oauth2.credential' = ':', + 'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:', + 's3.access_key' = '', + 's3.secret_key' = '', + 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', + 's3.region' = 'us-west-2', + 'iceberg.rest.nested-namespace-enabled' = 'true' + ); + ``` + - 连接 Apache Gravitino Rest Catalog ```sql diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/storages/s3.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/storages/s3.md index b812f0d5f0e72..ceb081d5129eb 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/storages/s3.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/lakehouse/storages/s3.md @@ -73,3 +73,82 @@ Amazon S3 Express One Zone(又名 Directory Bucket)提供更高性能,但 "s3.endpoint"="s3express-usw2-az1.us-west-2.amazonaws.com", "s3.region"="us-west-2" ``` + +## 权限策略 + +根据使用场景不同,可以分为 **只读** 和 **读写** 两类策略。 + +### 1. 只读权限 + +只允许读取 S3 中的对象。适用于 LOAD、TVF、查询 EXTERNAL CATALOG 等场景。 + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:GetObjectVersion", + ], + "Resource": "arn:aws:s3:::/your-prefix/*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListBucket", + "s3:GetBucketLocation" + ], + "Resource": "arn:aws:s3:::" + } + ] +} +``` + +### 2. 读写权限 + +在只读的基础上,允许删除、创建、修改对象。适用于 EXPORT、OUTFILE 以及 EXTERNAL CATALOG 回写等场景。 + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:PutObject", + "s3:GetObject", + "s3:GetObjectVersion", + "s3:DeleteObject", + "s3:DeleteObjectVersion", + "s3:AbortMultipartUpload", + "s3:ListMultipartUploadParts" + ], + "Resource": "arn:aws:s3::://*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListBucket", + "s3:GetBucketLocation", + "s3:GetBucketVersioning", + "s3:GetLifecycleConfiguration" + ], + "Resource": "arn:aws:s3:::" + } + ] +} +``` + +### 注意事项 + +1. 占位符替换 + + - `` → 你的 S3 Bucket 名称。 + - `` → 你的 AWS 账号 ID(12 位数字)。 + +2. 最小权限原则 + + - 如果只做查询,不要授予写权限。 + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/catalogs/hive-catalog.mdx b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/catalogs/hive-catalog.mdx index b62672f038cbb..1417dac05a32f 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/catalogs/hive-catalog.mdx +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/catalogs/hive-catalog.mdx @@ -220,6 +220,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 's3.secret_key' = '' ); ``` + 使用 IAM Assumed Role 的方式获取 S3 访问凭证 (3.1.2+) ```sql CREATE CATALOG hive_hms_on_s3_iamrole PROPERTIES ( @@ -483,6 +484,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'glue.secret_key' = '' ); ``` + Glue 服务的认证信息和 S3 的认证信息不一致时,可以通过以下方式单独指定 S3 的认证信息。 ```sql CREATE CATALOG hive_glue_on_s3_catalog PROPERTIES ( @@ -498,6 +500,17 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 's3.secret_key' = '' ); ``` + + 使用 IAM Assumed Role 的方式获取 S3 访问凭证 (3.1.2+ 支持) + ```sql + CREATE CATALOG `glue_hive_iamrole` PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.type' = 'glue', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + 'glue.role_arn' = '' + ); + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/catalogs/iceberg-catalog.mdx b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/catalogs/iceberg-catalog.mdx index 8cdfae5db59d4..eeb1e3c4e1658 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/catalogs/iceberg-catalog.mdx +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/catalogs/iceberg-catalog.mdx @@ -149,6 +149,42 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( > > 可以在 `DESCRIBE table_name` 语句中的 Extra 列查看源类型是否带时区信息。如显示 `WITH_TIMEZONE`,则表示源类型是带时区的类型。(该功能自 3.1.0 版本支持)。 +## Namespace 映射 + +Iceberg 的元数层级关系是 Catalog -> Namespace -> Table。其中 Namespace 可以有多级(Nested Namespace)。 + +``` + ┌─────────┐ + │ Catalog │ + └────┬────┘ + │ + ┌─────┴─────┐ + ┌──▼──┐ ┌──▼──┐ + │ NS1 │ │ NS2 │ + └──┬──┘ └──┬──┘ + │ │ +┌────▼───┐ ┌──▼──┐ +│ Table1 │ │ NS3 │ +└────────┘ └──┬──┘ + │ + ┌──────┴───────┐ + ┌────▼───┐ ┌────▼───┐ + │ Table2 │ │ Table3 │ + └────────┘ └────────┘ +``` + +自 3.1.2 版本开始,对于 Iceberg Rest Catalog,Doris 支持对 Nested Namespace 的映射。 + +在上述示例中表,会按照如下逻辑映射为 Doris 的元数据: + +| Catalog | Database | Table | +| --- | --- | --- | +| Catalog | NS1 | Table1 | +| Catalog | NS2.NS3 | Table2 | +| Catalog | NS1.NS3 | Table3 | + +对 Nested Namespace 的支持需要显式开启,具体请参阅 [Iceberg Rest Catalog](../metastores/iceberg-rest.md) + ## 基础示例 ### Hive Metastore @@ -481,6 +517,7 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'glue.secret_key' = '' ); ``` + Glue 服务的认证信息和 S3 的认证信息不一致时,可以通过以下方式单独指定 S3 的认证信息。 ```sql CREATE CATALOG `iceberg_glue_on_s3_catalog_` PROPERTIES ( @@ -497,6 +534,18 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 's3.secret_key' = '' ); ``` + + 使用 IAM Assumed Role 的方式获取 S3 访问凭证 (3.1.2+ 支持) + ```sql + CREATE CATALOG `glue_iceberg_iamrole` PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'glue', + 'warehouse' = 's3://bucket/warehouse', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + 'glue.role_arn' = '' + ); + ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/metastores/aws-glue.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/metastores/aws-glue.md index 5ecbb0fe7e586..33f5df619c9d6 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/metastores/aws-glue.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/metastores/aws-glue.md @@ -19,23 +19,71 @@ AWS Glue Catalog 当前支持三种类型的 Catalog: 本说明文档分别对这写类型的参数进行详细介绍,便于用户配置。 -## Hive Glue Catalog +## 通用参数总览 +| 参数名称 | 描述 | 是否必须 | 默认值 | +|---------------------------|-------------------------------------------------------------|------|--------| +| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | +| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | +| `glue.access_key` | AWS Access Key ID | 是 | 空 | +| `glue.secret_key` | AWS Secret Access Key | 是 | 空 | +| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | +| `glue.role_arn` | IAM Role ARN,用于访问 Glue(自 3.1.2+ 支持) | 否 | 空 | +| `glue.external_id` | IAM External ID,用于访问 Glue(自 3.1.2+ 支持) | 否 | 空 | + +### 认证参数 + +访问 Glue 需要认证信息,支持以下两种方式: + +1. Access Key 认证 + + 通过 `glue.access_key` 和 `glue.secret_key` 提供的 Access Key 认证访问 Glue。 + +2. IAM Role 认证(自 3.1.2+ 起支持) + + 通过 `glue.role_arn` 提供的 IAM Role 认证访问 Glue。 + + 该方式需要 Doris 部署在 AWS EC2 上,并且 EC2 实例需要绑定一个 IAM Role,且该 Role 需要有访问 Glue 的权限。 + + 如果需要通过 External ID 进行访问,需要同时配置 `glue.external_id`。 + +注意事项: + +- 两种方式必须至少配置一种,如果同时配置了两种方式,则优先使用 AccessKey 认证。 + +示例: + + ```sql + CREATE CATALOG hive_glue_catalog PRPPERTIES ( + 'type' = 'hms', + 'hive.metastore.type' = 'glue', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + -- 使用 Access Key 认证 + 'glue.access_key' = '', + 'glue.secret_key' = '' + -- 或者使用 IAM Role 认证 + -- 'glue.role_arn' = '', + -- 'glue.external_id' = '' + ); + ``` + +### Hive Glue Catalog Hive Glue Catalog 用于访问 Hive 表,通过 AWS Glue 的 Hive Metastore 兼容接口访问 Glue。配置如下: | 参数名称 | 描述 | 是否必须 | 默认值 | -|---------------------------|-----------------------------------------------------------|----------|--------| -| `type` | 固定为 `hms` | 是 | 无 | -| `hive.metastore.type` | 固定为 `glue` | 是 | 无 | -| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | -| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | -| `glue.access_key` | AWS Access Key ID | 是 | 空 | -| `glue.secret_key` | AWS Secret Access Key | 是 | 空 | -| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | -| `glue.role_arn` | IAM Role ARN,用于访问 Glue(暂未支持) | 否 | 空 | -| `glue.external_id` | IAM External ID,用于访问 Glue(暂未支持) | 否 | 空 | - -### 示例 +|---------------------------|-----------------------------------------------------------|------|--------| +| `type` | 固定为 `hms` | 是 | 无 | +| `hive.metastore.type` | 固定为 `glue` | 是 | 无 | +| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | +| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | +| `glue.access_key` | AWS Access Key ID | 否 | 空 | +| `glue.secret_key` | AWS Secret Access Key | 否 | 空 | +| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | +| `glue.role_arn` | IAM Role ARN,用于访问 Glue | 否 | 空 | +| `glue.external_id` | IAM External ID,用于访问 Glue | 否 | 空 | + +#### 示例 ```sql CREATE CATALOG hive_glue_catalog PROPERTIES ( @@ -48,24 +96,24 @@ CREATE CATALOG hive_glue_catalog PROPERTIES ( ); ``` -## Iceberg Glue Catalog +### Iceberg Glue Catalog Iceberg Glue Catalog 通过 Glue Client 访问 Glue。配置如下: | 参数名称 | 描述 | 是否必须 | 默认值 | -|-------------------------|--------------------------------------------------------------|----------|------------| -| `type` | 固定为 `iceberg` | 是 | 无 | -| `iceberg.catalog.type` | 固定为 `glue` | 是 | 无 | -| `warehouse` | Iceberg 数据仓库路径,例如:`s3://my-bucket/iceberg-warehouse/` | 是 | s3://doris | -| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | -| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | -| `glue.access_key` | AWS Access Key ID | 是 | 空 | -| `glue.secret_key` | AWS Secret Access Key | 是 | 空 | -| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | -| `glue.role_arn` | IAM Role ARN,用于访问 Glue(暂未支持) | 否 | 空 | -| `glue.external_id` | IAM External ID,用于访问 Glue(暂未支持) | 否 | 空 | - -### 示例 +|-------------------------|--------------------------------------------------------------|------|------------| +| `type` | 固定为 `iceberg` | 是 | 无 | +| `iceberg.catalog.type` | 固定为 `glue` | 是 | 无 | +| `warehouse` | Iceberg 数据仓库路径,例如:`s3://my-bucket/iceberg-warehouse/` | 是 | s3://doris | +| `glue.region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 无 | +| `glue.endpoint` | AWS Glue endpoint,例如:`https://glue.us-east-1.amazonaws.com` | 是 | 无 | +| `glue.access_key` | AWS Access Key ID | 否 | 空 | +| `glue.secret_key` | AWS Secret Access Key | 否 | 空 | +| `glue.catalog_id` | Glue Catalog ID(暂未支持) | 否 | 空 | +| `glue.role_arn` | IAM Role ARN,用于访问 Glue(暂未支持) | 否 | 空 | +| `glue.external_id` | IAM External ID,用于访问 Glue(暂未支持) | 否 | 空 | + +#### 示例 ```sql CREATE CATALOG iceberg_glue_catalog PROPERTIES ( @@ -78,7 +126,7 @@ CREATE CATALOG iceberg_glue_catalog PROPERTIES ( ); ``` -## Iceberg Glue Rest Catalog +### Iceberg Glue Rest Catalog Iceberg Glue Rest Catalog 通过 Glue Rest Catalog 接口访问 Glue。目前仅支持存储在 AWS S3 Table Bucket 中的 Iceberg 表。配置如下: @@ -94,7 +142,7 @@ Iceberg Glue Rest Catalog 通过 Glue Rest Catalog 接口访问 Glue。目前仅 | `iceberg.rest.secret-access-key` | 访问 Glue 的 Secret Key(同时也用于访问 S3 Bucket) | 是 | 空 | | `iceberg.rest.signing-region` | AWS Glue 所在区域,例如:`us-east-1` | 是 | 空 | -### 示例 +#### 示例 ```sql CREATE CATALOG glue_s3 PROPERTIES ( @@ -109,3 +157,89 @@ CREATE CATALOG glue_s3 PROPERTIES ( 'iceberg.rest.signing-region' = '' ); ``` + + +## 权限策略 + +根据使用场景不同,可以分为 **只读** 和 **读写** 两类策略。 + +### 1. 只读权限 + +只允许读取 Glue Catalog 的数据库和表信息。 + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "GlueCatalogReadOnly", + "Effect": "Allow", + "Action": [ + "glue:GetCatalog", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:GetTable", + "glue:GetTables", + "glue:GetPartitions" + ], + "Resource": [ + "arn:aws:glue:::catalog", + "arn:aws:glue:::database/*", + "arn:aws:glue:::table/*/*" + ] + } + ] +} +``` + +### 2. 读写权限 + +在只读的基础上,允许创建 / 修改 / 删除数据库和表。 + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "GlueCatalogReadWrite", + "Effect": "Allow", + "Action": [ + "glue:GetCatalog", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:GetTable", + "glue:GetTables", + "glue:GetPartitions", + "glue:CreateDatabase", + "glue:UpdateDatabase", + "glue:DeleteDatabase", + "glue:CreateTable", + "glue:UpdateTable", + "glue:DeleteTable" + ], + "Resource": [ + "arn:aws:glue:::catalog", + "arn:aws:glue:::database/*", + "arn:aws:glue:::table/*/*" + ] + } + ] +} +``` + +### 注意事项 + +1. 占位符替换 + + - `` → 你的 AWS 区域(如 `us-east-1`)。 + - `` → 你的 AWS 账号 ID(12 位数字)。 + +2. 最小权限原则 + + - 如果只做查询,不要授予写权限。 + - 可以替换 `*` 为具体数据库、表 ARN,进一步收紧权限。 + +3. S3 权限 + + - 上述策略只涉及 Glue Catalog + - 如果需要读取数据文件,还需额外授予 S3 权限(如 `s3:GetObject`, `s3:ListBucket` 等)。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/metastores/iceberg-rest.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/metastores/iceberg-rest.md index 446a0720558b1..ec66765829af0 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/metastores/iceberg-rest.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/metastores/iceberg-rest.md @@ -19,6 +19,7 @@ | iceberg.rest.oauth2.credential | | `oauth2` 凭证,用于访问 `server-uri` 获取 token | - | 否 | | iceberg.rest.oauth2.server-uri | | 用于获取 `oauth2` token 的 uri 地址,配合 `iceberg.rest.oauth2.credential` 使用 | - | 否 | | iceberg.rest.vended-credentials-enabled | | 是否启用 `vended-credentials` 功能。启用后,会同 rest 服务端获取访问存储系统的凭证信息,如 `access-key` 和 `secret-key`,不再需要手动指定。需要 rest 服务端本身支持该能力。| `false` | 否 | +| iceberg.rest.nested-namespace-enabled | | (自 3.1.2+ 版本支持)是否启用对 Nested Namespace 的支持。默认为 `false`。如果为 `true`,则 Nested Namespace 会被打平作为 Database 名称显示,如 `parent_ns.child_ns`。某些 Rest Catalog 服务不支持 Nested Namespace,如 AWS Glue,责改参数需设置为 `false` | 否 | > 注: > @@ -28,6 +29,27 @@ > > 3. AWS Glue Rest Catalog 请参阅 [AWS Glue 文档](./aws-glue.md) +## Nested Namespace + +在 3.1.2 及后续版本中,如需完整访问 Nested Namespace,除了在 Catalog 属性中将 `iceberg.rest.nested-namespace-enabled` 设置为 `true` 外,还需开启如下全局参数: + +``` +SET GLOBAL enable_nested_namespace=true; +``` + +假设 Catalog 为 "ice",Namespace 为 "ns1.ns2",Table 为 "tbl1",可参考如下方式访问 Nested Namespace: + +```sql +mysql> USE ice.ns1.ns2; +mysql> SELECT k1 FROM ice.`ns1.ns2`.tbl1; +mysql> SELECT tbl1.k1 FROM `ns1.ns2`.tbl1; +mysql> SELECT `ns1.ns2`.tbl1.k1 FROM ice.`ns1.ns2`.tbl1; +mysql> SELECT ice.`ns1.ns2`.tbl1.k1 FROM tbl1; +mysql> REFRESH CATALOG ice; +mysql> REFRESH DATABASE ice.`ns1.ns2`; +mysql> REFRESH TABLE ice.`ns1.ns2`.tbl1; +``` + ## 示例配置 - 无认证的 Rest Catalog 服务 @@ -111,6 +133,43 @@ ); ``` +- 连接 Snowflake Open Catalog (自 3.1.2 版本支持) + + ```sql + -- Enable vended-credentials + CREATE CATALOG snowflake_open_catalog PROPERTIES ( + 'type' = 'iceberg', + 'warehouse' = '', + 'iceberg.catalog.type' = 'rest', + 'iceberg.rest.uri' = 'https://.snowflakecomputing.com/polaris/api/catalog', + 'iceberg.rest.security.type' = 'oauth2', + 'iceberg.rest.oauth2.credential' = ':', + 'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:', + 'iceberg.rest.vended-credentials-enabled' = 'true', + 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', + 's3.region' = 'us-west-2', + 'iceberg.rest.nested-namespace-enabled' = 'true' + ); + ``` + + ```sql + -- Disable vended-credentials + CREATE CATALOG snowflake_open_catalog PROPERTIES ( + 'type' = 'iceberg', + 'warehouse' = '', + 'iceberg.catalog.type' = 'rest', + 'iceberg.rest.uri' = 'https://.snowflakecomputing.com/polaris/api/catalog', + 'iceberg.rest.security.type' = 'oauth2', + 'iceberg.rest.oauth2.credential' = ':', + 'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:', + 's3.access_key' = '', + 's3.secret_key' = '', + 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', + 's3.region' = 'us-west-2', + 'iceberg.rest.nested-namespace-enabled' = 'true' + ); + ``` + - 连接 Apache Gravitino Rest Catalog ```sql diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/storages/s3.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/storages/s3.md index b812f0d5f0e72..ceb081d5129eb 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/storages/s3.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/lakehouse/storages/s3.md @@ -73,3 +73,82 @@ Amazon S3 Express One Zone(又名 Directory Bucket)提供更高性能,但 "s3.endpoint"="s3express-usw2-az1.us-west-2.amazonaws.com", "s3.region"="us-west-2" ``` + +## 权限策略 + +根据使用场景不同,可以分为 **只读** 和 **读写** 两类策略。 + +### 1. 只读权限 + +只允许读取 S3 中的对象。适用于 LOAD、TVF、查询 EXTERNAL CATALOG 等场景。 + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:GetObjectVersion", + ], + "Resource": "arn:aws:s3:::/your-prefix/*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListBucket", + "s3:GetBucketLocation" + ], + "Resource": "arn:aws:s3:::" + } + ] +} +``` + +### 2. 读写权限 + +在只读的基础上,允许删除、创建、修改对象。适用于 EXPORT、OUTFILE 以及 EXTERNAL CATALOG 回写等场景。 + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:PutObject", + "s3:GetObject", + "s3:GetObjectVersion", + "s3:DeleteObject", + "s3:DeleteObjectVersion", + "s3:AbortMultipartUpload", + "s3:ListMultipartUploadParts" + ], + "Resource": "arn:aws:s3::://*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListBucket", + "s3:GetBucketLocation", + "s3:GetBucketVersioning", + "s3:GetLifecycleConfiguration" + ], + "Resource": "arn:aws:s3:::" + } + ] +} +``` + +### 注意事项 + +1. 占位符替换 + + - `` → 你的 S3 Bucket 名称。 + - `` → 你的 AWS 账号 ID(12 位数字)。 + +2. 最小权限原则 + + - 如果只做查询,不要授予写权限。 + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md index 406eebc7d6706..a700c369846ca 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md @@ -74,7 +74,7 @@ - `timeout`:导出作业的超时时间,默认为 2 小时,单位是秒。 - - `compress_type`:(自 2.1.5 支持) 当指定导出的文件格式为 Parquet / ORC 文件时,可以指定 Parquet / ORC 文件使用的压缩方式。Parquet 文件格式可指定压缩方式为 SNAPPY,GZIP,BROTLI,ZSTD,LZ4 及 PLAIN,默认值为 SNAPPY。ORC 文件格式可指定压缩方式为 PLAIN,SNAPPY,ZLIB 以及 ZSTD,默认值为 ZLIB。该参数自 2.1.5 版本开始支持。(PLAIN 就是不采用压缩) + - `compress_type`:(自 2.1.5 支持) 当指定导出的文件格式为 Parquet / ORC 文件时,可以指定 Parquet / ORC 文件使用的压缩方式。Parquet 文件格式可指定压缩方式为 SNAPPY,GZIP,BROTLI,ZSTD,LZ4 及 PLAIN,默认值为 SNAPPY。ORC 文件格式可指定压缩方式为 PLAIN,SNAPPY,ZLIB 以及 ZSTD,默认值为 ZLIB。该参数自 2.1.5 版本开始支持。(PLAIN 就是不采用压缩)。自 3.1.1 版本开始,支持对 CSV 格式指定压缩算法,目前支持 "plain", "gz", "bz2", "snappyblock", "lz4block", "zstd"。 :::caution 注意 要使用 delete_existing_files 参数,还需要在 fe.conf 中添加配置`enable_delete_existing_files = true`并重启 fe,此时 delete_existing_files 才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md index 093a6b5bafc6c..e134c05f88a5c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md @@ -61,7 +61,7 @@ INTO OUTFILE "" - `max_file_size`: 单个文件大小限制,如果结果超过这个值,将切割成多个文件,`max_file_size` 取值范围是[5MB, 2GB], 默认为 `1GB`。(当指定导出为 OCR 文件格式时,实际切分文件的大小将是 64MB 的倍数,如:指定 `max_file_size = 5MB`, 实际将以 64 MB 为切分;指定 `max_file_size = 65MB`, 实际将以 128 MB 为切分) - `delete_existing_files`: 默认为 `false`,若指定为 `true`,则会先删除 `file_path` 指定的目录下的所有文件,然后导出数据到该目录下。例如:"file_path" = "/user/tmp", 则会删除"/user/"下所有文件及目录;"file_path" = "/user/tmp/", 则会删除"/user/tmp/"下所有文件及目录。 - `file_suffix`: 指定导出文件的后缀,若不指定该参数,将使用文件格式的默认后缀。 -- `compress_type`:当指定导出的文件格式为 Parquet / ORC 文件时,可以指定 Parquet / ORC 文件使用的压缩方式。Parquet 文件格式可指定压缩方式为 SNAPPY,GZIP,BROTLI,ZSTD,LZ4 及 PLAIN,默认值为 SNAPPY。ORC 文件格式可指定压缩方式为 PLAIN,SNAPPY,ZLIB 以及 ZSTD,默认值为 ZLIB。该参数自 2.1.5 版本开始支持。(PLAIN 就是不采用压缩) +- `compress_type`:当指定导出的文件格式为 Parquet / ORC 文件时,可以指定 Parquet / ORC 文件使用的压缩方式。Parquet 文件格式可指定压缩方式为 SNAPPY,GZIP,BROTLI,ZSTD,LZ4 及 PLAIN,默认值为 SNAPPY。ORC 文件格式可指定压缩方式为 PLAIN,SNAPPY,ZLIB 以及 ZSTD,默认值为 ZLIB。该参数自 2.1.5 版本开始支持。(PLAIN 就是不采用压缩)。自 3.1.1 版本开始,支持对 CSV 格式指定压缩算法,目前支持 "plain", "gz", "bz2", "snappyblock", "lz4block", "zstd"。 **Broker 相关属性** _(需加前缀 `broker.`)_ - `broker.name: broker`: 名称 diff --git a/versioned_docs/version-2.1/lakehouse/catalogs/hive-catalog.mdx b/versioned_docs/version-2.1/lakehouse/catalogs/hive-catalog.mdx index 46205b43285db..bd2d396f6730d 100644 --- a/versioned_docs/version-2.1/lakehouse/catalogs/hive-catalog.mdx +++ b/versioned_docs/version-2.1/lakehouse/catalogs/hive-catalog.mdx @@ -474,6 +474,7 @@ Hive transactional tables are supported from version 3.x onwards. For details, r 'glue.secret_key' = '' ); ``` + When Glue service authentication information differs from S3 authentication information, you can specify S3 authentication information separately in the following way. ```sql CREATE CATALOG hive_glue_on_s3_catalog PROPERTIES ( @@ -489,6 +490,16 @@ Hive transactional tables are supported from version 3.x onwards. For details, r 's3.secret_key' = '' ); ``` + + Using IAM Assumed Role to obtain S3 access credentials (Since 3.1.2+) + ```sql + CREATE CATALOG `glue_hive_iamrole` PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.type' = 'glue', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + 'glue.role_arn' = '' + ); diff --git a/versioned_docs/version-2.1/lakehouse/catalogs/iceberg-catalog.mdx b/versioned_docs/version-2.1/lakehouse/catalogs/iceberg-catalog.mdx index e80dc1f510137..0f1b5bd0b6aee 100644 --- a/versioned_docs/version-2.1/lakehouse/catalogs/iceberg-catalog.mdx +++ b/versioned_docs/version-2.1/lakehouse/catalogs/iceberg-catalog.mdx @@ -137,6 +137,43 @@ The current Iceberg dependency is version 1.6.1, which is compatible with higher > > You can check whether the source type has timezone information in the Extra column of the `DESCRIBE table_name` statement. If it shows `WITH_TIMEZONE`, it indicates that the source type is a timezone-aware type. (Supported since 3.1.0). +## Namespace Mapping + +Iceberg's metadata hierarchy is Catalog -> Namespace -> Table. Namespace can have multiple levels (Nested Namespace). + +``` + ┌─────────┐ + │ Catalog │ + └────┬────┘ + │ + ┌─────┴─────┐ + ┌──▼──┐ ┌──▼──┐ + │ NS1 │ │ NS2 │ + └──┬──┘ └──┬──┘ + │ │ +┌────▼───┐ ┌──▼──┐ +│ Table1 │ │ NS3 │ +└────────┘ └──┬──┘ + │ + ┌──────┴───────┐ + ┌────▼───┐ ┌────▼───┐ + │ Table2 │ │ Table3 │ + └────────┘ └────────┘ +``` + + +Starting from version 3.1.2, for Iceberg Rest Catalog, Doris supports mapping of Nested Namespace. + +In the above example, tables will be mapped to Doris metadata according to the following logic: + +| Catalog | Database | Table | +| --- | --- | --- | +| Catalog | NS1 | Table1 | +| Catalog | NS2.NS3 | Table2 | +| Catalog | NS2.NS3 | Table3 | + +Support for Nested Namespace needs to be explicitly enabled. For details, please refer to [Iceberg Rest Catalog](../metastores/iceberg-rest.md) + ## Examples ### Hive Metastore @@ -469,6 +506,7 @@ The current Iceberg dependency is version 1.6.1, which is compatible with higher 'glue.secret_key' = '' ); ``` + When Glue service authentication credentials differ from S3 authentication credentials, you can specify S3 authentication credentials separately using the following method. ```sql CREATE CATALOG `iceberg_glue_on_s3_catalog_` PROPERTIES ( @@ -485,6 +523,18 @@ The current Iceberg dependency is version 1.6.1, which is compatible with higher 's3.secret_key' = '' ); ``` + + Using IAM Assumed Role to obtain S3 access credentials (Since 3.1.2+) + ```sql + CREATE CATALOG `glue_iceberg_iamrole` PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'glue', + 'warehouse' = 's3://bucket/warehouse', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + 'glue.role_arn' = '' + ); + ``` diff --git a/versioned_docs/version-2.1/lakehouse/metastores/aws-glue.md b/versioned_docs/version-2.1/lakehouse/metastores/aws-glue.md index be5f6fa868e6a..fba9332b51f64 100644 --- a/versioned_docs/version-2.1/lakehouse/metastores/aws-glue.md +++ b/versioned_docs/version-2.1/lakehouse/metastores/aws-glue.md @@ -11,31 +11,79 @@ This document describes the parameter configuration when using **AWS Glue Catalo AWS Glue Catalog currently supports three types of Catalogs: -| Catalog Type | Type Identifier (`type`) | Description | -|--------------|-------------------------|----------------------------------------------------| -| Hive | glue | Catalog for connecting to Hive Metastore | -| Iceberg | glue | Catalog for connecting to Iceberg table format | -| Iceberg | rest | Catalog for connecting to Iceberg via Glue Rest | +| Catalog Type | Type Identifier (`type`) | Description | +|-------------|-------------------------|------------------------------------------------| +| Hive | glue | Catalog for connecting to Hive Metastore | +| Iceberg | glue | Catalog for connecting to Iceberg table format | +| Iceberg | rest | Catalog for connecting to Iceberg table format via Glue Rest Catalog | -This document provides detailed descriptions of the parameters for each type to help users with configuration. +This documentation provides detailed parameter descriptions for each type to facilitate user configuration. -## Hive Glue Catalog +## Common Parameters Overview +| Parameter Name | Description | Required | Default Value | +|--------------------------|---------------------------------------------------------------|----------|---------------| +| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | +| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | +| `glue.access_key` | AWS Access Key ID | Yes | Empty | +| `glue.secret_key` | AWS Secret Access Key | Yes | Empty | +| `glue.catalog_id` | Glue Catalog ID (not supported yet) | No | Empty | +| `glue.role_arn` | IAM Role ARN for accessing Glue (supported since 3.1.2+) | No | Empty | +| `glue.external_id` | IAM External ID for accessing Glue (supported since 3.1.2+) | No | Empty | -Hive Glue Catalog is used to access Hive tables through AWS Glue's Hive Metastore compatible interface. Configuration parameters are as follows: +### Authentication Parameters -| Parameter Name | Description | Required | Default Value | -|---------------------------|----------------------------------------------------------------------|----------|---------------| -| `type` | Fixed value `hms` | Yes | None | -| `hive.metastore.type` | Fixed value `glue` | Yes | None | -| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | -| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | -| `glue.access_key` | AWS Access Key ID | Yes | Empty | -| `glue.secret_key` | AWS Secret Access Key | Yes | Empty | -| `glue.catalog_id` | Glue Catalog ID (not yet supported) | No | Empty | -| `glue.role_arn` | IAM Role ARN for accessing Glue (not yet supported) | No | Empty | -| `glue.external_id` | IAM External ID for accessing Glue (not yet supported) | No | Empty | +Accessing Glue requires authentication information, supporting the following two methods: -### Example +1. Access Key Authentication + + Authenticate access to Glue through Access Key provided by `glue.access_key` and `glue.secret_key`. + +2. IAM Role Authentication (supported since 3.1.2+) + + Authenticate access to Glue through IAM Role provided by `glue.role_arn`. + + This method requires Doris to be deployed on AWS EC2, and the EC2 instance needs to be bound to an IAM Role that has permission to access Glue. + + If access through External ID is required, you need to configure `glue.external_id` as well. + +Notes: + +- At least one of the two methods must be configured. If both methods are configured, Access Key authentication takes priority. + +Example: + + ```sql + CREATE CATALOG hive_glue_catalog PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.type' = 'glue', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + -- Using Access Key authentication + 'glue.access_key' = '', + 'glue.secret_key' = '' + -- Or using IAM Role authentication + -- 'glue.role_arn' = '', + -- 'glue.external_id' = '' + ); + ``` + +### Hive Glue Catalog + +Hive Glue Catalog is used to access Hive tables through AWS Glue's Hive Metastore compatible interface. Configuration as follows: + +| Parameter Name | Description | Required | Default Value | +|--------------------------|---------------------------------------------------------------|----------|---------------| +| `type` | Fixed as `hms` | Yes | None | +| `hive.metastore.type` | Fixed as `glue` | Yes | None | +| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | +| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | +| `glue.access_key` | AWS Access Key ID | No | Empty | +| `glue.secret_key` | AWS Secret Access Key | No | Empty | +| `glue.catalog_id` | Glue Catalog ID (not supported yet) | No | Empty | +| `glue.role_arn` | IAM Role ARN for accessing Glue | No | Empty | +| `glue.external_id` | IAM External ID for accessing Glue | No | Empty | + +#### Example ```sql CREATE CATALOG hive_glue_catalog PROPERTIES ( @@ -48,24 +96,24 @@ CREATE CATALOG hive_glue_catalog PROPERTIES ( ); ``` -## Iceberg Glue Catalog +### Iceberg Glue Catalog -Iceberg Glue Catalog accesses Glue through the Glue Client. Configuration parameters are as follows: +Iceberg Glue Catalog accesses Glue through Glue Client. Configuration as follows: -| Parameter Name | Description | Required | Default Value | -|-------------------------|-----------------------------------------------------------------------------|----------|---------------| -| `type` | Fixed value `iceberg` | Yes | None | -| `iceberg.catalog.type` | Fixed value `glue` | Yes | None | -| `warehouse` | Iceberg warehouse path, e.g., `s3://my-bucket/iceberg-warehouse/` | Yes | s3://doris | -| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | -| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | -| `glue.access_key` | AWS Access Key ID | Yes | Empty | -| `glue.secret_key` | AWS Secret Access Key | Yes | Empty | -| `glue.catalog_id` | Glue Catalog ID (not yet supported) | No | Empty | -| `glue.role_arn` | IAM Role ARN for accessing Glue (not yet supported) | No | Empty | -| `glue.external_id` | IAM External ID for accessing Glue (not yet supported) | No | Empty | +| Parameter Name | Description | Required | Default Value | +|------------------------|------------------------------------------------------------------|----------|---------------| +| `type` | Fixed as `iceberg` | Yes | None | +| `iceberg.catalog.type` | Fixed as `glue` | Yes | None | +| `warehouse` | Iceberg data warehouse path, e.g., `s3://my-bucket/iceberg-warehouse/` | Yes | s3://doris | +| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | +| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | +| `glue.access_key` | AWS Access Key ID | No | Empty | +| `glue.secret_key` | AWS Secret Access Key | No | Empty | +| `glue.catalog_id` | Glue Catalog ID (not supported yet) | No | Empty | +| `glue.role_arn` | IAM Role ARN for accessing Glue (not supported yet) | No | Empty | +| `glue.external_id` | IAM External ID for accessing Glue (not supported yet) | No | Empty | -### Example +#### Example ```sql CREATE CATALOG iceberg_glue_catalog PROPERTIES ( @@ -78,23 +126,23 @@ CREATE CATALOG iceberg_glue_catalog PROPERTIES ( ); ``` -## Iceberg Glue Rest Catalog +### Iceberg Glue Rest Catalog -Iceberg Glue Rest Catalog accesses Glue through the Glue Rest Catalog interface. Currently only supports Iceberg tables stored in AWS S3 Table Bucket. Configuration parameters are as follows: +Iceberg Glue Rest Catalog accesses Glue through Glue Rest Catalog interface. Currently only supports Iceberg tables stored in AWS S3 Table Bucket. Configuration as follows: -| Parameter Name | Description | Required | Default Value | -|----------------------------------|---------------------------------------------------------------------------------|----------|---------------| -| `type` | Fixed value `iceberg` | Yes | None | -| `iceberg.catalog.type` | Fixed value `rest` | Yes | None | +| Parameter Name | Description | Required | Default Value | +|----------------------------------|-------------------------------------------------------------------|----------|---------------| +| `type` | Fixed as `iceberg` | Yes | None | +| `iceberg.catalog.type` | Fixed as `rest` | Yes | None | | `iceberg.rest.uri` | Glue Rest service endpoint, e.g., `https://glue.ap-east-1.amazonaws.com/iceberg` | Yes | None | -| `warehouse` | Iceberg warehouse path, e.g., `:s3tablescatalog/` | Yes | None | -| `iceberg.rest.sigv4-enabled` | Enable V4 signature format, fixed value `true` | Yes | None | -| `iceberg.rest.signing-name` | Signature type, fixed value `glue` | Yes | Empty | -| `iceberg.rest.access-key-id` | Access Key for accessing Glue (also used for S3 Bucket access) | Yes | Empty | -| `iceberg.rest.secret-access-key` | Secret Key for accessing Glue (also used for S3 Bucket access) | Yes | Empty | -| `iceberg.rest.signing-region` | AWS Glue region, e.g., `us-east-1` | Yes | Empty | +| `warehouse` | Iceberg data warehouse path, e.g., `:s3tablescatalog/` | Yes | None | +| `iceberg.rest.sigv4-enabled` | Enable V4 signature format, fixed as `true` | Yes | None | +| `iceberg.rest.signing-name` | Signature type, fixed as `glue` | Yes | Empty | +| `iceberg.rest.access-key-id` | Access Key for accessing Glue (also used for accessing S3 Bucket) | Yes | Empty | +| `iceberg.rest.secret-access-key` | Secret Key for accessing Glue (also used for accessing S3 Bucket) | Yes | Empty | +| `iceberg.rest.signing-region` | AWS Glue region, e.g., `us-east-1` | Yes | Empty | -### Example +#### Example ```sql CREATE CATALOG glue_s3 PROPERTIES ( @@ -109,3 +157,89 @@ CREATE CATALOG glue_s3 PROPERTIES ( 'iceberg.rest.signing-region' = '' ); ``` + + +## Permission Policies + +Depending on usage scenarios, they can be divided into **read-only** and **read-write** policies. + +### 1. Read-Only Permissions + +Only allows reading database and table information from Glue Catalog. + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "GlueCatalogReadOnly", + "Effect": "Allow", + "Action": [ + "glue:GetCatalog", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:GetTable", + "glue:GetTables", + "glue:GetPartitions" + ], + "Resource": [ + "arn:aws:glue:::catalog", + "arn:aws:glue:::database/*", + "arn:aws:glue:::table/*/*" + ] + } + ] +} +``` + +### 2. Read-Write Permissions + +Based on read-only permissions, allows creating/modifying/deleting databases and tables. + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "GlueCatalogReadWrite", + "Effect": "Allow", + "Action": [ + "glue:GetCatalog", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:GetTable", + "glue:GetTables", + "glue:GetPartitions", + "glue:CreateDatabase", + "glue:UpdateDatabase", + "glue:DeleteDatabase", + "glue:CreateTable", + "glue:UpdateTable", + "glue:DeleteTable" + ], + "Resource": [ + "arn:aws:glue:::catalog", + "arn:aws:glue:::database/*", + "arn:aws:glue:::table/*/*" + ] + } + ] +} +``` + +### Notes + +1. Placeholder Replacement + + - `` → Your AWS region (e.g., `us-east-1`). + - `` → Your AWS account ID (12-digit number). + +2. Principle of Least Privilege + + - If only querying, do not grant write permissions. + - Can replace `*` with specific database/table ARNs to further restrict permissions. + +3. S3 Permissions + + - The above policies only involve Glue Catalog. + - If you need to read data files, additional S3 permissions are required (such as `s3:GetObject`, `s3:ListBucket`, etc.). \ No newline at end of file diff --git a/versioned_docs/version-2.1/lakehouse/metastores/iceberg-rest.md b/versioned_docs/version-2.1/lakehouse/metastores/iceberg-rest.md index be03e029efc57..1d89566ad31f9 100644 --- a/versioned_docs/version-2.1/lakehouse/metastores/iceberg-rest.md +++ b/versioned_docs/version-2.1/lakehouse/metastores/iceberg-rest.md @@ -19,6 +19,7 @@ This document describes the supported parameters when connecting to and accessin | iceberg.rest.oauth2.credential | | `oauth2` credentials used to access `server-uri` to obtain token | - | No | | iceberg.rest.oauth2.server-uri | | URI address for obtaining `oauth2` token, used in conjunction with `iceberg.rest.oauth2.credential` | - | No | | iceberg.rest.vended-credentials-enabled | | Whether to enable `vended-credentials` functionality. When enabled, it will obtain storage system access credentials such as `access-key` and `secret-key` from the rest server, eliminating the need for manual specification. Requires rest server support for this capability. | `false` | No | +| iceberg.rest.nested-namespace-enabled | | (Supported since version 3.1.2+) Whether to enable support for Nested Namespace. Default is `false`. If `true`, Nested Namespace will be flattened and displayed as Database names, such as `parent_ns.child_ns`. Some Rest Catalog services do not support Nested Namespace, such as AWS Glue, so this parameter should be set to `false` | No | > Note: > @@ -28,6 +29,27 @@ This document describes the supported parameters when connecting to and accessin > > 3. For AWS Glue Rest Catalog, please refer to the [AWS Glue documentation](./aws-glue.md) +## Nested Namespace + +Since 3.1.2, to fully access Nested Namespace, in addition to setting `iceberg.rest.nested-namespace-enabled` to `true` in the Catalog properties, you also need to enable the following global parameter: + +``` +SET GLOBAL enable_nested_namespace=true; +``` + +Assuming the Catalog is "ice", Namespace is "ns1.ns2", and Table is "tbl1", you can access Nested Namespace in the following ways: + +```sql +mysql> USE ice.ns1.ns2; +mysql> SELECT k1 FROM ice.`ns1.ns2`.tbl1; +mysql> SELECT tbl1.k1 FROM `ns1.ns2`.tbl1; +mysql> SELECT `ns1.ns2`.tbl1.k1 FROM ice.`ns1.ns2`.tbl1; +mysql> SELECT ice.`ns1.ns2`.tbl1.k1 FROM tbl1; +mysql> REFRESH CATALOG ice; +mysql> REFRESH DATABASE ice.`ns1.ns2`; +mysql> REFRESH TABLE ice.`ns1.ns2`.tbl1; +``` + ## Example Configurations - Rest Catalog service without authentication @@ -111,6 +133,43 @@ This document describes the supported parameters when connecting to and accessin ); ``` +- Connecting to Snowflake Open Catalog (Since 3.1.2) + + ```sql + -- Enable vended-credentials + CREATE CATALOG snowflake_open_catalog PROPERTIES ( + 'type' = 'iceberg', + 'warehouse' = '', + 'iceberg.catalog.type' = 'rest', + 'iceberg.rest.uri' = 'https://.snowflakecomputing.com/polaris/api/catalog', + 'iceberg.rest.security.type' = 'oauth2', + 'iceberg.rest.oauth2.credential' = ':', + 'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:', + 'iceberg.rest.vended-credentials-enabled' = 'true', + 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', + 's3.region' = 'us-west-2', + 'iceberg.rest.nested-namespace-enabled' = 'true' + ); + ``` + + ```sql + -- Disable vended-credentials + CREATE CATALOG snowflake_open_catalog PROPERTIES ( + 'type' = 'iceberg', + 'warehouse' = '', + 'iceberg.catalog.type' = 'rest', + 'iceberg.rest.uri' = 'https://.snowflakecomputing.com/polaris/api/catalog', + 'iceberg.rest.security.type' = 'oauth2', + 'iceberg.rest.oauth2.credential' = ':', + 'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:', + 's3.access_key' = '', + 's3.secret_key' = '', + 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', + 's3.region' = 'us-west-2', + 'iceberg.rest.nested-namespace-enabled' = 'true' + ); + ``` + - Connecting to Apache Gravitino Rest Catalog ```sql diff --git a/versioned_docs/version-2.1/lakehouse/storages/s3.md b/versioned_docs/version-2.1/lakehouse/storages/s3.md index 90e671f866646..8c5d8473940f4 100644 --- a/versioned_docs/version-2.1/lakehouse/storages/s3.md +++ b/versioned_docs/version-2.1/lakehouse/storages/s3.md @@ -15,18 +15,18 @@ This document describes the parameters required for accessing AWS S3. These para ## Parameter Overview -| Property Name | Legacy Name | Description | Default Value | Required | -|------------------------------|-------------|-------------------------------------------------|---------------|----------| -| s3.endpoint | | S3 service access endpoint, e.g., s3.us-east-1.amazonaws.com | None | No | -| s3.access_key | | AWS Access Key for authentication | None | No | -| s3.secret_key | | AWS Secret Key for authentication | None | No | -| s3.region | | S3 region, e.g., us-east-1. Highly recommended to configure | None | Yes | -| s3.use_path_style | | Whether to use path-style access | FALSE | No | -| s3.connection.maximum | | Maximum number of connections for high concurrency scenarios | 50 | No | -| s3.connection.request.timeout| | Request timeout in milliseconds for connection acquisition | 3000 | No | -| s3.connection.timeout | | Connection establishment timeout in milliseconds | 1000 | No | -| s3.role_arn | | Role ARN when using Assume Role mode | None | No | -| s3.external_id | | External ID used with s3.role_arn | None | No | +| Property Name | Legacy Name | Description | Default | Required | +|------------------------------|-------------|--------------------------------------------------|---------|----------| +| s3.endpoint | | S3 service access endpoint, e.g., s3.us-east-1.amazonaws.com | None | No | +| s3.access_key | | AWS Access Key for authentication | None | No | +| s3.secret_key | | AWS Secret Key for authentication | None | No | +| s3.region | | S3 region, e.g., us-east-1. Strongly recommended | None | Yes | +| s3.use_path_style | | Whether to use path-style access | FALSE | No | +| s3.connection.maximum | | Maximum number of connections for high concurrency scenarios | 50 | No | +| s3.connection.request.timeout| | Request timeout (milliseconds), controls connection acquisition timeout | 3000 | No | +| s3.connection.timeout | | Connection establishment timeout (milliseconds) | 1000 | No | +| s3.role_arn | | Role ARN specified when using Assume Role mode | None | No | +| s3.external_id | | External ID used with s3.role_arn | None | No | ## Authentication Configuration @@ -41,7 +41,7 @@ Doris supports the following two methods to access S3: "s3.region"="us-east-1" ``` -2. Assume Role +2. Assume Role Mode Suitable for cross-account and temporary authorization access. Automatically obtains temporary credentials through role authorization. @@ -52,13 +52,13 @@ Doris supports the following two methods to access S3: "s3.region"="us-east-1" ``` -> If both Access Key and Role ARN are configured, Access Key mode takes priority. +> If both Access Key and Role ARN are configured, Access Key mode takes precedence. ## Accessing S3 Directory Bucket > This feature is supported since version 3.1.0. -Amazon S3 Express One Zone (also known as Directory Bucket) provides higher performance but has a different endpoint format. +Amazon S3 Express One Zone (also known as Directory Bucket) provides higher performance, but has a different endpoint format. * Regular bucket: s3.us-east-1.amazonaws.com * Directory Bucket: s3express-usw2-az1.us-west-2.amazonaws.com @@ -71,5 +71,84 @@ Example: "s3.access_key"="ak", "s3.secret_key"="sk", "s3.endpoint"="s3express-usw2-az1.us-west-2.amazonaws.com", -"s3.region"="us-west +"s3.region"="us-west-2" ``` + +## Permission Policies + +Depending on the use case, permissions can be categorized into **read-only** and **read-write** policies. + +### 1. Read-only Permissions + +Only allows reading objects from S3. Suitable for LOAD, TVF, querying EXTERNAL CATALOG, and other scenarios. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:GetObjectVersion", + ], + "Resource": "arn:aws:s3:::/your-prefix/*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListBucket", + "s3:GetBucketLocation" + ], + "Resource": "arn:aws:s3:::" + } + ] +} +``` + +### 2. Read-write Permissions + +Based on read-only permissions, additionally allows deleting, creating, and modifying objects. Suitable for EXPORT, OUTFILE, and EXTERNAL CATALOG write-back scenarios. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:PutObject", + "s3:GetObject", + "s3:GetObjectVersion", + "s3:DeleteObject", + "s3:DeleteObjectVersion", + "s3:AbortMultipartUpload", + "s3:ListMultipartUploadParts" + ], + "Resource": "arn:aws:s3::://*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListBucket", + "s3:GetBucketLocation", + "s3:GetBucketVersioning", + "s3:GetLifecycleConfiguration" + ], + "Resource": "arn:aws:s3:::" + } + ] +} +``` + +### Notes + +1. Placeholder Replacement + + - `` → Your S3 Bucket name. + - `` → Your AWS account ID (12-digit number). + +2. Principle of Least Privilege + + - If only querying, do not grant write permissions. + \ No newline at end of file diff --git a/versioned_docs/version-3.0/lakehouse/catalogs/hive-catalog.mdx b/versioned_docs/version-3.0/lakehouse/catalogs/hive-catalog.mdx index 46205b43285db..bd2d396f6730d 100644 --- a/versioned_docs/version-3.0/lakehouse/catalogs/hive-catalog.mdx +++ b/versioned_docs/version-3.0/lakehouse/catalogs/hive-catalog.mdx @@ -474,6 +474,7 @@ Hive transactional tables are supported from version 3.x onwards. For details, r 'glue.secret_key' = '' ); ``` + When Glue service authentication information differs from S3 authentication information, you can specify S3 authentication information separately in the following way. ```sql CREATE CATALOG hive_glue_on_s3_catalog PROPERTIES ( @@ -489,6 +490,16 @@ Hive transactional tables are supported from version 3.x onwards. For details, r 's3.secret_key' = '' ); ``` + + Using IAM Assumed Role to obtain S3 access credentials (Since 3.1.2+) + ```sql + CREATE CATALOG `glue_hive_iamrole` PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.type' = 'glue', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + 'glue.role_arn' = '' + ); diff --git a/versioned_docs/version-3.0/lakehouse/catalogs/iceberg-catalog.mdx b/versioned_docs/version-3.0/lakehouse/catalogs/iceberg-catalog.mdx index e80dc1f510137..0f1b5bd0b6aee 100644 --- a/versioned_docs/version-3.0/lakehouse/catalogs/iceberg-catalog.mdx +++ b/versioned_docs/version-3.0/lakehouse/catalogs/iceberg-catalog.mdx @@ -137,6 +137,43 @@ The current Iceberg dependency is version 1.6.1, which is compatible with higher > > You can check whether the source type has timezone information in the Extra column of the `DESCRIBE table_name` statement. If it shows `WITH_TIMEZONE`, it indicates that the source type is a timezone-aware type. (Supported since 3.1.0). +## Namespace Mapping + +Iceberg's metadata hierarchy is Catalog -> Namespace -> Table. Namespace can have multiple levels (Nested Namespace). + +``` + ┌─────────┐ + │ Catalog │ + └────┬────┘ + │ + ┌─────┴─────┐ + ┌──▼──┐ ┌──▼──┐ + │ NS1 │ │ NS2 │ + └──┬──┘ └──┬──┘ + │ │ +┌────▼───┐ ┌──▼──┐ +│ Table1 │ │ NS3 │ +└────────┘ └──┬──┘ + │ + ┌──────┴───────┐ + ┌────▼───┐ ┌────▼───┐ + │ Table2 │ │ Table3 │ + └────────┘ └────────┘ +``` + + +Starting from version 3.1.2, for Iceberg Rest Catalog, Doris supports mapping of Nested Namespace. + +In the above example, tables will be mapped to Doris metadata according to the following logic: + +| Catalog | Database | Table | +| --- | --- | --- | +| Catalog | NS1 | Table1 | +| Catalog | NS2.NS3 | Table2 | +| Catalog | NS2.NS3 | Table3 | + +Support for Nested Namespace needs to be explicitly enabled. For details, please refer to [Iceberg Rest Catalog](../metastores/iceberg-rest.md) + ## Examples ### Hive Metastore @@ -469,6 +506,7 @@ The current Iceberg dependency is version 1.6.1, which is compatible with higher 'glue.secret_key' = '' ); ``` + When Glue service authentication credentials differ from S3 authentication credentials, you can specify S3 authentication credentials separately using the following method. ```sql CREATE CATALOG `iceberg_glue_on_s3_catalog_` PROPERTIES ( @@ -485,6 +523,18 @@ The current Iceberg dependency is version 1.6.1, which is compatible with higher 's3.secret_key' = '' ); ``` + + Using IAM Assumed Role to obtain S3 access credentials (Since 3.1.2+) + ```sql + CREATE CATALOG `glue_iceberg_iamrole` PROPERTIES ( + 'type' = 'iceberg', + 'iceberg.catalog.type' = 'glue', + 'warehouse' = 's3://bucket/warehouse', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + 'glue.role_arn' = '' + ); + ``` diff --git a/versioned_docs/version-3.0/lakehouse/metastores/aws-glue.md b/versioned_docs/version-3.0/lakehouse/metastores/aws-glue.md index be5f6fa868e6a..fba9332b51f64 100644 --- a/versioned_docs/version-3.0/lakehouse/metastores/aws-glue.md +++ b/versioned_docs/version-3.0/lakehouse/metastores/aws-glue.md @@ -11,31 +11,79 @@ This document describes the parameter configuration when using **AWS Glue Catalo AWS Glue Catalog currently supports three types of Catalogs: -| Catalog Type | Type Identifier (`type`) | Description | -|--------------|-------------------------|----------------------------------------------------| -| Hive | glue | Catalog for connecting to Hive Metastore | -| Iceberg | glue | Catalog for connecting to Iceberg table format | -| Iceberg | rest | Catalog for connecting to Iceberg via Glue Rest | +| Catalog Type | Type Identifier (`type`) | Description | +|-------------|-------------------------|------------------------------------------------| +| Hive | glue | Catalog for connecting to Hive Metastore | +| Iceberg | glue | Catalog for connecting to Iceberg table format | +| Iceberg | rest | Catalog for connecting to Iceberg table format via Glue Rest Catalog | -This document provides detailed descriptions of the parameters for each type to help users with configuration. +This documentation provides detailed parameter descriptions for each type to facilitate user configuration. -## Hive Glue Catalog +## Common Parameters Overview +| Parameter Name | Description | Required | Default Value | +|--------------------------|---------------------------------------------------------------|----------|---------------| +| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | +| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | +| `glue.access_key` | AWS Access Key ID | Yes | Empty | +| `glue.secret_key` | AWS Secret Access Key | Yes | Empty | +| `glue.catalog_id` | Glue Catalog ID (not supported yet) | No | Empty | +| `glue.role_arn` | IAM Role ARN for accessing Glue (supported since 3.1.2+) | No | Empty | +| `glue.external_id` | IAM External ID for accessing Glue (supported since 3.1.2+) | No | Empty | -Hive Glue Catalog is used to access Hive tables through AWS Glue's Hive Metastore compatible interface. Configuration parameters are as follows: +### Authentication Parameters -| Parameter Name | Description | Required | Default Value | -|---------------------------|----------------------------------------------------------------------|----------|---------------| -| `type` | Fixed value `hms` | Yes | None | -| `hive.metastore.type` | Fixed value `glue` | Yes | None | -| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | -| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | -| `glue.access_key` | AWS Access Key ID | Yes | Empty | -| `glue.secret_key` | AWS Secret Access Key | Yes | Empty | -| `glue.catalog_id` | Glue Catalog ID (not yet supported) | No | Empty | -| `glue.role_arn` | IAM Role ARN for accessing Glue (not yet supported) | No | Empty | -| `glue.external_id` | IAM External ID for accessing Glue (not yet supported) | No | Empty | +Accessing Glue requires authentication information, supporting the following two methods: -### Example +1. Access Key Authentication + + Authenticate access to Glue through Access Key provided by `glue.access_key` and `glue.secret_key`. + +2. IAM Role Authentication (supported since 3.1.2+) + + Authenticate access to Glue through IAM Role provided by `glue.role_arn`. + + This method requires Doris to be deployed on AWS EC2, and the EC2 instance needs to be bound to an IAM Role that has permission to access Glue. + + If access through External ID is required, you need to configure `glue.external_id` as well. + +Notes: + +- At least one of the two methods must be configured. If both methods are configured, Access Key authentication takes priority. + +Example: + + ```sql + CREATE CATALOG hive_glue_catalog PROPERTIES ( + 'type' = 'hms', + 'hive.metastore.type' = 'glue', + 'glue.region' = 'us-east-1', + 'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com', + -- Using Access Key authentication + 'glue.access_key' = '', + 'glue.secret_key' = '' + -- Or using IAM Role authentication + -- 'glue.role_arn' = '', + -- 'glue.external_id' = '' + ); + ``` + +### Hive Glue Catalog + +Hive Glue Catalog is used to access Hive tables through AWS Glue's Hive Metastore compatible interface. Configuration as follows: + +| Parameter Name | Description | Required | Default Value | +|--------------------------|---------------------------------------------------------------|----------|---------------| +| `type` | Fixed as `hms` | Yes | None | +| `hive.metastore.type` | Fixed as `glue` | Yes | None | +| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | +| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | +| `glue.access_key` | AWS Access Key ID | No | Empty | +| `glue.secret_key` | AWS Secret Access Key | No | Empty | +| `glue.catalog_id` | Glue Catalog ID (not supported yet) | No | Empty | +| `glue.role_arn` | IAM Role ARN for accessing Glue | No | Empty | +| `glue.external_id` | IAM External ID for accessing Glue | No | Empty | + +#### Example ```sql CREATE CATALOG hive_glue_catalog PROPERTIES ( @@ -48,24 +96,24 @@ CREATE CATALOG hive_glue_catalog PROPERTIES ( ); ``` -## Iceberg Glue Catalog +### Iceberg Glue Catalog -Iceberg Glue Catalog accesses Glue through the Glue Client. Configuration parameters are as follows: +Iceberg Glue Catalog accesses Glue through Glue Client. Configuration as follows: -| Parameter Name | Description | Required | Default Value | -|-------------------------|-----------------------------------------------------------------------------|----------|---------------| -| `type` | Fixed value `iceberg` | Yes | None | -| `iceberg.catalog.type` | Fixed value `glue` | Yes | None | -| `warehouse` | Iceberg warehouse path, e.g., `s3://my-bucket/iceberg-warehouse/` | Yes | s3://doris | -| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | -| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | -| `glue.access_key` | AWS Access Key ID | Yes | Empty | -| `glue.secret_key` | AWS Secret Access Key | Yes | Empty | -| `glue.catalog_id` | Glue Catalog ID (not yet supported) | No | Empty | -| `glue.role_arn` | IAM Role ARN for accessing Glue (not yet supported) | No | Empty | -| `glue.external_id` | IAM External ID for accessing Glue (not yet supported) | No | Empty | +| Parameter Name | Description | Required | Default Value | +|------------------------|------------------------------------------------------------------|----------|---------------| +| `type` | Fixed as `iceberg` | Yes | None | +| `iceberg.catalog.type` | Fixed as `glue` | Yes | None | +| `warehouse` | Iceberg data warehouse path, e.g., `s3://my-bucket/iceberg-warehouse/` | Yes | s3://doris | +| `glue.region` | AWS Glue region, e.g., `us-east-1` | Yes | None | +| `glue.endpoint` | AWS Glue endpoint, e.g., `https://glue.us-east-1.amazonaws.com` | Yes | None | +| `glue.access_key` | AWS Access Key ID | No | Empty | +| `glue.secret_key` | AWS Secret Access Key | No | Empty | +| `glue.catalog_id` | Glue Catalog ID (not supported yet) | No | Empty | +| `glue.role_arn` | IAM Role ARN for accessing Glue (not supported yet) | No | Empty | +| `glue.external_id` | IAM External ID for accessing Glue (not supported yet) | No | Empty | -### Example +#### Example ```sql CREATE CATALOG iceberg_glue_catalog PROPERTIES ( @@ -78,23 +126,23 @@ CREATE CATALOG iceberg_glue_catalog PROPERTIES ( ); ``` -## Iceberg Glue Rest Catalog +### Iceberg Glue Rest Catalog -Iceberg Glue Rest Catalog accesses Glue through the Glue Rest Catalog interface. Currently only supports Iceberg tables stored in AWS S3 Table Bucket. Configuration parameters are as follows: +Iceberg Glue Rest Catalog accesses Glue through Glue Rest Catalog interface. Currently only supports Iceberg tables stored in AWS S3 Table Bucket. Configuration as follows: -| Parameter Name | Description | Required | Default Value | -|----------------------------------|---------------------------------------------------------------------------------|----------|---------------| -| `type` | Fixed value `iceberg` | Yes | None | -| `iceberg.catalog.type` | Fixed value `rest` | Yes | None | +| Parameter Name | Description | Required | Default Value | +|----------------------------------|-------------------------------------------------------------------|----------|---------------| +| `type` | Fixed as `iceberg` | Yes | None | +| `iceberg.catalog.type` | Fixed as `rest` | Yes | None | | `iceberg.rest.uri` | Glue Rest service endpoint, e.g., `https://glue.ap-east-1.amazonaws.com/iceberg` | Yes | None | -| `warehouse` | Iceberg warehouse path, e.g., `:s3tablescatalog/` | Yes | None | -| `iceberg.rest.sigv4-enabled` | Enable V4 signature format, fixed value `true` | Yes | None | -| `iceberg.rest.signing-name` | Signature type, fixed value `glue` | Yes | Empty | -| `iceberg.rest.access-key-id` | Access Key for accessing Glue (also used for S3 Bucket access) | Yes | Empty | -| `iceberg.rest.secret-access-key` | Secret Key for accessing Glue (also used for S3 Bucket access) | Yes | Empty | -| `iceberg.rest.signing-region` | AWS Glue region, e.g., `us-east-1` | Yes | Empty | +| `warehouse` | Iceberg data warehouse path, e.g., `:s3tablescatalog/` | Yes | None | +| `iceberg.rest.sigv4-enabled` | Enable V4 signature format, fixed as `true` | Yes | None | +| `iceberg.rest.signing-name` | Signature type, fixed as `glue` | Yes | Empty | +| `iceberg.rest.access-key-id` | Access Key for accessing Glue (also used for accessing S3 Bucket) | Yes | Empty | +| `iceberg.rest.secret-access-key` | Secret Key for accessing Glue (also used for accessing S3 Bucket) | Yes | Empty | +| `iceberg.rest.signing-region` | AWS Glue region, e.g., `us-east-1` | Yes | Empty | -### Example +#### Example ```sql CREATE CATALOG glue_s3 PROPERTIES ( @@ -109,3 +157,89 @@ CREATE CATALOG glue_s3 PROPERTIES ( 'iceberg.rest.signing-region' = '' ); ``` + + +## Permission Policies + +Depending on usage scenarios, they can be divided into **read-only** and **read-write** policies. + +### 1. Read-Only Permissions + +Only allows reading database and table information from Glue Catalog. + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "GlueCatalogReadOnly", + "Effect": "Allow", + "Action": [ + "glue:GetCatalog", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:GetTable", + "glue:GetTables", + "glue:GetPartitions" + ], + "Resource": [ + "arn:aws:glue:::catalog", + "arn:aws:glue:::database/*", + "arn:aws:glue:::table/*/*" + ] + } + ] +} +``` + +### 2. Read-Write Permissions + +Based on read-only permissions, allows creating/modifying/deleting databases and tables. + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "GlueCatalogReadWrite", + "Effect": "Allow", + "Action": [ + "glue:GetCatalog", + "glue:GetDatabase", + "glue:GetDatabases", + "glue:GetTable", + "glue:GetTables", + "glue:GetPartitions", + "glue:CreateDatabase", + "glue:UpdateDatabase", + "glue:DeleteDatabase", + "glue:CreateTable", + "glue:UpdateTable", + "glue:DeleteTable" + ], + "Resource": [ + "arn:aws:glue:::catalog", + "arn:aws:glue:::database/*", + "arn:aws:glue:::table/*/*" + ] + } + ] +} +``` + +### Notes + +1. Placeholder Replacement + + - `` → Your AWS region (e.g., `us-east-1`). + - `` → Your AWS account ID (12-digit number). + +2. Principle of Least Privilege + + - If only querying, do not grant write permissions. + - Can replace `*` with specific database/table ARNs to further restrict permissions. + +3. S3 Permissions + + - The above policies only involve Glue Catalog. + - If you need to read data files, additional S3 permissions are required (such as `s3:GetObject`, `s3:ListBucket`, etc.). \ No newline at end of file diff --git a/versioned_docs/version-3.0/lakehouse/metastores/iceberg-rest.md b/versioned_docs/version-3.0/lakehouse/metastores/iceberg-rest.md index be03e029efc57..1d89566ad31f9 100644 --- a/versioned_docs/version-3.0/lakehouse/metastores/iceberg-rest.md +++ b/versioned_docs/version-3.0/lakehouse/metastores/iceberg-rest.md @@ -19,6 +19,7 @@ This document describes the supported parameters when connecting to and accessin | iceberg.rest.oauth2.credential | | `oauth2` credentials used to access `server-uri` to obtain token | - | No | | iceberg.rest.oauth2.server-uri | | URI address for obtaining `oauth2` token, used in conjunction with `iceberg.rest.oauth2.credential` | - | No | | iceberg.rest.vended-credentials-enabled | | Whether to enable `vended-credentials` functionality. When enabled, it will obtain storage system access credentials such as `access-key` and `secret-key` from the rest server, eliminating the need for manual specification. Requires rest server support for this capability. | `false` | No | +| iceberg.rest.nested-namespace-enabled | | (Supported since version 3.1.2+) Whether to enable support for Nested Namespace. Default is `false`. If `true`, Nested Namespace will be flattened and displayed as Database names, such as `parent_ns.child_ns`. Some Rest Catalog services do not support Nested Namespace, such as AWS Glue, so this parameter should be set to `false` | No | > Note: > @@ -28,6 +29,27 @@ This document describes the supported parameters when connecting to and accessin > > 3. For AWS Glue Rest Catalog, please refer to the [AWS Glue documentation](./aws-glue.md) +## Nested Namespace + +Since 3.1.2, to fully access Nested Namespace, in addition to setting `iceberg.rest.nested-namespace-enabled` to `true` in the Catalog properties, you also need to enable the following global parameter: + +``` +SET GLOBAL enable_nested_namespace=true; +``` + +Assuming the Catalog is "ice", Namespace is "ns1.ns2", and Table is "tbl1", you can access Nested Namespace in the following ways: + +```sql +mysql> USE ice.ns1.ns2; +mysql> SELECT k1 FROM ice.`ns1.ns2`.tbl1; +mysql> SELECT tbl1.k1 FROM `ns1.ns2`.tbl1; +mysql> SELECT `ns1.ns2`.tbl1.k1 FROM ice.`ns1.ns2`.tbl1; +mysql> SELECT ice.`ns1.ns2`.tbl1.k1 FROM tbl1; +mysql> REFRESH CATALOG ice; +mysql> REFRESH DATABASE ice.`ns1.ns2`; +mysql> REFRESH TABLE ice.`ns1.ns2`.tbl1; +``` + ## Example Configurations - Rest Catalog service without authentication @@ -111,6 +133,43 @@ This document describes the supported parameters when connecting to and accessin ); ``` +- Connecting to Snowflake Open Catalog (Since 3.1.2) + + ```sql + -- Enable vended-credentials + CREATE CATALOG snowflake_open_catalog PROPERTIES ( + 'type' = 'iceberg', + 'warehouse' = '', + 'iceberg.catalog.type' = 'rest', + 'iceberg.rest.uri' = 'https://.snowflakecomputing.com/polaris/api/catalog', + 'iceberg.rest.security.type' = 'oauth2', + 'iceberg.rest.oauth2.credential' = ':', + 'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:', + 'iceberg.rest.vended-credentials-enabled' = 'true', + 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', + 's3.region' = 'us-west-2', + 'iceberg.rest.nested-namespace-enabled' = 'true' + ); + ``` + + ```sql + -- Disable vended-credentials + CREATE CATALOG snowflake_open_catalog PROPERTIES ( + 'type' = 'iceberg', + 'warehouse' = '', + 'iceberg.catalog.type' = 'rest', + 'iceberg.rest.uri' = 'https://.snowflakecomputing.com/polaris/api/catalog', + 'iceberg.rest.security.type' = 'oauth2', + 'iceberg.rest.oauth2.credential' = ':', + 'iceberg.rest.oauth2.scope' = 'PRINCIPAL_ROLE:', + 's3.access_key' = '', + 's3.secret_key' = '', + 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', + 's3.region' = 'us-west-2', + 'iceberg.rest.nested-namespace-enabled' = 'true' + ); + ``` + - Connecting to Apache Gravitino Rest Catalog ```sql diff --git a/versioned_docs/version-3.0/lakehouse/storages/s3.md b/versioned_docs/version-3.0/lakehouse/storages/s3.md index 90e671f866646..8c5d8473940f4 100644 --- a/versioned_docs/version-3.0/lakehouse/storages/s3.md +++ b/versioned_docs/version-3.0/lakehouse/storages/s3.md @@ -15,18 +15,18 @@ This document describes the parameters required for accessing AWS S3. These para ## Parameter Overview -| Property Name | Legacy Name | Description | Default Value | Required | -|------------------------------|-------------|-------------------------------------------------|---------------|----------| -| s3.endpoint | | S3 service access endpoint, e.g., s3.us-east-1.amazonaws.com | None | No | -| s3.access_key | | AWS Access Key for authentication | None | No | -| s3.secret_key | | AWS Secret Key for authentication | None | No | -| s3.region | | S3 region, e.g., us-east-1. Highly recommended to configure | None | Yes | -| s3.use_path_style | | Whether to use path-style access | FALSE | No | -| s3.connection.maximum | | Maximum number of connections for high concurrency scenarios | 50 | No | -| s3.connection.request.timeout| | Request timeout in milliseconds for connection acquisition | 3000 | No | -| s3.connection.timeout | | Connection establishment timeout in milliseconds | 1000 | No | -| s3.role_arn | | Role ARN when using Assume Role mode | None | No | -| s3.external_id | | External ID used with s3.role_arn | None | No | +| Property Name | Legacy Name | Description | Default | Required | +|------------------------------|-------------|--------------------------------------------------|---------|----------| +| s3.endpoint | | S3 service access endpoint, e.g., s3.us-east-1.amazonaws.com | None | No | +| s3.access_key | | AWS Access Key for authentication | None | No | +| s3.secret_key | | AWS Secret Key for authentication | None | No | +| s3.region | | S3 region, e.g., us-east-1. Strongly recommended | None | Yes | +| s3.use_path_style | | Whether to use path-style access | FALSE | No | +| s3.connection.maximum | | Maximum number of connections for high concurrency scenarios | 50 | No | +| s3.connection.request.timeout| | Request timeout (milliseconds), controls connection acquisition timeout | 3000 | No | +| s3.connection.timeout | | Connection establishment timeout (milliseconds) | 1000 | No | +| s3.role_arn | | Role ARN specified when using Assume Role mode | None | No | +| s3.external_id | | External ID used with s3.role_arn | None | No | ## Authentication Configuration @@ -41,7 +41,7 @@ Doris supports the following two methods to access S3: "s3.region"="us-east-1" ``` -2. Assume Role +2. Assume Role Mode Suitable for cross-account and temporary authorization access. Automatically obtains temporary credentials through role authorization. @@ -52,13 +52,13 @@ Doris supports the following two methods to access S3: "s3.region"="us-east-1" ``` -> If both Access Key and Role ARN are configured, Access Key mode takes priority. +> If both Access Key and Role ARN are configured, Access Key mode takes precedence. ## Accessing S3 Directory Bucket > This feature is supported since version 3.1.0. -Amazon S3 Express One Zone (also known as Directory Bucket) provides higher performance but has a different endpoint format. +Amazon S3 Express One Zone (also known as Directory Bucket) provides higher performance, but has a different endpoint format. * Regular bucket: s3.us-east-1.amazonaws.com * Directory Bucket: s3express-usw2-az1.us-west-2.amazonaws.com @@ -71,5 +71,84 @@ Example: "s3.access_key"="ak", "s3.secret_key"="sk", "s3.endpoint"="s3express-usw2-az1.us-west-2.amazonaws.com", -"s3.region"="us-west +"s3.region"="us-west-2" ``` + +## Permission Policies + +Depending on the use case, permissions can be categorized into **read-only** and **read-write** policies. + +### 1. Read-only Permissions + +Only allows reading objects from S3. Suitable for LOAD, TVF, querying EXTERNAL CATALOG, and other scenarios. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:GetObjectVersion", + ], + "Resource": "arn:aws:s3:::/your-prefix/*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListBucket", + "s3:GetBucketLocation" + ], + "Resource": "arn:aws:s3:::" + } + ] +} +``` + +### 2. Read-write Permissions + +Based on read-only permissions, additionally allows deleting, creating, and modifying objects. Suitable for EXPORT, OUTFILE, and EXTERNAL CATALOG write-back scenarios. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:PutObject", + "s3:GetObject", + "s3:GetObjectVersion", + "s3:DeleteObject", + "s3:DeleteObjectVersion", + "s3:AbortMultipartUpload", + "s3:ListMultipartUploadParts" + ], + "Resource": "arn:aws:s3::://*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:ListBucket", + "s3:GetBucketLocation", + "s3:GetBucketVersioning", + "s3:GetLifecycleConfiguration" + ], + "Resource": "arn:aws:s3:::" + } + ] +} +``` + +### Notes + +1. Placeholder Replacement + + - `` → Your S3 Bucket name. + - `` → Your AWS account ID (12-digit number). + +2. Principle of Least Privilege + + - If only querying, do not grant write permissions. + \ No newline at end of file diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md index 786205b08454a..c811e75b1d436 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/EXPORT.md @@ -74,7 +74,7 @@ The `EXPORT` command is used to export data from a specified table to files at a - `timeout`: Timeout for export job, default is 2 hours, unit is seconds. - - `compress_type`: (Supported since 2.1.5) When specifying the export file format as Parquet / ORC files, you can specify the compression method used by Parquet / ORC files. Parquet file format can specify compression methods as SNAPPY, GZIP, BROTLI, ZSTD, LZ4, and PLAIN, with default value SNAPPY. ORC file format can specify compression methods as PLAIN, SNAPPY, ZLIB, and ZSTD, with default value ZLIB. This parameter is supported starting from version 2.1.5. (PLAIN means no compression) + - `compress_type`: (Supported since 2.1.5) When specifying the export file format as Parquet / ORC files, you can specify the compression method used by Parquet / ORC files. Parquet file format can specify compression methods as SNAPPY, GZIP, BROTLI, ZSTD, LZ4, and PLAIN, with default value SNAPPY. ORC file format can specify compression methods as PLAIN, SNAPPY, ZLIB, and ZSTD, with default value ZLIB. This parameter is supported starting from version 2.1.5. (PLAIN means no compression). Starting from version 3.1.1, supports specifying compression algorithms for CSV format, currently supports "plain", "gz", "bz2", "snappyblock", "lz4block", "zstd". :::caution Note To use the delete_existing_files parameter, you also need to add the configuration `enable_delete_existing_files = true` in fe.conf and restart fe, then delete_existing_files will take effect. delete_existing_files = true is a dangerous operation, it's recommended to use only in test environments. diff --git a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md index f0d59be7cee43..7aafd98add28d 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md +++ b/versioned_docs/version-3.0/sql-manual/sql-statements/data-modification/load-and-export/OUTFILE.md @@ -1,16 +1,15 @@ --- { - "title": "OUTFILE", - "language": "en" + "title": "OUTFILE", + "language": "en" } - --- ## Description -This statement is used to export query results to a file using the `SELECT INTO OUTFILE` command. Currently, it supports exporting to remote storage, such as HDFS, S3, BOS, COS (Tencent Cloud), through the Broker process, S3 protocol, or HDFS protocol. +The `SELECT INTO OUTFILE` command is used to export query results to files. Currently supports exporting to remote storage such as HDFS, S3, BOS, COS (Tencent Cloud) through Broker process, S3 protocol or HDFS protocol. -## Syntax +## Syntax: ```sql @@ -21,33 +20,32 @@ INTO OUTFILE "" ## Required Parameters -**1. ``** +**1. ``** - The query statement must be a valid SQL statement. Please refer to the [query statement documentation](../../data-query/SELECT.md). +Query statement, must be a valid SQL, refer to [query statement documentation](../../data-query/SELECT.md). **2. ``** - file_path points to the path where the file is stored and the file prefix. Such as `hdfs://path/to/my_file_`. - - The final filename will consist of `my_file_`, the file number and the file format suffix. The file serial number starts from 0, and the number is the number of files to be divided. Such as: - - my_file_abcdefg_0.csv - - my_file_abcdefg_1.csv - - my_file_abcdegf_2.csv +File storage path and file prefix. Points to the file storage path and file prefix. For example `hdfs://path/to/my_file_`. +The final filename will consist of `my_file_`, file sequence number, and file format suffix. The file sequence number starts from 0, and the quantity is the number of files split. For example: +- my_file_abcdefg_0.csv +- my_file_abcdefg_1.csv +- my_file_abcdegf_2.csv - You can also omit the file prefix and specify only the file directory, such as: `hdfs://path/to/` +You can also omit the file prefix and only specify the file directory, such as `hdfs://path/to/` ## Optional Parameters **1. ``** - Specifies the export format. Supported formats include : - - `CSV` (Default) + Specify export format. Currently supports the following formats: + - `CSV` (default) - `PARQUET` - `CSV_WITH_NAMES` - `CSV_WITH_NAMES_AND_TYPES` - `ORC` - > Note: PARQUET, CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, and ORC are supported starting in version 1.2 . + > Note: PARQUET, CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, ORC are supported starting from version 1.2. **2. ``** @@ -55,73 +53,73 @@ INTO OUTFILE "" [ PROPERTIES (""="" [, ... ]) ] ``` -Specify related properties. Currently exporting via the Broker process, S3 protocol, or HDFS protocol is supported. +Currently supports export through Broker process, or through S3/HDFS protocol. -**File properties** -- `column_separator`: column separator,is only for CSV format. mulit-bytes is supported starting in version 1.2, such as: "\\x01", "abc". -- `line_delimiter`: line delimiter,is only for CSV format. mulit-bytes supported starting in version 1.2, such as: "\\x01", "abc". -- `max_file_size`: the size limit of a single file, if the result exceeds this value, it will be cut into multiple files, the value range of max_file_size is [5MB, 2GB] and the default is 1GB. (When specified that the file format is ORC, the size of the actual division file will be a multiples of 64MB, such as: specify max_file_size = 5MB, and actually use 64MB as the division; specify max_file_size = 65MB, and will actually use 128MB as cut division points.) -- `delete_existing_files`: default `false`. If it is specified as true, you will first delete all files specified in the directory specified by the file_path, and then export the data to the directory.For example: "file_path" = "/user/tmp", then delete all files and directory under "/user/"; "file_path" = "/user/tmp/", then delete all files and directory under "/user/tmp/" -- `file_suffix`: Specify the suffix of the export file. If this parameter is not specified, the default suffix for the file format will be used. +**Properties related to export file itself** +- `column_separator`: Column separator, only used for CSV related formats. Starting from version 1.2, supports multi-byte separators, such as: "\\x01", "abc". +- `line_delimiter`: Line delimiter, only used for CSV related formats. Starting from version 1.2, supports multi-byte separators, such as: "\\x01", "abc". +- `max_file_size`: Single file size limit, if the result exceeds this value, it will be split into multiple files, `max_file_size` value range is [5MB, 2GB], default is `1GB`. (When specifying export as ORC file format, the actual split file size will be a multiple of 64MB, for example: if `max_file_size = 5MB` is specified, it will actually be split by 64 MB; if `max_file_size = 65MB` is specified, it will actually be split by 128 MB) +- `delete_existing_files`: Default is `false`, if specified as `true`, it will first delete all files under the directory specified by `file_path`, then export data to that directory. For example: "file_path" = "/user/tmp", will delete all files and directories under "/user/"; "file_path" = "/user/tmp/", will delete all files and directories under "/user/tmp/". +- `file_suffix`: Specify the suffix of the exported file, if this parameter is not specified, the default suffix of the file format will be used. +- `compress_type`: When specifying the exported file format as Parquet / ORC file, you can specify the compression method used by Parquet / ORC file. Parquet file format can specify compression methods as SNAPPY, GZIP, BROTLI, ZSTD, LZ4 and PLAIN, default value is SNAPPY. ORC file format can specify compression methods as PLAIN, SNAPPY, ZLIB and ZSTD, default value is ZLIB. This parameter is supported starting from version 2.1.5. (PLAIN means no compression). Starting from version 3.1.1, supports specifying compression algorithms for CSV format, currently supports "plain", "gz", "bz2", "snappyblock", "lz4block", "zstd". -**Broker properties** _(need to be prefixed with `broker`)_ -- `broker.name: broker`: broker name -- `broker.hadoop.security.authentication`: specify the authentication method as kerberos -- `broker.kerberos_principal`: specifies the principal of kerberos -- `broker.kerberos_keytab`: specifies the path to the keytab file of kerberos. The file must be the absolute path to the file on the server where the broker process is located. and can be accessed by the Broker process +**Broker related properties** _(need to add prefix `broker.`)_ +- `broker.name: broker`: name +- `broker.hadoop.security.authentication`: Specify authentication method as kerberos +- `broker.kerberos_principal`: Specify kerberos principal +- `broker.kerberos_keytab`: Specify kerberos keytab file path. This file must be an absolute path of a file on the server where the Broker process is located. And it can be accessed by the Broker process -**HDFS properties** +**HDFS related properties** - `fs.defaultFS`: namenode address and port - `hadoop.username`: hdfs username -- `dfs.nameservices`: if hadoop enable HA, please set fs nameservice. See hdfs-site.xml -- `dfs.ha.namenodes.[nameservice ID]`: unique identifiers for each NameNode in the nameservice. See hdfs-site.xml -- `dfs.namenode.rpc-address.[nameservice ID].[name node ID]`: the fully-qualified RPC address for each NameNode to listen on. See hdfs-site.xml -- `dfs.client.failover.proxy.provider.[nameservice ID]`: the Java class that HDFS clients use to contact the Active NameNode, usually it is org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider - -**For a kerberos-authentication enabled Hadoop cluster, additional properties need to be set:** -- `dfs.namenode.kerberos.principal`: HDFS namenode service principal -- `hadoop.security.authentication`: kerberos -- `hadoop.kerberos.principal`: the Kerberos pincipal that Doris will use when connectiong to HDFS. -- `hadoop.kerberos.keytab`: HDFS client keytab location. - -For the S3 protocol, you can directly execute the S3 protocol configuration: -- `s3.endpoint` -- `s3.access_key` -- `s3.secret_key` -- `s3.region` -- `use_path_style`: (optional) default false . The S3 SDK uses the virtual-hosted style by default. However, some object storage systems may not be enabled or support virtual-hosted style access. At this time, we can add the use_path_style parameter to force the use of path style access method. - -> Note that to use the `delete_existing_files` parameter, you also need to add the configuration `enable_delete_existing_files = true` to the fe.conf file and restart the FE. Only then will the `delete_existing_files` parameter take effect. Setting `delete_existing_files = true` is a dangerous operation and it is recommended to only use it in a testing environment. +- `dfs.nameservices`: name service name, consistent with hdfs-site.xml +- `dfs.ha.namenodes.[nameservice ID]`: namenode id list, consistent with hdfs-site.xml +- `dfs.namenode.rpc-address.[nameservice ID].[name node ID]`: Name node rpc address, same number as namenode count, consistent with hdfs-site.xml +- `dfs.client.failover.proxy.provider.[nameservice ID]`: Java class for HDFS client to connect to active namenode, usually "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" + +**For Hadoop clusters with kerberos authentication enabled, additional PROPERTIES attributes need to be set:** +- `dfs.namenode.kerberos.principal`: Principal name of HDFS namenode service +- `hadoop.security.authentication`: Set authentication method to kerberos +- `hadoop.kerberos.principal`: Set the Kerberos principal used when Doris connects to HDFS +- `hadoop.kerberos.keytab`: Set keytab local file path + +For S3 protocol, directly configure S3 protocol settings: + - `s3.endpoint` + - `s3.access_key` + - `s3.secret_key` + - `s3.region` + - `use_path_style`: (Optional) Default is `false`. S3 SDK uses Virtual-hosted Style by default. But some object storage systems may not have enabled or support Virtual-hosted Style access, in this case you can add the `use_path_style` parameter to force the use of Path Style access. + +> Note: To use the `delete_existing_files` parameter, you also need to add the configuration `enable_delete_existing_files = true` in `fe.conf` and restart fe, then delete_existing_files will take effect. delete_existing_files = true is a dangerous operation, it is recommended to use only in test environments. ## Return Value -The results returned by the `Outfile` statement are explained as follows: - -| Column | DataType | Note | -|------------------|--------------|----------------------------------------------------------------------------------------------------------------| -| FileNumber | int | The total number of files generated. | -| TotalRows | int | The number of rows in the result set. | -| FileSize | int | The total size of the exported files, in bytes. | -| URL | string | The prefix of the exported file paths. Multiple files are numbered sequentially with suffixes like `_0`, `_1`. | +The result returned by the Outfile statement, the meaning of each column is as follows: -## Access Control Requirements +| Column Name | Type | Description | +|-------------|----------|-------------------------------------------------| +| FileNumber | int | Number of files finally generated | +| TotalRows | int | Number of rows in result set | +| FileSize | int | Total size of exported files. Unit: bytes. | +| URL | string | Prefix of exported file path, multiple files will be numbered with suffixes `_0`,`_1` sequentially. | -The user executing this SQL command must have at least the following privileges: +## Permission Control -| Privilege | Object | Notes | -|:-----------------|:-----------|:------------------------------------------------| -| SELECT_PRIV | Database | Requires read access to the database and table. | +Users executing this SQL command must have at least the following permissions: +| Permission | Object | Description | +|:------------|:-------------|:-------------------------------| +| SELECT_PRIV | Database | Requires read permissions on database and table. | -## Usage Notes +## Notes -### DataType Mapping +### Data Type Mapping -- All file formats support the export of basic data types, while only csv/orc/csv_with_names/csv_with_names_and_types currently support the export of complex data types (ARRAY/MAP/STRUCT). Nested complex data types are not supported. +- All file types support exporting basic data types, while for complex data types (ARRAY/MAP/STRUCT), currently only `csv`, `orc`, `csv_with_names` and `csv_with_names_and_types` support exporting complex types, and nested complex types are not supported. -- Parquet and ORC file formats have their own data types. The export function of Doris can automatically export the Doris data types to the corresponding data types of the Parquet/ORC file format. The following are the data type mapping relationship of the Doris data types and the Parquet/ORC file format data types: +- Parquet and ORC file formats have their own data types, Doris's export function can automatically export Doris data types to corresponding data types in Parquet/ORC file formats. The following are the data type mapping tables between Apache Doris data types and Parquet/ORC file formats: -1. The mapping relationship between the Doris data types to the ORC data types is: +1. **Doris to ORC file format data type mapping table:** | Doris Type | Orc Type | |-------------------------|-----------| | boolean | boolean | @@ -142,7 +140,9 @@ The user executing this SQL command must have at least the following privileges: | map | map | | array | array | -2. When Doris exports data to the Parquet file format, the Doris memory data will be converted to Arrow memory data format first, and then the paraquet file format is written by Arrow. The mapping relationship between the Doris data types to the ARROW data types is: +2. **Doris to Parquet file format data type mapping table:** + + When Doris exports to Parquet file format, it first converts Doris memory data to Arrow memory data format, then Arrow writes to Parquet file format. The mapping relationship between Doris data types and Arrow data types is: | Doris Type | Arrow Type | |-------------------------|------------| | boolean | boolean | @@ -163,40 +163,36 @@ The user executing this SQL command must have at least the following privileges: | map | map | | array | list | +### Export Data Volume and Export Efficiency -### Export data volume and export efficiency - - This function essentially executes an SQL query command. The final result is a single-threaded output. Therefore, the time-consuming of the entire export includes the time-consuming of the query itself and the time-consuming of writing the final result set. If the query is large, you need to set the session variable `query_timeout` to appropriately extend the query timeout. - -### Management of export files + This function essentially executes a SQL query command. The final result is output in a single thread. So the total export time includes the query execution time and the final result set write time. If the query is large, you need to set the session variable `query_timeout` to appropriately extend the query timeout. - Doris does not manage exported files. Including the successful export, or the remaining files after the export fails, all need to be handled by the user. +### Exported File Management -### Export to local file - To export to a local file, you need configure `enable_outfile_to_local=true` in fe.conf. + Doris does not manage exported files. Including successfully exported files or residual files after export failure, all need to be handled by users themselves. +### Export to Local Files + To export to local files, you need to first configure `enable_outfile_to_local=true` in `fe.conf` ```sql - select * from tbl1 limit 10 + select * from tbl1 limit 10 INTO OUTFILE "file:///home/work/path/result_"; ``` -The ability to export to a local file is not available for public cloud users, only for private deployments. And the default user has full control over the cluster nodes. Doris will not check the validity of the export path filled in by the user. If the process user of Doris does not have write permission to the path, or the path does not exist, an error will be reported. At the same time, for security reasons, if a file with the same name already exists in this path, the export will also fail. + The function of exporting to local files is not suitable for public cloud users, only for users with private deployment. And it defaults that users have complete control over cluster nodes. Doris does not perform validity checks on the export path filled by users. If the Doris process user does not have write permission to the path, or the path does not exist, an error will be reported. Also for security considerations, if a file with the same name already exists at the path, the export will also fail. -Doris does not manage files exported locally, nor does it check disk space, etc. These files need to be managed by the user, such as cleaning and so on. + Doris does not manage files exported locally, nor does it check disk space, etc. These files need to be managed by users themselves, such as cleanup. -### Results Integrity Guarantee - -This command is a synchronous command, so it is possible that the task connection is disconnected during the execution process, so that it is impossible to live the exported data whether it ends normally, or whether it is complete. At this point, you can use the `success_file_name` parameter to request that a successful file identifier be generated in the directory after the task is successful. Users can use this file to determine whether the export ends normally. +### Result Integrity Guarantee + This command is a synchronous command, so it's possible that the task connection is disconnected during execution, making it impossible to know whether the exported data ended normally or is complete. In this case, you can use the `success_file_name` parameter to require the task to generate a success file identifier in the directory after successful completion. Users can use this file to determine whether the export ended normally. ### Concurrent Export -Setting the session variable `set enable_parallel_outfile = true;` enables concurrent export using outfile. For detailed usage, see [Export Query Result](../../../../data-operate/export/outfile). - + Set Session variable `set enable_parallel_outfile = true;` to enable Outfile concurrent export. ## Examples -- Use the broker method to export, and export the simple query results to the file `hdfs://path/to/result.txt`. Specifies that the export format is CSV. Use `my_broker` and set kerberos authentication information. Specify the column separator as `,` and the row separator as `\n`. +- Export using Broker method, export simple query results to file `hdfs://path/to/result.txt`. Specify export format as CSV. Use `my_broker` and set kerberos authentication information. Specify column separator as `,`, line delimiter as `\n`. ```sql SELECT * FROM tbl @@ -214,10 +210,10 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ); ``` - If the final generated file is not larger than 100MB, it will be: `result_0.csv`. + The final generated file will be: `result_0.csv` if not larger than 100MB. If larger than 100MB, it may be `result_0.csv, result_1.csv, ...`. -- Export the simple query results to the file `hdfs://path/to/result.parquet`. Specify the export format as PARQUET. Use `my_broker` and set kerberos authentication information. +- Export simple query results to file `hdfs://path/to/result.parquet`. Specify export format as PARQUET. Use `my_broker` and set kerberos authentication information. ```sql SELECT c1, c2, c3 FROM tbl @@ -232,7 +228,7 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ); ``` -- Export the query result of the CTE statement to the file `hdfs://path/to/result.txt`. The default export format is CSV. Use `my_broker` and set hdfs high availability information. Use the default row and column separators. +- Export CTE statement query results to file `hdfs://path/to/result.txt`. Default export format is CSV. Use `my_broker` and set HDFS high availability information. Use default row and column separators. ```sql WITH @@ -255,11 +251,11 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ); ``` - If the final generated file is not larger than 1GB, it will be: `result_0.csv`. + The final generated file will be: `result_0.csv` if not larger than 1GB. If larger than 1GB, it may be `result_0.csv, result_1.csv, ...`. -- Export the query result of the UNION statement to the file `bos://bucket/result.txt`. Specify the export format as PARQUET. Use `my_broker` and set hdfs high availability information. The PARQUET format does not require a column delimiter to be specified. - After the export is complete, an identity file is generated. +- Export UNION statement query results to file `bos://bucket/result.txt`. Specify export format as PARQUET. Use `my_broker` and set HDFS high availability information. PARQUET format does not need to specify column separator. + After export completion, generate an identifier file. ```sql SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1 @@ -274,8 +270,8 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ); ``` -- Export the query result of the select statement to the file `s3a://${bucket_name}/path/result.txt`. Specify the export format as csv. - After the export is complete, an identity file is generated. +- Export Select statement query results to file `s3a://${bucket_name}/path/result.txt`. Specify export format as CSV. + After export completion, generate an identifier file. ```sql select k1,k2,v1 from tbl1 limit 100000 @@ -294,14 +290,14 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ) ``` - If the final generated file is not larger than 1GB, it will be: `my_file_0.csv`. + The final generated file will be: `my_file_0.csv` if not larger than 1GB. If larger than 1GB, it may be `my_file_0.csv, result_1.csv, ...`. - Verify on cos + Verification on cos: - 1. A path that does not exist will be automatically created - 2. Access.key/secret.key/endpoint needs to be confirmed with students of cos. Especially the value of endpoint does not need to fill in bucket_name. + 1. Non-existing paths will be automatically created + 2. access.key/secret.key/endpoint need to be confirmed with cos colleagues. Especially the endpoint value, no need to fill in bucket_name. -- Use the s3 protocol to export to bos, and enable concurrent export. +- Export to bos using S3 protocol, with concurrent export enabled. ```sql set enable_parallel_outfile = true; @@ -317,10 +313,10 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ) ``` - The resulting file is prefixed with `my_file_{fragment_instance_id}_`. + The final generated file prefix will be `my_file_{fragment_instance_id}_`. -- Use the s3 protocol to export to bos, and enable concurrent export of session variables. - Note: However, since the query statement has a top-level sorting node, even if the concurrently exported session variable is enabled for this query, it cannot be exported concurrently. +- Export to bos using S3 protocol, with concurrent export Session variable enabled. + Note: But because the query statement has a top-level sort node, this query cannot use concurrent export even if the concurrent export Session variable is enabled. ```sql set enable_parallel_outfile = true; @@ -336,10 +332,10 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ) ``` -- Use hdfs export to export simple query results to the file `hdfs://${host}:${fileSystem_port}/path/to/result.txt`. Specify the export format as CSV and the user name as work. Specify the column separator as `,` and the row separator as `\n`. +- Export using HDFS method, export simple query results to file `hdfs://${host}:${fileSystem_port}/path/to/result.txt`. Specify export format as CSV, username as work. Specify column separator as `,`, line delimiter as `\n`. ```sql - -- fileSystem_port 默认值为 9000 + -- fileSystem_port default value is 9000 SELECT * FROM tbl INTO OUTFILE "hdfs://${host}:${fileSystem_port}/path/to/result_" FORMAT AS CSV @@ -350,7 +346,7 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ); ``` - If the Hadoop cluster is highly available and Kerberos authentication is enabled, you can refer to the following SQL statement: + If Hadoop cluster has high availability enabled and uses Kerberos authentication, you can refer to the following SQL statement: ```sql SELECT * FROM tbl @@ -371,11 +367,11 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu ); ``` - If the final generated file is not larger than 100MB, it will be: `result_0.csv`. - If larger than 100MB, it may be `result_0.csv, result_1.csv, ...`. + The final generated file will be: `result_0.csv` if not larger than 100 MB. + If larger than 100 MB, it may be `result_0.csv, result_1.csv, ...`. -- Export the query result of the select statement to the file `cosn://${bucket_name}/path/result.txt` on Tencent Cloud Object Storage (COS). Specify the export format as csv. - After the export is complete, an identity file is generated. +- Export Select statement query results to Tencent Cloud cos file `cosn://${bucket_name}/path/result.txt`. Specify export format as CSV. + After export completion, generate an identifier file. ```sql select k1,k2,v1 from tbl1 limit 100000 @@ -392,6 +388,4 @@ Setting the session variable `set enable_parallel_outfile = true;` enables concu "max_file_size" = "1024MB", "success_file_name" = "SUCCESS" ) - ``` - - + ``` \ No newline at end of file