From 13bbea3dfe7eb0d921f1107760c7d03b4bf5dd77 Mon Sep 17 00:00:00 2001 From: daidai <2017501503@qq.com> Date: Fri, 11 Apr 2025 23:21:30 +0800 Subject: [PATCH 1/3] [feat](hive) add hive_parquet_use_column_names description --- .../current/lakehouse/catalogs/hive-catalog.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md index b5be52478b88d..0697e6affc7b0 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md @@ -357,10 +357,18 @@ AS SELECT col1,pt1 as col2,pt2 as pt1 FROM test_ctas.part_ctas_src WHERE col1>0; ### 相关参数 +* fe + | 参数名称 | 描述 | 默认值 | + | ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- | + | `hive_parquet_use_column_names` | Doris 在读取 Hive 表 Parquet 数据类型时,默认会根据 Hive 表的列名从 Parquet 文件中找同名的列来读取数据。当该变量为 `false` 时,Doris 会根据 Hive 表中的列顺序从 Parquet 文件中读取数据,与列名无关。类似于 Hive 中的 `parquet.column.index.access` 变量。该参数只适用于顶层列名,对 Struct 内部无效。 | `true` | + | `hive_orc_use_column_names` | 与 `hive_parquet_use_column_names` 类似,针对的是 Hive 表 Parquet 数据类型。类似于 Hive 中的 `orc.force.positional.evolution` 变量。 | `true` | + + + * BE - | 参数名称 | 默认值 | 描述 | - | ----------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- | + | 参数名称 | 描述 | 默认值 | + | ----------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- | | `hive_sink_max_file_size` | 最大的数据文件大小。当写入数据量超过该大小后会关闭当前文件,滚动产生一个新文件继续写入。 | 1GB | | `table_sink_partition_write_max_partition_nums_per_writer` | BE 节点上每个 Instance 最大写入的分区数目。 | 128 | | `table_sink_non_partition_write_scaling_data_processed_threshold` | 非分区表开始 scaling-write 的数据量阈值。每增加 `table_sink_non_partition_write_scaling_data_processed_threshold` 数据就会发送给一个新的 writer(instance) 进行写入。scaling-write 机制主要是为了根据数据量来使用不同数目的 writer(instance) 来进行写入,会随着数据量的增加而增大写入的 writer(instance) 数目,从而提高并发写入的吞吐。当数据量比较少的时候也会节省资源,并且尽可能地减少产生的文件数目。 | 25MB | From fed67998bede49667826dab3c04ac230f4aa5cf7 Mon Sep 17 00:00:00 2001 From: daidai <2017501503@qq.com> Date: Mon, 14 Apr 2025 10:34:31 +0800 Subject: [PATCH 2/3] fix parquet -> orc --- .../current/lakehouse/catalogs/hive-catalog.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md index 0697e6affc7b0..7e386ed87110a 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md @@ -358,10 +358,11 @@ AS SELECT col1,pt1 as col2,pt2 as pt1 FROM test_ctas.part_ctas_src WHERE col1>0; ### 相关参数 * fe + | 参数名称 | 描述 | 默认值 | | ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- | | `hive_parquet_use_column_names` | Doris 在读取 Hive 表 Parquet 数据类型时,默认会根据 Hive 表的列名从 Parquet 文件中找同名的列来读取数据。当该变量为 `false` 时,Doris 会根据 Hive 表中的列顺序从 Parquet 文件中读取数据,与列名无关。类似于 Hive 中的 `parquet.column.index.access` 变量。该参数只适用于顶层列名,对 Struct 内部无效。 | `true` | - | `hive_orc_use_column_names` | 与 `hive_parquet_use_column_names` 类似,针对的是 Hive 表 Parquet 数据类型。类似于 Hive 中的 `orc.force.positional.evolution` 变量。 | `true` | + | `hive_orc_use_column_names` | 与 `hive_parquet_use_column_names` 类似,针对的是 Hive 表 ORC 数据类型。类似于 Hive 中的 `orc.force.positional.evolution` 变量。 | `true` | From b6214a7221ddbef7593ea1b2c70cccb0ddfcd7b1 Mon Sep 17 00:00:00 2001 From: morningman Date: Mon, 14 Apr 2025 16:26:37 -0700 Subject: [PATCH 3/3] 2 --- docs/faq/lakehouse-faq.md | 26 ++++++++++++------- docs/lakehouse/catalogs/hive-catalog.md | 16 +++++++++++- .../current/faq/lakehouse-faq.md | 20 +++++++++----- .../lakehouse/catalogs/hive-catalog.md | 23 +++++++++------- .../version-2.1/faq/lakehouse-faq.md | 20 +++++++++----- .../version-3.0/faq/lakehouse-faq.md | 20 +++++++++----- .../version-2.1/faq/lakehouse-faq.md | 26 ++++++++++++------- .../version-3.0/faq/lakehouse-faq.md | 26 ++++++++++++------- 8 files changed, 116 insertions(+), 61 deletions(-) diff --git a/docs/faq/lakehouse-faq.md b/docs/faq/lakehouse-faq.md index 951c1b9daf1b0..96a59eb266e8b 100644 --- a/docs/faq/lakehouse-faq.md +++ b/docs/faq/lakehouse-faq.md @@ -126,17 +126,23 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ## Hive Catalog -1. Error accessing Iceberg table via Hive Metastore: `failed to get schema` or `Storage schema reading not supported` +1. Accessing Iceberg or Hive table through Hive Catalog reports an error: `failed to get schema` or `Storage schema reading not supported` - Place the relevant `iceberg` runtime jar files in Hive's lib/ directory. - - Configure in `hive-site.xml`: - - ``` - metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - ``` - - After configuration, restart the Hive Metastore. + You can try the following methods: + + * Put the `iceberg` runtime-related jar package in the lib/ directory of Hive. + + * Configure in `hive-site.xml`: + + ``` + metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + ``` + + After the configuration is completed, you need to restart the Hive Metastore. + + * Add `"get_schema_from_table" = "true"` in the Catalog properties + + This parameter is supported since versions 2.1.10 and 3.0.6. 2. Error connecting to Hive Catalog: `Caused by: java.lang.NullPointerException` diff --git a/docs/lakehouse/catalogs/hive-catalog.md b/docs/lakehouse/catalogs/hive-catalog.md index 78a000b92806f..aaa6ae10cf706 100644 --- a/docs/lakehouse/catalogs/hive-catalog.md +++ b/docs/lakehouse/catalogs/hive-catalog.md @@ -48,7 +48,8 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'fs.defaultFS' = '', -- optional {MetaStoreProperties}, {StorageProperties}, - {CommonProperties} + {CommonProperties}, + {OtherProperties} ); ``` @@ -78,6 +79,12 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( The CommonProperties section is for entering common attributes. Please see the "Common Properties" section in the [Catalog Overview](../catalog-overview.md). +* `{OtherProperties}` + + OtherProperties section is for entering properties related to Hive Catalog. + + * `get_schema_from_table`:The default value is false. By default, Doris will obtain the table schema information from the Hive Metastore. However, in some cases, compatibility issues may occur, such as the error `Storage schema reading not supported`. In this case, you can set this parameter to true, and the table schema will be obtained directly from the Table object. But please note that this method will cause the default value information of the column to be ignored. This property is supported since version 2.1.10 and 3.0.6. + ### Supported Hive Versions Supports Hive 1.x, 2.x, 3.x, and 4.x. @@ -348,6 +355,13 @@ AS SELECT col1, pt1 AS col2, pt2 AS pt1 FROM test_ctas.part_ctas_src WHERE col1 ### Related Parameters +* Session variables + +| Parameter name | Default value | Desciption | Since version | +| ----------| ---- | ---- | --- | +| `hive_parquet_use_column_names` | `true` | When Doris reads the Parquet data type of the Hive table, it will find the column with the same name from the Parquet file to read the data according to the column name of the Hive table by default. When this variable is `false`, Doris will read data from the Parquet file according to the column order in the Hive table, regardless of the column name. Similar to the `parquet.column.index.access` variable in Hive. This parameter only applies to the top-level column name and is invalid inside the Struct. | 2.1.6+, 3.0.3+ | +| `hive_orc_use_column_names` | `true` | Similar to `hive_parquet_use_column_names`, it is for the Hive table ORC data type. Similar to the `orc.force.positional.evolution` variable in Hive. | 2.1.6+, 3.0.3+ | + * BE | Parameter Name | Default Value | Description | diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md index 67559e4c92a6b..f84d0cffc3714 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/faq/lakehouse-faq.md @@ -128,17 +128,23 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ## Hive Catalog -1. 通过 Hive Metastore 访问 Iceberg 表报错:`failed to get schema` 或 `Storage schema reading not supported` +1. 通过 Hive Catalog 访问 Iceberg 或 Hive 表报错:`failed to get schema` 或 `Storage schema reading not supported` - 在 Hive 的 lib/ 目录放上 `iceberg` 运行时有关的 jar 包。 + 可以尝试以下方法: - 在 `hive-site.xml` 配置: + * 在 Hive 的 lib/ 目录放上 `iceberg` 运行时有关的 jar 包。 - ``` - metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - ``` + * 在 `hive-site.xml` 配置: + + ``` + metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + ``` + + 配置完成后需要重启 Hive Metastore。 + + * 在 Catalog 属性中添加 `"get_schema_from_table" = "true"` - 配置完成后需要重启 Hive Metastore。 + 该参数自 2.1.10 和 3.0.6 版本支持。 2. 连接 Hive Catalog 报错:`Caused by: java.lang.NullPointerException` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md index 7e386ed87110a..723bbdea8c946 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/catalogs/hive-catalog.md @@ -48,7 +48,8 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( 'fs.defaultFS' = '', -- optional {MetaStoreProperties}, {StorageProperties}, - {CommonProperties} + {CommonProperties}, + {OtherProperties} ); ``` @@ -80,6 +81,12 @@ CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES ( CommonProperties 部分用于填写通用属性。请参阅[ 数据目录概述 ](../catalog-overview.md)中【通用属性】部分。 +* `{OtherProperties}` + + OtherProperties 部分用于填写和 Hive Catalog 相关的其他参数。 + + * `get_schema_from_table`:默认为 false。默认情况下,Doris 会从 Hive Metastore 中获取表的 Schema 信息。但某些情况下可能出现兼容问题,如错误 `Storage schema reading not supported`。此时可以将这个参数设置为 true,则会从 Table 对象中直接获取表 Schema。但注意,该方式会导致列的默认值信息被忽略。该参数自 2.1.10 和 3.0.6 版本支持。 + ### 支持的 Hive 版本 支持 Hive 1.x,2.x,3.x,4.x。 @@ -357,16 +364,14 @@ AS SELECT col1,pt1 as col2,pt2 as pt1 FROM test_ctas.part_ctas_src WHERE col1>0; ### 相关参数 -* fe - - | 参数名称 | 描述 | 默认值 | - | ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---- | - | `hive_parquet_use_column_names` | Doris 在读取 Hive 表 Parquet 数据类型时,默认会根据 Hive 表的列名从 Parquet 文件中找同名的列来读取数据。当该变量为 `false` 时,Doris 会根据 Hive 表中的列顺序从 Parquet 文件中读取数据,与列名无关。类似于 Hive 中的 `parquet.column.index.access` 变量。该参数只适用于顶层列名,对 Struct 内部无效。 | `true` | - | `hive_orc_use_column_names` | 与 `hive_parquet_use_column_names` 类似,针对的是 Hive 表 ORC 数据类型。类似于 Hive 中的 `orc.force.positional.evolution` 变量。 | `true` | - +* Session 变量 + | 参数名称 | 描述 | 默认值 | 版本 | + | ----------| ---- | ---- | --- | + | `hive_parquet_use_column_names` | `true` | Doris 在读取 Hive 表 Parquet 数据类型时,默认会根据 Hive 表的列名从 Parquet 文件中找同名的列来读取数据。当该变量为 `false` 时,Doris 会根据 Hive 表中的列顺序从 Parquet 文件中读取数据,与列名无关。类似于 Hive 中的 `parquet.column.index.access` 变量。该参数只适用于顶层列名,对 Struct 内部无效。 | 2.1.6+, 3.0.3+ | + | `hive_orc_use_column_names` | `true` | 与 `hive_parquet_use_column_names` 类似,针对的是 Hive 表 ORC 数据类型。类似于 Hive 中的 `orc.force.positional.evolution` 变量。 | 2.1.6+, 3.0.3+ | -* BE +* BE 配置 | 参数名称 | 描述 | 默认值 | | ----------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- | diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/faq/lakehouse-faq.md index 67559e4c92a6b..f84d0cffc3714 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/faq/lakehouse-faq.md @@ -128,17 +128,23 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ## Hive Catalog -1. 通过 Hive Metastore 访问 Iceberg 表报错:`failed to get schema` 或 `Storage schema reading not supported` +1. 通过 Hive Catalog 访问 Iceberg 或 Hive 表报错:`failed to get schema` 或 `Storage schema reading not supported` - 在 Hive 的 lib/ 目录放上 `iceberg` 运行时有关的 jar 包。 + 可以尝试以下方法: - 在 `hive-site.xml` 配置: + * 在 Hive 的 lib/ 目录放上 `iceberg` 运行时有关的 jar 包。 - ``` - metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - ``` + * 在 `hive-site.xml` 配置: + + ``` + metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + ``` + + 配置完成后需要重启 Hive Metastore。 + + * 在 Catalog 属性中添加 `"get_schema_from_table" = "true"` - 配置完成后需要重启 Hive Metastore。 + 该参数自 2.1.10 和 3.0.6 版本支持。 2. 连接 Hive Catalog 报错:`Caused by: java.lang.NullPointerException` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/faq/lakehouse-faq.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/faq/lakehouse-faq.md index 67559e4c92a6b..f84d0cffc3714 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/faq/lakehouse-faq.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/faq/lakehouse-faq.md @@ -128,17 +128,23 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ## Hive Catalog -1. 通过 Hive Metastore 访问 Iceberg 表报错:`failed to get schema` 或 `Storage schema reading not supported` +1. 通过 Hive Catalog 访问 Iceberg 或 Hive 表报错:`failed to get schema` 或 `Storage schema reading not supported` - 在 Hive 的 lib/ 目录放上 `iceberg` 运行时有关的 jar 包。 + 可以尝试以下方法: - 在 `hive-site.xml` 配置: + * 在 Hive 的 lib/ 目录放上 `iceberg` 运行时有关的 jar 包。 - ``` - metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - ``` + * 在 `hive-site.xml` 配置: + + ``` + metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + ``` + + 配置完成后需要重启 Hive Metastore。 + + * 在 Catalog 属性中添加 `"get_schema_from_table" = "true"` - 配置完成后需要重启 Hive Metastore。 + 该参数自 2.1.10 和 3.0.6 版本支持。 2. 连接 Hive Catalog 报错:`Caused by: java.lang.NullPointerException` diff --git a/versioned_docs/version-2.1/faq/lakehouse-faq.md b/versioned_docs/version-2.1/faq/lakehouse-faq.md index 951c1b9daf1b0..96a59eb266e8b 100644 --- a/versioned_docs/version-2.1/faq/lakehouse-faq.md +++ b/versioned_docs/version-2.1/faq/lakehouse-faq.md @@ -126,17 +126,23 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ## Hive Catalog -1. Error accessing Iceberg table via Hive Metastore: `failed to get schema` or `Storage schema reading not supported` +1. Accessing Iceberg or Hive table through Hive Catalog reports an error: `failed to get schema` or `Storage schema reading not supported` - Place the relevant `iceberg` runtime jar files in Hive's lib/ directory. - - Configure in `hive-site.xml`: - - ``` - metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - ``` - - After configuration, restart the Hive Metastore. + You can try the following methods: + + * Put the `iceberg` runtime-related jar package in the lib/ directory of Hive. + + * Configure in `hive-site.xml`: + + ``` + metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + ``` + + After the configuration is completed, you need to restart the Hive Metastore. + + * Add `"get_schema_from_table" = "true"` in the Catalog properties + + This parameter is supported since versions 2.1.10 and 3.0.6. 2. Error connecting to Hive Catalog: `Caused by: java.lang.NullPointerException` diff --git a/versioned_docs/version-3.0/faq/lakehouse-faq.md b/versioned_docs/version-3.0/faq/lakehouse-faq.md index 951c1b9daf1b0..96a59eb266e8b 100644 --- a/versioned_docs/version-3.0/faq/lakehouse-faq.md +++ b/versioned_docs/version-3.0/faq/lakehouse-faq.md @@ -126,17 +126,23 @@ ln -s /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt /etc/ssl/certs/ca- ## Hive Catalog -1. Error accessing Iceberg table via Hive Metastore: `failed to get schema` or `Storage schema reading not supported` +1. Accessing Iceberg or Hive table through Hive Catalog reports an error: `failed to get schema` or `Storage schema reading not supported` - Place the relevant `iceberg` runtime jar files in Hive's lib/ directory. - - Configure in `hive-site.xml`: - - ``` - metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - ``` - - After configuration, restart the Hive Metastore. + You can try the following methods: + + * Put the `iceberg` runtime-related jar package in the lib/ directory of Hive. + + * Configure in `hive-site.xml`: + + ``` + metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + ``` + + After the configuration is completed, you need to restart the Hive Metastore. + + * Add `"get_schema_from_table" = "true"` in the Catalog properties + + This parameter is supported since versions 2.1.10 and 3.0.6. 2. Error connecting to Hive Catalog: `Caused by: java.lang.NullPointerException`