Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/data-operate/import/import-way/routine-load-manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -407,6 +407,7 @@ Here are the available parameters for the job_properties clause:
| send_batch_parallelism | Used to set the parallelism of sending batch data. If the parallelism value exceeds the `max_send_batch_parallelism_per_job` in BE configuration, the coordinating BE will use the value of `max_send_batch_parallelism_per_job`. |
| load_to_single_tablet | Supports importing data to only one tablet in the corresponding partition per task. Default value is false. This parameter can only be set when importing data to OLAP tables with random bucketing. |
| partial_columns | Specifies whether to enable partial column update feature. Default value is false. This parameter can only be set when the table model is Unique and uses Merge on Write. Multi-table streaming does not support this parameter. For details, refer to [Partial Column Update](../../../data-operate/update/update-of-unique-model) |
| partial_update_new_key_behavior | When performing partial column updates on Unique Merge on Write table, this parameter controls how new rows are handled. There are two types: `APPEND` and `ERROR`.<br/>- `APPEND`: Allows inserting new row data<br/>- `ERROR`: Fails and reports an error when inserting new rows |
| max_filter_ratio | The maximum allowed filter ratio within the sampling window. Must be between 0 and 1 inclusive. Default value is 1.0, indicating any error rows can be tolerated. The sampling window is `max_batch_rows * 10`. If the ratio of error rows to total rows within the sampling window exceeds `max_filter_ratio`, the routine job will be suspended and require manual intervention to check data quality issues. Rows filtered by WHERE conditions are not counted as error rows. |
| enclose | Specifies the enclosing character. When CSV data fields contain line or column separators, a single-byte character can be specified as an enclosing character for protection to prevent accidental truncation. For example, if the column separator is "," and the enclosing character is "'", the data "a,'b,c'" will have "b,c" parsed as one field. |
| escape | Specifies the escape character. Used to escape characters in fields that are identical to the enclosing character. For example, if the data is "a,'b,'c'", the enclosing character is "'", and you want "b,'c" to be parsed as one field, you need to specify a single-byte escape character, such as "\", and modify the data to "a,'b,\'c'". |
Expand Down
2 changes: 0 additions & 2 deletions docs/data-operate/update/partial-column-update.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,8 +78,6 @@ SET enable_unique_key_partial_update=true;
INSERT INTO order_tbl (order_id, order_status) VALUES (1, 'Pending Shipment');
```

Note that the session variable `enable_insert_strict` defaults to true, enabling strict mode by default. In strict mode, partial column updates do not allow updating non-existent keys. To insert non-existent keys using the insert statement for partial column updates, set `enable_unique_key_partial_update` to true and `enable_insert_strict` to false.

#### Flink Connector

If using Flink Connector, add the following configuration:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -418,6 +418,7 @@ job_properties 子句具体参数选项如下:
| send_batch_parallelism | 用于设置发送批量数据的并行度。如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`,那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。 |
| load_to_single_tablet | 支持一个任务只导入数据到对应分区的一个 tablet,默认值为 false,该参数只允许在对带有 random 分桶的 olap 表导数的时候设置。 |
| partial_columns | 指定是否开启部分列更新功能。默认值为 false。该参数只允许在表模型为 Unique 且采用 Merge on Write 时设置。一流多表不支持此参数。具体参考文档[列更新](../../../data-operate/update/partial-column-update.md) |
| partial_update_new_key_behavior | 在 Unique Merge on Write 表上进行部分列更新时,对新插入行的处理方式。有两种类型 `APPEND`, `ERROR`。<br/>-`APPEND`: 允许插入新行数据<br/>-`ERROR`: 插入新行时倒入失败并报错 |
| max_filter_ratio | 采样窗口内,允许的最大过滤率。必须在大于等于 0 到小于等于 1 之间。默认值是 1.0,表示可以容忍任何错误行。采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数/总行数大于 `max_filter_ratio`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。被 where 条件过滤掉的行不算错误行。 |
| enclose | 指定包围符。当 CSV 数据字段中含有行分隔符或列分隔符时,为防止意外截断,可指定单字节字符作为包围符起到保护作用。例如列分隔符为 ",",包围符为 "'",数据为 "a,'b,c'",则 "b,c" 会被解析为一个字段。 |
| escape | 指定转义符。用于转义在字段中出现的与包围符相同的字符。例如数据为 "a,'b,'c'",包围符为 "'",希望 "b,'c 被作为一个字段解析,则需要指定单字节转义符,例如"\",将数据修改为 "a,'b,\'c'"。 |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,6 @@ SET enable_unique_key_partial_update=true;
INSERT INTO order_tbl (order_id, order_status) VALUES (1, '待发货');
```

需要注意的是,控制 insert 语句是否开启严格模式的会话变量 `enable_insert_strict` 的默认值为 true,即 insert 语句默认开启严格模式。在严格模式下进行部分列更新不允许更新不存在的 key。所以,在使用 insert 语句进行部分列更新时,如果希望能插入不存在的 key,需要在 `enable_unique_key_partial_update` 设置为 true 的基础上,同时将 `enable_insert_strict` 设置为 false。

#### Flink Connector

如果使用 Flink Connector,需要添加如下配置:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

严格模式具有两个主要功能:
1. 对导入过程中发生列类型转换失败的数据行进行过滤。
2. 在部分列更新场景中,限制只能更新已存在的列。
2. 在部分列更新场景中,限制只能更新已存在的列(3.0.x 版本前,自 3.1.0 版本起,该行为由导入属性/会话变量 `partial_update_new_key_behavior` 控制)

### 列类型转换失败的过滤策略

Expand Down Expand Up @@ -64,6 +64,10 @@

### 限定部分列更新只能更新已有的列

:::tip
3.0.x 版本前,自 3.1.0 版本起,该行为由导入属性/会话变量 `partial_update_new_key_behavior` 控制
:::

在严格模式下,部分列更新插入的每一行数据必须满足该行数据的 Key 在表中已经存在。而在非严格模式下,进行部分列更新时可以更新 Key 已经存在的行,也可以插入 Key 不存在的新行。

例如有表结构如下:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,7 @@ job_properties 子句具体参数选项如下:
| send_batch_parallelism | 用于设置发送批量数据的并行度。如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`,那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。 |
| load_to_single_tablet | 支持一个任务只导入数据到对应分区的一个 tablet,默认值为 false,该参数只允许在对带有 random 分桶的 olap 表导数的时候设置。 |
| partial_columns | 指定是否开启部分列更新功能。默认值为 false。该参数只允许在表模型为 Unique 且采用 Merge on Write 时设置。一流多表不支持此参数。具体参考文档[部分列更新](../../../data-operate/update/update-of-unique-model) |
| partial_update_new_key_behavior<br/>(自 3.1.0 版本起) | 在 Unique Merge on Write 表上进行部分列更新时,对新插入行的处理方式。有两种类型 `APPEND`, `ERROR`。<br/>-`APPEND`: 允许插入新行数据<br/>-`ERROR`: 插入新行时倒入失败并报错 |
| max_filter_ratio | 采样窗口内,允许的最大过滤率。必须在大于等于 0 到小于等于 1 之间。默认值是 1.0,表示可以容忍任何错误行。采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数/总行数大于 `max_filter_ratio`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。被 where 条件过滤掉的行不算错误行。 |
| enclose | 指定包围符。当 CSV 数据字段中含有行分隔符或列分隔符时,为防止意外截断,可指定单字节字符作为包围符起到保护作用。例如列分隔符为 ",",包围符为 "'",数据为 "a,'b,c'",则 "b,c" 会被解析为一个字段。 |
| escape | 指定转义符。用于转义在字段中出现的与包围符相同的字符。例如数据为 "a,'b,'c'",包围符为 "'",希望 "b,'c 被作为一个字段解析,则需要指定单字节转义符,例如"\",将数据修改为 "a,'b,\'c'"。 |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -308,6 +308,7 @@ Stream Load 操作支持 HTTP 分块导入(HTTP chunked)与 HTTP 非分块
| escape | 指定转义符。用于转义在字段中出现的与包围符相同的字符。例如数据为 "a,'b,'c'",包围符为 "'",希望 "b,'c 被作为一个字段解析,则需要指定单字节转义符,例如"\",将数据修改为 "a,'b,\'c'"。 |
| memtable_on_sink_node | 导入数据的时候是否开启 MemTable 前移,默认为 false。 |
| unique_key_update_mode | Unique 表上的更新模式,目前仅对 Merge-On-Write Unique 表有效,一共支持三种类型 `UPSERT`, `UPDATE_FIXED_COLUMNS`, `UPDATE_FLEXIBLE_COLUMNS`。 `UPSERT`: 表示以 upsert 语义导入数据; `UPDATE_FIXED_COLUMNS`: 表示以[部分列更新](../../../data-operate/update/update-of-unique-model)的方式导入数据; `UPDATE_FLEXIBLE_COLUMNS`: 表示以[灵活部分列更新](../../../data-operate/update/update-of-unique-model)的方式导入数据|
| partial_update_new_key_behavior<br/>(自 3.1.0 版本起) | Unique 表上进行部分列更新或灵活列更新时,对新插入行的处理方式。有两种类型 `APPEND`, `ERROR`。<br/>-`APPEND`: 允许插入新行数据<br/>-`ERROR`: 插入新行时倒入失败并报错 |

### 导入返回值

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,9 @@ SET enable_unique_key_partial_update=true;
INSERT INTO order_tbl (order_id, order_status) VALUES (1, '待发货');
```

需要注意的是,控制 insert 语句是否开启严格模式的会话变量 `enable_insert_strict` 的默认值为 true,即 insert 语句默认开启严格模式。在严格模式下进行部分列更新不允许更新不存在的 key。所以,在使用 insert 语句进行部分列更新时,如果希望能插入不存在的 key,需要在 `enable_unique_key_partial_update` 设置为 true 的基础上,同时将 `enable_insert_strict` 设置为 false。
:::caution 注意:
需要注意的是,控制 insert 语句是否开启严格模式的会话变量 `enable_insert_strict` 的默认值为 true,即 insert 语句默认开启严格模式。在 3.0.x 及更低版本中,在严格模式下进行部分列更新不允许更新不存在的 key。所以,在使用 insert 语句进行部分列更新时,如果希望能插入不存在的 key,需要在 `enable_unique_key_partial_update` 设置为 true 的基础上,同时将 `enable_insert_strict` 设置为 false。
:::

#### Flink Connector

Expand Down Expand Up @@ -267,7 +269,8 @@ MySQL root@127.1:d1> select * from t1;

### 部分列更新/灵活列更新中对新插入的行的处理

session variable或导入属性`partial_update_new_key_behavior`用于控制部分列更新和灵活列更新中插入的新行的行为。
3.0.x 版本中,导入是否开启了严格模式会控制部分列更新中插入的新行的行为,具体可参考[严格模式](../import/handling-messy-data.md#限定部分列更新只能更新已有的列)文档。
自 3.1.0 版本起,session variable或导入属性`partial_update_new_key_behavior`用于控制部分列更新和灵活列更新中插入的新行的行为。

当`partial_update_new_key_behavior=ERROR`时,插入的每一行数据必须满足该行数据的 Key 在表中已经存在。而当`partial_update_new_key_behavior=APPEND`时,进行部分列更新或灵活列更新时可以更新 Key 已经存在的行,也可以插入 Key 不存在的新行。

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -418,6 +418,7 @@ job_properties 子句具体参数选项如下:
| send_batch_parallelism | 用于设置发送批量数据的并行度。如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`,那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。 |
| load_to_single_tablet | 支持一个任务只导入数据到对应分区的一个 tablet,默认值为 false,该参数只允许在对带有 random 分桶的 olap 表导数的时候设置。 |
| partial_columns | 指定是否开启部分列更新功能。默认值为 false。该参数只允许在表模型为 Unique 且采用 Merge on Write 时设置。一流多表不支持此参数。具体参考文档[部分列更新](../../../data-operate/update/update-of-unique-model) |
| partial_update_new_key_behavior | 在 Unique Merge on Write 表上进行部分列更新时,对新插入行的处理方式。有两种类型 `APPEND`, `ERROR`。<br/>-`APPEND`: 允许插入新行数据<br/>-`ERROR`: 插入新行时倒入失败并报错 |
| max_filter_ratio | 采样窗口内,允许的最大过滤率。必须在大于等于 0 到小于等于 1 之间。默认值是 1.0,表示可以容忍任何错误行。采样窗口为 `max_batch_rows * 10`。即如果在采样窗口内,错误行数/总行数大于 `max_filter_ratio`,则会导致例行作业被暂停,需要人工介入检查数据质量问题。被 where 条件过滤掉的行不算错误行。 |
| enclose | 指定包围符。当 CSV 数据字段中含有行分隔符或列分隔符时,为防止意外截断,可指定单字节字符作为包围符起到保护作用。例如列分隔符为 ",",包围符为 "'",数据为 "a,'b,c'",则 "b,c" 会被解析为一个字段。 |
| escape | 指定转义符。用于转义在字段中出现的与包围符相同的字符。例如数据为 "a,'b,'c'",包围符为 "'",希望 "b,'c 被作为一个字段解析,则需要指定单字节转义符,例如"\",将数据修改为 "a,'b,\'c'"。 |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,6 @@ SET enable_unique_key_partial_update=true;
INSERT INTO order_tbl (order_id, order_status) VALUES (1, '待发货');
```

需要注意的是,控制 insert 语句是否开启严格模式的会话变量 `enable_insert_strict` 的默认值为 true,即 insert 语句默认开启严格模式。在严格模式下进行部分列更新不允许更新不存在的 key。所以,在使用 insert 语句进行部分列更新时,如果希望能插入不存在的 key,需要在 `enable_unique_key_partial_update` 设置为 true 的基础上,同时将 `enable_insert_strict` 设置为 false。

#### Flink Connector

如果使用 Flink Connector,需要添加如下配置:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ This makes it easier to handle data loading problems and keeps data management s

Strict mode serves two primary purposes:
1. Filtering out data rows where column type conversion fails during load
2. Restricting updates to existing columns only in partial column update scenarios
2. Restricting updates to existing columns only in partial column update scenarios(before 3.0.x, since 3.1.0, this behavior is controlled by load property/session var `partial_update_new_key_behavior`)

### Filtering Strategy for Column Type Conversion Failures

Expand Down Expand Up @@ -65,6 +65,10 @@ The system employs different strategies based on the strict mode setting:

### Restricting Partial Column Updates to Existing Columns Only

:::tip
before 3.0.x, since 3.1.0, this behavior is controlled by load property/session var `partial_update_new_key_behavior`
:::

In strict mode, each row in a partial column update must have its Key already exist in the table. In non-strict mode, partial column updates can both update existing rows (where Key exists) and insert new rows (where Key doesn't exist).

For example, given a table structure as follows:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -399,6 +399,7 @@ Here are the available parameters for the job_properties clause:
| send_batch_parallelism | Used to set the parallelism of sending batch data. If the parallelism value exceeds the `max_send_batch_parallelism_per_job` in BE configuration, the coordinating BE will use the value of `max_send_batch_parallelism_per_job`. |
| load_to_single_tablet | Supports importing data to only one tablet in the corresponding partition per task. Default value is false. This parameter can only be set when importing data to OLAP tables with random bucketing. |
| partial_columns | Specifies whether to enable partial column update feature. Default value is false. This parameter can only be set when the table model is Unique and uses Merge on Write. Multi-table streaming does not support this parameter. For details, refer to [Partial Column Update](../../../data-operate/update/update-of-unique-model) |
| partial_update_new_key_behavior<br/>(since 3.1.0) | When performing partial column updates on Unique Merge on Write table, this parameter controls how new rows are handled. There are two types: `APPEND` and `ERROR`.<br/>- `APPEND`: Allows inserting new row data<br/>- `ERROR`: Fails and reports an error when inserting new rows |
| max_filter_ratio | The maximum allowed filter ratio within the sampling window. Must be between 0 and 1 inclusive. Default value is 1.0, indicating any error rows can be tolerated. The sampling window is `max_batch_rows * 10`. If the ratio of error rows to total rows within the sampling window exceeds `max_filter_ratio`, the routine job will be suspended and require manual intervention to check data quality issues. Rows filtered by WHERE conditions are not counted as error rows. |
| enclose | Specifies the enclosing character. When CSV data fields contain line or column separators, a single-byte character can be specified as an enclosing character for protection to prevent accidental truncation. For example, if the column separator is "," and the enclosing character is "'", the data "a,'b,c'" will have "b,c" parsed as one field. |
| escape | Specifies the escape character. Used to escape characters in fields that are identical to the enclosing character. For example, if the data is "a,'b,'c'", the enclosing character is "'", and you want "b,'c" to be parsed as one field, you need to specify a single-byte escape character, such as "\", and modify the data to "a,'b,\'c'". |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -310,6 +310,7 @@ Parameter Description: The default timeout for Stream Load. The load job will be
| escape | Specify the escape character. It is used to escape characters that are the same as the enclosure character within a field. For example, if the data is "a,'b,'c'", and the enclosure is "'", and you want "b,'c" to be parsed as a single field, you need to specify a single-byte escape character, such as "", and modify the data to "a,'b','c'". |
| memtable_on_sink_node | Whether to enable MemTable on DataSink node when loading data, default is false. |
|unique_key_update_mode | The update modes on Unique tables, currently are only effective for Merge-On-Write Unique tables. Supporting three types: `UPSERT`, `UPDATE_FIXED_COLUMNS`, and `UPDATE_FLEXIBLE_COLUMNS`. `UPSERT`: Indicates that data is loaded with upsert semantics; `UPDATE_FIXED_COLUMNS`: Indicates that data is loaded through partial updates; `UPDATE_FLEXIBLE_COLUMNS`: Indicates that data is loaded through flexible partial updates.|
| partial_update_new_key_behavior<br/>(since 3.1.0) | When performing partial column updates or flexible column updates on Unique tables, this parameter controls how new rows are handled. There are two types: `APPEND` and `ERROR`.<br/>- `APPEND`: Allows inserting new row data<br/>- `ERROR`: Fails and reports an error when inserting new rows |

### Load return value

Expand Down
Loading