Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflict data resolution enhancement during data import #16720

Merged
merged 14 commits into from
Mar 26, 2024
29 changes: 11 additions & 18 deletions tidb-lightning/tidb-lightning-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,16 +125,20 @@ driver = "file"
# keep-after-success = false

[conflict]
# Starting from v7.3.0, a new version of strategy is introduced to handle conflicting data. The default value is "".
# - "": TiDB Lightning does not detect or handle conflicting data. If the source file contains conflicting primary or unique key records, the subsequent step reports an error.
# Starting from v7.3.0, a new version of strategy is introduced to handle conflicting data. The default value is "". Starting from v8.0.0, TiDB Lightning optimizes the conflict strategy for both physical and logical import modes (experimental).
# - "": in the physical import mode, TiDB Lightning does not detect or handle conflicting data. If the source file contains conflicting primary or unique key records, the subsequent step reports an error. In the logical import mode, TiDB Lightning converts the "" strategy to the "error" strategy for processing.
# - "error": when detecting conflicting primary or unique key records in the imported data, TiDB Lightning terminates the import and reports an error.
# - "replace": when encountering conflicting primary or unique key records, TiDB Lightning retains the new data and overwrites the old data.
# - "ignore": when encountering conflicting primary or unique key records, TiDB Lightning retains the old data and ignores the new data.
# The new version strategy cannot be used together with tikv-importer.duplicate-resolution (the old version of conflict detection).
# - "replace": when encountering conflicting primary or unique key records, TiDB Lightning retains the latest data and overwrites the old data.
# The conflict data are recorded in the `lightning_task_info.conflict_error_v2` table (recording conflict data detected by conflict post checks in the physical import mode) and the `conflict_records` table (recording conflict data detected by conflict prechecks in both logical and physical import modes) of the target TiDB cluster.
qiancai marked this conversation as resolved.
Show resolved Hide resolved
# You can manually insert the correct records into the target table based on your business requirements.
qiancai marked this conversation as resolved.
Show resolved Hide resolved
# Note that the target TiKV must be v5.2.0 or later versions.
qiancai marked this conversation as resolved.
Show resolved Hide resolved
# - "ignore": when encountering conflicting primary or unique key records, TiDB Lightning retains the old data and ignores the new data. This option can only be used in the logical import mode.
strategy = ""
# Controls the upper limit of the conflicting data that can be handled when strategy is "replace" or "ignore". You can set it only when strategy is "replace" or "ignore". The default value is 9223372036854775807, which means that almost all errors are tolerant.
# Controls whether to enable conflict prechecks, which check conflicts in the data before importing it to TiDB. In scenarios where the ratio of conflict records is greater than or equal to 1%, it is recommended to enable conflict prechecks for better performance in conflict detection. In other scenarios, it is recommended to disable it. The default value is false, indicating that TiDB Lightning only checks conflicts after the import. If you set it to true, TiDB Lightning checks conflicts both before and after the import. This parameter is experimental, and it can be used only in the physical import mode.
qiancai marked this conversation as resolved.
Show resolved Hide resolved
# precheck-conflict-before-import = false
# Controls the maximum number of conflict errors that can be handled when strategy is "replace" or "ignore". You can set it only when strategy is "replace" or "ignore". The default value is 9223372036854775807, which means that almost all errors are tolerant.
qiancai marked this conversation as resolved.
Show resolved Hide resolved
# threshold = 9223372036854775807
# Controls the maximum number of records in the conflict_records table. The default value is 100. If the strategy is "ignore", the conflict records that are ignored are recorded; if the strategy is "replace", the conflict records that are overwritten are recorded. However, the "replace" strategy cannot record the conflict records in the logical import mode.
# Controls the maximum number of records in the `conflict_records` table. The default value is 100. In the physical import mode, if the strategy is "replace", the conflict records that are overwritten are recorded. In the logical import mode, if the strategy is "ignore", the conflict records that are ignored are recorded; if the strategy is "replace", the conflict records are recorded, but the conflict records that are overwritten are not recorded.
qiancai marked this conversation as resolved.
Show resolved Hide resolved
# max-record-rows = 100

[tikv-importer]
Expand All @@ -150,17 +154,6 @@ strategy = ""
# Note that this parameter is only used in scenarios where the target table is empty.
# parallel-import = false

qiancai marked this conversation as resolved.
Show resolved Hide resolved
# Whether to detect and resolve duplicate records (unique key conflict) in the physical import mode.
# The following resolution algorithms are supported:
# - none: does not detect duplicate records, which has the best performance of the two algorithms.
# But if there are duplicate records in the data source, it might lead to inconsistent data in the target TiDB.
# - remove: if there are primary key or unique key conflicts between the inserting data A and B,
# A and B will be removed from the target table and recorded
# in the `lightning_task_info.conflict_error_v1` table in the target TiDB.
# You can manually insert the correct records into the target table based on your business requirements.
# Note that the target TiKV must be v5.2.0 or later versions; otherwise it falls back to 'none'.
# The default value is 'none'.
# duplicate-resolution = 'none'
# The maximum number of KV pairs in one request when sending data to TiKV in physical import mode.
# Starting from v7.2.0, this parameter is deprecated and no longer takes effect after it is set.
# If you want to adjust the amount of data sent to TiKV in one request, use the `send-kv-size` parameter instead.
Expand Down
6 changes: 3 additions & 3 deletions tidb-lightning/tidb-lightning-error-resolution.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ This document introduces TiDB Lightning error types, how to query the errors, an

- `lightning.max-error`: the tolerance threshold of type error
- `conflict.strategy`, `conflict.threshold`, and `conflict.max-record-rows`: configurations related to conflicting data
- `tikv-importer.duplicate-resolution`: the conflict handling configuration that can only be used in the physical import mode
- `tikv-importer.duplicate-resolution` (deprecated in v8.0.0): the conflict handling configuration that can only be used in the physical import mode
- `lightning.task-info-schema-name`: the database where conflicting data is stored when TiDB Lightning detects conflicts

For more information, see [TiDB Lightning (Task)](/tidb-lightning/tidb-lightning-configuration.md#tidb-lightning-task).
Expand Down Expand Up @@ -119,9 +119,9 @@ CREATE TABLE conflict_records (

`type_error_v1` records all [type errors](#type-error) managed by `lightning.max-error`. Each error corresponds to one row.

`conflict_error_v1` records all unique and primary key conflicts managed by `tikv-importer.duplicate-resolution` in the physical import mode. Each pair of conflicts corresponds to two rows.
`conflict_error_v2` records conflicts managed by the `conflict` configuration group in the physical import mode. Each pair of conflicts corresponds to two rows.

`conflict_records` records all unique and primary key conflicts managed by the `conflict` configuration group in logical import mode and physical import mode. Each error corresponds to one row.
`conflict_records` records conflicts managed by the `conflict` configuration group in both the logical import mode and physical import mode. Each error corresponds to one row.

| Column | Syntax | Type | Conflict | Description |
| ------------ | ------ | ---- | -------- | ----------------------------------------------------------------------------------------------------------------------------------- |
Expand Down
4 changes: 2 additions & 2 deletions tidb-lightning/tidb-lightning-logical-import-mode-usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ Conflicting data refers to two or more records with the same data in the PK or U
| :-- | :-- | :-- |
| `"replace"` | Replacing existing data with new data. | `REPLACE INTO ...` |
| `"ignore"` | Keeping existing data and ignoring new data. | `INSERT IGNORE INTO ...` |
| `"error"` | Pausing the import and reporting an error. | `INSERT INTO ...` |
| `""` | TiDB Lightning does not detect or handle conflicting data. If data with primary and unique key conflicts exists, the subsequent step reports an error. | None |
| `"error"` | Pausing the import when conflicting data is detected. | `INSERT INTO ...` |
| `""` | Be converted to `"error"`, which means pausing the import when conflicting data is detected. | None |
qiancai marked this conversation as resolved.
Show resolved Hide resolved

When the strategy is `"error"`, errors caused by conflicting data directly terminates the import task. When the strategy is `"replace"` or `"ignore"`, you can control the maximum tolerant conflicts by configuring [`conflict.threshold`](/tidb-lightning/tidb-lightning-configuration.md#tidb-lightning-task). The default value is `9223372036854775807`, which means that almost all errors are tolerant.

Expand Down
35 changes: 19 additions & 16 deletions tidb-lightning/tidb-lightning-physical-import-mode-usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,23 +30,25 @@ check-requirements = true
data-source-dir = "/data/my_database"

[conflict]
# Starting from v7.3.0, a new version of strategy is introduced to handle conflicting data. The default value is "".
# Starting from v7.3.0, a new version of strategy is introduced to handle conflicting data. The default value is "". Starting from v8.0.0, TiDB Lightning optimizes the conflict strategy for both physical and logical import modes (experimental).
# - "": TiDB Lightning does not detect or handle conflicting data. If the source file contains conflicting primary or unique key records, the subsequent step reports an error.
# - "error": when detecting conflicting primary or unique key records in the imported data, TiDB Lightning terminates the import and reports an error.
# - "replace": when encountering conflicting primary or unique key records, TiDB Lightning retains the new data and overwrites the old data.
# - "replace": when encountering conflicting primary or unique key records, TiDB Lightning retains the latest data and overwrites the old data.
# The conflict data are recorded in the `lightning_task_info.conflict_error_v2` table (recording conflict data detected by conflict post checks) and the `conflict_records` table (recording conflict data detected by conflict prechecks) of the target TiDB cluster.
qiancai marked this conversation as resolved.
Show resolved Hide resolved
# You can manually insert the correct records into the target table based on your business requirements.
qiancai marked this conversation as resolved.
Show resolved Hide resolved
# Note that the target TiKV must be v5.2.0 or later versions.
qiancai marked this conversation as resolved.
Show resolved Hide resolved
# - "ignore": when encountering conflicting primary or unique key records, TiDB Lightning retains the old data and ignores the new data.
# The new version strategy cannot be used together with tikv-importer.duplicate-resolution (the old version of conflict detection).
qiancai marked this conversation as resolved.
Show resolved Hide resolved
strategy = ""
# Controls whether to enable conflict prechecks, which check conflicts in the data before importing it to TiDB. In scenarios where the ratio of conflict records is greater than or equal to 1%, it is recommended to enable conflict prechecks for better performance in conflict detection. In other scenarios, it is recommended to disable it. The default value is false, indicating that TiDB Lightning only checks conflicts after the import. If you set it to true, TiDB Lightning checks conflicts both before and after the import. This parameter is experimental.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

qiancai marked this conversation as resolved.
Show resolved Hide resolved
# precheck-conflict-before-import = false
# threshold = 9223372036854775807
# max-record-rows = 100

[tikv-importer]
# Import mode. "local" means using the physical import mode.
backend = "local"

qiancai marked this conversation as resolved.
Show resolved Hide resolved
# The method to resolve the conflicting data.
duplicate-resolution = 'remove'

# The directory of local KV sorting.
sorted-kv-dir = "./some-dir"

Expand Down Expand Up @@ -101,36 +103,37 @@ Conflicting data refers to two or more records with the same primary key or uniq
There are two versions for conflict detection:

- The new version of conflict detection, controlled by the `conflict` configuration item.
- The old version of conflict detection, controlled by the `tikv-importer.duplicate-resolution` configuration item.
- The old version of conflict detection (deprecated in v8.0.0), controlled by the `tikv-importer.duplicate-resolution` configuration item.

### The new version of conflict detection

The meaning of configuration values are as follows:
The meanings of configuration values are as follows:

| Strategy | Default behavior of conflicting data | The corresponding SQL statement |
| :-- | :-- | :-- |
| `"replace"` | Replacing existing data with new data. | `REPLACE INTO ...` |
| `"ignore"` | Keeping existing data and ignoring new data. | `INSERT IGNORE INTO ...` |
| `"error"` | Pausing the import and reporting an error. | `INSERT INTO ...` |
| `""` | TiDB Lightning does not detect or handle conflicting data. If data with primary and unique key conflicts exists, the subsequent step reports an error. | None |
| `"replace"` | Retaining the latest data and overwrites the old data | `REPLACE INTO ...` |
qiancai marked this conversation as resolved.
Show resolved Hide resolved
| `"error"` | Terminating the import and reporting an error. | `INSERT INTO ...` |
| `""` | TiDB Lightning does not detect or handle conflicting data. If data with primary and unique key conflicts exists, the subsequent checksum step reports an error. | None |

> **Note:**
>
> The conflict detection result in the physical import mode might differ from SQL-based import due to internal implementation and limitation of TiDB Lightning.

When the strategy is `"replace"` or `"ignore"`, conflicting data is treated as [conflict errors](/tidb-lightning/tidb-lightning-error-resolution.md#conflict-errors). If the [`conflict.threshold`](/tidb-lightning/tidb-lightning-configuration.md#tidb-lightning-task) value is greater than `0`, TiDB Lightning tolerates the specified number of conflict errors. The default value is `9223372036854775807`, which means that almost all errors are tolerant. For more information, see [error resolution](/tidb-lightning/tidb-lightning-error-resolution.md).
When the strategy is `"error"` and conflicting data is detected, TiDB Lightning reports an error and exits the import. When the strategy is `"replace"`, conflicting data is treated as [conflict errors](/tidb-lightning/tidb-lightning-error-resolution.md#conflict-errors). If the [`conflict.threshold`](/tidb-lightning/tidb-lightning-configuration.md#tidb-lightning-task) value is greater than `0`, TiDB Lightning tolerates the specified number of conflict errors. The default value is `9223372036854775807`, which means that almost all errors are tolerant. For more information, see [error resolution](/tidb-lightning/tidb-lightning-error-resolution.md).

The new version of conflict detection has the following limitations:

- Before importing, TiDB Lightning prechecks potential conflicting data by reading all data and encoding it. During the detection process, TiDB Lightning uses `tikv-importer.sorted-kv-dir` to store temporary files. After the detection is complete, TiDB Lightning retains the results for import phase. This introduces additional overhead for time consumption, disk space usage, and API requests to read the data.
- The new version of conflict detection only works in a single node, and does not apply to parallel imports and scenarios where the `disk-quota` parameter is enabled.
- The new version (`conflict`) and old version (`tikv-importer.duplicate-resolution`) conflict detection cannot be used at the same time. The new version of conflict detection is enabled when the configuration [`conflict.strategy`](/tidb-lightning/tidb-lightning-configuration.md#tidb-lightning-task) is set.

Compared with the old version of conflict detection, the new version takes less time when the imported data contains a large amount of conflicting data. It is recommended that you use the new version of conflict detection in non-parallel import tasks when the data contains conflicting data and there is sufficient local disk space.
The new version of conflict detection controls whether to enable prechecks via the `precheck-conflict-before-import` parameter. In cases where the original data has a lot of conflicting data, the total time consumed by conflict detection before and after the import is less compared to the old version. Therefore, it is recommended to enable prechecks in scenarios where the ratio of conflict records is greater than or equal to 1% and there is sufficient local disk space.
qiancai marked this conversation as resolved.
Show resolved Hide resolved

### The old version of conflict detection (deprecated in v8.0.0)

### The old version of conflict detection
Starting from v8.0.0, the old version of conflict detection (`tikv-importer.duplicate-resolution`) is deprecated. If `tikv-importer.duplicate-resolution` is `remove` and `conflict.strategy` is not configured, TiDB Lightning automatically enables the new version of conflict detection by assigning `conflict.strategy` to `"replace"`. Note that `tikv-importer.duplicate-resolution` and `conflict.strategy` cannot be configured at the same time, as it will result in an error.
qiancai marked this conversation as resolved.
Show resolved Hide resolved

The old version of conflict detection is enabled when `tikv-importer.duplicate-resolution` is not an empty string. In v7.2.0 and earlier versions, TiDB Lightning only supports this conflict detection method.
- For versions between v7.3.0 and v7.6.0, TiDB Lightning enables the old version of conflict detection when `tikv-importer.duplicate-resolution` is not an empty string.
- For v7.2.0 and earlier versions, TiDB Lightning only supports the old version of conflict detection.

In the old version of conflict detection, TiDB Lightning offers two strategies:

Expand Down
Loading