Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: restore 2T data from snapshot report 'table does not exist'. #19408

Closed
1 task done
Ariznawlll opened this issue Oct 17, 2024 · 22 comments
Closed
1 task done

[Bug]: restore 2T data from snapshot report 'table does not exist'. #19408

Ariznawlll opened this issue Oct 17, 2024 · 22 comments
Assignees
Labels
kind/bug Something isn't working phase/testing severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Milestone

Comments

@Ariznawlll
Copy link
Contributor

Ariznawlll commented Oct 17, 2024

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

main

Commit ID

cf5296b

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

恢复了大约5h后报错 table does not exist

mysql> restore account sys from snapshot sp01;


ERROR 1064 (HY000): SQL parser error: table "table_with_pk_index_for_write_1b" does not exist

日志:https://grafana.ci.matrixorigin.cn/explore?panes=%7B%223wf%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-big-data-20241016%5C%22%7D%20%7C%3D%20%60sp01%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221729137600000%22,%22to%22:%221729155600000%22%7D%7D%7D&schemaVersion=1&orgId=1

快照读能读到数据:
image

Expected Behavior

No response

Steps to Reproduce

步骤:
create snapshot sp01 for account sys ;
drop database big_data_test;
restore account sys from snapshot sp01;

big_data_test中有28张表,数据量大约有2T,找不到的表table_with_pk_index_for_write_1b的schema:
create table if not exists big_data_test.table_with_pk_index_for_write_1B( id bigint primary key, col1 tinyint, col2 smallint, col3 int, col4 bigint, col5 tinyint unsigned, col6 smallint unsigned, col7 int unsigned, col8 bigint unsigned, col9 float, col10 double, col11 varchar(255), col12 Date, col13 DateTime, col14 timestamp, col15 bool, col16 decimal(16,6), col17 text, col18 json, col19 blob, col20 binary(255), col21 varbinary(255), col22 vecf32(3), col23 vecf32(3), col24 vecf64(3), col25 vecf64(3));

Additional information

No response

@Ariznawlll Ariznawlll added kind/bug Something isn't working severity/s-1 labels Oct 17, 2024
@Ariznawlll Ariznawlll added this to the 2.0.0 milestone Oct 17, 2024
@sukki37 sukki37 assigned LeftHandCold and unassigned matrix-meow Oct 17, 2024
@YANGGMM
Copy link
Contributor

YANGGMM commented Oct 17, 2024

image

@triump2020
Copy link
Contributor

原因已大致定位到.

@Ariznawlll
Copy link
Contributor Author

Ariznawlll commented Oct 18, 2024

@triump2020
Copy link
Contributor

PR is on the way!

@triump2020 triump2020 mentioned this issue Oct 21, 2024
6 tasks
@triump2020
Copy link
Contributor

PR 可能只解决了,导致这个问题的原因之一,但如果概率比较大,可能还有其他原因,需要加日志再复现下.

@triump2020
Copy link
Contributor

又完善了Log , 线上,线下同时在复现。 应该是其他原因,导致了这个问题.

@triump2020
Copy link
Contributor

triump2020 commented Oct 24, 2024

复现步骤:

  1. 修改 以下配置:
    [tn.Ckp]
    flush-interval = "5s"
    min-count = 1
    scan-interval = "5s"
    incremental-interval = "10s"
    global-min-count = 3

  2. 修改程序:
    gcPartitionStateTicker = 5 * time.Second
    gcPartitionStateTimer = 90 * time.Second

1729753427198

  1. 运行mo-service

  2. 运行 sql:
    1>create table tpcc_1000.bmsql_order_line

    2>load data url s3option {'endpoint'='http://cos.ap-guangzhou.myqcloud.com','access_key_id'='***','secret_access_key'='***','bucket'='mo-load-guangzhou-1308875761','filepath'='tpcc_1000/order-line.csv', 'compression'=''} into table tpcc_1000.bmsql_order_line fields terminated by ',' lines terminated by '\n' parallel 'true';

3> create snapshot 1;
4> drop database tpcc_1000;
5> restore account sys from snapshot sp01;

@triump2020
Copy link
Contributor

triump2020 commented Oct 24, 2024

原因已定位,等待修复.
txn is stale 的错误,导致了报表找不到的错误.
txn is stale 的原因是 partition state 的 minTs, start, end 的数据不一致导致.

@triump2020
Copy link
Contributor

又完善了Log , 线上,线下同时在复现。 应该是其他原因,导致了这个问题.

经过线下复现,原因就是第一个pr 所修复的,只是修复失败.

@triump2020
Copy link
Contributor

由Txn is stale 导致的 table not found 问题应该修复了,线下测试过好多次了. @Ariznawlll 请测试.

@triump2020
Copy link
Contributor

等待pr 合并

sukki37 pushed a commit that referenced this issue Oct 27, 2024
@XuPeng-SH XuPeng-SH assigned Ariznawlll and unassigned triump2020 Oct 27, 2024
@XuPeng-SH
Copy link
Contributor

fixed

@Ariznawlll
Copy link
Contributor Author

测试中

@Ariznawlll
Copy link
Contributor Author

Ariznawlll commented Oct 28, 2024

用飞哥的复现方式没有出现,测试步骤

复现步骤:
mysql> select git_version();
+---------------+
| git_version() |
+---------------+
| 700ee56cf     |
+---------------+
1 row in set (0.00 sec)

修改 以下配置:
[tn.Ckp]
flush-interval = "5s"
min-count = 1
scan-interval = "5s"
incremental-interval = "10s"
global-min-count = 3

修改程序:
gcPartitionStateTicker = 5 * time.Second
gcPartitionStateTimer = 90 * time.Second
image
mysql> create table bmsql_order_line (
    ->   ol_w_id         integer   not null,
    ->   ol_d_id         integer   not null,
    ->   ol_o_id         integer   not null,
    ->   ol_number       integer   not null,
    ->   ol_i_id         integer   not null,
    ->   ol_delivery_d   timestamp,
    ->   ol_amount       decimal(6,2),
    ->   ol_supply_w_id  integer,
    ->   ol_quantity     integer,
    ->   ol_dist_info    char(24),
    ->   primary key (ol_w_id, ol_d_id, ol_o_id, ol_number)
    -> ) ;
Query OK, 0 rows affected (0.07 sec)

mysql> load data url s3option {'endpoint'='http://cos.ap-guangzhou.myqcloud.com','access_key_id'='***','secret_access_key'='***','bucket'='mo-load-guangzhou-1308875761','filepath'='tpcc_1000/order-line.csv', 'compression'=''} into table tpcc_1000.bmsql_order_line fields terminated by ',' lines terminated by '\n' parallel 'true';
Query OK, 300014864 rows affected (17 min 14.86 sec)

mysql> create snapshot sp01 for account sys;
Query OK, 0 rows affected (0.06 sec)

mysql> drop database tpcc_1000;
Query OK, 1 row affected (0.11 sec)

mysql> restore account sys from snapshot sp01;
Query OK, 0 rows affected (1 min 55.28 sec)
image

@Ariznawlll
Copy link
Contributor Author

Ariznawlll commented Oct 29, 2024

恢复account级别,account的数据量大约2T

mysql> select git_version();
+---------------+
| git_version() |
+---------------+
| 5c6909f |
+---------------+
1 row in set (0.00 sec)

企业微信截图_215120b5-f2a2-4cf1-be93-e63155772535

恢复成功

@Ariznawlll
Copy link
Contributor Author

测试结论:

  1. 集群级别的恢复失败,整个集群的数据约2T;
  2. account级别的恢复成功,account的数据约2T.

@Ariznawlll Ariznawlll assigned triump2020 and unassigned Ariznawlll Oct 29, 2024
@Ariznawlll Ariznawlll added severity/s0 Extreme impact: Cause the application to break down and seriously affect the use and removed severity/s-1 labels Oct 29, 2024
@Ariznawlll Ariznawlll modified the milestones: 2.0.0, 2.0.1 Oct 29, 2024
@triump2020
Copy link
Contributor

@Ariznawlll Pls test!

@triump2020 triump2020 assigned Ariznawlll and unassigned triump2020 Oct 30, 2024
mergify bot pushed a commit that referenced this issue Oct 30, 2024
1.  Cherry-pick 2.0-dev

Approved by: @XuPeng-SH, @sukki37
@Ariznawlll
Copy link
Contributor Author

测试中

@Ariznawlll
Copy link
Contributor Author

1.66T集群恢复成功

@Ariznawlll
Copy link
Contributor Author

2.37T数据量集群恢复成功

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working phase/testing severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Projects
None yet
Development

No branches or pull requests

6 participants