Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Double leader result in an abnormal table status #3150

Closed
2 tasks done
rfyu opened this issue Aug 31, 2024 · 0 comments · Fixed by #3151
Closed
2 tasks done

[Bug]: Double leader result in an abnormal table status #3150

rfyu opened this issue Aug 31, 2024 · 0 comments · Fixed by #3151
Labels
type:bug Something isn't working

Comments

@rfyu
Copy link
Contributor

rfyu commented Aug 31, 2024

What happened?

When two AMS both consider themselves as leaders, there is a high likelihood of the table optimization status being overridden.
image
Using "ods_fuxi_ubt_server_data_di" as a case:

  1. ams1 assigns two tasks to the optimizer
  2. task1 is completed, and ams1 receives the status report
  3. ams1 loses connection with ZooKeeper, but does not lose leadership
  4. ams2 gets leadership and initializes based on MySQL information
  5. task2 completed, and ams1 receives the status report, then changes ods_fuxi_ubt_server_data_di status to committing
  6. ams1 is committing a new snapshot
  7. ams1 lost leadership
  8. ods_fuxi_ubt_server_data_di commit fails, and its status is changed to idle
  9. during the refresh process, ams2 discovers a new snapshot being generated and mistakenly replaces the idle status with the running status, resulting in an abnormal state

Root cause:
https://issues.apache.org/jira/browse/CURATOR-696

Affects Versions

master

What table formats are you seeing the problem on?

Iceberg

What engines are you seeing the problem on?

AMS

How to reproduce

No response

Relevant log output

ams1 log


2024-08-25 22:31:16,414 INFO [plan-executor-thread-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208) plan new task OptimizingTaskId(processId:1724596273281, taskId:1), 2024-08-25 22:31:16,414 INFO [plan-executor-thread-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208) plan new task OptimizingTaskId(processId:1724596273281, taskId:2), 
2024-08-25 22:31:16,414 INFO [plan-executor-thread-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208) beginAndPersistProcess
2024-08-25 22:31:16,448 INFO [plan-executor-thread-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208) completes the persistence
2024-08-25 22:31:16,448 INFO [plan-executor-thread-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - Completed planning on table hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208) with 2 tasks with a total cost of 3569 ms, skipping tables。。。


2024-08-25 22:31:16,458 INFO [thrift-server-optimize-manager-888] [com.netease.arctic.server.DefaultOptimizingService] [] - OptimizerThread OptimizerThread{threadId=90, optimizer=OptimizerInstance{token=7100ca2c-1144-4bce-b373-b3dbadd8ac38, startTime=1720683350682, touchTime=1724596271458}} polled task OptimizingTaskId(processId:1724596273281, taskId:1)
2024-08-25 22:31:16,458 INFO [thrift-server-optimize-manager-441] [com.netease.arctic.server.DefaultOptimizingService] [] - OptimizerThread OptimizerThread{threadId=99, optimizer=OptimizerInstance{token=7100ca2c-1144-4bce-b373-b3dbadd8ac38, startTime=1720683350682, touchTime=1724596271458}} polled task OptimizingTaskId(processId:1724596273281, taskId:2)
2024-08-25 22:31:16,467 INFO [thrift-server-optimize-manager-441] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1724596273281, taskId:2) by optimizer 7100ca2c-1144-4bce-b373-b3dbadd8ac38 (threadId 99)
2024-08-25 22:31:16,467 INFO [thrift-server-optimize-manager-888] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1724596273281, taskId:1) by optimizer 7100ca2c-1144-4bce-b373-b3dbadd8ac38 (threadId 90)
2024-08-25 22:31:17,909 INFO [thrift-server-optimize-manager-805] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 7100ca2c-1144-4bce-b373-b3dbadd8ac38 (threadId 99) complete task OptimizingTaskId(processId:1724596273281, taskId:2)

2024-08-25 22:32:21,065 INFO [thrift-server-optimize-manager-892] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 7100ca2c-1144-4bce-b373-b3dbadd8ac38 (threadId 90) complete task OptimizingTaskId(processId:1724596273281, taskId:1)


2024-08-25 22:32:21,158 INFO [main] [com.netease.arctic.server.HighAvailabilityContainer] [] - Became the follower of AMS

2024-08-25 22:32:21,283 INFO [async-optimizing-commit-executor-0] [org.apache.iceberg.BaseMetastoreTableOperations] [] - Refreshing table metadata from new version: hdfs://ns/user/hive/warehouse/dwice_ctripdi.db/ods_fuxi_ubt_server_data_di/metadata/106145-ea768a06-8377-4439-9127-64b79b92a2b1.metadata.json
2024-08-25 22:32:21,283 WARN [async-optimizing-commit-executor-0] [org.apache.hadoop.ipc.Client] [] - interrupted waiting to send rpc request to server
2024-08-25 22:32:21,285 WARN [async-optimizing-commit-executor-0] [org.apache.iceberg.util.Tasks] [] - Retrying task after failure: Failed to open input stream for file: hdfs://ns/user/hive/warehouse/dwice_ctripdi.db/ods_fuxi_ubt_server_data_di/metadata/106145-ea768a06-8377-4439-9127-64b79b92a2b1.metadata.json
2024-08-25 22:32:21,287 ERROR [async-optimizing-commit-executor-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208) Commit optimizing failed
java.lang.RuntimeException: java.lang.InterruptedException: sleep interrupted
        at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:462) ~[iceberg-core-1.4.3.jar:?]
        at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219) ~[iceberg-core-1.4.3.jar:?]
        at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203) ~[iceberg-core-1.4.3.jar:?]
        at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196) ~[iceberg-core-1.4.3.jar:?]
        at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:208) ~[iceberg-core-1.4.3.jar:?]
        at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:185) ~[iceberg-core-1.4.3.jar:?]
        at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:180) ~[iceberg-core-1.4.3.jar:?]
        at org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:178) ~[iceberg-hive-metastore-1.4.3.jar:?]
        at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97) ~[iceberg-core-1.4.3.jar:?]
        at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80) ~[iceberg-core-1.4.3.jar:?]
        at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47) ~[iceberg-core-1.4.3.jar:?]
        at com.netease.arctic.formats.iceberg.IcebergCatalog.loadTable(IcebergCatalog.java:95) ~[amoro-core-0.7.0-SNAPSHOT.jar:?]
        at com.netease.arctic.CommonUnifiedCatalog.lambda$loadTable$1(CommonUnifiedCatalog.java:117) ~[amoro-core-0.7.0-SNAPSHOT.jar:?]
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_121]
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_121]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_121]
        at java.util.Spliterators$ArraySpliterator.tryAdvance(Spliterators.java:958) ~[?:1.8.0_121]
        at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) ~[?:1.8.0_121]
        at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498) ~[?:1.8.0_121]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485) ~[?:1.8.0_121]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_121]
        at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) ~[?:1.8.0_121]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_121]
        at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:464) ~[?:1.8.0_121]
        at com.netease.arctic.CommonUnifiedCatalog.loadTable(CommonUnifiedCatalog.java:123) ~[amoro-core-0.7.0-SNAPSHOT.jar:?]
        at com.netease.arctic.server.catalog.ExternalCatalog.lambda$loadTable$11(ExternalCatalog.java:137) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
        at com.netease.arctic.table.TableMetaStore.call(TableMetaStore.java:234) ~[amoro-core-0.7.0-SNAPSHOT.jar:?]
        at com.netease.arctic.table.TableMetaStore.lambda$doAs$0(TableMetaStore.java:209) ~[amoro-core-0.7.0-SNAPSHOT.jar:?]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_121]
        at javax.security.auth.Subject.doAs(Subject.java:360) ~[?:1.8.0_121]
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) ~[hadoop-common-2.9.2.jar:?]
        at com.netease.arctic.table.TableMetaStore.doAs(TableMetaStore.java:209) ~[amoro-core-0.7.0-SNAPSHOT.jar:?]
        at com.netease.arctic.server.catalog.ExternalCatalog.doAs(ExternalCatalog.java:161) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
        at com.netease.arctic.server.catalog.ExternalCatalog.loadTable(ExternalCatalog.java:137) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
        at com.netease.arctic.server.table.DefaultTableService.loadTable(DefaultTableService.java:262) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
        at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess.buildCommit(OptimizingQueue.java:589) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
        at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess.commit(OptimizingQueue.java:546) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
        at com.netease.arctic.server.table.executor.OptimizingCommitExecutor.execute(OptimizingCommitExecutor.java:53) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
        at com.netease.arctic.server.table.executor.BaseTableExecutor.executeTask(BaseTableExecutor.java:84) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
        at com.netease.arctic.server.table.executor.BaseTableExecutor.lambda$scheduleIfNecessary$4(BaseTableExecutor.java:95) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_121]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_121]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_121]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.lang.InterruptedException: sleep interrupted
        at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_121]
        at java.lang.Thread.sleep(Thread.java:340) ~[?:1.8.0_121]
        at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386) ~[?:1.8.0_121]
        at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:459) ~[iceberg-core-1.4.3.jar:?]
        ... 46 more
2024-08-25 22:32:21,290 INFO [async-optimizing-commit-executor-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - Bind queue ctripdi-default success with table hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208)

------------------------------------------------------------------------------------------------
ams2 log

2024-08-25 22:32:20,010 INFO [main] [com.netease.arctic.server.HighAvailabilityContainer] [] - Became the leader of AMS
2024-08-25 22:32:20,014 INFO [main] [com.netease.arctic.server.ArcticServiceContainer] [] - Setting up AMS table executors...
load mysql data
2024-08-25 22:32:27,363 INFO [main] [com.netease.arctic.server.ArcticServiceContainer] [] - AMS table service have been initialized

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

  • I agree to follow this project's Code of Conduct
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant