You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When two AMS both consider themselves as leaders, there is a high likelihood of the table optimization status being overridden.
Using "ods_fuxi_ubt_server_data_di" as a case:
ams1 assigns two tasks to the optimizer
task1 is completed, and ams1 receives the status report
ams1 loses connection with ZooKeeper, but does not lose leadership
ams2 gets leadership and initializes based on MySQL information
task2 completed, and ams1 receives the status report, then changes ods_fuxi_ubt_server_data_di status to committing
ams1 is committing a new snapshot
ams1 lost leadership
ods_fuxi_ubt_server_data_di commit fails, and its status is changed to idle
during the refresh process, ams2 discovers a new snapshot being generated and mistakenly replaces the idle status with the running status, resulting in an abnormal state
ams1 log
2024-08-25 22:31:16,414 INFO [plan-executor-thread-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208) plan new task OptimizingTaskId(processId:1724596273281, taskId:1), 2024-08-25 22:31:16,414 INFO [plan-executor-thread-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208) plan new task OptimizingTaskId(processId:1724596273281, taskId:2),
2024-08-25 22:31:16,414 INFO [plan-executor-thread-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208) beginAndPersistProcess
2024-08-25 22:31:16,448 INFO [plan-executor-thread-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208) completes the persistence
2024-08-25 22:31:16,448 INFO [plan-executor-thread-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - Completed planning on table hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208) with 2 tasks with a total cost of 3569 ms, skipping tables。。。
2024-08-25 22:31:16,458 INFO [thrift-server-optimize-manager-888] [com.netease.arctic.server.DefaultOptimizingService] [] - OptimizerThread OptimizerThread{threadId=90, optimizer=OptimizerInstance{token=7100ca2c-1144-4bce-b373-b3dbadd8ac38, startTime=1720683350682, touchTime=1724596271458}} polled task OptimizingTaskId(processId:1724596273281, taskId:1)
2024-08-25 22:31:16,458 INFO [thrift-server-optimize-manager-441] [com.netease.arctic.server.DefaultOptimizingService] [] - OptimizerThread OptimizerThread{threadId=99, optimizer=OptimizerInstance{token=7100ca2c-1144-4bce-b373-b3dbadd8ac38, startTime=1720683350682, touchTime=1724596271458}} polled task OptimizingTaskId(processId:1724596273281, taskId:2)
2024-08-25 22:31:16,467 INFO [thrift-server-optimize-manager-441] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1724596273281, taskId:2) by optimizer 7100ca2c-1144-4bce-b373-b3dbadd8ac38 (threadId 99)
2024-08-25 22:31:16,467 INFO [thrift-server-optimize-manager-888] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1724596273281, taskId:1) by optimizer 7100ca2c-1144-4bce-b373-b3dbadd8ac38 (threadId 90)
2024-08-25 22:31:17,909 INFO [thrift-server-optimize-manager-805] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 7100ca2c-1144-4bce-b373-b3dbadd8ac38 (threadId 99) complete task OptimizingTaskId(processId:1724596273281, taskId:2)
2024-08-25 22:32:21,065 INFO [thrift-server-optimize-manager-892] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 7100ca2c-1144-4bce-b373-b3dbadd8ac38 (threadId 90) complete task OptimizingTaskId(processId:1724596273281, taskId:1)
2024-08-25 22:32:21,158 INFO [main] [com.netease.arctic.server.HighAvailabilityContainer] [] - Became the follower of AMS
2024-08-25 22:32:21,283 INFO [async-optimizing-commit-executor-0] [org.apache.iceberg.BaseMetastoreTableOperations] [] - Refreshing table metadata from new version: hdfs://ns/user/hive/warehouse/dwice_ctripdi.db/ods_fuxi_ubt_server_data_di/metadata/106145-ea768a06-8377-4439-9127-64b79b92a2b1.metadata.json
2024-08-25 22:32:21,283 WARN [async-optimizing-commit-executor-0] [org.apache.hadoop.ipc.Client] [] - interrupted waiting to send rpc request to server
2024-08-25 22:32:21,285 WARN [async-optimizing-commit-executor-0] [org.apache.iceberg.util.Tasks] [] - Retrying task after failure: Failed to open input stream for file: hdfs://ns/user/hive/warehouse/dwice_ctripdi.db/ods_fuxi_ubt_server_data_di/metadata/106145-ea768a06-8377-4439-9127-64b79b92a2b1.metadata.json
2024-08-25 22:32:21,287 ERROR [async-optimizing-commit-executor-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208) Commit optimizing failed
java.lang.RuntimeException: java.lang.InterruptedException: sleep interrupted
at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:462) ~[iceberg-core-1.4.3.jar:?]
at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219) ~[iceberg-core-1.4.3.jar:?]
at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203) ~[iceberg-core-1.4.3.jar:?]
at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196) ~[iceberg-core-1.4.3.jar:?]
at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:208) ~[iceberg-core-1.4.3.jar:?]
at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:185) ~[iceberg-core-1.4.3.jar:?]
at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:180) ~[iceberg-core-1.4.3.jar:?]
at org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:178) ~[iceberg-hive-metastore-1.4.3.jar:?]
at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97) ~[iceberg-core-1.4.3.jar:?]
at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80) ~[iceberg-core-1.4.3.jar:?]
at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47) ~[iceberg-core-1.4.3.jar:?]
at com.netease.arctic.formats.iceberg.IcebergCatalog.loadTable(IcebergCatalog.java:95) ~[amoro-core-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.CommonUnifiedCatalog.lambda$loadTable$1(CommonUnifiedCatalog.java:117) ~[amoro-core-0.7.0-SNAPSHOT.jar:?]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_121]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_121]
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_121]
at java.util.Spliterators$ArraySpliterator.tryAdvance(Spliterators.java:958) ~[?:1.8.0_121]
at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) ~[?:1.8.0_121]
at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498) ~[?:1.8.0_121]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485) ~[?:1.8.0_121]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_121]
at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) ~[?:1.8.0_121]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_121]
at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:464) ~[?:1.8.0_121]
at com.netease.arctic.CommonUnifiedCatalog.loadTable(CommonUnifiedCatalog.java:123) ~[amoro-core-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.catalog.ExternalCatalog.lambda$loadTable$11(ExternalCatalog.java:137) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.table.TableMetaStore.call(TableMetaStore.java:234) ~[amoro-core-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.table.TableMetaStore.lambda$doAs$0(TableMetaStore.java:209) ~[amoro-core-0.7.0-SNAPSHOT.jar:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_121]
at javax.security.auth.Subject.doAs(Subject.java:360) ~[?:1.8.0_121]
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) ~[hadoop-common-2.9.2.jar:?]
at com.netease.arctic.table.TableMetaStore.doAs(TableMetaStore.java:209) ~[amoro-core-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.catalog.ExternalCatalog.doAs(ExternalCatalog.java:161) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.catalog.ExternalCatalog.loadTable(ExternalCatalog.java:137) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.table.DefaultTableService.loadTable(DefaultTableService.java:262) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess.buildCommit(OptimizingQueue.java:589) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess.commit(OptimizingQueue.java:546) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.table.executor.OptimizingCommitExecutor.execute(OptimizingCommitExecutor.java:53) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.table.executor.BaseTableExecutor.executeTask(BaseTableExecutor.java:84) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.table.executor.BaseTableExecutor.lambda$scheduleIfNecessary$4(BaseTableExecutor.java:95) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_121]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method) ~[?:1.8.0_121]
at java.lang.Thread.sleep(Thread.java:340) ~[?:1.8.0_121]
at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386) ~[?:1.8.0_121]
at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:459) ~[iceberg-core-1.4.3.jar:?]
... 46 more
2024-08-25 22:32:21,290 INFO [async-optimizing-commit-executor-0] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - Bind queue ctripdi-default success with table hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_server_data_di(tableId=8897208)
------------------------------------------------------------------------------------------------
ams2 log
2024-08-25 22:32:20,010 INFO [main] [com.netease.arctic.server.HighAvailabilityContainer] [] - Became the leader of AMS
2024-08-25 22:32:20,014 INFO [main] [com.netease.arctic.server.ArcticServiceContainer] [] - Setting up AMS table executors...
load mysql data
2024-08-25 22:32:27,363 INFO [main] [com.netease.arctic.server.ArcticServiceContainer] [] - AMS table service have been initialized
Anything else
No response
Are you willing to submit a PR?
Yes I am willing to submit a PR!
Code of Conduct
I agree to follow this project's Code of Conduct
The text was updated successfully, but these errors were encountered:
What happened?
When two AMS both consider themselves as leaders, there is a high likelihood of the table optimization status being overridden.
Using "ods_fuxi_ubt_server_data_di" as a case:
ods_fuxi_ubt_server_data_di
status to committingods_fuxi_ubt_server_data_di
commit fails, and its status is changed to idleRoot cause:
https://issues.apache.org/jira/browse/CURATOR-696
Affects Versions
master
What table formats are you seeing the problem on?
Iceberg
What engines are you seeing the problem on?
AMS
How to reproduce
No response
Relevant log output
Anything else
No response
Are you willing to submit a PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: