-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Closed
Description
Describe the bug
A replay NPE made 3 FE crash and cannot recover
2021-07-02 04:22:36,862 ERROR (replayer|83) [EditLog.loadJournal():816] Operation Type 29
java.lang.NullPointerException: null
at org.apache.doris.consistency.ConsistencyChecker.replayFinishConsistencyCheck(ConsistencyChecker.java:368) ~[palo-fe.jar:3.4.0]
at org.apache.doris.persist.EditLog.loadJournal(EditLog.java:339) [palo-fe.jar:3.4.0]
at org.apache.doris.catalog.Catalog.replayJournal(Catalog.java:2560) [palo-fe.jar:3.4.0]
at org.apache.doris.catalog.Catalog$3.runOneCycle(Catalog.java:2344) [palo-fe.jar:3.4.0]
at org.apache.doris.common.util.Daemon.run(Daemon.java:116) [palo-fe.jar:3.4.0]
https://github.com/apache/incubator-doris/blob/d6e6c7815b452d0e262b5c5a7a52fce0880c6117/fe/fe-core/src/main/java/org/apache/doris/consistency/ConsistencyChecker.java#L365-L370
The previous version of this file didn't prevent the NPE anyway, but never cause NPE.
https://github.com/apache/incubator-doris/blob/94a81e52c796150333c54838a889be01934983a4/fe/fe-core/src/main/java/org/apache/doris/consistency/ConsistencyChecker.java#L366-L371
We infer that this NPE is caused by a change in the write-order of editlog. We don't have enough log to prove what’s really going on, but one possible explanation is that:
CheckConsistencyJob.tryFinishJobhas already got the table and try to lock.- The table has been dropped just after
tryFinishJobgot the table. - The op succeeded on the dropped table, and write an editlog.
- A follower replay this editlog and crash, and never recover.
https://github.com/apache/incubator-doris/blob/d6e6c7815b452d0e262b5c5a7a52fce0880c6117/fe/fe-core/src/main/java/org/apache/doris/consistency/CheckConsistencyJob.java#L244-L270
Metadata
Metadata
Assignees
Labels
No labels