-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HDFS-17488. DN can fail IBRs with NPE when a volume is removed #6759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
Hi @ZanderXu @Hexiaoqiao, can you help me take a look whenever you're free? Thanks! |
ZanderXu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kokonguyen191 for your report. It's a nice catch.
Leave some comments, thanks
| final ReceivedDeletedBlockInfo[] rdbi = perStorage.removeAll(); | ||
| if (rdbi != null) { | ||
| reports.add(new StorageReceivedDeletedBlocks(entry.getKey(), rdbi)); | ||
| if (rdbi == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (rdbi != null) {
// Null storage, should not happen
if (entry.getKey() == null) {
dnMetrics.incrNullStorageBlockReports();
continue;
}
reports.add(new StorageReceivedDeletedBlocks(entry.getKey(), rdbi));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally put it like this to reduce nesting and make the logic a bit easier to read
| return pendingIBRs.size(); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can rollback this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Actually took me quite a while to find the setting to disable it in IDE
| // Block is not finalized - ignore the difference | ||
| return; | ||
| } | ||
| if (!storageMap.containsKey(storageUuid)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving this logic to line 2750?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the block up
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
ZanderXu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kokonguyen191 . LGTM
...project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java
Show resolved
Hide resolved
|
@kokonguyen191 Thanks for your report and contributions. Sorry didn't get this issue completely. Another side, you mentioned that this issue is triggered by volume removed, so how the logic between volume removed to NPE. Thanks again. |
|
Hi @Hexiaoqiao, thanks for the review. For your first question, |
|
🎊 +1 overall
This message was automatically generated. |
@Hexiaoqiao Hi, sir. please take a look at the description of #6730 . I found the same problem and explained in that PR's description. |
...adoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/metrics/DataNodeMetrics.java
Show resolved
Hide resolved
|
LGTM. Hi @Hexiaoqiao @hfutatzhanghb any other comments? |
|
💔 -1 overall
This message was automatically generated. |
|
Merged. Thanks @kokonguyen191 for your contribution and thanks @Hexiaoqiao @haiyang1987 @hfutatzhanghb for your review. |
|
💔 -1 overall
This message was automatically generated. |
Description of PR
Error logs
The root cause is in
BPOfferService#notifyNamenodeBlock, happens when it's called on a block belonging to a volume already removed prior. Because the volume was already removedso IBRs with a null storage are now pending.
The reason why notifyNamenodeBlock can trigger on such blocks is up in DirectoryScanner#reconcile
Inside
checkAndUpdate,memBlockInfois null because all the block meta in memory is removed during the volume removal, butdiskFilestill exists. ThenDataNode#notifyNamenodeDeletedBlock(and further down the line,notifyNamenodeBlock) is called on this block.This patch effectively contains 2 parts:
FsDatasetImpl#checkAndUpdatefrom running on a block belonging to a removed storageHow was this patch tested?
Unit tests. Partially on production cluster.
For code changes: