-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11331. Fix Datanode unable to report for a long time #7090
Conversation
ci : https://github.com/jianghuazhu/ozone/actions/runs/10433754194 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianghuazhu , thanks for working on this! Please see the comments inlined and also https://issues.apache.org/jira/secure/attachment/13070970/7090_review.patch
final Map<PipelineKey, PipelineAction> actionsForEndpoint = | ||
pipelineActions.get(endpoint); | ||
synchronized (actionsForEndpoint) { | ||
if (actionsForEndpoint.values().stream().noneMatch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is a map, we don't need to call stream().
@@ -112,7 +115,7 @@ public class StateContext { | |||
private final Map<InetSocketAddress, List<Message>> | |||
incrementalReportsQueue; | |||
private final Map<InetSocketAddress, Queue<ContainerAction>> containerActions; | |||
private final Map<InetSocketAddress, Queue<PipelineAction>> pipelineActions; | |||
private final Map<InetSocketAddress, LinkedHashMap<PipelineKey, PipelineAction>> pipelineActions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add a new class for the inner map. It is easier to see the synchronization.
static class ActionMap {
private final LinkedHashMap<PipelineKey, PipelineAction> map = new LinkedHashMap<>();
synchronized int size() {
return map.size();
}
synchronized void putIfAbsent(PipelineKey key, PipelineAction pipelineAction) {
map.putIfAbsent(key, pipelineAction);
}
synchronized List<PipelineAction> getActions(List<PipelineReport> reports, int max) {
if (map.isEmpty()) {
return Collections.emptyList();
}
final List<PipelineAction> pipelineActionList = new ArrayList<>();
final int limit = Math.min(map.size(), max);
final Iterator<Map.Entry<PipelineKey, PipelineAction>> i = map.entrySet().iterator();
for (int count = 0; count < limit && i.hasNext(); count++) {
final Map.Entry<PipelineKey, PipelineAction> entry = i.next();
final PipelineAction action = entry.getValue();
// Add closePipeline back to the pipelineAction queue until
// pipeline is closed and removed from the DN.
if (action.hasClosePipeline()) {
if (reports.stream().noneMatch(entry.getKey()::equalsId)) {
// pipeline is removed from the DN, this action is no longer needed.
i.remove();
continue;
}
// pipeline is closed but not yet removed from the DN.
} else {
i.remove();
}
pipelineActionList.add(action);
}
// add all
return pipelineActionList;
}
}
private final HddsProtos.PipelineID pipelineID; | ||
private final PipelineAction.Action action; | ||
|
||
PipelineKey(HddsProtos.PipelineID pipelineID, PipelineAction.Action action) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may pass PipelineAction
instead.
PipelineKey(PipelineAction p) {
this.pipelineID = p.getClosePipeline().getPipelineID();
this.action = p.getAction();
}
public HddsProtos.PipelineID getPipelineID() { | ||
return pipelineID; | ||
} | ||
|
||
public PipelineAction.Action getAction() { | ||
return action; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have an equalsId
method instead.
boolean equalsId(PipelineReport report) {
return pipelineID.equals(report.getPipelineID());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @szetszwo .
I have updated them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good.
@jianghuazhu @szetszwo Thanks for the contribution! I don't have any issues with this pr, but I don't understand why this change resolves the blocking issue? Why did the original code cause the issue to be blocked? Sorry, I missed some of the JIRA discussion, but I have a general understanding of the PR changes. LGTM +1. |
@@ -112,7 +119,7 @@ public class StateContext { | |||
private final Map<InetSocketAddress, List<Message>> | |||
incrementalReportsQueue; | |||
private final Map<InetSocketAddress, Queue<ContainerAction>> containerActions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianghuazhu Will this issue also occur with container reporting? cc: @szetszwo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably yes? The code looks similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this incident, we did not find any exceptions related to containerActions. Therefore, I did not improve containerActions.
Do you think it is necessary to improve it together? @szetszwo .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. If there is a need for fixing containerActions
, let's do it separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contributions! We have updated the code, and its readability has improved, which is a great thing. The code for container reporting and pipeline reporting is similar, but since we have not encountered issues with the container code, does this imply that there are no problems with this segment of the code, considering that the number of containers is much larger than the number of pipelines?
Upon careful consideration of the differences between container and pipeline reporting, I personally suspect that the issue might be related to the Ratis state management in the pipeline. We have identified some details and will be submitting an issue. I hope to continue discussing this with you. @szetszwo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding lifeline reporting, I understand that this is a standard operation in HDFS. However, I have concerns about the current implementation of this feature.
For example, if a pipeline causes a DataNode (DN) to become unavailable—meaning the DN cannot provide data services and the client cannot retrieve data from it—then marking the DN as DEAD is reasonable.
However, if there is a lifeline, the DN may appear to be healthy even though it is actually not, which can prevent maintenance personnel from detecting the issue.
Lifeline reporting is more suitable for scenarios where heavy operations impact the heartbeat, but the heartbeat can recover once the heavy operation is complete. Both pipeline and container reporting are lightweight, and from my perspective, I haven't observed these reports causing any significant load on the SCM.
cc: @ChenSammi
@@ -988,4 +947,79 @@ public DatanodeQueueMetrics getQueueMetrics() { | |||
public String getThreadNamePrefix() { | |||
return threadNamePrefix; | |||
} | |||
|
|||
static class ActionMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ActionMap -> PipelineActionMap ?
Wouldn't it be better if we named it PipelineActionMap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @slfan1989 .
I have updated it.
@szetszwo @slfan1989 @jianghuazhu Here is some information about RATIS, and details of the problem. Perhaps the cause of this problem is ratis. https://issues.apache.org/jira/browse/RATIS-2143 |
@slfan1989 , @weimingdiit , thanks for filing RATIS-2143. Let's continue the discussion there. |
@slfan1989 , thanks also for reviewing this! |
* master: (50 commits) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094) HDDS-11155. Improve Volumes page UI (apache#7048) HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093) HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037) HDDS-11323. Mark TestLeaseRecovery as flaky HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085) HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084) HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079) HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky HDDS-11336. Bump slf4j to 2.0.16 (apache#7086) HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
* master: (50 commits) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094) HDDS-11155. Improve Volumes page UI (apache#7048) HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093) HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037) HDDS-11323. Mark TestLeaseRecovery as flaky HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085) HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084) HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079) HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky HDDS-11336. Bump slf4j to 2.0.16 (apache#7086) HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
…an-on-error * HDDS-10239-container-reconciliation: (428 commits) HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047) HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
…rrupt-files * HDDS-10239-container-reconciliation: (61 commits) HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047) HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
(cherry picked from commit fb43023)
(cherry picked from commit fb43023)
What changes were proposed in this pull request?
In some cases, Datanodes were unable to report to the SCM for a long time due to StateContext#pipelineActions being stuck. This PR is intended to try to fix this.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11331
How was this patch tested?
Need to ensure CI execution is successful.