HDDS-11331. Fix Datanode unable to report for a long time #7090

jianghuazhu · 2024-08-18T02:08:36Z

What changes were proposed in this pull request?

In some cases, Datanodes were unable to report to the SCM for a long time due to StateContext#pipelineActions being stuck. This PR is intended to try to fix this.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11331

How was this patch tested?

Need to ensure CI execution is successful.

jianghuazhu · 2024-08-18T02:12:05Z

ci : https://github.com/jianghuazhu/ozone/actions/runs/10433754194
@szetszwo , can you help review this PR?
Thanks.

szetszwo

@jianghuazhu , thanks for working on this! Please see the comments inlined and also https://issues.apache.org/jira/secure/attachment/13070970/7090_review.patch

szetszwo · 2024-08-19T05:12:38Z

...ervice/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java

+      final Map<PipelineKey, PipelineAction> actionsForEndpoint =
+          pipelineActions.get(endpoint);
+      synchronized (actionsForEndpoint) {
+        if (actionsForEndpoint.values().stream().noneMatch(


Since it is a map, we don't need to call stream().

szetszwo · 2024-08-19T05:14:09Z

...ervice/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java

@@ -112,7 +115,7 @@ public class StateContext {
  private final Map<InetSocketAddress, List<Message>>
      incrementalReportsQueue;
  private final Map<InetSocketAddress, Queue<ContainerAction>> containerActions;
-  private final Map<InetSocketAddress, Queue<PipelineAction>> pipelineActions;
+  private final Map<InetSocketAddress, LinkedHashMap<PipelineKey, PipelineAction>> pipelineActions;


Let's also add a new class for the inner map. It is easier to see the synchronization.

static class ActionMap { private final LinkedHashMap<PipelineKey, PipelineAction> map = new LinkedHashMap<>(); synchronized int size() { return map.size(); } synchronized void putIfAbsent(PipelineKey key, PipelineAction pipelineAction) { map.putIfAbsent(key, pipelineAction); } synchronized List<PipelineAction> getActions(List<PipelineReport> reports, int max) { if (map.isEmpty()) { return Collections.emptyList(); } final List<PipelineAction> pipelineActionList = new ArrayList<>(); final int limit = Math.min(map.size(), max); final Iterator<Map.Entry<PipelineKey, PipelineAction>> i = map.entrySet().iterator(); for (int count = 0; count < limit && i.hasNext(); count++) { final Map.Entry<PipelineKey, PipelineAction> entry = i.next(); final PipelineAction action = entry.getValue(); // Add closePipeline back to the pipelineAction queue until // pipeline is closed and removed from the DN. if (action.hasClosePipeline()) { if (reports.stream().noneMatch(entry.getKey()::equalsId)) { // pipeline is removed from the DN, this action is no longer needed. i.remove(); continue; } // pipeline is closed but not yet removed from the DN. } else { i.remove(); } pipelineActionList.add(action); } // add all return pipelineActionList; } }

szetszwo · 2024-08-19T05:15:06Z

...ervice/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java

+    private final HddsProtos.PipelineID pipelineID;
+    private final PipelineAction.Action action;
+
+    PipelineKey(HddsProtos.PipelineID pipelineID, PipelineAction.Action action) {


We may pass PipelineAction instead.

PipelineKey(PipelineAction p) { this.pipelineID = p.getClosePipeline().getPipelineID(); this.action = p.getAction(); }

szetszwo · 2024-08-19T05:15:47Z

...ervice/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java

+    public HddsProtos.PipelineID getPipelineID() {
+      return pipelineID;
+    }
+
+    public PipelineAction.Action getAction() {
+      return action;
+    }


We may have an equalsId method instead.

boolean equalsId(PipelineReport report) { return pipelineID.equals(report.getPipelineID()); }

Thanks @szetszwo .
I have updated them all.

szetszwo

+1 the change looks good.

slfan1989 · 2024-08-20T01:17:26Z

@jianghuazhu @szetszwo Thanks for the contribution! I don't have any issues with this pr, but I don't understand why this change resolves the blocking issue? Why did the original code cause the issue to be blocked?

Sorry, I missed some of the JIRA discussion, but I have a general understanding of the PR changes.

LGTM +1.

slfan1989 · 2024-08-20T01:37:33Z

...ervice/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java

@@ -112,7 +119,7 @@ public class StateContext {
  private final Map<InetSocketAddress, List<Message>>
      incrementalReportsQueue;
  private final Map<InetSocketAddress, Queue<ContainerAction>> containerActions;


@jianghuazhu Will this issue also occur with container reporting? cc: @szetszwo

Probably yes? The code looks similar.

For this incident, we did not find any exceptions related to containerActions. Therefore, I did not improve containerActions.
Do you think it is necessary to improve it together? @szetszwo .

Agree. If there is a need for fixing containerActions, let's do it separately.

Thank you for your contributions! We have updated the code, and its readability has improved, which is a great thing. The code for container reporting and pipeline reporting is similar, but since we have not encountered issues with the container code, does this imply that there are no problems with this segment of the code, considering that the number of containers is much larger than the number of pipelines?

Upon careful consideration of the differences between container and pipeline reporting, I personally suspect that the issue might be related to the Ratis state management in the pipeline. We have identified some details and will be submitting an issue. I hope to continue discussing this with you. @szetszwo

Regarding lifeline reporting, I understand that this is a standard operation in HDFS. However, I have concerns about the current implementation of this feature.

For example, if a pipeline causes a DataNode (DN) to become unavailable—meaning the DN cannot provide data services and the client cannot retrieve data from it—then marking the DN as DEAD is reasonable.
However, if there is a lifeline, the DN may appear to be healthy even though it is actually not, which can prevent maintenance personnel from detecting the issue.

Lifeline reporting is more suitable for scenarios where heavy operations impact the heartbeat, but the heartbeat can recover once the heavy operation is complete. Both pipeline and container reporting are lightweight, and from my perspective, I haven't observed these reports causing any significant load on the SCM.

cc: @ChenSammi

slfan1989 · 2024-08-20T01:38:58Z

...ervice/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java

@@ -988,4 +947,79 @@ public DatanodeQueueMetrics getQueueMetrics() {
  public String getThreadNamePrefix() {
    return threadNamePrefix;
  }
+
+  static class ActionMap {


ActionMap -> PipelineActionMap ?

Wouldn't it be better if we named it PipelineActionMap?

Thanks @slfan1989 .
I have updated it.

weimingdiit · 2024-08-21T08:02:26Z

@szetszwo @slfan1989 @jianghuazhu Here is some information about RATIS, and details of the problem. Perhaps the cause of this problem is ratis. https://issues.apache.org/jira/browse/RATIS-2143

szetszwo · 2024-08-21T16:07:37Z

@slfan1989 , @weimingdiit , thanks for filing RATIS-2143. Let's continue the discussion there.

szetszwo · 2024-08-21T16:08:51Z

@slfan1989 , thanks also for reviewing this!

* master: (50 commits) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094) HDDS-11155. Improve Volumes page UI (apache#7048) HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093) HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037) HDDS-11323. Mark TestLeaseRecovery as flaky HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085) HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084) HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079) HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky HDDS-11336. Bump slf4j to 2.0.16 (apache#7086) HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java

…an-on-error * HDDS-10239-container-reconciliation: (428 commits) HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047) HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java

…rrupt-files * HDDS-10239-container-reconciliation: (61 commits) HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047) HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java

(cherry picked from commit fb43023)

HDDS-11331. Fix Datanode unable to report for a long time

d7c82b0

adoroszlai requested a review from szetszwo August 18, 2024 07:35

szetszwo reviewed Aug 19, 2024

View reviewed changes

jianghuazhu added 2 commits August 19, 2024 15:17

Update some code.

71eed97

Update some code

1814134

szetszwo approved these changes Aug 19, 2024

View reviewed changes

slfan1989 reviewed Aug 20, 2024

View reviewed changes

Update some code

481fc43

slfan1989 approved these changes Aug 21, 2024

View reviewed changes

szetszwo merged commit fb43023 into apache:master Aug 21, 2024
39 checks passed

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Sep 16, 2024

HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)

61d9e5c

(cherry picked from commit fb43023)

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Sep 18, 2024

HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)

3bc0128

(cherry picked from commit fb43023)

xichen01 mentioned this pull request Sep 19, 2024

[DO NOT MERGE] Backport some fixes and compatibility commits from master to ozone-1.4 #7218

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-11331. Fix Datanode unable to report for a long time #7090

HDDS-11331. Fix Datanode unable to report for a long time #7090

jianghuazhu commented Aug 18, 2024

jianghuazhu commented Aug 18, 2024

szetszwo left a comment

szetszwo Aug 19, 2024

szetszwo Aug 19, 2024

szetszwo Aug 19, 2024

szetszwo Aug 19, 2024

jianghuazhu Aug 19, 2024

szetszwo left a comment

slfan1989 commented Aug 20, 2024 •

edited

Loading

slfan1989 Aug 20, 2024

szetszwo Aug 20, 2024

jianghuazhu Aug 21, 2024

szetszwo Aug 21, 2024

slfan1989 Aug 21, 2024 •

edited

Loading

slfan1989 Aug 21, 2024 •

edited

Loading

slfan1989 Aug 20, 2024 •

edited

Loading

jianghuazhu Aug 20, 2024

weimingdiit commented Aug 21, 2024 •

edited

Loading

szetszwo commented Aug 21, 2024

szetszwo commented Aug 21, 2024

HDDS-11331. Fix Datanode unable to report for a long time #7090

HDDS-11331. Fix Datanode unable to report for a long time #7090

Conversation

jianghuazhu commented Aug 18, 2024

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

jianghuazhu commented Aug 18, 2024

szetszwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szetszwo left a comment

Choose a reason for hiding this comment

slfan1989 commented Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slfan1989 Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

slfan1989 Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

slfan1989 Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weimingdiit commented Aug 21, 2024 • edited Loading

szetszwo commented Aug 21, 2024

szetszwo commented Aug 21, 2024

slfan1989 commented Aug 20, 2024 •

edited

Loading

slfan1989 Aug 21, 2024 •

edited

Loading

slfan1989 Aug 21, 2024 •

edited

Loading

slfan1989 Aug 20, 2024 •

edited

Loading

weimingdiit commented Aug 21, 2024 •

edited

Loading