Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11331. Fix Datanode unable to report for a long time #7090

Merged
merged 4 commits into from
Aug 21, 2024

Conversation

jianghuazhu
Copy link
Contributor

What changes were proposed in this pull request?

In some cases, Datanodes were unable to report to the SCM for a long time due to StateContext#pipelineActions being stuck. This PR is intended to try to fix this.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11331

How was this patch tested?

Need to ensure CI execution is successful.

@jianghuazhu
Copy link
Contributor Author

ci : https://github.com/jianghuazhu/ozone/actions/runs/10433754194
@szetszwo , can you help review this PR?
Thanks.

@adoroszlai adoroszlai requested a review from szetszwo August 18, 2024 07:35
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianghuazhu , thanks for working on this! Please see the comments inlined and also https://issues.apache.org/jira/secure/attachment/13070970/7090_review.patch

final Map<PipelineKey, PipelineAction> actionsForEndpoint =
pipelineActions.get(endpoint);
synchronized (actionsForEndpoint) {
if (actionsForEndpoint.values().stream().noneMatch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is a map, we don't need to call stream().

@@ -112,7 +115,7 @@ public class StateContext {
private final Map<InetSocketAddress, List<Message>>
incrementalReportsQueue;
private final Map<InetSocketAddress, Queue<ContainerAction>> containerActions;
private final Map<InetSocketAddress, Queue<PipelineAction>> pipelineActions;
private final Map<InetSocketAddress, LinkedHashMap<PipelineKey, PipelineAction>> pipelineActions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a new class for the inner map. It is easier to see the synchronization.

  static class ActionMap {
    private final LinkedHashMap<PipelineKey, PipelineAction> map = new LinkedHashMap<>();

    synchronized int size() {
      return map.size();
    }

    synchronized void putIfAbsent(PipelineKey key, PipelineAction pipelineAction) {
      map.putIfAbsent(key, pipelineAction);
    }

    synchronized List<PipelineAction> getActions(List<PipelineReport> reports, int max) {
      if (map.isEmpty()) {
        return Collections.emptyList();
      }
      final List<PipelineAction> pipelineActionList = new ArrayList<>();
      final int limit = Math.min(map.size(), max);
      final Iterator<Map.Entry<PipelineKey, PipelineAction>> i = map.entrySet().iterator();
      for (int count = 0; count < limit && i.hasNext(); count++) {
        final Map.Entry<PipelineKey, PipelineAction> entry = i.next();
        final PipelineAction action = entry.getValue();

        // Add closePipeline back to the pipelineAction queue until
        // pipeline is closed and removed from the DN.
        if (action.hasClosePipeline()) {
          if (reports.stream().noneMatch(entry.getKey()::equalsId)) {
            // pipeline is removed from the DN, this action is no longer needed.
            i.remove();
            continue;
          }
          // pipeline is closed but not yet removed from the DN.
        } else {
          i.remove();
        }
        pipelineActionList.add(action);
      }
      // add all
      return pipelineActionList;
    }
  }

private final HddsProtos.PipelineID pipelineID;
private final PipelineAction.Action action;

PipelineKey(HddsProtos.PipelineID pipelineID, PipelineAction.Action action) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may pass PipelineAction instead.

    PipelineKey(PipelineAction p) {
      this.pipelineID = p.getClosePipeline().getPipelineID();
      this.action = p.getAction();
    }

Comment on lines 1012 to 1018
public HddsProtos.PipelineID getPipelineID() {
return pipelineID;
}

public PipelineAction.Action getAction() {
return action;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have an equalsId method instead.

    boolean equalsId(PipelineReport report) {
      return pipelineID.equals(report.getPipelineID());
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @szetszwo .
I have updated them all.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@slfan1989
Copy link
Contributor

slfan1989 commented Aug 20, 2024

@jianghuazhu @szetszwo Thanks for the contribution! I don't have any issues with this pr, but I don't understand why this change resolves the blocking issue? Why did the original code cause the issue to be blocked?

Sorry, I missed some of the JIRA discussion, but I have a general understanding of the PR changes.

LGTM +1.

@@ -112,7 +119,7 @@ public class StateContext {
private final Map<InetSocketAddress, List<Message>>
incrementalReportsQueue;
private final Map<InetSocketAddress, Queue<ContainerAction>> containerActions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianghuazhu Will this issue also occur with container reporting? cc: @szetszwo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably yes? The code looks similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this incident, we did not find any exceptions related to containerActions. Therefore, I did not improve containerActions.
Do you think it is necessary to improve it together? @szetszwo .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. If there is a need for fixing containerActions, let's do it separately.

Copy link
Contributor

@slfan1989 slfan1989 Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contributions! We have updated the code, and its readability has improved, which is a great thing. The code for container reporting and pipeline reporting is similar, but since we have not encountered issues with the container code, does this imply that there are no problems with this segment of the code, considering that the number of containers is much larger than the number of pipelines?

Upon careful consideration of the differences between container and pipeline reporting, I personally suspect that the issue might be related to the Ratis state management in the pipeline. We have identified some details and will be submitting an issue. I hope to continue discussing this with you. @szetszwo

Copy link
Contributor

@slfan1989 slfan1989 Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding lifeline reporting, I understand that this is a standard operation in HDFS. However, I have concerns about the current implementation of this feature.

For example, if a pipeline causes a DataNode (DN) to become unavailable—meaning the DN cannot provide data services and the client cannot retrieve data from it—then marking the DN as DEAD is reasonable.
However, if there is a lifeline, the DN may appear to be healthy even though it is actually not, which can prevent maintenance personnel from detecting the issue.

Lifeline reporting is more suitable for scenarios where heavy operations impact the heartbeat, but the heartbeat can recover once the heavy operation is complete. Both pipeline and container reporting are lightweight, and from my perspective, I haven't observed these reports causing any significant load on the SCM.

cc: @ChenSammi

@@ -988,4 +947,79 @@ public DatanodeQueueMetrics getQueueMetrics() {
public String getThreadNamePrefix() {
return threadNamePrefix;
}

static class ActionMap {
Copy link
Contributor

@slfan1989 slfan1989 Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ActionMap -> PipelineActionMap ?

Wouldn't it be better if we named it PipelineActionMap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @slfan1989 .
I have updated it.

@weimingdiit
Copy link
Contributor

weimingdiit commented Aug 21, 2024

@szetszwo @slfan1989 @jianghuazhu Here is some information about RATIS, and details of the problem. Perhaps the cause of this problem is ratis. https://issues.apache.org/jira/browse/RATIS-2143

@szetszwo
Copy link
Contributor

@slfan1989 , @weimingdiit , thanks for filing RATIS-2143. Let's continue the discussion there.

@szetszwo szetszwo merged commit fb43023 into apache:master Aug 21, 2024
39 checks passed
@szetszwo
Copy link
Contributor

@slfan1989 , thanks also for reviewing this!

errose28 added a commit to errose28/ozone that referenced this pull request Aug 21, 2024
* master: (50 commits)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094)
  HDDS-11155. Improve Volumes page UI (apache#7048)
  HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093)
  HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037)
  HDDS-11323. Mark TestLeaseRecovery as flaky
  HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085)
  HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084)
  HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079)
  HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky
  HDDS-11336. Bump slf4j to 2.0.16 (apache#7086)
  HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
errose28 added a commit to errose28/ozone that referenced this pull request Aug 21, 2024
* master: (50 commits)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094)
  HDDS-11155. Improve Volumes page UI (apache#7048)
  HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093)
  HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037)
  HDDS-11323. Mark TestLeaseRecovery as flaky
  HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085)
  HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084)
  HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079)
  HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky
  HDDS-11336. Bump slf4j to 2.0.16 (apache#7086)
  HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
errose28 added a commit to errose28/ozone that referenced this pull request Aug 26, 2024
…an-on-error

* HDDS-10239-container-reconciliation: (428 commits)
  HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091)
  HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089)
  Mark TestPipelineManagerMXBean#testPipelineInfo as flaky
  Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky
  HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky
  HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky
  HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky
  HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081)
  HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075)
  HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047)
  HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
errose28 added a commit to errose28/ozone that referenced this pull request Aug 28, 2024
…rrupt-files

* HDDS-10239-container-reconciliation: (61 commits)
  HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091)
  HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089)
  Mark TestPipelineManagerMXBean#testPipelineInfo as flaky
  Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky
  HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky
  HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky
  HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky
  HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081)
  HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075)
  HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047)
  HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Sep 16, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants