HDDS-10614. Avoid decreasing cached space usage below zero #6508

ArafatKhan2198 · 2024-04-10T10:16:44Z

What changes were proposed in this pull request?

The root cause seems to be an error during the refresh operation of the CachingSpaceUsageSource. Specifically, the underlying SpaceUsageSource (likely an instance of DU, which uses the Unix du command to calculate disk usage) is failing due to a permission issue when trying to read the /data3/lost+found directory. This failure might cause the getUsedSpace() method to return an incorrect value (possibly zero), which, when decremented, results in a negative value.

This PR introduces error handling and validation in the CachingSpaceUsageSource class to ensure data integrity. Specifically, it prevents negative values for used space by validating new values before updating the cache and handles exceptions, including UncheckedIOException, by maintaining the last known good value and logging errors. These changes ensure that temporary issues, such as permission errors, do not result in invalid state transitions or data corruption.

We catch UncheckedIOException because it indicates a problem occurred when the program tried to read or write data, and we saw it during operations like calculating disk space usage. This specific exception wraps lower-level errors, making it a clear sign that something went wrong with I/O operations, which are crucial for accurately tracking disk space.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10614

How was this patch tested?

CI ran green :- https://github.com/ArafatKhan2198/ozone/actions/runs/8627744703
Will be adding Unit tests for it if the approach is correct

ArafatKhan2198 · 2024-04-10T10:17:18Z

@devmadhuu @adoroszlai Could you place take a look

adoroszlai

Thanks @ArafatKhan2198 for working on this.

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java

devmadhuu

Thanks @ArafatKhan2198 for working on this patch. Pls check and handle the comments.

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java

hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/fs/TestCachingSpaceUsageSource.java

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java

adoroszlai · 2024-04-18T16:36:28Z

@ArafatKhan2198 @devmadhuu This PR hopefully fixes the source of negative space at Datanode. Does Recon need any additional fix? Is space usage stored in Recon DB, which might prevent startup if Recon already saved invalid value?

adoroszlai · 2024-04-18T19:23:19Z

Thanks @ArafatKhan2198 for the fix, @devmadhuu for the review.

ArafatKhan2198 · 2024-04-19T11:23:59Z

Thanks @ArafatKhan2198 for the fix, @devmadhuu for the review.

@adoroszlai

ClusterStateEndpoint in Recon requests the cluster-wide storage statistics, the ReconNodeManager aggregates the individual Datanode statistics to calculate the total capacity, used space, and remaining space for the entire cluster.
The ReconNodeManager exposes methods like getStats(), which returns an object (e.g., SCMNodeStat) containing the aggregated storage statistics. The ClusterStateEndpoint then uses these aggregated statistics to create the DatanodeStorageReport and include it in the ClusterStateResponse.

So, in summary, the Recon component gets the storage information from the periodic heartbeat messages sent by the individual Datanodes in the cluster. The ReconNodeManager aggregates these individual Datanode statistics to provide the cluster-wide storage information to other components like the ClusterStateEndpoint.

This is my understanding from the code. We do not store the space in any of the DB's in recon (Derby or Rocks)

@devmadhuu @dombizita please correct me if I am making a wrong assumption anywhere.

(cherry picked from commit cc023e7)

HDDS-10614. Recon fails to start with used space cannot be negative.

9936b6b

adoroszlai requested changes Apr 10, 2024

View reviewed changes

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java Outdated Show resolved Hide resolved

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java Outdated Show resolved Hide resolved

ArafatKhan2198 added 3 commits April 16, 2024 16:48

Moved the negative check logic to decrementUsedSpace

6771216

Removed unused import

25e632c

Added Unit test

5cc86d1

ivandika3 added recon and removed recon labels Apr 16, 2024

adoroszlai reviewed Apr 16, 2024

View reviewed changes

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java Outdated Show resolved Hide resolved

devmadhuu reviewed Apr 16, 2024

View reviewed changes

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java Outdated Show resolved Hide resolved

Made review comment changes

d0ff3f5

adoroszlai reviewed Apr 17, 2024

View reviewed changes

hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/fs/TestCachingSpaceUsageSource.java Outdated Show resolved Hide resolved

Fixed the failing UT

b2fc96d

ArafatKhan2198 requested review from adoroszlai and devmadhuu April 17, 2024 06:00

adoroszlai reviewed Apr 17, 2024

View reviewed changes

hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/fs/TestCachingSpaceUsageSource.java Outdated Show resolved Hide resolved

hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/fs/TestCachingSpaceUsageSource.java Outdated Show resolved Hide resolved

ArafatKhan2198 added 2 commits April 17, 2024 20:08

Made changes to the assertions and did some refactoring

fc06d39

Fixed Checkstyle

d341102

adoroszlai reviewed Apr 17, 2024

View reviewed changes

hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/fs/TestCachingSpaceUsageSource.java Outdated Show resolved Hide resolved

ArafatKhan2198 added 2 commits April 17, 2024 21:49

Fixed bug

05081d6

Removed unnecessary change

f58f3d2

ArafatKhan2198 requested a review from adoroszlai April 18, 2024 05:05

adoroszlai reviewed Apr 18, 2024

View reviewed changes

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java Outdated Show resolved Hide resolved

ArafatKhan2198 added 2 commits April 18, 2024 19:40

Added condition for logging

8f56158

Added the data volume also to the log message

04e9dcf

adoroszlai reviewed Apr 18, 2024

View reviewed changes

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java Outdated Show resolved Hide resolved

Remove unnecessary toString

32c3809

adoroszlai approved these changes Apr 18, 2024

View reviewed changes

adoroszlai merged commit cc023e7 into apache:master Apr 18, 2024
39 checks passed

adoroszlai changed the title ~~HDDS-10614. Recon fails to start with used space cannot be negative.~~ HDDS-10614. Avoid decreasing cached space usage below zero Apr 18, 2024

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024

HDDS-10614. Avoid decreasing cached space usage below zero (apache#6508)

9b66caf

(cherry picked from commit cc023e7)

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024

HDDS-10614. Avoid decreasing cached space usage below zero (apache#6508)

a267593

(cherry picked from commit cc023e7)

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024

HDDS-10614. Avoid decreasing cached space usage below zero (apache#6508)

ed8979d

(cherry picked from commit cc023e7)

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024

HDDS-10614. Avoid decreasing cached space usage below zero (apache#6508)

815856b

(cherry picked from commit cc023e7)

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 18, 2024

HDDS-10614. Avoid decreasing cached space usage below zero (apache#6508)

ae5daa8

(cherry picked from commit cc023e7)

xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 18, 2024

HDDS-10614. Avoid decreasing cached space usage below zero (apache#6508)

392dc66

(cherry picked from commit cc023e7)

xichen01 mentioned this pull request Jul 18, 2024

[DO NOT MERGE] Backport some fixes, performance optimizations from master to ozone-1.4 #6929 #6964

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-10614. Avoid decreasing cached space usage below zero #6508

HDDS-10614. Avoid decreasing cached space usage below zero #6508

ArafatKhan2198 commented Apr 10, 2024 •

edited

Loading

ArafatKhan2198 commented Apr 10, 2024

adoroszlai left a comment

devmadhuu left a comment

adoroszlai commented Apr 18, 2024

adoroszlai commented Apr 18, 2024

ArafatKhan2198 commented Apr 19, 2024 •

edited

Loading

HDDS-10614. Avoid decreasing cached space usage below zero #6508

HDDS-10614. Avoid decreasing cached space usage below zero #6508

Conversation

ArafatKhan2198 commented Apr 10, 2024 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

ArafatKhan2198 commented Apr 10, 2024

adoroszlai left a comment

Choose a reason for hiding this comment

devmadhuu left a comment

Choose a reason for hiding this comment

adoroszlai commented Apr 18, 2024

adoroszlai commented Apr 18, 2024

ArafatKhan2198 commented Apr 19, 2024 • edited Loading

ArafatKhan2198 commented Apr 10, 2024 •

edited

Loading

ArafatKhan2198 commented Apr 19, 2024 •

edited

Loading