-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-10614. Avoid decreasing cached space usage below zero #6508
Conversation
@devmadhuu @adoroszlai Could you place take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ArafatKhan2198 for working on this.
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ArafatKhan2198 for working on this patch. Pls check and handle the comments.
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/fs/TestCachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/fs/TestCachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/fs/TestCachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/fs/TestCachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/CachingSpaceUsageSource.java
Outdated
Show resolved
Hide resolved
@ArafatKhan2198 @devmadhuu This PR hopefully fixes the source of negative space at Datanode. Does Recon need any additional fix? Is space usage stored in Recon DB, which might prevent startup if Recon already saved invalid value? |
Thanks @ArafatKhan2198 for the fix, @devmadhuu for the review. |
So, in summary, the Recon component gets the storage information from the periodic heartbeat messages sent by the individual Datanodes in the cluster. The ReconNodeManager aggregates these individual Datanode statistics to provide the cluster-wide storage information to other components like the ClusterStateEndpoint. This is my understanding from the code. We do not store the space in any of the DB's in recon (Derby or Rocks) @devmadhuu @dombizita please correct me if I am making a wrong assumption anywhere. |
(cherry picked from commit cc023e7)
(cherry picked from commit cc023e7)
(cherry picked from commit cc023e7)
(cherry picked from commit cc023e7)
(cherry picked from commit cc023e7)
(cherry picked from commit cc023e7)
What changes were proposed in this pull request?
The root cause seems to be an error during the refresh operation of the
CachingSpaceUsageSource
. Specifically, the underlyingSpaceUsageSource
(likely an instance of DU, which uses the Unix du command to calculate disk usage) is failing due to a permission issue when trying to read the /data3/lost+found directory. This failure might cause thegetUsedSpace()
method to return an incorrect value (possibly zero), which, when decremented, results in a negative value.This PR introduces error handling and validation in the
CachingSpaceUsageSource
class to ensure data integrity. Specifically, it prevents negative values for used space by validating new values before updating the cache and handles exceptions, includingUncheckedIOException
, by maintaining the last known good value and logging errors. These changes ensure that temporary issues, such as permission errors, do not result in invalid state transitions or data corruption.We catch
UncheckedIOException
because it indicates a problem occurred when the program tried to read or write data, and we saw it during operations like calculating disk space usage. This specific exception wraps lower-level errors, making it a clear sign that something went wrong with I/O operations, which are crucial for accurately tracking disk space.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10614
How was this patch tested?
CI ran green :- https://github.com/ArafatKhan2198/ozone/actions/runs/8627744703
Will be adding Unit tests for it if the approach is correct