Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10614. Avoid decreasing cached space usage below zero #6508

Merged
merged 13 commits into from
Apr 18, 2024

Conversation

ArafatKhan2198
Copy link
Contributor

@ArafatKhan2198 ArafatKhan2198 commented Apr 10, 2024

What changes were proposed in this pull request?

The root cause seems to be an error during the refresh operation of the CachingSpaceUsageSource. Specifically, the underlying SpaceUsageSource (likely an instance of DU, which uses the Unix du command to calculate disk usage) is failing due to a permission issue when trying to read the /data3/lost+found directory. This failure might cause the getUsedSpace() method to return an incorrect value (possibly zero), which, when decremented, results in a negative value.

This PR introduces error handling and validation in the CachingSpaceUsageSource class to ensure data integrity. Specifically, it prevents negative values for used space by validating new values before updating the cache and handles exceptions, including UncheckedIOException, by maintaining the last known good value and logging errors. These changes ensure that temporary issues, such as permission errors, do not result in invalid state transitions or data corruption.

We catch UncheckedIOException because it indicates a problem occurred when the program tried to read or write data, and we saw it during operations like calculating disk space usage. This specific exception wraps lower-level errors, making it a clear sign that something went wrong with I/O operations, which are crucial for accurately tracking disk space.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10614

How was this patch tested?

CI ran green :- https://github.com/ArafatKhan2198/ozone/actions/runs/8627744703
Will be adding Unit tests for it if the approach is correct

@ArafatKhan2198
Copy link
Contributor Author

@devmadhuu @adoroszlai Could you place take a look

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArafatKhan2198 for working on this.

@ivandika3 ivandika3 added recon and removed recon labels Apr 16, 2024
Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArafatKhan2198 for working on this patch. Pls check and handle the comments.

@adoroszlai
Copy link
Contributor

@ArafatKhan2198 @devmadhuu This PR hopefully fixes the source of negative space at Datanode. Does Recon need any additional fix? Is space usage stored in Recon DB, which might prevent startup if Recon already saved invalid value?

@adoroszlai adoroszlai merged commit cc023e7 into apache:master Apr 18, 2024
39 checks passed
@adoroszlai
Copy link
Contributor

Thanks @ArafatKhan2198 for the fix, @devmadhuu for the review.

@adoroszlai adoroszlai changed the title HDDS-10614. Recon fails to start with used space cannot be negative. HDDS-10614. Avoid decreasing cached space usage below zero Apr 18, 2024
@ArafatKhan2198
Copy link
Contributor Author

ArafatKhan2198 commented Apr 19, 2024

Thanks @ArafatKhan2198 for the fix, @devmadhuu for the review.

@adoroszlai

ClusterStateEndpoint in Recon requests the cluster-wide storage statistics, the ReconNodeManager aggregates the individual Datanode statistics to calculate the total capacity, used space, and remaining space for the entire cluster.
The ReconNodeManager exposes methods like getStats(), which returns an object (e.g., SCMNodeStat) containing the aggregated storage statistics. The ClusterStateEndpoint then uses these aggregated statistics to create the DatanodeStorageReport and include it in the ClusterStateResponse.

So, in summary, the Recon component gets the storage information from the periodic heartbeat messages sent by the individual Datanodes in the cluster. The ReconNodeManager aggregates these individual Datanode statistics to provide the cluster-wide storage information to other components like the ClusterStateEndpoint.

This is my understanding from the code. We do not store the space in any of the DB's in recon (Derby or Rocks)

@devmadhuu @dombizita please correct me if I am making a wrong assumption anywhere.

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 17, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 18, 2024
xichen01 pushed a commit to xichen01/ozone that referenced this pull request Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants