- 
                Notifications
    
You must be signed in to change notification settings  - Fork 9.1k
 
HDFS-17801. EC: Reading support retryCurrentNode to avoid transient errors cause application level failures. #7762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b91d1b9    to
    6c144bb      
    Compare
  
    | 
           💔 -1 overall 
 
 This message was automatically generated.  | 
    
| 
           💔 -1 overall 
 
 This message was automatically generated.  | 
    
6c144bb    to
    f52be54      
    Compare
  
    | 
           💔 -1 overall 
 
 
 This message was automatically generated.  | 
    
f52be54    to
    bd73e6c      
    Compare
  
    | 
           🎊 +1 overall 
 
 This message was automatically generated.  | 
    
| 
           Hi, @Hexiaoqiao @KeeProMise @zhangshuyan0 . Could you please help review this PR when you are free ? Thanks a lot.  | 
    
bd73e6c    to
    b89b1d0      
    Compare
  
    | 
           🎊 +1 overall 
 
 This message was automatically generated.  | 
    
b89b1d0    to
    a44f887      
    Compare
  
    …rrors cause application level failures.
| 
           🎊 +1 overall 
 
 This message was automatically generated.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces transient-error resilience to EC read paths by adding a retry mechanism modeled after the 3-replication read implementation. Key changes include:
- Adding unit tests to simulate long idle periods and validate the retry behavior.
 - Updating StripeReader logic to conditionally reset chunk states and use additional reader info checks.
 - Introducing and integrating a retryCurrentReaderFlags mechanism in DFSStripedInputStream to control reader retries.
 
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description | 
|---|---|
| hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSStripedInputStream.java | Added new tests for EC read retry behavior. | 
| hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java | Enhanced handling of missing blocks and integrated logic to account for transient errors. | 
| hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java | Introduced retryCurrentReaderFlags for controlling reader retry behavior in case of transient errors. | 
Comments suppressed due to low confidence (2)
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java:179
- Consider adding a comment explaining the rationale behind comparing countOfNullReaderInfos with parityBlkNum, clarifying how this check differentiates between transient errors and genuine missing blocks.
 
      if (countOfNullReaderInfos(readerInfos) < parityBlkNum) {
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java:233
- Add an inline comment to explain why a reader is skipped only when its corresponding retry flag is false, which will help improve code clarity and maintainability.
 
      if (!retryCurrentReaderFlags[readerIndex]) {
a44f887    to
    299330d      
    Compare
  
    | 
           🎊 +1 overall 
 
 This message was automatically generated.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for your contribution. cc @zhangshuyan0 Would you mind to review again?
| 
           Thanks @hfutatzhanghb report.  | 
    
          
 @haiyang1987 Thank you for reminding this. I will check it.  | 
    
| 
           Hi, @Hexiaoqiao @haiyang1987 @zhangshuyan0 . Sorry for disturbing you . I will close this PR since #5829 have fixed this problem and already used in production envirnment. We can try to push #5829 forward and merge that pr into trunk. Thank you all!  | 
    
| 
           🎊 +1 overall 
 
 This message was automatically generated.  | 
    
Description of PR
Refer to HDFS-17801.
Under the 3-replication read implementation, when an IOException occurs, there is the retryCurrentNode mechanism.
This is very useful to avoid application level failures due to transient errors (e.g. Datanode could have closed the connection because the client is idle for too long). Please refer to below codes :
hadoop/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java
Lines 824 to 828 in 6eae158
We should make EC read also support this mechanism.
BTW, this issue is motivated by the failure of our cluster's applications failure when we change the data from 3-rep to EC policy.
How was this patch tested?
Add an unit test.