- 
                Notifications
    
You must be signed in to change notification settings  - Fork 9.1k
 
HDFS-15413. add dfs.client.read.striped.datanode.max.attempts to fix read ecfile timeout #5829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           💔 -1 overall 
 
 
 This message was automatically generated.  | 
    
| 
           cc @zhangshuyan0 Would you mind to take a review?  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add tests around this, which can reproduce these issues, maybe by setting a lower value for socket timeout.
Should cover scenarios, where
- Connection to DN containing DataBlock is established.
 - Connection to DN containing ParityBlock is established.
 - When there are missing/lost nodes in the pipeline
 
        
          
                hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | 
           Please also check the checkstyle and blannks reported by Yetus. Thanks. @Neilxzn  | 
    
          
 Fix these checkstyle and add unit test. Please review it again. Thanks  | 
    
| 
           💔 -1 overall 
 
 
 This message was automatically generated.  | 
    
| 
           @ayushtkn @zhangshuyan0 looks like the remaining failing checks are unrelated, and the feedback was addressed. Any chance for another look?  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this test is reproducing the issue for me, I reverted the changes in StripeReader and ran the test & it still passed.
If that gets sorted, We should add a test, where one DN is dead, like same test, but kill a DN
        
          
                ...t/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSStripedInputStreamWithTimeout.java
          
            Show resolved
            Hide resolved
        
              
          
                ...t/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSStripedInputStreamWithTimeout.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | 
           Hi @Neilxzn Any progress here? Thanks.  | 
    
| 
           Hi @Neilxzn , any chance you have time to finish this up?  | 
    
          
 Sorry for the late reply. I have been busy with other things recently. I will try to submit a new unit test tomorrow.  | 
    
| 
           💔 -1 overall 
 
 
 This message was automatically generated.  | 
    
| 
           💔 -1 overall 
 
 
 This message was automatically generated.  | 
    
| 
           I can pass the unit test hadoop.hdfs.TestDFSStripedInputStreamWithTimeout in my local development environment, but it fails on GitHub Jenkins.  | 
    
| 
           @Neilxzn I tried & it fails locally To reproduce:  | 
    
          
 Thank you. I will check it again soon.  | 
    
| 
           Hi @Neilxzn Any progress here? Thanks. this PR is still necessary, there are some similar problems in our environment~  | 
    
        
          
                hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | 
           💔 -1 overall 
 
 
 This message was automatically generated.  | 
    
| 
           please fix checkstyle, thanks~  | 
    
          Should we suppress this checkstyle warning? Or are there any better suggestions?  | 
    
| 
           🎊 +1 overall 
 
 This message was automatically generated.  | 
    
| 
           I believe we've started encountering this issue as well, would be great to get this in  | 
    
        
          
                hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
          
            Show resolved
            Hide resolved
        
      | 
           Hi, @Neilxzn . Thanks for reporting this problem. Can we push it forward?  | 
    
        
          
                hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java
          
            Show resolved
            Hide resolved
        
      | 
           Rebase it.  | 
    
| 
           🎊 +1 overall 
 
 This message was automatically generated.  | 
    
| 
           🎊 +1 overall 
 
 This message was automatically generated.  | 
    
| 
           We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.  | 
    

Description of PR
https://issues.apache.org/jira/browse/HDFS-15413
Offer a available patch to fix HDFS-15413. This patch add dfs.client.read.striped.datanode.max.attempts config to allow users to adjust the number of dn retries to solve the problem of Datanode timeout when reading EC files.
How was this patch tested?
no add test. just test in our cluster
For code changes:
add dfs.client.read.striped.datanode.max.attempts config