HDDS-2347. XCeiverClientGrpc's parallel use leads to NPE #81
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
We found this issue during Hive TPCDS tests, the basis of the problem is that Hive starts up an arbitrary number of threads to work on the same file, and reads the file from multiple threads.
In this case, the same XCeiverClientGrpc is called, and there are certain scenarios, where the current client is not synchronized properly. This PR is to add necessary synchronization around the closed internal boolean state, and around the channels and asyncstubs structures.
A fundamental change in behaviour is that the XCeiverClientGrpc instances are served after connecting to the first DN in a synchronized fashion in the XCeiverClientManager, then reconnect if needed is done after checking wether the DN is connected properly, and if not then reconnect in a synchronized block.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-2347
How was this patch tested?
As this issue comes out intermittently, and reproduction depends on how the JVM schedules the code of different threads, I was not able to write any reliable tests so far.
Manually the patch was tested on a 42 node cluster, with the 100 tpcds queries on a scale 2 and scale 3 large data set generated by the tools here: https://github.com/fapifta/hive-testbench
These tools are coming from https://github.com/hortonworks/hive-testbench with some modification to be able to use Ozone and HDFS as filesystems in parallel.
After applying the patch on the cluster with current trunk, I have not seen the NPE in 3 runs of the 99 TPCDS queries, before the patch I was able to see 2-5 queries failing with the given NPE per run.