HDDS-2347. XCeiverClientGrpc's parallel use leads to NPE #81

fapifta · 2019-10-24T12:35:50Z

What changes were proposed in this pull request?

We found this issue during Hive TPCDS tests, the basis of the problem is that Hive starts up an arbitrary number of threads to work on the same file, and reads the file from multiple threads.
In this case, the same XCeiverClientGrpc is called, and there are certain scenarios, where the current client is not synchronized properly. This PR is to add necessary synchronization around the closed internal boolean state, and around the channels and asyncstubs structures.
A fundamental change in behaviour is that the XCeiverClientGrpc instances are served after connecting to the first DN in a synchronized fashion in the XCeiverClientManager, then reconnect if needed is done after checking wether the DN is connected properly, and if not then reconnect in a synchronized block.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-2347

How was this patch tested?

As this issue comes out intermittently, and reproduction depends on how the JVM schedules the code of different threads, I was not able to write any reliable tests so far.
Manually the patch was tested on a 42 node cluster, with the 100 tpcds queries on a scale 2 and scale 3 large data set generated by the tools here: https://github.com/fapifta/hive-testbench
These tools are coming from https://github.com/hortonworks/hive-testbench with some modification to be able to use Ozone and HDFS as filesystems in parallel.

After applying the patch on the cluster with current trunk, I have not seen the NPE in 3 runs of the 99 TPCDS queries, before the patch I was able to see 2-5 queries failing with the given NPE per run.

fapifta · 2019-10-24T12:41:10Z

/label ozone

bshashikant · 2019-10-25T10:16:14Z

The changes look good to me. I am +1 on this.

hanishakoneru · 2019-10-25T20:48:53Z

Thank you for working on this @fapifta. Integration test failures do no look related to this PR.
LGTM. +1.

adoroszlai · 2019-10-26T07:18:00Z

/retest

lokeshj1703 · 2019-10-30T07:02:00Z

The changes look good to me. +1.

lokeshj1703 · 2019-10-30T07:19:24Z

@fapifta Thanks for the contribution! @bshashikant @hanishakoneru Thanks for the reviews! I have merged the PR to master branch.

adoroszlai · 2019-10-31T10:40:31Z

Hi @fapifta, it seems secure acceptance tests are failing after this change.

Some examples:

https://github.com/elek/ozone-ci-03/blob/master/trunk/trunk-nightly-20191031-wdrx2/acceptance/output.log#L1274

https://github.com/elek/ozone-ci-03/blob/1451482d5fcc2e4a67975c347f2a67168d6557b3/pr/pr-hdds-2381-lj5v4/acceptance/output.log#L1613

Can you please check?

…lush takes time. (apache#6978) (apache#81) (cherry picked from commit 19f9afb) Co-authored-by: Sumit Agrawal <sumit.jecrc@gmail.com>

elek added the ozone label Oct 24, 2019

HDDS-2347 XCeiverClientGrpc's parallel use leads to NPE

3f4c2d6

fapifta force-pushed the HDDS-2347 branch from 284cd7c to 3f4c2d6 Compare October 24, 2019 13:32

mukul1987 requested review from bshashikant and hanishakoneru October 24, 2019 16:17

elek changed the title ~~HDDS-2347 XCeiverClientGrpc's parallel use leads to NPE~~ HDDS-2347. XCeiverClientGrpc's parallel use leads to NPE Oct 28, 2019

lokeshj1703 merged commit d3021fb into apache:master Oct 30, 2019

fapifta deleted the HDDS-2347 branch October 30, 2019 13:28

adoroszlai mentioned this pull request Nov 4, 2019

HDDS-2291. Acceptance tests for OM HA. #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-2347. XCeiverClientGrpc's parallel use leads to NPE #81

HDDS-2347. XCeiverClientGrpc's parallel use leads to NPE #81

fapifta commented Oct 24, 2019

fapifta commented Oct 24, 2019

bshashikant commented Oct 25, 2019

hanishakoneru commented Oct 25, 2019

adoroszlai commented Oct 26, 2019

lokeshj1703 commented Oct 30, 2019

lokeshj1703 commented Oct 30, 2019

adoroszlai commented Oct 31, 2019

HDDS-2347. XCeiverClientGrpc's parallel use leads to NPE #81

HDDS-2347. XCeiverClientGrpc's parallel use leads to NPE #81

Conversation

fapifta commented Oct 24, 2019

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

fapifta commented Oct 24, 2019

bshashikant commented Oct 25, 2019

hanishakoneru commented Oct 25, 2019

adoroszlai commented Oct 26, 2019

lokeshj1703 commented Oct 30, 2019

lokeshj1703 commented Oct 30, 2019

adoroszlai commented Oct 31, 2019