-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-27926 DBB release too early for replication #5288
Conversation
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Mind explaining more? It is strange that why a general rpc resource releasing will only affect replication related calls? |
🎊 +1 overall
This message was automatically generated. |
Any updates here? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this will leak? What will run the cleanup?
Would be good to get a bit more details into your analysis/thinking here, both regarding the problem for replication and how this plays with other non-replication cases
Thanks, @Apache9 , @bbeaudreault . |
I'd need to dig into the code to know for sure, but I think the appropriate thing to do might be to retain() in the replication endpoint rather than remove the release/done call in the encoder |
I think the issue is not at the replication endpoint, e.g. source cluster RS A -> dest cluster RS B -> dest cluster RS C, when B act as client to send batches to C and fails, the DBB will be released before next retries. @bbeaudreault |
Yea so we are talking about RSRpcServices.replicateWALEntry endpoint, correct? The NettyRpcServerResponseEncoder calls |
Using retain() before ReplicationSink.batch is reasonable. But I think there is another circumstance that the DBB will be released unexpectedly. That is, source cluster RS S1 -> dest cluster RS D1, dest cluster D1 redirects the entries to more than one other RSes, they can be D2,D3,D4..., but can also be D1 itself, I think this is the central problem. |
Is this repeatable for you? If so can you add logging to try to get a more exact trace of when the problem occurs? Otherwise add a unit test? It’s really painful tracking down leaks, so I’m not excited to remove a release/cleanup without very clear evidence or other options ruled out. |
I will close this PR and try UT reproduce this issue in another. Any one intresting in this issue can try reproduce and fix it too. Thanks a lot. |
No description provided.