Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-27926 DBB release too early for replication #5288

Closed
wants to merge 1 commit into from

Conversation

sunhelly
Copy link
Contributor

No description provided.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 22s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+1 💚 mvninstall 2m 50s master passed
+1 💚 compile 2m 26s master passed
+1 💚 checkstyle 0m 36s master passed
+1 💚 spotless 0m 43s branch has no errors when running spotless:check.
+1 💚 spotbugs 1m 29s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 32s the patch passed
+1 💚 compile 2m 23s the patch passed
+1 💚 javac 2m 23s the patch passed
+1 💚 checkstyle 0m 36s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 8m 59s Patch does not cause any errors with Hadoop 3.2.4 3.3.5.
+1 💚 spotless 0m 41s patch has no errors when running spotless:check.
+1 💚 spotbugs 1m 35s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 12s The patch does not generate ASF License warnings.
31m 16s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #5288
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname Linux 89cd1979033e 5.4.0-148-generic #165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / ddc6752
Default Java Eclipse Adoptium-11.0.17+8
Max. process+thread count 80 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/console
versions git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 46s Docker mode activated.
-0 ⚠️ yetus 0m 2s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 2m 29s master passed
+1 💚 compile 0m 40s master passed
+1 💚 shadedjars 4m 30s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 26s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 14s the patch passed
+1 💚 compile 0m 39s the patch passed
+1 💚 javac 0m 39s the patch passed
+1 💚 shadedjars 4m 29s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 23s the patch passed
_ Other Tests _
+1 💚 unit 222m 20s hbase-server in the patch passed.
243m 36s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR #5288
Optional Tests javac javadoc unit shadedjars compile
uname Linux 724d56eaf01b 5.4.0-148-generic #165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / ddc6752
Default Java Temurin-1.8.0_352-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/testReport/
Max. process+thread count 4057 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 50s Docker mode activated.
-0 ⚠️ yetus 0m 4s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 2m 48s master passed
+1 💚 compile 0m 46s master passed
+1 💚 shadedjars 4m 38s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 27s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 34s the patch passed
+1 💚 compile 0m 48s the patch passed
+1 💚 javac 0m 48s the patch passed
+1 💚 shadedjars 4m 32s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 25s the patch passed
_ Other Tests _
-1 ❌ unit 222m 52s hbase-server in the patch failed.
245m 27s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #5288
Optional Tests javac javadoc unit shadedjars compile
uname Linux 0e6898c0071f 5.4.0-144-generic #161-Ubuntu SMP Fri Feb 3 14:49:04 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / ddc6752
Default Java Eclipse Adoptium-11.0.17+8
unit https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/artifact/yetus-jdk11-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/testReport/
Max. process+thread count 4426 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor

Apache9 commented Jun 14, 2023

Mind explaining more? It is strange that why a general rpc resource releasing will only affect replication related calls?

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 45s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+1 💚 mvninstall 2m 49s master passed
+1 💚 compile 2m 26s master passed
+1 💚 checkstyle 0m 37s master passed
+1 💚 spotless 0m 44s branch has no errors when running spotless:check.
+1 💚 spotbugs 1m 33s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 32s the patch passed
+1 💚 compile 2m 21s the patch passed
+1 💚 javac 2m 21s the patch passed
+1 💚 checkstyle 0m 33s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 8m 56s Patch does not cause any errors with Hadoop 3.2.4 3.3.5.
+1 💚 spotless 0m 41s patch has no errors when running spotless:check.
+1 💚 spotbugs 1m 36s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 12s The patch does not generate ASF License warnings.
31m 48s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/hbase-kustomize-github-pr/job/PR-5288/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #5288
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname Linux 7a44bd2dc58d 5.4.0-148-generic #165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 68da890
Default Java Eclipse Adoptium-11.0.17+8
Max. process+thread count 79 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/hbase-kustomize-github-pr/job/PR-5288/1/console
versions git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor

Apache9 commented Jun 30, 2023

Any updates here?

Thanks.

Copy link
Contributor

@bbeaudreault bbeaudreault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this will leak? What will run the cleanup?

Would be good to get a bit more details into your analysis/thinking here, both regarding the problem for replication and how this plays with other non-replication cases

@sunhelly
Copy link
Contributor Author

sunhelly commented Jul 5, 2023

Thanks, @Apache9 , @bbeaudreault .
Under the current replication architecture, the replication requests are redirected by RSes. The dest cluster RSes act as hbase client to forward replicated edits to each other by calling ReplicationSink#batch. When using NettyRpcServer, the sink dest RS uses offheap byte buffer to keep the replicated edits and send RPCs to the RSes that the edits refer to, when encounters exception, the ServerCall run progress is not ended but the response.done() is called and the DBB will be released, then the next retries for the ReplicationSink#batch will use dirty DBB to replicate wrong edits.
I think ServerCall#cleanup called when the rpc call is completed or failed is enough. While another cleanup in NettyRpcServerResponseEncoder is redundant and will cause every RS client release the DBB before the ServerCall is really completed.

@bbeaudreault
Copy link
Contributor

I'd need to dig into the code to know for sure, but I think the appropriate thing to do might be to retain() in the replication endpoint rather than remove the release/done call in the encoder

@sunhelly
Copy link
Contributor Author

sunhelly commented Jul 5, 2023

I think the issue is not at the replication endpoint, e.g. source cluster RS A -> dest cluster RS B -> dest cluster RS C, when B act as client to send batches to C and fails, the DBB will be released before next retries. @bbeaudreault

@bbeaudreault
Copy link
Contributor

Yea so we are talking about RSRpcServices.replicateWALEntry endpoint, correct? The NettyRpcServerResponseEncoder calls done() after the response bytes are written to the channel (i.e. from ServerCall.sendResponseIfReady()). The response bytes in this case is an empty ReplicateWALEntryResponse proto. The replicateWALEntry method should only return a ReplicateWALEntryResponse once the batch calls have all succeeded or failed. So we should only trigger NettyRpcServerResponseEncoder once the batch calls have all succeeded or failed. So I'm not sure why this would cause a problem for retries within replicateWALEntry?

@sunhelly
Copy link
Contributor Author

sunhelly commented Jul 6, 2023

Using retain() before ReplicationSink.batch is reasonable. But I think there is another circumstance that the DBB will be released unexpectedly. That is, source cluster RS S1 -> dest cluster RS D1, dest cluster D1 redirects the entries to more than one other RSes, they can be D2,D3,D4..., but can also be D1 itself, I think this is the central problem.

@bbeaudreault
Copy link
Contributor

Is this repeatable for you? If so can you add logging to try to get a more exact trace of when the problem occurs? Otherwise add a unit test?

It’s really painful tracking down leaks, so I’m not excited to remove a release/cleanup without very clear evidence or other options ruled out.

@sunhelly
Copy link
Contributor Author

sunhelly commented Jul 6, 2023

I will close this PR and try UT reproduce this issue in another. Any one intresting in this issue can try reproduce and fix it too. Thanks a lot.

@sunhelly sunhelly closed this Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants