HBASE-27926 DBB release too early for replication #5288

sunhelly · 2023-06-14T08:38:50Z

No description provided.

Apache-HBase · 2023-06-14T09:21:40Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 22s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ master Compile Tests _
+1 💚	mvninstall	2m 50s	master passed
+1 💚	compile	2m 26s	master passed
+1 💚	checkstyle	0m 36s	master passed
+1 💚	spotless	0m 43s	branch has no errors when running spotless:check.
+1 💚	spotbugs	1m 29s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 32s	the patch passed
+1 💚	compile	2m 23s	the patch passed
+1 💚	javac	2m 23s	the patch passed
+1 💚	checkstyle	0m 36s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	hadoopcheck	8m 59s	Patch does not cause any errors with Hadoop 3.2.4 3.3.5.
+1 💚	spotless	0m 41s	patch has no errors when running spotless:check.
+1 💚	spotbugs	1m 35s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 12s	The patch does not generate ASF License warnings.
		31m 16s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#5288
Optional Tests	dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname	Linux 89cd1979033e 5.4.0-148-generic #165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `ddc6752`
Default Java	Eclipse Adoptium-11.0.17+8
Max. process+thread count	80 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/console
versions	git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2023-06-14T12:54:04Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 46s	Docker mode activated.
-0 ⚠️	yetus	0m 2s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	2m 29s	master passed
+1 💚	compile	0m 40s	master passed
+1 💚	shadedjars	4m 30s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 26s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 14s	the patch passed
+1 💚	compile	0m 39s	the patch passed
+1 💚	javac	0m 39s	the patch passed
+1 💚	shadedjars	4m 29s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 23s	the patch passed
		_ Other Tests _
+1 💚	unit	222m 20s	hbase-server in the patch passed.
		243m 36s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR	#5288
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux 724d56eaf01b 5.4.0-148-generic #165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `ddc6752`
Default Java	Temurin-1.8.0_352-b08
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/testReport/
Max. process+thread count	4057 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/console
versions	git=2.34.1 maven=3.8.6
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2023-06-14T12:56:06Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 50s	Docker mode activated.
-0 ⚠️	yetus	0m 4s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	2m 48s	master passed
+1 💚	compile	0m 46s	master passed
+1 💚	shadedjars	4m 38s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 27s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 34s	the patch passed
+1 💚	compile	0m 48s	the patch passed
+1 💚	javac	0m 48s	the patch passed
+1 💚	shadedjars	4m 32s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 25s	the patch passed
		_ Other Tests _
-1 ❌	unit	222m 52s	hbase-server in the patch failed.
		245m 27s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR	#5288
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux 0e6898c0071f 5.4.0-144-generic #161-Ubuntu SMP Fri Feb 3 14:49:04 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `ddc6752`
Default Java	Eclipse Adoptium-11.0.17+8
unit	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/artifact/yetus-jdk11-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/testReport/
Max. process+thread count	4426 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5288/1/console
versions	git=2.34.1 maven=3.8.6
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache9 · 2023-06-14T15:45:03Z

Mind explaining more? It is strange that why a general rpc resource releasing will only affect replication related calls?

Apache-HBase · 2023-06-15T12:31:30Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 45s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ master Compile Tests _
+1 💚	mvninstall	2m 49s	master passed
+1 💚	compile	2m 26s	master passed
+1 💚	checkstyle	0m 37s	master passed
+1 💚	spotless	0m 44s	branch has no errors when running spotless:check.
+1 💚	spotbugs	1m 33s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 32s	the patch passed
+1 💚	compile	2m 21s	the patch passed
+1 💚	javac	2m 21s	the patch passed
+1 💚	checkstyle	0m 33s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	hadoopcheck	8m 56s	Patch does not cause any errors with Hadoop 3.2.4 3.3.5.
+1 💚	spotless	0m 41s	patch has no errors when running spotless:check.
+1 💚	spotbugs	1m 36s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 12s	The patch does not generate ASF License warnings.
		31m 48s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/hbase-kustomize-github-pr/job/PR-5288/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#5288
Optional Tests	dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname	Linux 7a44bd2dc58d 5.4.0-148-generic #165-Ubuntu SMP Tue Apr 18 08:53:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `68da890`
Default Java	Eclipse Adoptium-11.0.17+8
Max. process+thread count	79 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/hbase-kustomize-github-pr/job/PR-5288/1/console
versions	git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache9 · 2023-06-30T13:57:43Z

Any updates here?

Thanks.

bbeaudreault

It seems like this will leak? What will run the cleanup?

Would be good to get a bit more details into your analysis/thinking here, both regarding the problem for replication and how this plays with other non-replication cases

sunhelly · 2023-07-05T03:49:22Z

Thanks, @Apache9 , @bbeaudreault .
Under the current replication architecture, the replication requests are redirected by RSes. The dest cluster RSes act as hbase client to forward replicated edits to each other by calling ReplicationSink#batch. When using NettyRpcServer, the sink dest RS uses offheap byte buffer to keep the replicated edits and send RPCs to the RSes that the edits refer to, when encounters exception, the ServerCall run progress is not ended but the response.done() is called and the DBB will be released, then the next retries for the ReplicationSink#batch will use dirty DBB to replicate wrong edits.
I think ServerCall#cleanup called when the rpc call is completed or failed is enough. While another cleanup in NettyRpcServerResponseEncoder is redundant and will cause every RS client release the DBB before the ServerCall is really completed.

bbeaudreault · 2023-07-05T10:53:51Z

I'd need to dig into the code to know for sure, but I think the appropriate thing to do might be to retain() in the replication endpoint rather than remove the release/done call in the encoder

sunhelly · 2023-07-05T14:48:12Z

I think the issue is not at the replication endpoint, e.g. source cluster RS A -> dest cluster RS B -> dest cluster RS C, when B act as client to send batches to C and fails, the DBB will be released before next retries. @bbeaudreault

bbeaudreault · 2023-07-05T15:09:21Z

Yea so we are talking about RSRpcServices.replicateWALEntry endpoint, correct? The NettyRpcServerResponseEncoder calls done() after the response bytes are written to the channel (i.e. from ServerCall.sendResponseIfReady()). The response bytes in this case is an empty ReplicateWALEntryResponse proto. The replicateWALEntry method should only return a ReplicateWALEntryResponse once the batch calls have all succeeded or failed. So we should only trigger NettyRpcServerResponseEncoder once the batch calls have all succeeded or failed. So I'm not sure why this would cause a problem for retries within replicateWALEntry?

sunhelly · 2023-07-06T02:37:47Z

Using retain() before ReplicationSink.batch is reasonable. But I think there is another circumstance that the DBB will be released unexpectedly. That is, source cluster RS S1 -> dest cluster RS D1, dest cluster D1 redirects the entries to more than one other RSes, they can be D2,D3,D4..., but can also be D1 itself, I think this is the central problem.

bbeaudreault · 2023-07-06T10:22:31Z

Is this repeatable for you? If so can you add logging to try to get a more exact trace of when the problem occurs? Otherwise add a unit test?

It’s really painful tracking down leaks, so I’m not excited to remove a release/cleanup without very clear evidence or other options ruled out.

sunhelly · 2023-07-06T11:26:31Z

I will close this PR and try UT reproduce this issue in another. Any one intresting in this issue can try reproduce and fix it too. Thanks a lot.

HBASE-27926 DBB release too early for replication

fab137a

bbeaudreault reviewed Jun 30, 2023

View reviewed changes

sunhelly closed this Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HBASE-27926 DBB release too early for replication #5288

HBASE-27926 DBB release too early for replication #5288

sunhelly commented Jun 14, 2023

Apache-HBase commented Jun 14, 2023

Apache-HBase commented Jun 14, 2023

Apache-HBase commented Jun 14, 2023

Apache9 commented Jun 14, 2023

Apache-HBase commented Jun 15, 2023

Apache9 commented Jun 30, 2023

bbeaudreault left a comment

sunhelly commented Jul 5, 2023

bbeaudreault commented Jul 5, 2023

sunhelly commented Jul 5, 2023 •

edited

Loading

bbeaudreault commented Jul 5, 2023

sunhelly commented Jul 6, 2023 •

edited

Loading

bbeaudreault commented Jul 6, 2023

sunhelly commented Jul 6, 2023

HBASE-27926 DBB release too early for replication #5288

HBASE-27926 DBB release too early for replication #5288

Conversation

sunhelly commented Jun 14, 2023

Apache-HBase commented Jun 14, 2023

Apache-HBase commented Jun 14, 2023

Apache-HBase commented Jun 14, 2023

Apache9 commented Jun 14, 2023

Apache-HBase commented Jun 15, 2023

Apache9 commented Jun 30, 2023

bbeaudreault left a comment

Choose a reason for hiding this comment

sunhelly commented Jul 5, 2023

bbeaudreault commented Jul 5, 2023

sunhelly commented Jul 5, 2023 • edited Loading

bbeaudreault commented Jul 5, 2023

sunhelly commented Jul 6, 2023 • edited Loading

bbeaudreault commented Jul 6, 2023

sunhelly commented Jul 6, 2023

sunhelly commented Jul 5, 2023 •

edited

Loading

sunhelly commented Jul 6, 2023 •

edited

Loading