HBASE-24813 ReplicationSource should clear buffer usage on Replicatio… #2546

wchevreuil · 2020-10-14T18:09:24Z

…nSourceManager upon termination (rebased after HBASE-25117)

...server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

...src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java

esteban

Just a couple of comments.

esteban · 2020-11-09T17:49:32Z

...src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java

+      } catch (InterruptedException e) {
+        LOG.warn("{} Interrupted while waiting {} to stop on clearWALEntryBatch: {}",
+          this.source.getPeerId(), this.getName(), e);
+        Thread.currentThread().interrupt();


Shouldn't be just INFO? Also, I think it might be better tho handle those InterruptedException inside ReplicationSource.terminate().

Left as WARN because it aborts the flow without effectively updating the buffer usage, which is the fundamental issue we are trying to solve here.

ankitsinghal · 2020-11-10T06:58:24Z

...server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

    for (ReplicationSourceShipper worker : workers) {
+      worker.stopWorker();
+      if (worker.entryReader != null) {
+        worker.entryReader.setReaderRunning(false);


If a worker is doing some async work when it is asked to stop and can take time. then I think we should keep the implementation as it was done before, like ask all to stop at once and then wait. because if no. of workers gets large due to backlog and someone changes wait time config to 10s of seconds, then removePeer command/procedure has to wait for a long time (no. of workers * (sleep time + time for clearWalEntryBatch) ) to terminate the replication source.

Sorry, I'm not following your concern here. I don't see how the extra loop in the same method context just setting two a flag in the shipper and other in the reader can help with the contention scenario described, terminate execution would be stuck in the second for loop anyways.

sure, let me try to explain again.
I was referring to restore this loop.

for (ReplicationSourceShipper worker : workers) { worker.stopWorker(); if(worker.entryReader != null) { worker.entryReader.setReaderRunning(false); } }

As your current flow is stopping the worker in a linear manner:-

Stop a worker

wait for the worker thread to complete.

stop another worker

wait for it finishes

continue for others......
So in the worst case, you would have to wait for the number of workers * min(time taken by the worker to finish, timeout)

though by restoring the old loop, you are parallelizing the stopping of the workers.

ask all worker threads to finish their work by setting their state.

then in the second loop, wait for each worker to finish, while you are waiting for 1 worker, others are also completing their work in parallel.

so when you are done with one worker it is possible that all other workers are also done.

Got you, thanks for explaining in more details. Will address it on next commit.

...src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java

ankitsinghal · 2020-11-10T09:07:09Z

...src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java

+          LOG.warn("Interrupting source thread for peer {} without cleaning buffer usage "
+            + "because clearWALEntryBatch method timed out whilst waiting reader/shipper "
+            + "thread to stop.", this.source.getPeerId());
+          Thread.currentThread().interrupt();


why do we need additional interrupt here when ReplicationSource.terminate() is already interrupted the worker thread prior to clearWALEntryBatch method call?

We are just interrupting if either shipper or reader thread is still alive. We can't guarantee that the caller will always have stopped these threads, therefore, the extra check here.

This method should only be called upon replication source termination.

so what this interrupt will do, how is it handled in the source?

LOG.warn("Interrupting source thread for peer {} without cleaning buffer usage " + "because clearWALEntryBatch method timed out whilst waiting reader/shipper " + "thread to stop.", this.source.getPeerId());

don't we need to return here as we timed out and not clearing the batch?

Right, it's not been handled. Changing to simply log the exceptional and return back to source.

if return, then we do not clean the batch, so replication quota will be leaked.

Apache-HBase · 2020-11-10T19:51:06Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 29s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ master Compile Tests _
+1 💚	mvninstall	3m 36s	master passed
+1 💚	checkstyle	1m 5s	master passed
+1 💚	spotbugs	1m 59s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	3m 31s	the patch passed
-0 ⚠️	checkstyle	1m 3s	hbase-server: The patch generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1)
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	hadoopcheck	19m 28s	Patch does not cause any errors with Hadoop 3.1.2 3.2.1 3.3.0.
+1 💚	spotbugs	2m 19s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 12s	The patch does not generate ASF License warnings.
		41m 44s

Subsystem	Report/Notes
Docker	ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2546/3/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#2546
Optional Tests	dupname asflicense spotbugs hadoopcheck hbaseanti checkstyle
uname	Linux a35191384864 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `f0c430a`
checkstyle	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2546/3/artifact/yetus-general-check/output/diff-checkstyle-hbase-server.txt
Max. process+thread count	94 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-2546/3/console
versions	git=2.17.1 maven=3.6.3 spotbugs=3.1.12
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org