-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-24877 Add option to avoid aborting RS process upon uncaught exceptions happen on replication source #2399
Conversation
…eptions happen on replication source
Please create addendum first and then backport it to branch-2? |
And while reading the current code of master, the loop in the startup method does not have a sleep? And yes the retryStartup flag has to be AtomicBoolean as it will be set in another thread, though it is only used in the startup method... And now, seems we will block in the startup method forever? I think a better solution is to do the retry in the initThread so we will not block the startup call? I think the reason we use a initThread is to avoid blocking the startup call, if this is not the case, why not just move the initialize call into the startup method... Thanks. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Sure, addendum is created here: #2400
Holy cow, you're right @Apache9 . That's not the original intention, we should not block startup here, I was so eager to fix the race condition issue and didn't notice the side effect.
Definitely, yeah. Let me do this. |
…ace control in the initialize loop (apache#2400) Signed-off-by: Duo Zhang <zhangduo@apache.org> Signed-off-by: Josh Elser <elserj@apache.org>
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
2 similar comments
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
ping @Apache9 , had tried cherry-picking the changes of the master addendum in here, noticed first UTs never finishing on the first pre-commit run (could be related to this PR), but never found the patch-unit-hbase-server.txt artifact was not available, then tried submit the build again and got this docker issue:
Was talking to @busbey , he mentioned that you had done some recent work for queuing up builds in the new jenkins nodes, would you know this should avoid this kind of errors? |
The pre commit job is still on the old Hadoop nodes, so it should not be effected by the new nodes. Maybe the problem is that our replication related tests are all hanging recently and generate too many logs which cause the surefire plugin to cache a lot of stdout and stderr files under /tmp and make the nodes unavailable. The new jenkins node has a small root disk so it is easy to be full, but finally, we also blow up the H nodes? |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
…eptions happen on replication source (apache#2399) Signed-off-by: stack <stack@apache.org>
PR for branch-2, after cherry-picking and resolving conflicts. Found one extra possible race condition while testing this, I think we need an addendum for the master one.