-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-23597 Give high priority for meta assign procedure and ServerCr… #955
base: master
Are you sure you want to change the base?
Conversation
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging in. Can you do a bit of a high-level on how this all is supposed to work? There is new priority entity type and the (complicated) scheduling system is refactored to service it first ahead of any other table, server, peer, etc. entity. Any fear that this will disturb our scheduling equilibrium such as it is? Any chance we'll deadlock scheduling? (Probably not given the new mechanism is trying to run ahead of anything else). Thanks.
@@ -174,7 +188,7 @@ public Procedure poll(final long nanos) { | |||
return null; | |||
} | |||
} | |||
final Procedure pollResult = dequeue(); | |||
final Procedure pollResult = dequeue(highPriority); | |||
|
|||
pollCalls++; | |||
nullPollCalls += (pollResult == null) ? 1 : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we use this new functionality? We pass flag once set true for high priority and then later false for all the rest?
IIRC, there is a mechanism for putting procedures at front of the queue already... so we process children before their parent. We could not exploit that mechanism here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a pity we have to add the boolean priority flag as it makes this queue Interface no longer look like the usual queue Interface (they don't usually take a flag like this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dequeue(highPriority=true) only pull highPriority procedure from the queue.
And dequeue(highPriority=false) will pull procedure from the queue like before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the new method makes the queue Interface no longer look like the usual.
@@ -2029,7 +2053,7 @@ protected boolean keepAlive(long lastUpdate) { | |||
private final class WorkerMonitor extends InlineChore { | |||
public static final String WORKER_MONITOR_INTERVAL_CONF_KEY = | |||
"hbase.procedure.worker.monitor.interval.msec"; | |||
private static final int DEFAULT_WORKER_MONITOR_INTERVAL = 5000; // 5sec | |||
private static final int DEFAULT_WORKER_MONITOR_INTERVAL = 1000; // 1sec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mean we will get logging from procedure in the Master log more frequently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
* @param unit a TimeUnit determining how to interpret the timeout parameter | ||
* @return the Procedure to execute, or null if nothing present. | ||
*/ | ||
Procedure pollHighPriority(long timeout, TimeUnit unit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is high priority? We have priority on rpcs with a gradient where 200 is high priority. Here there only gradient is high or not high? And only the assign of meta is high priority?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high priority procedure is meta procedure/ServerCrashProcedure(carry meta)
@@ -135,7 +139,11 @@ protected void enqueue(final Procedure proc, final boolean addFront) { | |||
doAdd(tableRunQueue, getTableQueue(getTableName(proc)), proc, addFront); | |||
} else if (isServerProcedure(proc)) { | |||
ServerProcedureInterface spi = (ServerProcedureInterface) proc; | |||
doAdd(serverRunQueue, getServerQueue(spi.getServerName(), spi), proc, addFront); | |||
if (spi.hasMetaTableRegion()) { | |||
doAdd(serverHighPriorityRunQueue, getServerQueue(proc), proc, addFront); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make it so all procedures associated with this server are 'high' priority?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, only the server procedure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, there is a problem that, the sub procedures will inherit some resources from the parent procedure, so changing the logic here is easy to introduce dead lock...
I checked the code, actually we do have a priority for the current ServerQueue, you can see the code in MasterProcedureScheduler.getServerQueue. The one carries meta will be be pulled out first. So I think first we need to find the root cause. The priority is not set correctly? Or is it because that all the workers have been blocked by other procedures so there is no chance for the SCP which carries meta to execute? Thanks. |
And in general, I think it is reasonable to have a mechanism to execute the SCP which carries meta immediately. In the past this is done by checking if all workers are blocked and add new workers, but having a special worker maybe better. The only problem is that we need to have a better abstraction, as ProcedureExecutor is not in hbase-server... |
@Apache9 @saintstack This background is First A ServerCrashProcedure(carry meta)execute and assign meta to B, then ServerCrashProcedure A assign other region following, there are too much regions to assign and do not finish fast, meanwhile B stop and invoke a ServerCrashProcedure(carry meta), then B ServerCrashProcedure blocked by TransitRegionStateProcedure. {code} 2019-12-18 16:33:04,728 DEBUG [PEWorker-15] procedure.MasterProcedureScheduler: Remove ServerQueue(100.107.165.61,60020,1576553057082, xlock=false sharedLock=0 size=0) from run queue because: queue is empty after polling out pid=69619, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=100.107.165.61,60020,1576553057082, splitWal=true, meta=true The SCP in the queue wait for 18 mins. |
That is awkward @binlijin With this added level of priority, server B will be able to be processed ahead of server A? And if during processing of B -- assigning meta to server C, and if C crashes scheduling an SCP, will it be able to be processed ahead of B in turn? |
@saintstack And if during processing of B -- assigning meta to server C, and if C crashes scheduling an SCP, will it be able to be processed ahead of B in turn? |
Ok. You want more feedback? What you think of what has been put here already sir? |
What do you think about bringing some life back into this approach? It may be a reasonable solution to HBASE-24526/HBASE-24673. |
…ashProcedure which carry meta.
a6f5e62
to
123596a
Compare
rebase the code with the latest master. |
So if some one have a better method, you can take it over or give some advices. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
…ashProcedure which carry meta.
The design is simple, WorkerMonitor frequently check the ProcedureScheduler and poll high priority procedure such as meta assign procedure and ServerCrashProcedure(carry meta) and immediately new a thread and execute it. So we need a way to poll the high priority procedure such as meta assign procedure and ServerCrashProcedure(carry meta).