You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #1342, we add a background thread to detect and cancel hanging mpp query, the basic idea is:
For each mpp task, we set progress_callback to its running streams, and it will updatetask_progress in MPPTask once the task has made a progress.
A MPP task monitor will check if a mpp task does not made any progress in the past timeout seconds, in case that a mpp task is already started(which means it has made at least 1 progress), but not make any progess in the past timeout seconds, this task is marked as hangs
If a mpp task hangs, MPP task monitor will cancel the whole MPP query
However, since MPP task monitor does not coordinate with each other, the task hang detect only rely on its local information, which is not enough, and may get false positive result. Take the following sql as an example:
select * from c join (select * from a join b on a.id1 = b.id1) d on c.id = d.id
Assuming there are 2 TiFlash node, and the plan of the above query is
Assuming task 1,3,5,7,9 is running in TiFlash node 1, and task 2,4,6,8,10 is running in TiFlash node 2.
Consider the case that:
table c is a small table, so task 5 and 6 will finished in a short time, which means the build stage of task 9 and 10 will finished in a short time, after finish build, task 9 and 10 will start the probe stage, which need to read data from task 7 and 8
table b is a big table, so the build stage of task 7 and 8 will task a very long time
Because task 7 and 8 contains a join operator, it will start to send data to task 9 and 10 only if the build build stage finshes, so after the build stage of task 9 and 10 finishes, it need to wait a long time to start the probe stage, however, according to the hang detect rules described above, if the build stage of task 7 and 8 takes too long, task 9 and 10 will be treated as timeout task.
An intuitive improvement is that MPP task monitor only cancel a query if all the mpp task is hanging. But this does not sovle the problem completely, take the query select * from a join b on a.id1 = b.id1 for example, assuming the query plan is
Task 1,3,5 is running in TiFlash node 1 and task 2,4,6 is running in TiFlash node 2. If table a is a big table, however, its data is so skewed that
Amost all the data is in TiFlash node 2, node 1 only contains a small part of the data
After hash partition almost all the data goes to node 2
In this case, after the build stage is finished(task 3 and 4 is finished), task 1 will finish quickly so in TiFlash node 1 only task 5 is alive, and since it won't get data from task 3,4 it will "hangs", and MPP Monitor will try to cancel the query since task 5 is the only alive task in TiFlash node 1.
The above analysis proves that if only local information is available, the MPP query hang detect may get false positive result, it is usually unacceptable to kill a MPP query which is actually running in normal state, I think we can disable the hang detect in TiFlash side, and if user want to kill mpp query manully, they can use kill command to do this.
The text was updated successfully, but these errors were encountered:
In #1342, we add a background thread to detect and cancel hanging mpp query, the basic idea is:
progress_callback
to its running streams, and it will updatetask_progress
in MPPTask once the task has made a progress.timeout
seconds, in case that a mpp task is already started(which means it has made at least 1 progress), but not make any progess in the pasttimeout
seconds, this task is marked ashangs
However, since MPP task monitor does not coordinate with each other, the task hang detect only rely on its local information, which is not enough, and may get false positive result. Take the following sql as an example:
Assuming there are 2 TiFlash node, and the plan of the above query is
Assuming task 1,3,5,7,9 is running in TiFlash node 1, and task 2,4,6,8,10 is running in TiFlash node 2.
Consider the case that:
Because task 7 and 8 contains a join operator, it will start to send data to task 9 and 10 only if the build build stage finshes, so after the build stage of task 9 and 10 finishes, it need to wait a long time to start the probe stage, however, according to the hang detect rules described above, if the build stage of task 7 and 8 takes too long, task 9 and 10 will be treated as timeout task.
An intuitive improvement is that MPP task monitor only cancel a query if all the mpp task is hanging. But this does not sovle the problem completely, take the query
select * from a join b on a.id1 = b.id1
for example, assuming the query plan isTask 1,3,5 is running in TiFlash node 1 and task 2,4,6 is running in TiFlash node 2. If table a is a big table, however, its data is so skewed that
In this case, after the build stage is finished(task 3 and 4 is finished), task 1 will finish quickly so in TiFlash node 1 only task 5 is alive, and since it won't get data from task 3,4 it will "hangs", and MPP Monitor will try to cancel the query since task 5 is the only alive task in TiFlash node 1.
The above analysis proves that if only local information is available, the MPP query hang detect may get false positive result, it is usually unacceptable to kill a MPP query which is actually running in normal state, I think we can disable the hang detect in TiFlash side, and if user want to kill mpp query manully, they can use
kill
command to do this.The text was updated successfully, but these errors were encountered: