MPP query hang detect may get false positive result since it only has local information #1921

windtalker · 2021-05-18T09:29:49Z

In #1342, we add a background thread to detect and cancel hanging mpp query, the basic idea is:

For each mpp task, we set progress_callback to its running streams, and it will update task_progress in MPPTask once the task has made a progress.
A MPP task monitor will check if a mpp task does not made any progress in the past timeout seconds, in case that a mpp task is already started(which means it has made at least 1 progress), but not make any progess in the past timeout seconds, this task is marked as hangs
If a mpp task hangs, MPP task monitor will cancel the whole MPP query

However, since MPP task monitor does not coordinate with each other, the task hang detect only rely on its local information, which is not enough, and may get false positive result. Take the following sql as an example:

select * from c join (select * from a join b on a.id1 = b.id1) d on c.id = d.id

Assuming there are 2 TiFlash node, and the plan of the above query is

Assuming task 1,3,5,7,9 is running in TiFlash node 1, and task 2,4,6,8,10 is running in TiFlash node 2.

Consider the case that:

table c is a small table, so task 5 and 6 will finished in a short time, which means the build stage of task 9 and 10 will finished in a short time, after finish build, task 9 and 10 will start the probe stage, which need to read data from task 7 and 8
table b is a big table, so the build stage of task 7 and 8 will task a very long time

Because task 7 and 8 contains a join operator, it will start to send data to task 9 and 10 only if the build build stage finshes, so after the build stage of task 9 and 10 finishes, it need to wait a long time to start the probe stage, however, according to the hang detect rules described above, if the build stage of task 7 and 8 takes too long, task 9 and 10 will be treated as timeout task.

An intuitive improvement is that MPP task monitor only cancel a query if all the mpp task is hanging. But this does not sovle the problem completely, take the query select * from a join b on a.id1 = b.id1 for example, assuming the query plan is

Task 1,3,5 is running in TiFlash node 1 and task 2,4,6 is running in TiFlash node 2. If table a is a big table, however, its data is so skewed that

Amost all the data is in TiFlash node 2, node 1 only contains a small part of the data
After hash partition almost all the data goes to node 2
In this case, after the build stage is finished(task 3 and 4 is finished), task 1 will finish quickly so in TiFlash node 1 only task 5 is alive, and since it won't get data from task 3,4 it will "hangs", and MPP Monitor will try to cancel the query since task 5 is the only alive task in TiFlash node 1.

The above analysis proves that if only local information is available, the MPP query hang detect may get false positive result, it is usually unacceptable to kill a MPP query which is actually running in normal state, I think we can disable the hang detect in TiFlash side, and if user want to kill mpp query manully, they can use kill command to do this.

The text was updated successfully, but these errors were encountered:

windtalker mentioned this issue May 18, 2021

Do not detect hanging mpp task since it will get false positive result #1922

Merged

windtalker closed this as completed in #1922 May 19, 2021

ti-srebot mentioned this issue May 19, 2021

Do not detect hanging mpp task since it will get false positive result (#1922) #1927

Merged

lilinghai added type/bug The issue is confirmed as a bug. severity/major labels May 25, 2021

windtalker mentioned this issue Jan 2, 2025

settings: remove mpp_time_out related config #9760

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPP query hang detect may get false positive result since it only has local information #1921

MPP query hang detect may get false positive result since it only has local information #1921

windtalker commented May 18, 2021 •

edited

Loading

MPP query hang detect may get false positive result since it only has local information #1921

MPP query hang detect may get false positive result since it only has local information #1921

Comments

windtalker commented May 18, 2021 • edited Loading

windtalker commented May 18, 2021 •

edited

Loading