Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPP query hang detect may get false positive result since it only has local information #1921

Closed
windtalker opened this issue May 18, 2021 · 0 comments · Fixed by #1922
Closed
Labels
severity/major type/bug The issue is confirmed as a bug.

Comments

@windtalker
Copy link
Contributor

windtalker commented May 18, 2021

In #1342, we add a background thread to detect and cancel hanging mpp query, the basic idea is:

  1. For each mpp task, we set progress_callback to its running streams, and it will update task_progress in MPPTask once the task has made a progress.
  2. A MPP task monitor will check if a mpp task does not made any progress in the past timeout seconds, in case that a mpp task is already started(which means it has made at least 1 progress), but not make any progess in the past timeout seconds, this task is marked as hangs
  3. If a mpp task hangs, MPP task monitor will cancel the whole MPP query

However, since MPP task monitor does not coordinate with each other, the task hang detect only rely on its local information, which is not enough, and may get false positive result. Take the following sql as an example:

select * from c join (select * from a join b on a.id1 = b.id1) d on c.id = d.id

Assuming there are 2 TiFlash node, and the plan of the above query is
Screen Shot 2021-05-18 at 4 27 31 PM

Assuming task 1,3,5,7,9 is running in TiFlash node 1, and task 2,4,6,8,10 is running in TiFlash node 2.

Consider the case that:

  1. table c is a small table, so task 5 and 6 will finished in a short time, which means the build stage of task 9 and 10 will finished in a short time, after finish build, task 9 and 10 will start the probe stage, which need to read data from task 7 and 8
  2. table b is a big table, so the build stage of task 7 and 8 will task a very long time

Because task 7 and 8 contains a join operator, it will start to send data to task 9 and 10 only if the build build stage finshes, so after the build stage of task 9 and 10 finishes, it need to wait a long time to start the probe stage, however, according to the hang detect rules described above, if the build stage of task 7 and 8 takes too long, task 9 and 10 will be treated as timeout task.

An intuitive improvement is that MPP task monitor only cancel a query if all the mpp task is hanging. But this does not sovle the problem completely, take the query select * from a join b on a.id1 = b.id1 for example, assuming the query plan is
Screen Shot 2021-05-18 at 5 14 31 PM

Task 1,3,5 is running in TiFlash node 1 and task 2,4,6 is running in TiFlash node 2. If table a is a big table, however, its data is so skewed that

  1. Amost all the data is in TiFlash node 2, node 1 only contains a small part of the data
  2. After hash partition almost all the data goes to node 2
    In this case, after the build stage is finished(task 3 and 4 is finished), task 1 will finish quickly so in TiFlash node 1 only task 5 is alive, and since it won't get data from task 3,4 it will "hangs", and MPP Monitor will try to cancel the query since task 5 is the only alive task in TiFlash node 1.

The above analysis proves that if only local information is available, the MPP query hang detect may get false positive result, it is usually unacceptable to kill a MPP query which is actually running in normal state, I think we can disable the hang detect in TiFlash side, and if user want to kill mpp query manully, they can use kill command to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants