Scheduler can't restart until long-running local executor(s) finish #1389

xiaoliangsc · 2016-04-15T20:24:36Z

Dear Airflow Maintainers,

Before I tell you about my issue, let me describe my environment:

Environment

Version of Airflow (e.g. a release version, running your own fork, running off master -- provide a git log snippet): 1.7.0
Airflow components and configuration, if applicable (e.g. "Running a Scheduler with CeleryExecutor")
LocalExecutor, set parallelism = 128
Example code to reproduce the bug (as a code snippet in markdown)
Screen shots of your DAG's graph and tree views:
Stack trace if applicable:
Operating System: (Windows Version or $): Linux blizzard-flowrida 3.16.0-4-amd64 Improving the search functionality in the graph view #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) x86_64 GNU/Linux
Python Version: $ python --version: 2.7.9

Now that you know a little about me, let me tell you about the issue I am having:

Description of Issue

What did you expect to happen?

When scheduler runs out of num_runs (we set to 200), it should force local executor's running task to stop. (e.g sends a kill signal).

What happened instead?

The current implementation waits until each executor drains its queue. However, if we have long-running tasks, e.g sensor_task to detect a file exists, scheduler would simply wait there, doing nothing.

I'm not listing any code here because I feel it's pretty clear what the issue is, can you please help look into it or suggest a workaround? Thanks a lot!

The text was updated successfully, but these errors were encountered:

bolkedebruin · 2016-04-17T19:15:35Z

Can you clarify what is happening as I don't think your analysis is correct. The scheduler does not wait on the executor except if the sequentialexecutor is used. However, if you are using SubDags a lot you can run out of executor slots, but what you are describing seems different.

Please provide a way to replicate the behavior you are observing.

xiaoliangsc · 2016-04-18T02:05:43Z

Well, I'm using LocalExecutor without many SubDags.

When scheduler runs out of cycles, it calls on executor's end()
https://github.com/airbnb/airflow/blob/master/airflow/jobs.py#L788

Which leads to https://github.com/airbnb/airflow/blob/master/airflow/executors/local_executor.py#L75

From https://docs.python.org/2/library/multiprocessing.html#multiprocessing.JoinableQueue
The JoinableQueue used by LocalExecutor will block on join() method until the message are cleared by workers. If any worker process is taking a long-running task, it blocks the call.

Anyway, that's just my brief investigation of a production issue that we found using LocalExecutor and a long-running sensor task. I'll try to reproduce it locally and past some codes.

bolkedebruin · 2016-04-18T09:29:51Z

Ah ok. Now I understand the issue and I agree that airflow should behave differently although I don't think airflow should just terminate: you data could be in limbo or the question becomes how to shutdown cleanly?

But the current behavior seems to defeat the purpose of num_runs, but I am not sure maybe @mistercrunch or @r39132 wants to comment on this.

xiaoliangsc · 2016-04-18T14:55:41Z

Yeah, the scheduler behavior is kinda inconsistent between Executors. E.g using CeleryExecutor, it looks like scheduler ends num_runs without holding on to any task execution whereas for LocalExecutor it does. In terms of shutting down tasks, we've designed our operator to be idempotent using XCom so that a random task kill is fine for us. As a workaround, we gave the sensor task a short timeout with many retries so that it doesn't always block scheduler. But it'll be great to know what you think on this issue. Thanks a lot!

bolkedebruin · 2016-05-08T17:27:30Z

@xiaoliangsc can you create a Jira issue for this? It will be a while before we can fix this. Thanks!

…and changed methods to rely on job events for access to job info (apache#1389) Signed-off-by: Michael Collado <mike@datakin.com>

xiaoliangsc closed this as completed Apr 18, 2016

xiaoliangsc reopened this Apr 18, 2016

bolkedebruin closed this as completed May 8, 2016

bolkedebruin added the Missing JIRA Issue label May 8, 2016

bolkedebruin reopened this May 8, 2016

potiuk removed the Missing JIRA Issue label Sep 7, 2019

kaxil closed this as completed Mar 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler can't restart until long-running local executor(s) finish #1389

Scheduler can't restart until long-running local executor(s) finish #1389

xiaoliangsc commented Apr 15, 2016

bolkedebruin commented Apr 17, 2016

xiaoliangsc commented Apr 18, 2016

bolkedebruin commented Apr 18, 2016

xiaoliangsc commented Apr 18, 2016 •

edited

Loading

bolkedebruin commented May 8, 2016

Scheduler can't restart until long-running local executor(s) finish #1389

Scheduler can't restart until long-running local executor(s) finish #1389

Comments

xiaoliangsc commented Apr 15, 2016

Environment

Description of Issue

bolkedebruin commented Apr 17, 2016

xiaoliangsc commented Apr 18, 2016

bolkedebruin commented Apr 18, 2016

xiaoliangsc commented Apr 18, 2016 • edited Loading

bolkedebruin commented May 8, 2016

xiaoliangsc commented Apr 18, 2016 •

edited

Loading