-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler can't restart until long-running local executor(s) finish #1389
Comments
Can you clarify what is happening as I don't think your analysis is correct. The scheduler does not wait on the executor except if the sequentialexecutor is used. However, if you are using SubDags a lot you can run out of executor slots, but what you are describing seems different. Please provide a way to replicate the behavior you are observing. |
Well, I'm using LocalExecutor without many SubDags. When scheduler runs out of cycles, it calls on executor's end() Which leads to https://github.com/airbnb/airflow/blob/master/airflow/executors/local_executor.py#L75 From https://docs.python.org/2/library/multiprocessing.html#multiprocessing.JoinableQueue Anyway, that's just my brief investigation of a production issue that we found using LocalExecutor and a long-running sensor task. I'll try to reproduce it locally and past some codes. |
Ah ok. Now I understand the issue and I agree that airflow should behave differently although I don't think airflow should just terminate: you data could be in limbo or the question becomes how to shutdown cleanly? But the current behavior seems to defeat the purpose of num_runs, but I am not sure maybe @mistercrunch or @r39132 wants to comment on this. |
Yeah, the scheduler behavior is kinda inconsistent between Executors. E.g using CeleryExecutor, it looks like scheduler ends num_runs without holding on to any task execution whereas for LocalExecutor it does. In terms of shutting down tasks, we've designed our operator to be idempotent using XCom so that a random task kill is fine for us. As a workaround, we gave the sensor task a short timeout with many retries so that it doesn't always block scheduler. But it'll be great to know what you think on this issue. Thanks a lot! |
@xiaoliangsc can you create a Jira issue for this? It will be a while before we can fix this. Thanks! |
…and changed methods to rely on job events for access to job info (apache#1389) Signed-off-by: Michael Collado <mike@datakin.com>
Dear Airflow Maintainers,
Before I tell you about my issue, let me describe my environment:
Environment
LocalExecutor, set parallelism = 128
$
): Linux blizzard-flowrida 3.16.0-4-amd64 Improving the search functionality in the graph view #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) x86_64 GNU/Linux$ python --version
: 2.7.9Now that you know a little about me, let me tell you about the issue I am having:
Description of Issue
When scheduler runs out of num_runs (we set to 200), it should force local executor's running task to stop. (e.g sends a kill signal).
The current implementation waits until each executor drains its queue. However, if we have long-running tasks, e.g sensor_task to detect a file exists, scheduler would simply wait there, doing nothing.
I'm not listing any code here because I feel it's pretty clear what the issue is, can you please help look into it or suggest a workaround? Thanks a lot!
The text was updated successfully, but these errors were encountered: