Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler can't restart until long-running local executor(s) finish #1389

Closed
xiaoliangsc opened this issue Apr 15, 2016 · 5 comments
Closed

Comments

@xiaoliangsc
Copy link
Contributor

Dear Airflow Maintainers,

Before I tell you about my issue, let me describe my environment:

Environment

  • Version of Airflow (e.g. a release version, running your own fork, running off master -- provide a git log snippet): 1.7.0
  • Airflow components and configuration, if applicable (e.g. "Running a Scheduler with CeleryExecutor")
    LocalExecutor, set parallelism = 128
  • Example code to reproduce the bug (as a code snippet in markdown)
  • Screen shots of your DAG's graph and tree views:
  • Stack trace if applicable:
  • Operating System: (Windows Version or $): Linux blizzard-flowrida 3.16.0-4-amd64 Improving the search functionality in the graph view #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) x86_64 GNU/Linux
  • Python Version: $ python --version: 2.7.9

Now that you know a little about me, let me tell you about the issue I am having:

Description of Issue

  • What did you expect to happen?

When scheduler runs out of num_runs (we set to 200), it should force local executor's running task to stop. (e.g sends a kill signal).

  • What happened instead?

The current implementation waits until each executor drains its queue. However, if we have long-running tasks, e.g sensor_task to detect a file exists, scheduler would simply wait there, doing nothing.

I'm not listing any code here because I feel it's pretty clear what the issue is, can you please help look into it or suggest a workaround? Thanks a lot!

@bolkedebruin
Copy link
Contributor

Can you clarify what is happening as I don't think your analysis is correct. The scheduler does not wait on the executor except if the sequentialexecutor is used. However, if you are using SubDags a lot you can run out of executor slots, but what you are describing seems different.

Please provide a way to replicate the behavior you are observing.

@xiaoliangsc
Copy link
Contributor Author

Well, I'm using LocalExecutor without many SubDags.

When scheduler runs out of cycles, it calls on executor's end()
https://github.com/airbnb/airflow/blob/master/airflow/jobs.py#L788

Which leads to https://github.com/airbnb/airflow/blob/master/airflow/executors/local_executor.py#L75

From https://docs.python.org/2/library/multiprocessing.html#multiprocessing.JoinableQueue
The JoinableQueue used by LocalExecutor will block on join() method until the message are cleared by workers. If any worker process is taking a long-running task, it blocks the call.

Anyway, that's just my brief investigation of a production issue that we found using LocalExecutor and a long-running sensor task. I'll try to reproduce it locally and past some codes.

@bolkedebruin
Copy link
Contributor

Ah ok. Now I understand the issue and I agree that airflow should behave differently although I don't think airflow should just terminate: you data could be in limbo or the question becomes how to shutdown cleanly?

But the current behavior seems to defeat the purpose of num_runs, but I am not sure maybe @mistercrunch or @r39132 wants to comment on this.

@xiaoliangsc
Copy link
Contributor Author

xiaoliangsc commented Apr 18, 2016

Yeah, the scheduler behavior is kinda inconsistent between Executors. E.g using CeleryExecutor, it looks like scheduler ends num_runs without holding on to any task execution whereas for LocalExecutor it does. In terms of shutting down tasks, we've designed our operator to be idempotent using XCom so that a random task kill is fine for us. As a workaround, we gave the sensor task a short timeout with many retries so that it doesn't always block scheduler. But it'll be great to know what you think on this issue. Thanks a lot!

@bolkedebruin
Copy link
Contributor

@xiaoliangsc can you create a Jira issue for this? It will be a while before we can fix this. Thanks!

@bolkedebruin bolkedebruin reopened this May 8, 2016
@kaxil kaxil closed this as completed Mar 20, 2020
mobuchowski pushed a commit to mobuchowski/airflow that referenced this issue Jan 4, 2022
…and changed methods to rely on job events for access to job info (apache#1389)

Signed-off-by: Michael Collado <mike@datakin.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants