-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subdag operators consuming all celeryd worker processes. Tasks are hanging in queued or no state #1350
Comments
i've observed that restarting the worker processes sometimes unblocks a few tasks. |
tested with 5 subdags per level and it runs fine... just completed 3 runs covering a two day period each ... a limit is being hit somewhere. This works (5 subdags per level):
|
Performed the same test with 10 subdags per level (2 days). Everything worked as expected.
|
So I think i'm on to something ... eveything grinds to a halt when 32 subdag operators are running. I have two workers each with a celery_concurrency of 16 (32 Celery processes)
When they are all consumed nothing else can run... including the subtasks which would allow for the subdag operator to complete. I am going to to boost the celery concurrency to confirm. Short of boosting the celery workers .... not sure what the best 'fix' for this would be. |
@mistercrunch your mentioned using
|
Boosting the celery_concurrency worked ... but not really a solution. I tried setting the sub-tasks to use the sequential executor but hit the same issue once 32
|
So I tried @mistercrunch suggestion from https://groups.google.com/d/msg/airbnb_airflow/8NyLPHV1Fv8/VmgSx4EXBgAJ) This appears to resolve the issue ... I am a bit unclear what the workflow actually is now though ... trying to figure out from the logs. Are my sub-tasks still being distributed through celery to the workers ? Is the SequentialExecutor in this case just responsible for queueing the sub-tasks for the CeleryExecutor to consume ?
|
@jlowin @mistercrunch . The issue is that subdag operators keep their slot and call "run" that uses a backfilljob and then waits for this process to complete, keeping its slot the entire time. Imho the subdag should not call "run", but should do a schedule. This allows the subdag to return right away and clear its slot. Furthermore, I consider being able to specify the executor on the subdag operator an issue as it allows someone to go beyond the resources that have been assigned by ops (ie. this is known and marked as such in the docs). However, by specifying the executor you can essentially tie the dag to a node and make it possible to share data via, eg. /tmp . I think that should be solved differently. |
See also work on AIRFLOW-20. @syvineckruyk Can you create a Jira on this and link it to AIRFLOW-20? |
@bolkedebruin i cannot seem to create issues .. do i need to be given permission ? |
Did you register? You should have access by default |
@bolkedebruin yeah registered and logged in ... I had to change whatever this option is to get it working. |
Dear Airflow Maintainers,
Before I tell you about my issue, let me describe my environment:
Environment
airflow scheduler -n 5
I was using default settings but have boosted up concurrency in some runs to test. This test was using the following:
dag view
subdag view
dag runs
main dag page showing 32 running tasks (subdags)
32 Running task instances
pools with 15 queued sub-tasks
15 Queued task instances
49 Task instances (subdag operator tasks) with no state
1 Succesful task instance (not a subdag operator)
Text view of task instance log (97 Total task instances)
NA
$ uname -a
)$ python --version
$ pip freeze
or$ conda list
Now that you know a little about me, let me tell you about the issue I am having:
Description of Issue
Reproduction Steps
The text was updated successfully, but these errors were encountered: