-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SubDag Operator broken when used with pools - infinite retry of tasks #1225
Comments
@syvineckruyk Thanks so much for taking the time to fill out such a complete report. I believed I've identified the issue, but would you mind verifying it? The fix is in my branch, so you can either clone it or you can install it directly like this:
This branch includes the fixes from #1220, so we'd have to merge that first, but I think it's just about ready to go. I don't think #1220 directly impacts your issue, but it touches the same code so I'd like to avoid a messy conflict. Please let me know if you have any issues getting it running -- if this works for you I'll put through the PR. If you're curious, here's what I believe is going on:
|
@jlowin Awesome ... I'll give it a go ... will update later today |
@jlowin still getting some weird (but different) behavior ... the subdag with pool did what it should the first time but then actually succeeded which should be impossible as the task returns an exit code of 1. I need a bit more time for a better analysis. |
Strange! These subdags might be more trouble than they're worth :) @syvineckruyk instead of watching your DAG through the scheduler and airflow UI, could you just try kicking off a backfill and see if it completes or hangs (and if it properly sets the task states -- you can check those in the UI)? I want to try to minimize any other variables that could be affecting things.
(note with your settings this will kick off more than one run -- you might want to set your interval to 1 day for simplicity) It's a more controllable way to see what's happening than the Scheduler and in your case should give the same result because the subdag is ultimately just running that exact code. Sorry for the iterative debugging process -- if you post any modifications you make to your code I'll run it as well. |
If I run that backfill command (with a fresh That's Python 2.7.11 using the branch I linked above. Next I'll run it in the scheduler, and let's see if we can figure out what's going on. I modified your code slightly to make it run faster (literally, shortened the from datetime import datetime, timedelta
from airflow.models import DAG
from airflow.operators import BashOperator, SubDagOperator
dag_name = 'test_18'
test_pool = 'test_pool'
start_time = datetime(2016, 3, 25)
default_args = {
'owner': 'Test',
'depends_on_past': True,
'start_date': start_time,
# 'email': [email_to],
# 'email_on_failure': True,
# 'email_on_retry': True,
# 'wait_for_downstream': False,
}
# Create the dag object
dag = DAG(dag_name,
default_args=default_args,
schedule_interval=timedelta(days=1))
def get_subdag(dag, sd_id, pool=None):
subdag = DAG(
dag_id='{parent_dag}.{sd_id}'.format(
parent_dag=dag.dag_id,
sd_id=sd_id),
params=dag.params,
default_args=dag.default_args,
template_searchpath=dag.template_searchpath,
user_defined_macros=dag.user_defined_macros,
)
t1 = BashOperator(
task_id='test_task',
bash_command='echo "hello" && sleep 1',
dag=subdag,
pool=pool
)
t2 = BashOperator(
task_id='test_task_2_with_exit_1',
bash_command='echo "hello" && sleep 1 && exit 1',
dag=subdag,
pool=pool
)
t2.set_upstream(t1)
sdo = SubDagOperator(
task_id=sd_id,
subdag=subdag,
retries=1,
retry_delay=timedelta(seconds=3),
dag=dag,
depends_on_past=True,
)
return sdo
sd1 = get_subdag(dag, sd_id='test_subdag')
sd2 = get_subdag(dag, sd_id='test_subdag_with_pool', pool=test_pool) |
@jlowin just got through a clean run ... (after resetdb) ... I want to perform a couple more tests/runs to make sure the behavior is consistent as I cannot explain my previous runs that had issues on your branch. |
@jlowin still some issues happening. So my current testing round is based on issuing backfill from the cli .. but i suspect the behavior is the same from the scheduler. I ran The tasks in retry (stuck there btw)... Have retries set to 0 ... so this should not be possible. Also as you can see the subdag operators that are parent to the sub-tasks are in success ... so i think we both saw clean runs due to reset db .. but something is still off if there are previous dag runs / task instances. Thanks for your help with all of this and let me know if you need anything from me. |
@syvineckruyk is |
@jlowin yeah .. im bumping up the version for clean logs. |
@syvineckruyk unfortunately I can't reproduce that weird case in your screenshot. I'm putting the exact I run:
The 3/26 backfill simply doesn't run because its dependencies haven't been met. That's the correct behavior, since (Something weird happens that I do need to fix but I don't think it's what you're seeing. When a task doesn't run because of dependencies, the command exits without failing. Therefore, the executor thinks it succeeded. But the task doesn't report a success, and so the executor concludes that something went wrong, puts up the "The airflow run command failed at reporting an error message" and, after three loops, fails the task. So you'll see the 3/26 task show up as failed with 0 try_number... that's wrong.) I've updated my branch to pull the latest fixes from master -- do you mind checking it out again? In my dags folder: from datetime import datetime, timedelta
from airflow.models import DAG, Pool
from airflow.operators import BashOperator, SubDagOperator
import airflow
session = airflow.settings.Session()
pool = (
session.query(Pool)
.filter(Pool.pool == 'test_pool')
.first())
if not pool:
session.add(Pool(pool='test_pool', slots=10))
session.commit()
dag_name = 'test_25'
test_pool = 'test_pool'
start_time = datetime(2016, 3, 25)
default_args = {
'owner': 'Test',
'depends_on_past': True,
'start_date': start_time,
# 'email': [email_to],
# 'email_on_failure': True,
# 'email_on_retry': True,
# 'wait_for_downstream': False,
}
# Create the dag object
dag = DAG(dag_name,
default_args=default_args,
schedule_interval=timedelta(days=1))
def get_subdag(dag, sd_id, pool=None):
subdag = DAG(
dag_id='{parent_dag}.{sd_id}'.format(
parent_dag=dag.dag_id,
sd_id=sd_id),
params=dag.params,
default_args=dag.default_args,
template_searchpath=dag.template_searchpath,
user_defined_macros=dag.user_defined_macros,
)
t1 = BashOperator(
task_id='test_task',
bash_command='echo "hello" && sleep 0.2',
dag=subdag,
pool=pool
)
t2 = BashOperator(
task_id='test_task_2_with_exit_1',
bash_command='echo "hello" && sleep 0.2 && exit 1',
dag=subdag,
pool=pool
)
t2.set_upstream(t1)
sdo = SubDagOperator(
task_id=sd_id,
subdag=subdag,
retries=0,
retry_delay=timedelta(seconds=0.2),
dag=dag,
depends_on_past=True,
)
return sdo
# sd1 = get_subdag(dag, sd_id='test_subdag')
sd2 = get_subdag(dag, sd_id='test_subdag_with_pool', pool=test_pool) |
@jlowin ill check it out now... thanks ! ... no need for apologies. let me know if I can do more ... I don't want to create a bunch of messy commits since you are already in deep. Let me know if you need anything from me other than testing. |
@jlowin are you using CeleryExecutor ? |
Not for testing. I'm using Sequential Executor -- are you using Celery Executor? I'm using a completely fresh, default version of Airflow. |
@jlowin
I'll continue testing different configurations. |
Set retries=2 and retry_delay=10
This also seems to work as expected ... will drill through the logs though to make sure. Should I be able to test with depends_on_past = true ? Or is there still an open issue ? Thanks ! |
@jlowin
The loop is not infinite ... but the sub-task queues, fails, and re-runs many times before the error bubbles up to the subdag operator. subdag operator log: node 1:
**
sub-task log: from ui:
|
@jlowin I noticed I had not set the depends_on_past=False in the subdag operator specification. Performed another test with this set and observed the same behavior as above. (multiple attempts before bubbling up to the subdag operator)
|
I've managed to replicate this behavior by dropping my |
@jlowin intersting observation ! ... I'll run some tests with upped hearbeat times. |
@syvineckruyk please give my latest commits a try. Here's what I found (I know I keep writing that but the bittersweet news is you're hitting lots of pain points!): the
Solution: This isn't a complete solution -- the problem will crop up if someone is running a scheduler along any other kind of job because the scheduler will try to run all queued tasks (in fact I wonder if it's the cause of some of the weird backfill errors people have been seeing). To do this properly, I will make the scheduler only work with its "own" tasks -- but that will involve more work. Please let me know if this solution works for now. p.s. if you want to really stress this scenario, set your scheduler_heartbeat to 0 Thanks! |
@jlowin this seems to be working much better. As far as actual execution, errors, and bubbling up of errors everything seems in order ... I am seeing more queuing activity then I would expect as I spin up workers ... Need to get the logs together ... Im at Strata conference all day today so more then likely will not get to this before this evening. Thanks again for all your help. Also just wanted to clarify your last statement.
Do you mean for example if someone were to run a backfill from cli while the schedule is running ? Are the subdag operator's "backfills" protected from this ? Thanks |
Yes, that's exactly what I mean -- the Scheduler is subtly greedy and tries to run ALL queued tasks even if they were queued by a subdagoperator's backfill (or potentially a CLI backfill). I will fix that with the PR that addresses all of your issues. |
You're making me sound like a basket case ;) |
I should have said: all of Airflow's issues! |
@jlowin In the subdag operator logs it appears that every task gets queued twice. Is it normal ? The task instance logs looks fine. Going to try with depends on past next. Code
Subdag operator log from worker 2 (no log on worker 1)
Subtask log from worker 1
Subtask log from worker 2
|
@syvineckruyk you're referring to the doubles near the top of the log, right? (After "starting attempt 1 of 3"). If so, that's normal. The line appears any time the task is sent to the executor. The first time, it gets queued into the pool and the second time it gets run. That's why you don't see it for the non-pooled tasks; they go straight to running. Maybe we can make that log a little clearer about executor queuing vs celery queuing vs pool queuing. |
@jlowin I think I understand ... just to be 100% sure. I am talking about these messages:
Which are also duplicated for "starting attempt 2 of 3" and "starting attempt 3 of 3"... Sounds like we are talking about the same thing though. |
Code run
Subdag Operator - Worker 1 - Log
Subdag Operator - Worker 2 - Log
Sub-Task - Worker 1 - Log
Sub-task - Worker 2 - Log
|
@syvineckruyk #1271 was just merged into master so I'm going to mark this issue as closed. If you have any other problems please let me know! |
thanks @jlowin great work on this fix. |
`SchedulerJob` contains a `set` of `TaskInstance`s called `queued_tis`. `SchedulerJob.process_events` loops through `queued_tis` and tries to remove completed tasks. However, without customizing `__eq__` and `__hash__`, the following two lines have no effect, never removing elements from `queued_tis` leading to infinite retries on failure. This is related to my comment on apache#216. The following code was introduced in the fix to apache#1225. ``` elif ti in self.queued_tis: self.queued_tis.remove(ti) ````
Dear Airflow Maintainers,
Before I tell you about my issue, let me describe my environment:
Environment
Centos 6
CeleryExecutor (redis)
Two workers, 1 scheduler, 1 webserver
MySql
$ uname -a
)Linux **** 2.6.32-431.29.2.el6.x86_64 Improving the search functionality in the graph view #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$ python --version
Python 2.7.10
Now that you know a little about me, let me tell you about the issue I am having:
Description of Issue
When subdag operators are used with pools, sub-task failures are not "bubbled" up to the subdag operator. The contained failed task will retry indefinitely even when retries is set to 0.
When a subtask failure occurs the subdag operator remains in the running state, the sub-task quickly enters a queued state, and shortly there after re-enters a running state.
Reproduction Steps
This behavior has been observed on multiple versions. I did the testing for submitting this issue on 1.7.0rc1.
I also tested on HEAD 2016-03-26. The issue seems to get worst. The subdag which is not pooled (which works as expected on 1.7.0rc1) enters the retry state as it should .. but then never re-enters a running state. Hanging indefinitely.
The text was updated successfully, but these errors were encountered: