AIRFLOW-128 Optimize and refactor process_dag #1514

bolkedebruin · 2016-05-18T20:57:07Z

This addresses AIRFLOW-128.

@aoen @artwr @mistercrunch @r39132 @jlowin : ready for review.

Goals:

Improve readability of the code and generic assumptions (getters should not change a state) for DagRuns
Improve robustness and lower risk of race conditions in the scheduler
Reduce amount of calls to the database, limit connections in the scheduler
Identify speed optimizations possibilities

What has changed:

Two new TaskInstance states have been introduced. "REMOVED" and "SCHEDULED". REMOVED will be set when taskinstances are encountered that do no exist anymore in the DAG. This happens when a DAG is changed (ie. a new version). The "REMOVED" state exists for lineage purposes. "SCHEDULED" is used when a Task that did not have a state before is sent to the executor. It is used by both the scheduler and backfills. This state almost removes the race condition that exists if using multiple schedulers: due to the fact UP_FOR_RETRY is being managed by the TaskInstance (I think that is the wrong place) is still exists for that state.
get_active_runs was a getter that was also updating to the database. This patch refactors get_active_runs into two different functions that are now part of DagRun. 1) update_state updates the state of the dagrun based on the taskinstances of the dagrun. 2) verify_integrity checks and updates the dag run based on if the dag contains new or missing tasks.
DagRun.update_state has been updated to not call the database twice for the same functions. This reduces the time spent here by 50% in certain occasions when having many tasks in a Dag that need to be evaluated. Still this needs to be faster: for those Dags with many tasks the aggregation query in TaskInstance.are_dependencies_met is very expensive. It should be refactored.
process_dag has been updated to use the functions and cleaned up, making it much more readable. Tasks are now properly locked by the database. I have played with multiprocessing here (on dagruns and taskinstances) but left it out for now. Fixing the above will help more I think.

Stats:

New scheduler time spent at earlier stages is a bit more, due to eager creation of TaskIntances
Old scheduler MAX is higher due to "are_dependencies_met()" called twice
New scheduler fluctuates a bit more due to database locking and "are_dependencies_met" scanning the table (needs to wait for lock).

Old:

2016-05-23 11:17:39,031 INFO - Loop took: 0.018685 seconds
2016-05-23 11:17:44,022 INFO - Loop took: 0.013455 seconds
2016-05-23 11:17:49,033 INFO - Loop took: 0.018868 seconds
2016-05-23 11:17:54,031 INFO - Loop took: 0.019578 seconds
2016-05-23 11:17:59,024 INFO - Loop took: 0.013051 seconds
2016-05-23 11:18:05,026 INFO - Loop took: 1.010049 seconds
2016-05-23 11:18:10,399 INFO - Loop took: 1.390761 seconds
2016-05-23 11:18:32,112 INFO - Loop took: 18.104715 seconds
2016-05-23 11:19:03,432 INFO - Loop took: 31.089109 seconds
2016-05-23 11:19:14,140 INFO - Loop took: 10.492135 seconds
2016-05-23 11:19:38,339 INFO - Loop took: 24.05553 seconds
2016-05-23 11:20:06,281 INFO - Loop took: 27.887196 seconds
2016-05-23 11:20:30,215 INFO - Loop took: 23.9155 seconds
2016-05-23 11:20:53,953 INFO - Loop took: 23.375444 seconds
2016-05-23 11:21:29,168 INFO - Loop took: 35.191994 seconds
2016-05-23 11:22:42,276 INFO - Loop took: 72.384736 seconds
2016-05-23 11:23:06,276 INFO - Loop took: 23.831495 seconds
2016-05-23 11:23:38,852 INFO - Loop took: 32.333608 seconds

New:

2016-05-23 11:26:12,021 INFO - Loop took: 0.011257 seconds
2016-05-23 11:26:17,031 INFO - Loop took: 0.018259 seconds
2016-05-23 11:26:22,021 INFO - Loop took: 0.01233 seconds
2016-05-23 11:26:27,026 INFO - Loop took: 0.017952 seconds
2016-05-23 11:26:32,031 INFO - Loop took: 0.017606 seconds
2016-05-23 11:26:37,707 INFO - Loop took: 0.697367 seconds
2016-05-23 11:26:43,268 INFO - Loop took: 1.255278 seconds
2016-05-23 11:27:01,234 INFO - Loop took: 14.225399 seconds
2016-05-23 11:27:03,832 INFO - Loop took: 2.580292 seconds
2016-05-23 11:27:35,556 INFO - Loop took: 29.534056 seconds
2016-05-23 11:27:55,896 INFO - Loop took: 20.321862 seconds
2016-05-23 11:28:10,192 INFO - Loop took: 14.250471 seconds
2016-05-23 11:28:40,778 INFO - Loop took: 30.337702 seconds
2016-05-23 11:28:49,003 INFO - Loop took: 8.135393 seconds
2016-05-23 11:29:09,132 INFO - Loop took: 19.923375 seconds
2016-05-23 11:29:46,856 INFO - Loop took: 37.393256 seconds
2016-05-23 11:30:30,984 INFO - Loop took: 43.79019 seconds
2016-05-23 11:31:09,856 INFO - Loop took: 38.444254 seconds
2016-05-23 11:31:31,164 INFO - Loop took: 21.061177 seconds
2016-05-23 11:32:29,776 INFO - Loop took: 58.153763 seconds
2016-05-23 11:33:12,128 INFO - Loop took: 42.105758 seconds
2016-05-23 11:33:50,796 INFO - Loop took: 38.137385 seconds
2016-05-23 11:34:40,290 INFO - Loop took: 49.115855 seconds
2016-05-23 11:35:22,900 INFO - Loop took: 42.269646 seconds
2016-05-23 11:35:38,456 INFO - Loop took: 15.453262 seconds

** Note: unittests have been added to cover process_dag **

Dag used for testing:

Please note that both schedulers this Dag does not even finish. It only finished properly with the fully implemented new scheduler (including connecting dagruns by using previous ids)

from datetime import timedelta, datetime
from airflow.models import DAG, Pool
from airflow.operators import BashOperator, SubDagOperator, DummyOperator
from airflow.executors import SequentialExecutor
import airflow


# -----------------------------------------------------------------\
# DEFINE THE POOLS
# -----------------------------------------------------------------/
session = airflow.settings.Session()
for p in ['test_pool_1', 'test_pool_2', 'test_pool_3']:
    pool = (
        session.query(Pool)
        .filter(Pool.pool == p)
        .first())
    if not pool:
        session.add(Pool(pool=p, slots=8))
        session.commit()


# -----------------------------------------------------------------\
# DEFINE THE DAG
# -----------------------------------------------------------------/

# Define the Dag Name. This must be unique.
dag_name = 'hanging_subdags_n16_sqe'

# Default args are passed to each task
default_args = {
    'owner': 'Airflow',
    'depends_on_past': False,
    'start_date': datetime(2016, 04, 10),
    'retries': 0,
    'retry_interval': timedelta(minutes=5),
    'email': ['your@email.com'],
    'email_on_failure': True,
    'email_on_retry': True,
    'wait_for_downstream': False,
}

# Create the dag object
dag = DAG(dag_name,
          default_args=default_args,
          schedule_interval='0 0 * * *'
          )

# -----------------------------------------------------------------\
# DEFINE THE TASKS
# -----------------------------------------------------------------/


def get_subdag(dag, sd_id, pool=None):
    subdag = DAG(
        dag_id='{parent_dag}.{sd_id}'.format(
            parent_dag=dag.dag_id,
            sd_id=sd_id),
        params=dag.params,
        default_args=dag.default_args,
        template_searchpath=dag.template_searchpath,
        user_defined_macros=dag.user_defined_macros,
    )

    t1 = BashOperator(
        task_id='{sd_id}_step_1'.format(
            sd_id=sd_id
        ),
        bash_command='echo "hello" && sleep 1',
        dag=subdag,
        pool=pool
    )

    t2 = BashOperator(
        task_id='{sd_id}_step_two'.format(
            sd_id=sd_id
        ),
        bash_command='echo "hello" && sleep 2',
        dag=subdag,
        pool=pool
    )

    t2.set_upstream(t1)

    sdo = SubDagOperator(
        task_id=sd_id,
        subdag=subdag,
        retries=0,
        retry_delay=timedelta(seconds=5),
        dag=dag,
        depends_on_past=False,
        executor=SequentialExecutor()
    )

    return sdo

start_task = DummyOperator(
    task_id='start',
    dag=dag
)

for n in range(1, 17):
    sd_i = get_subdag(dag=dag, sd_id='level_1_{n}'.format(n=n), pool='test_pool_1')
    sd_ii = get_subdag(dag=dag, sd_id='level_2_{n}'.format(n=n), pool='test_pool_2')
    sd_iii = get_subdag(dag=dag, sd_id='level_3_{n}'.format(n=n), pool='test_pool_3')

    sd_i.set_upstream(start_task)
    sd_ii.set_upstream(sd_i)
    sd_iii.set_upstream(sd_ii)

criccomini · 2016-05-19T17:00:16Z

airflow/models.py

@@ -1061,6 +1060,7 @@ def are_dependencies_met(

        task = self.task

+        logging.info("Checkpoint A")


Assuming this is for debugging, and will go away before final merge

Absolutely, everything after my initial commit in this pr is actually profiling stuff (ie. WIP). A lot of time is spent in are_dependencies_met as it is iterating over all tasks now every time now.

landscape-bot · 2016-05-20T19:42:47Z

Repository health decreased by 0.48% when pulling 0a84f4a on bolkedebruin:process_dag into ccfc4c8 on apache:master.

14 new problems were found (including 1 error and 9 code smells).
3 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-21T14:14:27Z

Repository health decreased by 0.42% when pulling 632ed76 on bolkedebruin:process_dag into ccfc4c8 on apache:master.

13 new problems were found (including 1 error and 8 code smells).
3 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-21T19:01:42Z

Repository health decreased by 0.50% when pulling b0bfa3f on bolkedebruin:process_dag into ccfc4c8 on apache:master.

15 new problems were found (including 1 error and 10 code smells).
3 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-22T11:08:54Z

Repository health decreased by 0.43% when pulling e60fa08 on bolkedebruin:process_dag into ccfc4c8 on apache:master.

14 new problems were found (including 1 error and 8 code smells).
3 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-22T11:15:27Z

Repository health decreased by 0.52% when pulling 726de7c on bolkedebruin:process_dag into ccfc4c8 on apache:master.

16 new problems were found (including 1 error and 10 code smells).
3 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-22T11:22:47Z

Repository health decreased by 0.71% when pulling d5d63bc on bolkedebruin:process_dag into ccfc4c8 on apache:master.

21 new problems were found (including 1 error and 14 code smells).
3 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-22T15:15:43Z

Repository health decreased by 0.52% when pulling 529ceb2 on bolkedebruin:process_dag into ccfc4c8 on apache:master.

16 new problems were found (including 1 error and 10 code smells).
3 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-22T17:27:30Z

Repository health decreased by 0.48% when pulling cf20a11 on bolkedebruin:process_dag into ccfc4c8 on apache:master.

15 new problems were found (including 1 error and 9 code smells).
3 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-22T17:40:18Z

Repository health decreased by 0.48% when pulling dfd18bc on bolkedebruin:process_dag into ccfc4c8 on apache:master.

15 new problems were found (including 1 error and 9 code smells).
3 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-22T19:06:32Z

Repository health decreased by 0.48% when pulling aad7554 on bolkedebruin:process_dag into ccfc4c8 on apache:master.

15 new problems were found (including 1 error and 9 code smells).
3 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-22T19:37:33Z

Repository health decreased by 0.46% when pulling 696063b on bolkedebruin:process_dag into 88f895a on apache:master.

15 new problems were found (including 1 error and 9 code smells).
6 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-22T19:42:55Z

Repository health decreased by 0.40% when pulling 3f97b2b on bolkedebruin:process_dag into 88f895a on apache:master.

13 new problems were found (including 1 error and 8 code smells).
7 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-23T08:56:42Z

Repository health decreased by 0.33% when pulling ff7ebba on bolkedebruin:process_dag into 88f895a on apache:master.

13 new problems were found (including 0 errors and 8 code smells).
7 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-23T08:56:54Z

Repository health decreased by 0.33% when pulling ff7ebba on bolkedebruin:process_dag into 88f895a on apache:master.

13 new problems were found (including 0 errors and 8 code smells).
7 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-23T09:50:07Z

Repository health decreased by 0.25% when pulling 68ec4dc on bolkedebruin:process_dag into 88f895a on apache:master.

13 new problems were found (including 0 errors and 6 code smells).
7 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-23T09:58:38Z

Repository health decreased by 0.25% when pulling 5a1fed9 on bolkedebruin:process_dag into 88f895a on apache:master.

13 new problems were found (including 0 errors and 6 code smells).
7 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-23T10:06:27Z

Repository health decreased by 0.34% when pulling a774a2e on bolkedebruin:process_dag into 88f895a on apache:master.

15 new problems were found (including 0 errors and 8 code smells).
7 problems were fixed (including 0 errors and 0 code smells).

landscape-bot · 2016-05-23T10:48:21Z

Repository health decreased by 0.33% when pulling d0ed971 on bolkedebruin:process_dag into 88f895a on apache:master.

14 new problems were found (including 0 errors and 8 code smells).
7 problems were fixed (including 0 errors and 0 code smells).

criccomini · 2016-06-01T14:22:49Z

Got it. Thanks! :)

mistercrunch · 2016-06-01T15:22:53Z

airflow/jobs.py

-                        or_(DagRun.external_trigger == False,
-                            # add % as a wildcard for the like query
-                            DagRun.run_id.like(DagRun.ID_PREFIX+'%')))
+                dag_id=dag.dag_id).filter(


my new preferred way to indent method chains is

qry = ( session.query(func.max(DagRun.execution_date)) .filter_by(dag_id = dag.dag_id) .filter(or_( DagRun.external_trigger == False, DagRun.run_id.like(DagRun.ID_PREFIX+'%') ) )

mistercrunch · 2016-06-01T15:29:25Z

@plypaul, how does this play with your upcoming PR? can you submit yours or share the branch you've been working with to get a sense on whether there's duplicated effort/overlap here?

bolkedebruin · 2016-06-01T20:51:53Z

@plypaul @mistercrunch I'm assuming #1559 is referred to. That one does raise some questions with me and @jlowin, but I dont mind working with @plypaul to make it work together. It should not be too difficult as #1559 is basically wrapping the areas I touched.

This patch addresses the following issues: get_active_runs was a getter that was also updating to the database. This patch refactors get_active_runs into two different functions that are part of DagRun. update_state update state of the dagrun based on the taskinstances of the dagrun. verify_integrity checks and updates the dag run based on if the dag contains new or missing tasks. Deadlock detection has been refactored to ensure that database does not get hit twice, in some circumstances this can reduce the time spent by 50%. process_dag has been refactored to use the functions of DagRun reducing complexity and reducing pressure on the database. In addition locking is now properly working under the assumption that the heartrate is longer than the time process_dag spends. Two new TaskInstance states have been introduced. "REMOVED" and "SCHEDULED". REMOVED will be set when taskinstances are encountered that do no exist anymore in the DAG. This happens when a DAG is changed (ie. a new version). The "REMOVED" state exists for lineage purposes. "SCHEDULED" is used when a Task that did not have a state before is sent to the executor. It is used by both the scheduler and backfills. This state almost removes the race condition that exists if using multiple schedulers: due to the fact UP_FOR_RETRY is being managed by the TaskInstance (I think that is the wrong place) is still exists for that state.

plypaul · 2016-06-01T23:36:38Z

Yeah, as @bolkedebruin mentioned, this should be mostly compatible with #1559. To clarify, what's the difference between the SCHEDULED and QUEUED states?

bolkedebruin · 2016-06-02T07:47:49Z

@plypaul SCHEDULED is set at handover from the scheduler to the executor only. It can only be set by the scheduler. It prevents race conditions. Ideally, it should always be the state of the task when it is sent to the executor. It isn't now due to UP_FOR_RETRY being handled by the TI instead of in the scheduler.

QUEUED indicates that a ti is waiting for a slot in the executor from either a pool or max parallelism. It is a bit ambiguous as it is managed in different locations. Some of the questions @jlowin and I have around #1559 are due to this ambiguity and your broader use of QUEUED.

criccomini · 2016-06-02T22:43:51Z

@bolkedebruin so is the order SCHEDULED -> QUEUED?

plypaul · 2016-06-03T02:28:17Z

Can you elaborate on the race condition that the SCHEDULED state helps to resolve?

criccomini · 2016-06-03T04:29:31Z

@plypaul via https://cwiki.apache.org/confluence/display/AIRFLOW/Scheduler+Basics

The scheduler processes tasks that have a state of NONE, QUEUED, and UP_FOR_RETRY. NONE is a newly created TaskInstance, QUEUED is a task that is waiting for a slot in an executor and UP_FOR_RETRY means a task that failed before but needs to be retried. At the moment the scheduler will not change the state of a Task. This creates the possibility of a race condition: if the executor cannot execute a task quickly enough its state will not be updated and it can get scheduled again. Due to the loop times being quite large in complex cases, this won't occur too often. As people try to limit scheduler loop times, or just prefer LocalExecutor more than the CeleryExecutor, they will run multiple schedulers at the same time, which increases the chances of this happening.

For this reason a "SCHEDULED" state is proposed in one of the PRs. This will not fully close the door to this race condition due to the fact that TaskInstance evaluate their own state in case of UP_FOR_RETRY (ie. checking if they should run now). Normally one would move this check to the scheduler was it not for that fact that we have backfills...

Basically, without a SCHEDULED state, the task is given to the executor. If the scheduler loops around fast enough again (or you have multiple schedulers running) before the executor updates the task's state, the scheduler will schedule it again, since the task's state hasn't changed (and it's still runnable).

plypaul · 2016-06-03T05:03:07Z

I share the same concern as @jlowin. Schedulers can be restarted at random points for deployment / instance replacements, and having orphaned tasks is something that needs to be addressed.

plypaul · 2016-06-03T05:32:09Z

Prior to adding the SCHEDULED state, I take it that you were seeing task instances getting queued multiple times in the executor? Since the code in TaskInstance.run() seems to (somewhat) mitigate this by detecting the RUNNING state, what sort of issues were you seeing?

bolkedebruin · 2016-06-03T06:36:36Z

@plypaul arbitrary means a SIGKILL in this case which leaves any process in an undetermined state. So the chances of it to occur are small (it also needs to coincide with the time between the change to scheduled and the executor receiving it at a kill of the scheduler). There is also a small chance the executor gets killed which would leave the task in limbo too.

To remove no 1. The with_for_update needs to wrap the "send" to the executor. The transaction to the db will then fail in case of a kill and the state will not be changed. To remove number 2 probably a kind of garbage collection needs to be added (at the time of checking for zombie tasks?).

Changes are slim and in general one should not go around kill -9 processes.

bolkedebruin · 2016-06-03T06:43:37Z

On the occurrence of the race condition please see the comment in the executor by @jlowin. As said when people get more complex environments and start running two local executors for example, the chances for this to occur would be there for every scheduler loop so for it to occur at a certain point in time would be 99% certain. People might just not have seen it.

plypaul · 2016-06-08T04:30:51Z

The condition where task instances can be orphaned doesn't require SIGKILL - if the scheduler is stopped by SIGTERM after the task is sent to the executor but before the task is run, all those task instances will be left in the SCHEDULED state, right?

This could be a problem in our setup where there can be 1000's of task instances waiting for a slot in the executor to run. That queue can take hours to clear, and if the scheduler is restarted at any point in that window, we'll have a bunch of orphaned tasks.

We also use the CeleryExecutor, which relies on an external queue mechanism. The queue can be cleared (for example, the Redis instance dies or needs to be restarted), and when this happens, we'll also have a bunch of orphaned tasks.

Overall, it would be great for Airflow to be resilient to restarts and failures. Machines die and services restart all the time, and the oncall who wakes up needs to have simple remediation procedures for handling the failure. Without a way to recover from these cases, it would make the oncall's life much more difficult.

bolkedebruin · 2016-06-08T05:18:36Z

@plypaul in case of celery that is a really short time frame: it gets to scheduled at the moment of sending it to the executor (thus celery). So if you get in between that moment yes you can get a orphaned task at the moment although chances imho are very slim. This can be further eliminated by surrounding the "send to executor" with the "for_update_block" this way the record wont get updated if sending to the executor fails.

In addition the executor should set a state when it picks up the task. This way you can do a bit of garbage collection of tasks that are in a certain state without a heartbeat for some time.

plypaul · 2016-06-08T06:12:31Z

So this was my understanding of how the state changes with this PR:

Scheduler processes DAGs, and if a task instance needs to be run, it is created in the ORM in the SCHEDULED state and added to the queue of tasks that should be sent to the executor. Task instances already in the SCHEDULED state do not get added to the queue.
Once all the DAG files have been processed, task instances are examined and sent to the executor.
The executor runs the task. When the task runs, the state of the task changes from SCHEDULED to RUNNING.

When there are many DAGs, step 1 can take a long time. In our case, it's about 6 minutes. If the scheduler exits between steps 1 and 2, there will be orphaned tasks and the 6 minute window is fairly large.

Also, the state of the task does not change when it's sent to the executor. The state of the task only changes when the task is actually run. With the CeleryExecutor, the task would stay in the SCHEDULED state until a worker becomes free, picks the task from the Redis queue, and runs it. On busy moments, the task can be in either the internal executor queue or the Redis queue for many minutes. If the scheduler restarts while the task is in the internal executor queue, there will be orphaned tasks. Likewise, if the Redis queue is cleared, there will be orphaned tasks.

bolkedebruin · 2016-06-08T06:27:24Z

Ah so yes like I mentioned we probably should move the set to scheduled state to to moment it is really sent to the executor and indeed not at evaluation time.

And yes at the moment the executor does not set a state but it should.

I'll have a look at #1 as that is the biggest issue and think about a garbage collector.

bolkedebruin · 2016-06-08T06:46:20Z

Btw the issue with the scheduler can/should only occur on a SIGKILL a sigterm allows for clean termination (which might need some work)

plypaul · 2016-06-08T22:52:27Z

Considering the operational issues with the scheduled state, it seems like the garbage collector is needed. Why don't we go back to the previous state logic, and put then in the state change + garbage collector together?

bolkedebruin · 2016-06-09T06:26:06Z

Ha that sounds like it is not needed (reverting - working together is fine :) ). Applying the previous state logic would just mean also evaluating the scheduled status and and re-scheduling those taskinstances everytime and let the taskinstance at run time figure out what to do. This is what the previous scheduler did with"none" states. That's kind of a one-liner.

What I would suggest is to move the "set to scheduled state" to when it the taskinstance is sent to the executor. This resolves the orphaning when the scheduler is killed and is a small change.

For the garbage collector we can add a time stamp last_updated to the task_instance which will get set everytime the record is updated (or a state change happens). The scheduler can then do a simple sweep by sql statement to set the state to "none" or to a new "reschedule" state by comparing it to an arbitrary timeout value. This would be orphaning down the line and would be future proof. Also a one liner.

Or even better we could ask the executor which tasks it knows about and compare that to the list of scheduled tasks. If they are not there reschedule them. This is a little bit more work and might need to be combined with the above one as in the past the reporting by the executor was not really trustworthy.

What do you think? Over the weekend I can have a patch for both I think.

plypaul · 2016-06-09T08:03:21Z

Changing the state when the task instance when the executor gets it reduces the window, but there are still cases where we can have a bunch of orphaned tasks. Hence, we want to try for a more robust solution.

As you point out, having the scheduler check the executor to see if a particular task instance has been submitted already and only queue if it hasn't been queued seems like a simple solution to the original problem. With that solution, there wouldn't need to be an additional timeout, right?

bolkedebruin · 2016-06-09T10:24:31Z

You are right on the possibility of orphaned tasks, but please approach them as two separate windows. Moving the "setting schedule state" to when it gets sent to the executor closes one as the DB safe guards against a kill of the scheduler by not committing the transaction. The second one is the time between the handover from the executor to the worker by means of the MQ (and then after not setting a different state).

Indeed asking the executor for the information should work. The timeout would just safe guard defensively against the executor not reporting back correctly. It 'lost' tasks in the past therefore @jlowin reverted some logic in process_queued_tasks before the release of 1.7.1 .

So yes I think we are on the right track :). Let's see if asking the executor solves the issue, but I do think scanning for orphaned tasks might be nice to have in also to also safe guard against any future changes. Ie. the scan has a holistic view and asking the executor it is only an particular view.

bolkedebruin · 2016-06-09T11:07:14Z

@plypaul I did a first iteration in #1581 (untested) lets continue the discussing there.

plypaul · 2016-06-29T04:09:50Z

airflow/models.py

+            state=State.unfinished(),
+            session=session
+        )
+        none_depends_on_past = all(t.task.depends_on_past for t in unfinished_tasks)


Should this be not t.task.depends_on_past?

bolkedebruin force-pushed the process_dag branch 2 times, most recently from dab8009 to 320285b Compare May 19, 2016 09:21

criccomini reviewed May 19, 2016
View reviewed changes

bolkedebruin force-pushed the process_dag branch from dcc3955 to 1f25928 Compare May 20, 2016 11:55

bolkedebruin mentioned this pull request May 21, 2016

AIRFLOW-124 Implement create_dagrun #1506

Merged

bolkedebruin changed the title ~~AIRFLOW-128 Refactor get_active_runs into DagRun and reduce roundtrips to database in process_dag~~ AIRFLOW-128 Optimize and refactor process_dag May 22, 2016

bolkedebruin force-pushed the process_dag branch from aad7554 to 696063b Compare May 22, 2016 19:35

bolkedebruin force-pushed the process_dag branch from 696063b to 3f97b2b Compare May 22, 2016 19:39

bolkedebruin force-pushed the process_dag branch 2 times, most recently from 825866c to ff7ebba Compare May 23, 2016 08:54

bolkedebruin force-pushed the process_dag branch from a774a2e to d0ed971 Compare May 23, 2016 10:45

mistercrunch reviewed Jun 1, 2016
View reviewed changes

bolkedebruin force-pushed the process_dag branch from d452b46 to b18c995 Compare June 1, 2016 20:58

asfgit merged commit b18c995 into apache:master Jun 1, 2016

bolkedebruin mentioned this pull request Jun 9, 2016

[AIRFLOW-224] Collect orphaned tasks and reschedule them #1581

Merged

plypaul reviewed Jun 29, 2016
View reviewed changes

This was referenced Mar 27, 2020

DagRuns are marked as failed as soon as one task fails #7939

Closed

DagRuns are marked as failed as soon as one task fails #7940

Closed

		@@ -1061,6 +1060,7 @@ def are_dependencies_met(

		task = self.task

		logging.info("Checkpoint A")

AIRFLOW-128 Optimize and refactor process_dag #1514

AIRFLOW-128 Optimize and refactor process_dag #1514

Conversation

bolkedebruin commented May 18, 2016 • edited Loading

criccomini May 19, 2016

Choose a reason for hiding this comment

bolkedebruin May 19, 2016

Choose a reason for hiding this comment

landscape-bot commented May 20, 2016

landscape-bot commented May 21, 2016

landscape-bot commented May 21, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 22, 2016

landscape-bot commented May 23, 2016

landscape-bot commented May 23, 2016

landscape-bot commented May 23, 2016

landscape-bot commented May 23, 2016

landscape-bot commented May 23, 2016

landscape-bot commented May 23, 2016

criccomini commented Jun 1, 2016

mistercrunch Jun 1, 2016

Choose a reason for hiding this comment

mistercrunch commented Jun 1, 2016

bolkedebruin commented Jun 1, 2016 • edited Loading

plypaul commented Jun 1, 2016

bolkedebruin commented Jun 2, 2016

criccomini commented Jun 2, 2016

plypaul commented Jun 3, 2016

criccomini commented Jun 3, 2016 • edited Loading

plypaul commented Jun 3, 2016

plypaul commented Jun 3, 2016

bolkedebruin commented Jun 3, 2016

bolkedebruin commented Jun 3, 2016 • edited Loading

plypaul commented Jun 8, 2016

bolkedebruin commented Jun 8, 2016

plypaul commented Jun 8, 2016

bolkedebruin commented Jun 8, 2016

bolkedebruin commented Jun 8, 2016

plypaul commented Jun 8, 2016 • edited Loading

bolkedebruin commented Jun 9, 2016

plypaul commented Jun 9, 2016 • edited Loading

bolkedebruin commented Jun 9, 2016 • edited Loading

bolkedebruin commented Jun 9, 2016

plypaul Jun 29, 2016

Choose a reason for hiding this comment

bolkedebruin commented May 18, 2016 •

edited

Loading

bolkedebruin commented Jun 1, 2016 •

edited

Loading

criccomini commented Jun 3, 2016 •

edited

Loading

bolkedebruin commented Jun 3, 2016 •

edited

Loading

plypaul commented Jun 8, 2016 •

edited

Loading

plypaul commented Jun 9, 2016 •

edited

Loading

bolkedebruin commented Jun 9, 2016 •

edited

Loading