[AIRFLOW-56] Airflow's scheduler can "lose" queued tasks #1378

jlowin · 2016-04-13T23:19:07Z

When Scheduler is run with —num-runs, there can be multiple
Schedulers and Executors all trying to run tasks. For queued tasks,
Scheduler was previously only trying to run tasks that it itself had
queued — but that doesn’t work if the Scheduler is restarting. This PR
reverts that behavior and adds two types of “best effort” executions —
before running a TI, executors check if it is already running, and
before ending executors call sync() one last time

Closes https://issues.apache.org/jira/browse/AIRFLOW-56

jlowin · 2016-04-13T23:21:17Z

@abridgett please give this a shot regarding #1342
@syvineckruyk it's very important for you to have a look at this -- I had to partially undo one of the fixes that kept SchedulerJob from preying on BackfillJob's queued tasks. That fix made it so SchedulerJob only dealt with its own tasks, but when SchedulerJob runs with num-runs, it keeps restarting and the new scheduler was ignoring tasks queued by the old one.

landscape-bot · 2016-04-13T23:22:04Z

Repository health decreased by 0.03% when pulling 8fa7fd4 on jlowin:queued-tasks into 5d15d68 on airbnb:master.

3 new problems were found (including 0 errors and 1 code smell).
2 problems were fixed (including 0 errors and 1 code smell).

coveralls · 2016-04-13T23:47:17Z

Coverage decreased (-0.02%) to 67.979% when pulling 8fa7fd487c59673c1163dee1b0631f4d49c6809e on jlowin:queued-tasks into 5d15d68 on airbnb:master.

abridgett · 2016-04-14T11:36:55Z

It seems to have worked today, will keep an eye on it - many many thanks @jlowin

abridgett · 2016-04-15T11:17:14Z

worked today as well FYI (last time it was updated in the middle of a run effectively so this was the first real test)

jlowin · 2016-04-15T15:53:34Z

Great. @syvineckruyk any luck?

syvineckruyk · 2016-04-15T16:03:09Z

@jlowin I won't be able to test until tomorrow. Sorry for that .. I have bunch of test dags ready to go so testing will be quick when I start.

The sample dags from #1350 and #1225 should be a good start though.

jlowin · 2016-04-15T16:05:59Z

No rush, just wanted to make sure you had seen the issue.

jlowin · 2016-04-21T13:00:57Z

@syvineckruyk just want to ping you on this, we may need to merge it to master shortly

landscape-bot · 2016-04-21T13:03:01Z

Repository health decreased by 0.01% when pulling f592c65 on jlowin:queued-tasks into c1f485f on airbnb:master.

3 new problems were found (including 0 errors and 1 code smell).
2 problems were fixed (including 0 errors and 1 code smell).

coveralls · 2016-04-21T13:25:35Z

Coverage decreased (-0.03%) to 66.996% when pulling f592c65fb6b8d974ddb1ce477cb7aefc5c0fff5b on jlowin:queued-tasks into c1f485f on airbnb:master.

syvineckruyk · 2016-04-21T15:43:20Z

@jlowin gotcha ... starting to test now.

syvineckruyk · 2016-04-22T15:59:34Z

@jlowin so I am running your "queued-tasks" branch ... and seeing some weird issues around dag runs. don't have an extact handle on it yet... failed dag runs are getting created ... and I am now seeing dag runs created for 5 days prior to the start date.

jlowin · 2016-04-22T16:02:06Z

I think the 5 day before thing is a known issue with the scheduler logic -- basically if you have a non-scheduled execution, the scheduler picks up 5 days before it. We will have to deal with that separately.

syvineckruyk · 2016-04-22T16:15:04Z

@jlowin cool. I found the cause of the failing dag runs .. it was an unrelated issue. continuing to run this version today ... so far so good.

syvineckruyk · 2016-04-24T00:11:08Z

@jlowin so far on this branch I have found a couple of issues. depends_on_past=True does not appear to be respected. At least on SDOs ... don't know about other tasks. I have SDOs in a failed state... at the next interval the following task instance of the failed SDO is launched.

This may be helpful ... the job was behaving as expected. A few minutes ago when we crossed over to April 24th UTC... that is when the task-instances that should not have been kicked off began running. The schedule of those tasks was not midnight ... it was actually for 12 hours from then at 12pm. So something about the new UTC day seems to be involved.

I have also observed several instances of subtasks completing in success.... but SDO never notices and stays in a running state indefinitely.

syvineckruyk · 2016-04-24T15:31:22Z

@jlowin still trying to create a generic DAG to expose these bugs ... but wanted to let you know that it appears that the SDO operators that do not get updated do so when another task in the dag run has failed.

Task Fails
Dag Run Fails
Subtasks complete
Status of SDOs remains running

jlowin · 2016-04-24T15:33:46Z

Excellent, thanks so much for the feedback @syvineckruyk. I'm trying to figure out what in this code could have led to that behavior. I think it might be the extra sync() command I gave each executor. The idea was to avoid orphaned tasks by waiting for all tasks to complete -- but maybe there's a situation where the parent DAG has already quit before the orphaned tasks do... thinking further.

syvineckruyk · 2016-04-24T15:41:50Z

@jlowin just a question when I was running the last version you gave me.... things were running well ... what was the issue that required this area to need more work ?

commit fd9388c0c27c2e469f4eb0362800323a08b76d68
Merge: 58abca2 b2844af
Author: bolkedebruin <bolkedebruin@users.noreply.github.com>
Date:   Tue Apr 5 22:24:15 2016 +0200

    Merge pull request #1290 from jlowin/subdag-backfill-status

    Make sure backfill deadlocks raise errors

commit b2844af020cb5a470bd83ead09ddb121923084ca
Author: jlowin <jlowin@users.noreply.github.com>
Date:   Mon Apr 4 18:59:13 2016 -0400

    Fix infinite retries with pools, with test

    Addresses the issue raised in #1299

jlowin · 2016-04-24T15:45:14Z

The issue you were experiencing was because Scheduler and Backfill were both trying to queue your SubDag tasks. The fix I put in place was that Scheduler kept track of tasks it submitted to the executor, and only tried to run those tasks if they became queued. That way, the queued subdag tasks were left alone and Backfill handled them without interference. The problem is that folks often run Scheduler with the --num_runs parameter, which makes Scheduler restart periodically. When Scheduler restarts, it has no way to know what tasks it submitted before it quit, and so those tasks remain queued forever. The bug could occur less commonly if Scheduler were to quit for any other reason while tasks were queued.

landscape-bot · 2016-04-26T14:08:18Z

Repository health decreased by 0.01% when pulling 88a211d on jlowin:queued-tasks into c1f485f on airbnb:master.

3 new problems were found (including 0 errors and 1 code smell).
2 problems were fixed (including 0 errors and 1 code smell).

coveralls · 2016-04-26T14:30:37Z

Coverage increased (+1.7%) to 68.704% when pulling 88a211d66894b8d02b8923378a97a3e5cce50373 on jlowin:queued-tasks into c1f485f on airbnb:master.

syvineckruyk · 2016-04-26T14:45:39Z

@jlowin were you able to identify any of the issue I referred to above ?

Thanks

jlowin · 2016-04-26T14:46:37Z

@syvineckruyk not directly but I removed some of the code that I suspect is causing it

jlowin · 2016-04-26T14:47:46Z

As a matter of fact it looks like now Travis is failing with queued tasks -- so I think I need to leave that code in. More thinking...

landscape-bot · 2016-04-26T15:45:29Z

Repository health decreased by 0.02% when pulling 12b7b6a on jlowin:queued-tasks into c1f485f on airbnb:master.

4 new problems were found (including 0 errors and 1 code smell).
2 problems were fixed (including 0 errors and 1 code smell).

coveralls · 2016-04-26T16:13:22Z

Coverage increased (+0.04%) to 67.066% when pulling 12b7b6a116a0eadcf3101b636ffbec6239437870 on jlowin:queued-tasks into c1f485f on airbnb:master.

bolkedebruin · 2016-05-06T20:37:41Z

airflow/models.py

@@ -1159,7 +1167,7 @@ def run(
        self.pool = pool or task.pool
        self.test_mode = test_mode
        self.force = force
-        self.refresh_from_db()
+        self.refresh_from_db(lock_for_update=True)


Please provide the session as well, to make sure the commit happens in the same session

bolkedebruin · 2016-05-09T09:47:31Z

I think that here: https://github.com/jlowin/airflow/blob/queued-tasks/airflow/jobs.py#L650 we need an extra session.commit() after the delete.

Update: here we can actually lose a commit.

aoen · 2016-05-09T18:38:43Z

airflow/executors/base_executor.py

@@ -86,10 +84,23 @@ def heartbeat(self):
            key=lambda x: x[1][1],
            reverse=True)
        for i in range(min((open_slots, len(self.queued_tasks)))):
-            key, (command, priority, queue) = sorted_queue.pop(0)
-            self.running[key] = command
+            key, (command, priority, queue, ti) = sorted_queue.pop(0)


_ instead of priority (looks like it's unused)

bolkedebruin · 2016-05-09T21:17:01Z

+1!

When Scheduler is run with `—num-runs`, there can be multiple Schedulers and Executors all trying to run tasks. For queued tasks, Scheduler was previously only trying to run tasks that it itself had queued — but that doesn’t work if the Scheduler is restarting. This PR reverts that behavior and adds two types of “best effort” executions — before running a TI, executors check if it is already running, and before ending executors call sync() one last time

The scheduler can encounter a queued task twice before the task actually starts to run -- this locks the task and avoids that condition.

aoen · 2016-05-09T22:29:40Z

LGTM

griffinqiu · 2016-06-03T07:03:55Z

We have been blocked on this issue in 1.7.0.
Do you have some suggestions for fix it? Hot patch to 1.7.0 or upgrade to 1.7.1? Is 1.7.1 stable enough?
The 1.6.2 seem has the issue same. ;(

Thank you very much

bolkedebruin · 2016-06-03T07:54:03Z

1.7.1 is in much better condition than 1.7.0. So I would definitely use that one.

criccomini · 2016-06-03T15:59:07Z

To be clear, 1.7.1.2 is the version you should be using. This version is quite stable.

griffinqiu · 2016-06-14T02:36:28Z

I have tested 1.7.1.2.
It seems there still are many issues on SubDag with pool, and the SubDags also waste many resources on celery.

criccomini · 2016-06-14T19:25:10Z

@griffinqiu can you open up JIRAs for the known issues?

jlowin mentioned this pull request Apr 14, 2016

1.7.1rc1 seems to keep re-scheduling successful tasks #1365

Closed

bolkedebruin mentioned this pull request Apr 18, 2016

Scheduling on latest Master broken when using Pooled #1397

Closed

jlowin force-pushed the queued-tasks branch from 8fa7fd4 to f592c65 Compare April 21, 2016 13:00

jlowin added the Missing JIRA Issue label May 2, 2016

bolkedebruin changed the title ~~Handle queued tasks from multiple jobs/executors~~ [AIRFLOW-56] Airflow's scheduler can "loose" queued tasks May 6, 2016

bolkedebruin added kind:bug This is a clearly a bug Job/Executor and removed Missing JIRA Issue labels May 6, 2016

jlowin changed the title ~~[AIRFLOW-56] Airflow's scheduler can "loose" queued tasks~~ [AIRFLOW-56] Airflow's scheduler can "lose" queued tasks May 6, 2016

jlowin force-pushed the queued-tasks branch from 12b7b6a to 0c6a5c5 Compare May 6, 2016 20:14

bolkedebruin reviewed May 6, 2016
View reviewed changes

jlowin force-pushed the queued-tasks branch from 0c6a5c5 to abfe2a6 Compare May 6, 2016 20:52

aoen reviewed May 9, 2016
View reviewed changes

jlowin force-pushed the queued-tasks branch from abfe2a6 to 3b0df02 Compare May 9, 2016 20:58

jlowin added 2 commits May 9, 2016 17:18

Add logic to lock DB and avoid race condition

c1aa93f

The scheduler can encounter a queued task twice before the task actually starts to run -- this locks the task and avoids that condition.

jlowin force-pushed the queued-tasks branch from 3b0df02 to c1aa93f Compare May 9, 2016 21:19

asfgit merged commit c1aa93f into apache:master May 9, 2016

asfgit pushed a commit that referenced this pull request May 9, 2016

Merge pull request #1378 from jlowin/queued-tasks

40b3fff

[AIRFLOW-56] Airflow's scheduler can "lose" queued tasks #1378

[AIRFLOW-56] Airflow's scheduler can "lose" queued tasks #1378

Conversation

jlowin commented Apr 13, 2016 • edited Loading

jlowin commented Apr 13, 2016

landscape-bot commented Apr 13, 2016

coveralls commented Apr 13, 2016

abridgett commented Apr 14, 2016

abridgett commented Apr 15, 2016

jlowin commented Apr 15, 2016

syvineckruyk commented Apr 15, 2016

jlowin commented Apr 15, 2016

jlowin commented Apr 21, 2016

landscape-bot commented Apr 21, 2016

coveralls commented Apr 21, 2016

syvineckruyk commented Apr 21, 2016

syvineckruyk commented Apr 22, 2016

jlowin commented Apr 22, 2016

syvineckruyk commented Apr 22, 2016

syvineckruyk commented Apr 24, 2016 • edited Loading

syvineckruyk commented Apr 24, 2016

jlowin commented Apr 24, 2016

syvineckruyk commented Apr 24, 2016

jlowin commented Apr 24, 2016 • edited Loading

landscape-bot commented Apr 26, 2016

coveralls commented Apr 26, 2016 • edited Loading

syvineckruyk commented Apr 26, 2016

jlowin commented Apr 26, 2016

jlowin commented Apr 26, 2016

landscape-bot commented Apr 26, 2016

coveralls commented Apr 26, 2016 • edited Loading

bolkedebruin May 6, 2016

Choose a reason for hiding this comment

jlowin May 6, 2016

Choose a reason for hiding this comment

bolkedebruin commented May 9, 2016 • edited Loading

aoen May 9, 2016

Choose a reason for hiding this comment

bolkedebruin commented May 9, 2016

aoen commented May 9, 2016

griffinqiu commented Jun 3, 2016

bolkedebruin commented Jun 3, 2016

criccomini commented Jun 3, 2016

griffinqiu commented Jun 14, 2016

criccomini commented Jun 14, 2016

jlowin commented Apr 13, 2016 •

edited

Loading

syvineckruyk commented Apr 24, 2016 •

edited

Loading

jlowin commented Apr 24, 2016 •

edited

Loading

coveralls commented Apr 26, 2016 •

edited

Loading

coveralls commented Apr 26, 2016 •

edited

Loading

bolkedebruin commented May 9, 2016 •

edited

Loading