-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-69] Make backfill use dagruns #1667
Conversation
5bc21c9
to
14f88b1
Compare
# make sure backfills are also considered | ||
last_run = dag.get_last_dagrun(session=session) | ||
if last_run: | ||
if next_run_date: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try this:
if last_run and next_run_date:
...
14f88b1
to
4dc95fe
Compare
Current coverage is 64.72% (diff: 82.14%)@@ master #1667 diff @@
==========================================
Files 126 127 +1
Lines 9407 9481 +74
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 6059 6137 +78
+ Misses 3348 3344 -4
Partials 0 0
|
tasks_to_run[ti.key] = ti | ||
session.merge(ti) | ||
session.commit() | ||
active_dag_runs = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should backfill job honor a dag's max_active_runs? for subdag it wouldn't make much sense to honor it since it is part of a dag; Would it matter for backfilling dags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think backfill should honor max active dags because e.g. max active dag runs is one way of limiting concurrency to external resources (pools being another). In my dependency refactor PR this has been made the case. For subdags the max dag runs should be unlimited by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah that makes sense. Im not sure if I know at the time of running whether we are a subdagoperator or not, but lets find out. Any suggestions how to report this back to the user? Just skip it?
On a side note @aoen , I probably can make backfill use the scheduler quite easily. However the side effect will be that backfill dont necessarily run at their point of origin. Is that a concern for you guys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subdag operator passing a flag to the backfill command would be one way to do it. Another would be to make the subdag operator mutate the subdag's max active dag runs. Personally I think subdags should be a first class citizen in airflow (as opposed to being an operator that shells out to run a separate airflow CLI command), but that is for another time.
I think that making the backfill use the scheduler is definitely the place we want to end up. One problem I can see for existing airflow use cases is that it would make it hard to iterate on/test dags since you could not pass a local copy of a dag into a backfill. Once this problem is solved I can't think of any other blockers (at least for AIrbnb). Maybe the backfill command could take an optional DAG folder argument which it would send to the workers somehow, actually now that I think about this would become easy to do in the upcoming git time-machine model (users could push their dags to a branch and then specify the branch in the backfill command).
Note that this end-goal can already be achieved by using the same executor in the airflow.cfg on the backfill machine as the scheduler uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aoen the 'git time-machine model' is very interesting! 👍
@bolkedebruin made 2 comments, everything else lgtm :) |
9afdbc5
to
c5ef2ae
Compare
@bolkedebruin nice catch on filtering out 'backfill_' jobs in the scheduler. A minor concern I have is that although max_active_run is respected by backfill, backfill itself can still create more active dagruns than max_active_run. @aoen mentioned that end goal is to make the backfill use the scheduler. How I interpreted it: instead of having backfill do everything locally and independently (the way it is now), backfillJob would instead only create dagruns, and scheduleJob (which is the scheduler) would pick them up from db along with other scheduled dagruns to process all of them in a consistent manner. I am not sure if this interpretation is correct, but if it is, then it may be beyond the scope of this PR anyways. So I am content =P |
@jgao54 good point. Further considering the max amount of dag runs in backfills creates a bit of an issue. As the BackfillJob waits on all runs to finish before it exits should I then wait for active_dag_runs < max_dag_runs which can be arbitrarily long? BackfillJobs should indeed be scheduled, but although the change to the backfilljob logic is relatively small, all tests need to be updated. So that will take a bit longer and is outside the scope of this PR |
c5ef2ae
to
ac5d430
Compare
@bolkedebruin yeah, agree that without using the scheduler, starvation is theoretically possible with backfills. I personally think that is an okay temporary sacrifice in order to respect max_active_run, since in most cases a dag only has a single dagrun running, if at all (and the default config for max_active_run is 16). |
@bolkedebruin just coming back to this PR again... since the existing backfill doesn't scope max active runs anyways, I am convince that it's more valuable to merge this commit for now and refactor scheduler in a separate PR |
0704498
to
9606066
Compare
@plypaul @aoen updated to work with @plypaul patches, tests fully pass here: https://travis-ci.org/bolkedebruin/airflow Can you please verify? Thanks! |
tasks_to_run.pop(key) | ||
session.commit() | ||
continue | ||
# set requiered transient field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
One thing that I realized with the backfill is that if we create DAG runs, the current scheduler will pickup tasks and send them to the executor in this call: https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L892 To avoid that, we could check the state of the DAG run in that function and make sure that it's not a backfill. |
Sorry to be joining the conversation so late. I've been testing the PR and as long as I have a scheduler running it works great. However, when there is no scheduler running, I get weird behavior: the backfill quickly iterates through all requested DagRuns and claims they are successful when in reality, the DagRuns aren't being created (they don't seem to be in the database) and the TaskInstances aren't either. Here's the log output of backfilling one of the example DAGs without a scheduler running:
|
Update: maybe I'm doing something wrong, because now no matter what I do I get the above behavior (with the exception of the first DagRun, that one executes it's tasks correctly). Am I missing something?
output:
Now I feel like maybe I'm making a mistake somewhere?? |
9606066
to
e5b57d5
Compare
@plypaul I see. I need to look at that function because it uses TaskInstances as a basis which I would like to rework to a DagRun basis. To make it work now I would need to add a ti.get_dag_run method, which is not guaranteed to be unique. @jlowin fixed the issue good catch. I will add a test for that. Can you retest? |
ffdaace
to
92f50fb
Compare
Backfill jobs create taskinstances without any associated DagRuns. This creates consistency errors. This patch addresses this issue and also makes the scheduler backfill aware. The scheduler makes sure to schedule new dag runs after the last dag run including backfills. It will not pick up any tasks that are part of a backfill as those are considered to be managed by the backfill process. This can be (and should be) changed when backfill are running in a scheduled fashion. It doesn't deal with the remaining issue that backfills can be scheduled on top of existing dag runs and that due to this TaskInstances can point to multiple DagRuns.
92f50fb
to
13c452f
Compare
@bolkedebruin looking good now! I can't tell because of the rebase, what did you change to fix it? |
@jlowin considered next run date instead of just the start date and made sure the dag run state was checked for every dag run iteration |
@bolkedebruin I think this is in good shape. +1 from me. I accidentally bumped into the "max active runs" issue by testing the |
@jlowin thanks. I would like to leave the concurrency bit for when i move the backfill to a version that uses the scheduler. That should simplify things. Can you merge it for me? |
@bolkedebruin To clarify, it's possible to have multiple DAG runs for the same (dag_id, execution_date, start_date)? |
@bolkedebruin merged! |
Backfill jobs create taskinstances without any associated DagRuns. This creates consistency errors. This patch addresses this issue and also makes the scheduler backfill aware. The scheduler makes sure to schedule new dag runs after the last dag run including backfills. It will not pick up any tasks that are part of a backfill as those are considered to be managed by the backfill process. This can be (and should be) changed when backfill are running in a scheduled fashion. It doesn't deal with the remaining issue that backfills can be scheduled on top of existing dag runs and that due to this TaskInstances can point to multiple DagRuns. Closes apache#1667 from bolkedebruin/backfill_dagrun
Dear Airflow Maintainers,
Please accept this PR that addresses the following issues:
Testing Done:
Backfill jobs create taskinstances without any associated
DagRuns. This creates consistency errors. This patch addresses
this issue and also makes the scheduler backfill aware.
It doesn't deal with the remaining issue that backfills can be
scheduled on top of existing dag runs and that due to this
TaskInstances can point to multiple DagRuns
(this is the case now as well)
@plypaul @jlowin @aoen please give it your thoughts