-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backfill crashes with "KeyError: TaskInstanceKey" when task has retries #13322
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Hey @sarvothaman - you need to provider way more information in order to successfully report issue - version of airflow is a bare mniumum, but without logs, information what you have done to debug and check what's wrong, what are the conditions and how to reproduce the issue, there is no way anyone will do anything with the issue. I am closing it as invalid, until you specify enough information to be able to reproduce it or at least understand what's wrong. |
@potiuk Sorry, accidentally hit enter before entering all the info (with no way to delete?). In any case, added the details |
I was too fast then :). Sorry. I see very comprehensive information now :). Looks like an interesting one to take a look at. I believe this is something we are already aware of - there are some cases where try_number is wrongly calculated @turbaszek and @ashb -> I know you had discussions about similar case - maybe it is related? |
@sarvothaman do you by any chance use a sensor in the DAG (especially in reschedule mode)? |
future start date with backfill ? |
We hit this last night, ill follow up later today with some more information on environment & conditions it happened under. |
@leonsmith Looking forward to more information. Not sure we can do much with this issue as it stands. |
I got the same error.Let me explain my workflow : I submitted the airflow job with DebugExecutor on my mac and submitted it to Amazon EMR. I tried to find google for relevant informations , but didn't get any things. I sincerely hope that this problem will be taken seriously and resolved as soon as possible.Thanks a lot! Here is my error: Here is my Code: default_args = { Job_Flow_Overrides = {
` |
I am having the same issue here on The exception happens in this file the value of TaskInstanceKey(dag_id='refactor', task_id='task-1', execution_date=datetime.datetime(2021, 5, 7, 0, 0, tzinfo=Timezone('UTC')), try_number=2) and this is what I have in {
TaskInstanceKey(dag_id='refactor', task_id='task-1', execution_date=datetime.datetime(2021, 5, 1, 0, 0, tzinfo=Timezone('UTC')), try_number=3): <TaskInstance: refactor.task-1 2021-05-01 00:00:00+00:00 [failed]>,
TaskInstanceKey(dag_id='refactor', task_id='task-1', execution_date=datetime.datetime(2021, 5, 2, 0, 0, tzinfo=Timezone('UTC')), try_number=3): <TaskInstance: refactor.task-1 2021-05-02 00:00:00+00:00 [failed]>,
TaskInstanceKey(dag_id='refactor', task_id='task-1', execution_date=datetime.datetime(2021, 5, 3, 0, 0, tzinfo=Timezone('UTC')), try_number=3): <TaskInstance: refactor.task-1 2021-05-03 00:00:00+00:00 [failed]>,
TaskInstanceKey(dag_id='refactor', task_id='task-1', execution_date=datetime.datetime(2021, 5, 4, 0, 0, tzinfo=Timezone('UTC')), try_number=3): <TaskInstance: refactor.task-1 2021-05-04 00:00:00+00:00 [failed]>,
TaskInstanceKey(dag_id='refactor', task_id='task-1', execution_date=datetime.datetime(2021, 5, 5, 0, 0, tzinfo=Timezone('UTC')), try_number=3): <TaskInstance: refactor.task-1 2021-05-05 00:00:00+00:00 [failed]>,
TaskInstanceKey(dag_id='refactor', task_id='task-1', execution_date=datetime.datetime(2021, 5, 6, 0, 0, tzinfo=Timezone('UTC')), try_number=3): <TaskInstance: refactor.task-1 2021-05-06 00:00:00+00:00 [failed]>,
TaskInstanceKey(dag_id='refactor', task_id='task-1', execution_date=datetime.datetime(2021, 5, 7, 0, 0, tzinfo=Timezone('UTC')), try_number=3): <TaskInstance: refactor.task-1 2021-05-07 00:00:00+00:00 [failed]>,
TaskInstanceKey(dag_id='refactor', task_id='task-1', execution_date=datetime.datetime(2021, 5, 8, 0, 0, tzinfo=Timezone('UTC')), try_number=3): <TaskInstance: refactor.task-1 2021-05-08 00:00:00+00:00 [failed]>,
TaskInstanceKey(dag_id='refactor', task_id='task-1', execution_date=datetime.datetime(2021, 5, 9, 0, 0, tzinfo=Timezone('UTC')), try_number=3): <TaskInstance: refactor.task-1 2021-05-09 00:00:00+00:00 [failed]>,
TaskInstanceKey(dag_id='refactor', task_id='task-1', execution_date=datetime.datetime(2021, 5, 10, 0, 0, tzinfo=Timezone('UTC')), try_number=3): <TaskInstance: refactor.task-1 2021-05-10 00:00:00+00:00 [failed]>,
TaskInstanceKey(dag_id='refactor', task_id='task-1', execution_date=datetime.datetime(2021, 5, 11, 0, 0, tzinfo=Timezone('UTC')), try_number=3): <TaskInstance: refactor.task-1 2021-05-11 00:00:00+00:00 [failed]>,
TaskInstanceKey(dag_id='refactor', task_id='task-1', execution_date=datetime.datetime(2021, 5, 12, 0, 0, tzinfo=Timezone('UTC')), try_number=3): <TaskInstance: refactor.task-1 2021-05-12 00:00:00+00:00 [failed]>
} |
This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
It happened in airflow 2.1.0 also, and there is a bug in backfill logic. I have fixed it in my env and I will submit a PR for it later when idle... |
Hey Guys! I'm facing similar issue but with I've found two cases when The first one is described in #17305 and the second one occurs when I run tests for tasks that've already been executed with any other executor (hence they already have increased try number in DB).
I didn't have much time to investigate it so i might be completely wrong but i think that it works now only thanks to |
I seem to be getting the same error with In my case, I successfully ran |
This issue still happened with apache-airflow==2.1.3 and 2.1.4 version. @potiuk |
could you please open a new issue with reproducible case? |
Hi guys! Just for those who are facing this issue when debugging a DAG with DebugExecutor. If you change your start_date and use the function days_ago instead of passing a datetime it seems to work fine (although I don't know the reasons behing this behaviour...): from airflow.utils.dates import days_ago
default_args = {
'owner': 'owner',
# 'start_date': datetime(2021, 10, 5, 7, 45, 0, 0, tzinfo=TZINFO),
'start_date': days_ago(1),
'email': ['blabla@email.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 2
} |
…eschedue state (#17305) Backfill job fails to run when there are tasks run into rescheduling state. The error log as follows in issue #13322 ``` Traceback (most recent call last): File "/opt/conda/bin/airflow", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.8/site-packages/airflow/__main__.py", line 40, in main args.func(args) File "/opt/conda/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/airflow/utils/cli.py", line 89, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/airflow/cli/commands/dag_command.py", line 103, in dag_backfill dag.run( File "/opt/conda/lib/python3.8/site-packages/airflow/models/dag.py", line 1701, in run job.run() File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 237, in run self._execute() File "/opt/conda/lib/python3.8/site-packages/airflow/utils/session.py", line 65, in wrapper return func(*args, session=session, **kwargs) File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/backfill_job.py", line 799, in _execute self._execute_for_run_dates( File "/opt/conda/lib/python3.8/site-packages/airflow/utils/session.py", line 62, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/backfill_job.py", line 722, in _execute_for_run_dates processed_dag_run_dates = self._process_backfill_task_instances( File "/opt/conda/lib/python3.8/site-packages/airflow/utils/session.py", line 62, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/backfill_job.py", line 620, in _process_backfill_task_instances self._update_counters(ti_status=ti_status) File "/opt/conda/lib/python3.8/site-packages/airflow/utils/session.py", line 65, in wrapper return func(*args, session=session, **kwargs) File "/opt/conda/lib/python3.8/site-packages/airflow/jobs/backfill_job.py", line 211, in _update_counters ti_status.running.pop(key) KeyError: TaskInstanceKey(dag_id='dag_id', task_id='task_name', execution_date=datetime.datetime(2020, 12, 15, 0, 0, tzinfo=Timezone('UTC')), try_number=2) ``` The root cause is that the field `try_number` doesn't Increase when the task runs into rescheduling state, but there is a reduce operation on `try_number`. Currently, I can't think out a good ut to test it, only post the code here to help the one who is affected by it to solve the problem.
Apache Airflow version: 2.0.0
Kubernetes version (if you are using kubernetes) (use
kubectl version
): No KubernetesEnvironment: Docker python environment (3.8)
uname -a
): Linux b494b1048cf4 5.4.39-linuxkit Improving the search functionality in the graph view #1 SMP Fri May 8 23:03:06 UTC 2020 x86_64 x86_64 x86_64 GNU/LinuxWhat happened: Backfill command crashes with this stack error:
From the webserver, it looks like after the second try the task actually finished successfully (the first time there was a network error.
Just before the error I also see this warning:
WARNING - TaskInstanceKey(dag_id='dag_id', task_id='task_name', execution_date=datetime.datetime(2020, 12, 15, 0, 0, tzinfo=Timezone('UTC')), try_number=2) state success not in running=dict_values([<TaskInstance: dag_id.task_name 2020-12-15 00:00:00+00:00 [queued]>])
This happens whenever a task has to retry. The subsequent commands are not run and the backfill command has to be re-run to continue.
What you expected to happen: The backfill command to continue to the next step.
How to reproduce it: Not sure. Create a DAG with a future start date with a task that fails on the first try but succeeds in the second, keep it turned off, and run a backfill command with a single past date. Command that was used:
airflow dags backfill dag_id -s 2020-12-15 -e 2020-12-15
Anything else we need to know:
The text was updated successfully, but these errors were encountered: