Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks do not resume from DEFERRED state and fails eventually #434

Closed
pankajkoti opened this issue Jun 13, 2022 · 6 comments
Closed

Tasks do not resume from DEFERRED state and fails eventually #434

pankajkoti opened this issue Jun 13, 2022 · 6 comments
Assignees
Labels
kind/stability Improving stability or resiliency

Comments

@pankajkoti
Copy link
Collaborator

Observing an issue where random tasks for random DAGs fail sometime. The logs indicate that the was DEFERRED and did not resume.

*** Reading remote log from s3://airflow-logs-ckqv957bt00000rso4cjtfi1s/cl2k1thcf796551f28bu039mtr/dag_id=example_s3_sensor/run_id=s3_sensor_dag_example_s3_sensor_2022-06-12/task_id=check_if_wildcard_exists/attempt=2.log.
[2022-06-13, 00:00:57 UTC] {taskinstance.py:1150} INFO - Dependencies not met for <TaskInstance: example_s3_sensor.check_if_wildcard_exists s3_sensor_dag_example_s3_sensor_2022-06-12 [up_for_retry]>, dependency 'Not In Retry Period' FAILED: Task is not ready for retry yet but will be retried automatically. Current date is 2022-06-13T00:00:57.946431+00:00 and task will be retried at 2022-06-13T00:01:55.958201+00:00.
[2022-06-13, 00:00:57 UTC] {local_task_job.py:101} INFO - Task is not able to be run

[2022-06-13, 00:01:57 UTC] {taskinstance.py:1160} INFO - Dependencies all met for <TaskInstance: example_s3_sensor.check_if_wildcard_exists s3_sensor_dag_example_s3_sensor_2022-06-12 [queued]>
[2022-06-13, 00:01:57 UTC] {taskinstance.py:1160} INFO - Dependencies all met for <TaskInstance: example_s3_sensor.check_if_wildcard_exists s3_sensor_dag_example_s3_sensor_2022-06-12 [queued]>
[2022-06-13, 00:01:57 UTC] {taskinstance.py:1357} INFO - 
--------------------------------------------------------------------------------
[2022-06-13, 00:01:57 UTC] {taskinstance.py:1358} INFO - Starting attempt 2 of 3
[2022-06-13, 00:01:57 UTC] {taskinstance.py:1359} INFO - 
--------------------------------------------------------------------------------
[2022-06-13, 00:01:57 UTC] {manager.py:53} WARNING - Unable to find an extractor. task_type=S3KeySensorAsync airflow_dag_id=example_s3_sensor task_id=check_if_wildcard_exists airflow_run_id=s3_sensor_dag_example_s3_sensor_2022-06-12 
[2022-06-13, 00:01:58 UTC] {taskinstance.py:1378} INFO - Executing <Task(S3KeySensorAsync): check_if_wildcard_exists> on 2022-06-12 00:00:00+00:00
[2022-06-13, 00:01:58 UTC] {standard_task_runner.py:52} INFO - Started process 454 to run task
[2022-06-13, 00:01:58 UTC] {standard_task_runner.py:79} INFO - Running: ['airflow', 'tasks', 'run', 'example_s3_sensor', 'check_if_wildcard_exists', 's3_sensor_dag_example_s3_sensor_2022-06-12', '--job-id', '21483', '--raw', '--subdir', 'DAGS_FOLDER/example_s3.py', '--cfg-path', '/tmp/tmp1kzpct42', '--error-file', '/tmp/tmp9msl3v03']
[2022-06-13, 00:01:58 UTC] {standard_task_runner.py:80} INFO - Job 21483: Subtask check_if_wildcard_exists
[2022-06-13, 00:01:58 UTC] {warnings.py:109} WARNING - /usr/local/lib/python3.9/site-packages/airflow/configuration.py:525: DeprecationWarning: The sql_alchemy_conn option in [core] has been moved to the sql_alchemy_conn option in [database] - the old setting has been used, but please update your config.
  option = self._get_environment_variables(deprecated_key, deprecated_section, key, section)

[2022-06-13, 00:01:58 UTC] {task_command.py:370} INFO - Running <TaskInstance: example_s3_sensor.check_if_wildcard_exists s3_sensor_dag_example_s3_sensor_2022-06-12 [running]> on host 10.1.3.34
[2022-06-13, 00:01:58 UTC] {taskinstance.py:1570} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_s3_sensor
AIRFLOW_CTX_TASK_ID=check_if_wildcard_exists
AIRFLOW_CTX_EXECUTION_DATE=2022-06-12T00:00:00+00:00
AIRFLOW_CTX_TRY_NUMBER=2
AIRFLOW_CTX_DAG_RUN_ID=s3_sensor_dag_example_s3_sensor_2022-06-12
[2022-06-13, 00:01:58 UTC] {taskinstance.py:1460} INFO - Pausing task as DEFERRED. dag_id=example_s3_sensor, task_id=check_if_wildcard_exists, execution_date=20220612T000000, start_date=20220613T000157
[2022-06-13, 00:01:58 UTC] {local_task_job.py:156} INFO - Task exited with return code 0
[2022-06-13, 00:01:58 UTC] {warnings.py:109} WARNING - /usr/local/lib/python3.9/site-packages/airflow/configuration.py:525: DeprecationWarning: The sql_alchemy_conn option in [core] has been moved to the sql_alchemy_conn option in [database] - the old setting has been used, but please update your config.
  option = self._get_environment_variables(deprecated_key, deprecated_section, key, section)

[2022-06-13, 00:01:58 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check

@pankajkoti pankajkoti changed the title Taks does not resume from DEFERRED state and fails eventually Tasks do not resume from DEFERRED state and fails eventually Jun 13, 2022
@pankajkoti
Copy link
Collaborator Author

pankajkoti commented Jun 27, 2022

I am guessing this could be due to some infra issue or underlying core implementation issue. I was unable to reproduce this issue locally.
We will need to spend some time to reproduce and debug this issue.

@pankajkoti
Copy link
Collaborator Author

We reached out multiple support channels for help on this issue #support-cloud & #astro-runtime but did not get any response.

Hence, we have created a support ticket in Zendesk now: https://astronomer.zendesk.com/agent/tickets/10760

@pankajkoti
Copy link
Collaborator Author

We need to some analysis as part of Providers team on what could be the root cause (scheduler/triggerer) for this.

@pankajkoti pankajkoti self-assigned this Jul 4, 2022
@phanikumv phanikumv added the kind/stability Improving stability or resiliency label Jul 5, 2022
@pankajkoti
Copy link
Collaborator Author

pankajkoti commented Jul 12, 2022

There was a known issue apache/airflow#23824 and it has been fixed in Airflow 2.3.3 . We deployed a corresponding dev runtime image: astro-runtime-dev:5.0.6-alpha4 on astronomer-dev.io deployment and we did not observe this issue.
With upgrading the base image to 5.0.6 , hopefully, we will not observe this issue.

@pankajkoti
Copy link
Collaborator Author

Today, we deployed quay.io/astronomer/astro-runtime:5.0.6-base for our Integration Tests deployment in Astro Cloud and the master DAG triggered manually has run fine for all DAGs (all green). We can observe for another run tomorrow morning with our nightly job to confirm that the issue is resolved and then close the ticket.

@pankajkoti
Copy link
Collaborator Author

Today morning too, the master DAG ran fine and we did not observe this issue in any of the DAG runs. Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/stability Improving stability or resiliency
Projects
None yet
Development

No branches or pull requests

2 participants