Replies: 5 comments 9 replies
-
Can you upload the scheduler logs when this happens |
Beta Was this translation helpful? Give feedback.
-
Converted it to discussion until more information is provided that might enable to asses if this is an Airflow issue or not. |
Beta Was this translation helpful? Give feedback.
-
@ephraimbuddy Somehow I missed this. Somehow this issue happen today, again. The tasks are scheduled in pods that belong to a Ec2 spot instance node. Date,Message "2022-03-16T00:03:26.443Z","[2022-03-16 00:03:26,146] {scheduler_job.py:570} INFO - TaskInstance Finished: dag_id=continuous_load, task_id=load_table, run_id=scheduled__2022-03-15T23:45:00+00:00, run_start_date=2022-03-16 00:01:26.128956+00:00, run_end_date=None, run_duration=None, state=running, executor_state=failed, try_number=1, max_tries=1, job_id=403468, pool=default_pool, queue=default, priority_weight=1, operator=ExtractLoadOperator" "2022-03-16T00:03:26.442Z","[2022-03-16 00:03:26,069] {kubernetes_executor.py:575} INFO - Changing state of (TaskInstanceKey(dag_id='continuous_load', task_id='load_table', run_id='scheduled__2022-03-15T23:45:00+00:00', try_number=1), , 'load_table.b45adf3ca7ef4f5482639bfedbc4c340', 'airflow-data-eng', '109859634') to failed" "2022-03-16T00:03:26.442Z","[2022-03-16 00:03:26,064] {kubernetes_executor.py:374} INFO - Attempting to finish pod; pod_id: load_table.b45adf3ca7ef4f5482639bfedbc4c340; state: failed; annotations: {'dag_id': 'continuous_load', 'task_id': 'load_table', 'execution_date': None, 'run_id': 'scheduled__2022-03-15T23:45:00+00:00', 'try_number': '1'}" "2022-03-16T00:00:21.204Z","[2022-03-16 00:00:20,563] {kubernetes_executor.py:297} INFO - Kubernetes job is (TaskInstanceKey(dag_id='continuous_load', task_id='load_table', run_id='scheduled__2022-03-15T23:45:00+00:00', try_number=1), ['airflow', 'tasks', 'run', 'continuous_load', 'load_table', 'scheduled__2022-03-15T23:45:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/continuous_load.py'], {'api_version': 'v1', 'kind': 'Pod', 'metadata': {'annotations': None, 'cluster_name': None, 'creation_timestamp': None, 'deletion_grace_period_seconds': None, 'deletion_timestamp': None, 'finalizers': None, 'generate_name': None, 'generation': None, 'initializers': None, 'labels': None, 'managed_fields': None, 'name': None, 'namespace': None, 'owner_references': None, 'resource_version': None, 'self_link': None, 'uid': None}, 'spec': {'active_deadline_seconds': None, 'affinity': None, 'automount_service_account_token': None, 'containers': [{'args': [], 'command': [], 'env': [], 'env_from': [], 'image': None, 'image_pull_policy': None, 'lifecycle': None, 'liveness_probe': None, 'name': 'base', 'ports': [], 'readiness_probe': None, 'resources': {'limits': {'cpu': '3000m', 'memory': '6G'}, 'requests': {'cpu': '1000m', 'memory': '2G'}}, 'security_context': None, 'stdin': None, 'stdin_once': None, 'termination_message_path': None, 'termination_message_policy': None, 'tty': None, 'volume_devices': None, 'volume_mounts': [], 'working_dir': None}], 'dns_config': None, 'dns_policy': None, 'enable_service_links': None, 'host_aliases': None, 'host_ipc': None, 'host_network': False, 'host_pid': None, 'hostname': None, 'image_pull_secrets': [], 'init_containers': None, 'node_name': None, 'node_selector': None, 'preemption_policy': None, 'priority': None, 'priority_class_name': None, 'readiness_gates': None, 'restart_policy': None, 'runtime_class_name': None, 'scheduler_name': None, 'security_context': None, 'service_account': None, 'service_account_name': None, 'share_process_namespace': None, 'subdomain': None, 'termination_grace_period_seconds': None, 'tolerations': None, 'volumes': []}, 'status': None}, None)" "2022-03-16T00:00:01.934Z","[2022-03-16 00:00:00,968] {kubernetes_executor.py:530} INFO - Add task TaskInstanceKey(dag_id='continuous_load', task_id='load_table', run_id='scheduled__2022-03-15T23:45:00+00:00', try_number=1) with command ['airflow', 'tasks', 'run', 'continuous_load', 'load_table', 'scheduled__2022-03-15T23:45:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/continuous_load.py'] with executor_config {'KubernetesExecutor': {'request_memory': '2G', 'request_cpu': '1000m', 'limit_memory': '6G', 'limit_cpu': '3000m'}}" "2022-03-16T00:00:01.933Z","[2022-03-16 00:00:00,953] {base_executor.py:82} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'continuous_load', 'load_table', 'scheduled__2022-03-15T23:45:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/continuous_load.py']" "2022-03-16T00:00:01.933Z","[2022-03-16 00:00:00,952] {scheduler_job.py:473} INFO - Sending TaskInstanceKey(dag_id='continuous_load', task_id='load_table', run_id='scheduled__2022-03-15T23:45:00+00:00', try_number=1) to executor with priority 1 and queue default" |
Beta Was this translation helpful? Give feedback.
-
We are facing the issue with our configuration as well, seems like a bug when we lost a machine by spot instances our task fails without any retry. I will try to gather the logs to share here but I can say that seems the case in our side. We were expecting to the task retry the other two times, but that is not being the case. |
Beta Was this translation helpful? Give feedback.
-
Any update on this, we are still facing this issue with airflow 2.8.1 |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
2.2.3 (latest released)
What happened
We use Kubernetes Executor in combinations with AWS spot instances. Time to time spot instances are evicted, therefore the tasks/pods running on it fail. All our tasks are having a retry >=1, which generally works fine, but in this case, the tasks running in the evicted K8s node, are marked immediately as failed.
What you expected to happen
I expected that the scheduler considers the retry, and trigger another instance of the task without marking it immediately as failed.
How to reproduce
Kill a running pod when is running, to obtain the same error described in this issue.
Operating System
Official Docker Airflow image 2.2.3
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==2.4.0
apache-airflow-providers-postgres==2.3.0
apache-airflow-providers-slack==4.1.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions