Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When an ECS Task fails to start, ECS Operator raises a CloudWatch exception #15000

Closed
kanga333 opened this issue Mar 25, 2021 · 4 comments · Fixed by #18733
Closed

When an ECS Task fails to start, ECS Operator raises a CloudWatch exception #15000

kanga333 opened this issue Mar 25, 2021 · 4 comments · Fixed by #18733
Labels
good first issue kind:bug This is a clearly a bug provider:amazon AWS/Amazon - related issues

Comments

@kanga333
Copy link
Contributor

Apache Airflow version: 1.10.13

Environment:

  • Cloud provider or hardware configuration:AWS
  • OS (e.g. from /etc/os-release): Amazon Linux 2
  • Kernel (e.g. uname -a): 4.14.209-160.339.amzn2.x86_64
  • Install tools: pip
  • Others:

What happened:

When an ECS Task exits with stopCode: TaskFailedToStart, the ECS Operator will exit with a ResourceNotFoundException for the GetLogEvents operation. This is because the task has failed to start, so no log is created.

[2021-03-14 02:32:49,792] {ecs_operator.py:147} INFO - ECS Task started: {'tasks': [{'attachments': [], 'availabilityZone': 'ap-northeast-1c', 'clusterArn': 'arn:aws:ecs:ap-northeast-1:xxxx:cluster/ecs-cluster', 'containerInstanceArn': 'arn:aws:ecs:ap-northeast-1:xxxx:container-instance/ecs-cluster/xxxx', 'containers': [{'containerArn': 'arn:aws:ecs:ap-northeast-1:xxxx:container/xxxx', 'taskArn': 'arn:aws:ecs:ap-northeast-1:xxxx:task/ecs-cluster/xxxx', 'name': 'container_image', 'image': 'xxxx.dkr.ecr.ap-northeast-1.amazonaws.com/ecr/container_image:latest', 'lastStatus': 'PENDING', 'networkInterfaces': [], 'cpu': '128', 'memoryReservation': '128'}], 'cpu': '128', 'createdAt': datetime.datetime(2021, 3, 14, 2, 32, 49, 770000, tzinfo=tzlocal()), 'desiredStatus': 'RUNNING', 'group': 'family:task', 'lastStatus': 'PENDING', 'launchType': 'EC2', 'memory': '128', 'overrides': {'containerOverrides': [{'name': 'container_image', 'command': ['/bin/bash', '-c', 'xxxx']}], 'inferenceAcceleratorOverrides': []}, 'startedBy': 'airflow', 'tags': [], 'taskArn': 'arn:aws:ecs:ap-northeast-1:xxxx:task/ecs-cluster/xxxx', 'taskDefinitionArn': 'arn:aws:ecs:ap-northeast-1:xxxx:task-definition/task:1', 'version': 1}], 'failures': [], 'ResponseMetadata': {'RequestId': 'xxxx', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'xxxx', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1471', 'date': 'Sun, 14 Mar 2021 02:32:48 GMT'}, 'RetryAttempts': 0}}
[2021-03-14 02:34:15,022] {ecs_operator.py:168} INFO - ECS Task stopped, check status: {'tasks': [{'attachments': [], 'availabilityZone': 'ap-northeast-1c', 'clusterArn': 'arn:aws:ecs:ap-northeast-1:xxxx:cluster/ecs-cluster', 'connectivity': 'CONNECTED', 'connectivityAt': datetime.datetime(2021, 3, 14, 2, 32, 49, 770000, tzinfo=tzlocal()), 'containerInstanceArn': 'arn:aws:ecs:ap-northeast-1:xxxx:container-instance/ecs-cluster/xxxx', 'containers': [{'containerArn': 'arn:aws:ecs:ap-northeast-1:xxxx:container/xxxx', 'taskArn': 'arn:aws:ecs:ap-northeast-1:xxxx:task/ecs-cluster/xxxx', 'name': 'container_image', 'image': 'xxxx.dkr.ecr.ap-northeast-1.amazonaws.com/ecr/container_image:latest', 'lastStatus': 'STOPPED', 'reason': 'CannotPullContainerError: failed to register layer: Error processing tar file(exit status 1): write /var/lib/xxxx: no space left on device', 'networkInterfaces': [], 'healthStatus': 'UNKNOWN', 'cpu': '128', 'memoryReservation': '128'}], 'cpu': '128', 'createdAt': datetime.datetime(2021, 3, 14, 2, 32, 49, 770000, tzinfo=tzlocal()), 'desiredStatus': 'STOPPED', 'executionStoppedAt': datetime.datetime(2021, 3, 14, 2, 34, 12, 810000, tzinfo=tzlocal()), 'group': 'family:task', 'healthStatus': 'UNKNOWN', 'lastStatus': 'STOPPED', 'launchType': 'EC2', 'memory': '128', 'overrides': {'containerOverrides': [{'name': 'container_image', 'command': ['/bin/bash', '-c', 'xxxx']}], 'inferenceAcceleratorOverrides': []}, 'pullStartedAt': datetime.datetime(2021, 3, 14, 2, 32, 51, 68000, tzinfo=tzlocal()), 'pullStoppedAt': datetime.datetime(2021, 3, 14, 2, 34, 13, 584000, tzinfo=tzlocal()), 'startedBy': 'airflow', 'stopCode': 'TaskFailedToStart', 'stoppedAt': datetime.datetime(2021, 3, 14, 2, 34, 12, 821000, tzinfo=tzlocal()), 'stoppedReason': 'Task failed to start', 'stoppingAt': datetime.datetime(2021, 3, 14, 2, 34, 12, 821000, tzinfo=tzlocal()), 'tags': [], 'taskArn': 'arn:aws:ecs:ap-northeast-1:xxxx:task/ecs-cluster/xxxx', 'taskDefinitionArn': 'arn:aws:ecs:ap-northeast-1:xxxx:task-definition/task:1', 'version': 2}], 'failures': [], 'ResponseMetadata': {'RequestId': 'xxxx', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'xxxx', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1988', 'date': 'Sun, 14 Mar 2021 02:34:14 GMT'}, 'RetryAttempts': 0}}
[2021-03-14 02:34:15,024] {ecs_operator.py:172} INFO - ECS Task logs output:
[2021-03-14 02:34:15,111] {credentials.py:1094} INFO - Found credentials in environment variables.
[2021-03-14 02:34:15,416] {taskinstance.py:1150} ERROR - An error occurred (ResourceNotFoundException) when calling the GetLogEvents operation: The specified log stream does not exist.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.7/site-packages/airflow/contrib/operators/ecs_operator.py", line 152, in execute
    self._check_success_task()
  File "/usr/local/lib/python3.7/site-packages/airflow/contrib/operators/ecs_operator.py", line 175, in _check_success_task
    for event in self.get_logs_hook().get_log_events(self.awslogs_group, stream_name):
  File "/usr/local/lib/python3.7/site-packages/airflow/contrib/hooks/aws_logs_hook.py", line 85, in get_log_events
    **token_arg)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 676, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.ResourceNotFoundException: An error occurred (ResourceNotFoundException) when calling the GetLogEvents operation: The specified log stream does not exist.

What you expected to happen:

ResourceNotFoundException is misleading because it feels like a problem with CloudWatchLogs. Expect AirflowException to indicate that the task has failed.

How to reproduce it:

This can be reproduced by running an ECS Task that fails to start, for example by specifying a non-existent entry_point.

Anything else we need to know:

I suspect Issue #11663 has the same problem, i.e. it's not a CloudWatch issue, but a failure to start an ECS Task.

@kanga333 kanga333 added the kind:bug This is a clearly a bug label Mar 25, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Mar 25, 2021

Thanks for opening your first issue here! Be sure to follow the issue template!

@vikramkoka vikramkoka added the provider:amazon AWS/Amazon - related issues label Mar 25, 2021
@zachliu
Copy link
Contributor

zachliu commented Apr 21, 2021

Yes. And Fargate tasks are more likely to experience this failed-to-start issue than the conventional EC2 tasks. I've been back and forth with AWS support on this one for a long long time. The AWS's diagnostic is this:

  • Randomly AWS could fail to provision the ENI (elastic network interface) when running your Fargate task, and when this happens, the Fargate task fails to start silently. The only way to avoid this is to write your own "baby-sitting" routine to poll the eni_status in your ecs task after you trigger the task.

But luckily we only see around 1% Fargate tasks failed to start due to this random issue.

I didn't close #11663 just to remind me to create a new issue here but apparently @kanga333 beat me to it 😁

@eladkal
Copy link
Contributor

eladkal commented Sep 27, 2021

@zachliu @kanga333 did remove retry for now #16150 solved the problem?

@zachliu
Copy link
Contributor

zachliu commented Sep 27, 2021

@zachliu @kanga333 did remove retry for now #16150 solved the problem?

@eladkal unfortunately no, #16150 was for another issue that's only somewhat related to this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue kind:bug This is a clearly a bug provider:amazon AWS/Amazon - related issues
Projects
None yet
4 participants