Skip to content

[BUG] Service Termination Fails Silently on Invalid Credentials, Potentially Leading to Resource Leaks #4605

@andylizf

Description

@andylizf

When running the command sky serve down:

andyl@andylizf-dev-server ~/skypilot (persistent-service)> sky serve down sky-s
ervice-6c01
Terminating service(s) 'sky-service-6c01'. Proceed? [Y/n]: 
Service 'sky-service-6c01' is scheduled to be terminated.

The system confirms that the service is scheduled for termination. However, upon inspecting the controller logs, it becomes evident that the replicas fail to terminate due to invalid credentials. While the failure itself is expected, the absence of any error messages during the operation is problematic and could lead to resource leaks.

/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching.  Some resources might leak.
  warnings.warn('resource_tracker: process died unexpectedly, '
I 01-25 02:02:15 service.py:103] Terminating replica 1 ...
I 01-25 02:02:15 service.py:103] Terminating replica 2 ...
WARNING:googleapiclient.http:Encountered 403 Forbidden with reason "insufficientPermissions"
WARNING:googleapiclient.http:Encountered 403 Forbidden with reason "insufficientPermissions"
E 01-25 02:02:20 replica_managers.py:163] Failed to terminate the sky serve replica cluster sky-service-6c01-1. Retrying after 5.001233025315587 seconds.Details: googleapiclient.errors.HttpError: <HttpError 403 when requesting https://compute.googleapis.com/compute/v1/projects/skypilot-375900/zones/us-central1-a/instances?filter=%28%28labels.ray-cluster-name+%3D+sky-service-6c01-1-e2dc%29%29&alt=json returned "Request had insufficient authentication scopes.". Details: "[{'message': 'Insufficient Permission', 'domain': 'global', 'reason': 'insufficientPermissions'}]">
E 01-25 02:02:20 replica_managers.py:167]   Traceback: Traceback (most recent call last):
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 151, in terminate_cluster
E 01-25 02:02:20 replica_managers.py:167]     sky.down(cluster_name)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 01-25 02:02:20 replica_managers.py:167]     return f(*args, **kwargs)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/core.py", line 487, in down
E 01-25 02:02:20 replica_managers.py:167]     backend.teardown(handle, terminate=True, purge=purge)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 01-25 02:02:20 replica_managers.py:167]     return f(*args, **kwargs)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 366, in _record
E 01-25 02:02:20 replica_managers.py:167]     return f(*args, **kwargs)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend.py", line 146, in teardown
E 01-25 02:02:20 replica_managers.py:167]     self._teardown(handle, terminate, purge)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 3680, in _teardown
E 01-25 02:02:20 replica_managers.py:167]     self.teardown_no_lock(
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4022, in teardown_no_lock
E 01-25 02:02:20 replica_managers.py:167]     provisioner.teardown_cluster(repr(cloud),
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/provisioner.py", line 208, in teardown_cluster
E 01-25 02:02:20 replica_managers.py:167]     provision.terminate_instances(cloud_name, cluster_name.name_on_cloud,
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/__init__.py", line 52, in _wrapper
E 01-25 02:02:20 replica_managers.py:167]     return impl(*args, **kwargs)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/gcp/instance.py", line 549, in terminate_instances
E 01-25 02:02:20 replica_managers.py:167]     handler_to_instances = _filter_instances(handlers, project_id, zone,
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/gcp/instance.py", line 38, in _filter_instances
E 01-25 02:02:20 replica_managers.py:167]     instance_dict = instance_handler.filter(
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/gcp/instance_utils.py", line 396, in filter
E 01-25 02:02:20 replica_managers.py:167]     response = (cls.load_resource().instances().list(
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
E 01-25 02:02:20 replica_managers.py:167]     return wrapped(*args, **kwargs)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/googleapiclient/http.py", line 938, in execute
E 01-25 02:02:20 replica_managers.py:167]     raise HttpError(resp, content, uri=self.uri)
E 01-25 02:02:20 replica_managers.py:167] googleapiclient.errors.HttpError: <HttpError 403 when requesting https://compute.googleapis.com/compute/v1/projects/skypilot-375900/zones/us-central1-a/instances?filter=%28%28labels.ray-cluster-name+%3D+sky-service-6c01-1-e2dc%29%29&alt=json returned "Request had insufficient authentication scopes.". Details: "[{'message': 'Insufficient Permission', 'domain': 'global', 'reason': 'insufficientPermissions'}]">
E 01-25 02:02:20 replica_managers.py:167] 
E 01-25 02:02:20 replica_managers.py:163] Failed to terminate the sky serve replica cluster sky-service-6c01-2. Retrying after 6.67271440093055 seconds.Details: googleapiclient.errors.HttpError: <HttpError 403 when requesting https://compute.googleapis.com/compute/v1/projects/skypilot-375900/zones/us-central1-a/instances?filter=%28%28labels.ray-cluster-name+%3D+sky-service-6c01-2-e2dc%29%29&alt=json returned "Request had insufficient authentication scopes.". Details: "[{'message': 'Insufficient Permission', 'domain': 'global', 'reason': 'insufficientPermissions'}]">
E 01-25 02:02:20 replica_managers.py:167]   Traceback: Traceback (most recent call last):
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 151, in terminate_cluster
E 01-25 02:02:20 replica_managers.py:167]     sky.down(cluster_name)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 01-25 02:02:20 replica_managers.py:167]     return f(*args, **kwargs)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/core.py", line 487, in down
E 01-25 02:02:20 replica_managers.py:167]     backend.teardown(handle, terminate=True, purge=purge)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 01-25 02:02:20 replica_managers.py:167]     return f(*args, **kwargs)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 366, in _record
E 01-25 02:02:20 replica_managers.py:167]     return f(*args, **kwargs)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend.py", line 146, in teardown
E 01-25 02:02:20 replica_managers.py:167]     self._teardown(handle, terminate, purge)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 3680, in _teardown
E 01-25 02:02:20 replica_managers.py:167]     self.teardown_no_lock(
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4022, in teardown_no_lock
E 01-25 02:02:20 replica_managers.py:167]     provisioner.teardown_cluster(repr(cloud),
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/provisioner.py", line 208, in teardown_cluster
E 01-25 02:02:20 replica_managers.py:167]     provision.terminate_instances(cloud_name, cluster_name.name_on_cloud,
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/__init__.py", line 52, in _wrapper
E 01-25 02:02:20 replica_managers.py:167]     return impl(*args, **kwargs)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/gcp/instance.py", line 549, in terminate_instances
E 01-25 02:02:20 replica_managers.py:167]     handler_to_instances = _filter_instances(handlers, project_id, zone,
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/gcp/instance.py", line 38, in _filter_instances
E 01-25 02:02:20 replica_managers.py:167]     instance_dict = instance_handler.filter(
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/gcp/instance_utils.py", line 396, in filter
E 01-25 02:02:20 replica_managers.py:167]     response = (cls.load_resource().instances().list(
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
E 01-25 02:02:20 replica_managers.py:167]     return wrapped(*args, **kwargs)
E 01-25 02:02:20 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/googleapiclient/http.py", line 938, in execute
E 01-25 02:02:20 replica_managers.py:167]     raise HttpError(resp, content, uri=self.uri)
E 01-25 02:02:20 replica_managers.py:167] googleapiclient.errors.HttpError: <HttpError 403 when requesting https://compute.googleapis.com/compute/v1/projects/skypilot-375900/zones/us-central1-a/instances?filter=%28%28labels.ray-cluster-name+%3D+sky-service-6c01-2-e2dc%29%29&alt=json returned "Request had insufficient authentication scopes.". Details: "[{'message': 'Insufficient Permission', 'domain': 'global', 'reason': 'insufficientPermissions'}]">
E 01-25 02:02:20 replica_managers.py:167] 
WARNING:googleapiclient.http:Encountered 403 Forbidden with reason "insufficientPermissions"
E 01-25 02:02:28 replica_managers.py:163] Failed to terminate the sky serve replica cluster sky-service-6c01-1. Retrying after 9.805574426895168 seconds.Details: googleapiclient.errors.HttpError: <HttpError 403 when requesting https://compute.googleapis.com/compute/v1/projects/skypilot-375900/zones/us-central1-a/instances?filter=%28%28labels.ray-cluster-name+%3D+sky-service-6c01-1-e2dc%29%29&alt=json returned "Request had insufficient authentication scopes.". Details: "[{'message': 'Insufficient Permission', 'domain': 'global', 'reason': 'insufficientPermissions'}]">
E 01-25 02:02:28 replica_managers.py:167]   Traceback: Traceback (most recent call last):
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 151, in terminate_cluster
E 01-25 02:02:28 replica_managers.py:167]     sky.down(cluster_name)
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 01-25 02:02:28 replica_managers.py:167]     return f(*args, **kwargs)
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/core.py", line 487, in down
E 01-25 02:02:28 replica_managers.py:167]     backend.teardown(handle, terminate=True, purge=purge)
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 01-25 02:02:28 replica_managers.py:167]     return f(*args, **kwargs)
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 366, in _record
E 01-25 02:02:28 replica_managers.py:167]     return f(*args, **kwargs)
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend.py", line 146, in teardown
E 01-25 02:02:28 replica_managers.py:167]     self._teardown(handle, terminate, purge)
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 3680, in _teardown
E 01-25 02:02:28 replica_managers.py:167]     self.teardown_no_lock(
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4022, in teardown_no_lock
E 01-25 02:02:28 replica_managers.py:167]     provisioner.teardown_cluster(repr(cloud),
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/provisioner.py", line 208, in teardown_cluster
E 01-25 02:02:28 replica_managers.py:167]     provision.terminate_instances(cloud_name, cluster_name.name_on_cloud,
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/__init__.py", line 52, in _wrapper
E 01-25 02:02:28 replica_managers.py:167]     return impl(*args, **kwargs)
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/gcp/instance.py", line 549, in terminate_instances
E 01-25 02:02:28 replica_managers.py:167]     handler_to_instances = _filter_instances(handlers, project_id, zone,
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/gcp/instance.py", line 38, in _filter_instances
E 01-25 02:02:28 replica_managers.py:167]     instance_dict = instance_handler.filter(
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/gcp/instance_utils.py", line 396, in filter
E 01-25 02:02:28 replica_managers.py:167]     response = (cls.load_resource().instances().list(
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
E 01-25 02:02:28 replica_managers.py:167]     return wrapped(*args, **kwargs)
E 01-25 02:02:28 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/googleapiclient/http.py", line 938, in execute
E 01-25 02:02:28 replica_managers.py:167]     raise HttpError(resp, content, uri=self.uri)
E 01-25 02:02:28 replica_managers.py:167] googleapiclient.errors.HttpError: <HttpError 403 when requesting https://compute.googleapis.com/compute/v1/projects/skypilot-375900/zones/us-central1-a/instances?filter=%28%28labels.ray-cluster-name+%3D+sky-service-6c01-1-e2dc%29%29&alt=json returned "Request had insufficient authentication scopes.". Details: "[{'message': 'Insufficient Permission', 'domain': 'global', 'reason': 'insufficientPermissions'}]">
E 01-25 02:02:28 replica_managers.py:167] 
WARNING:googleapiclient.http:Encountered 403 Forbidden with reason "insufficientPermissions"
E 01-25 02:02:30 replica_managers.py:163] Failed to terminate the sky serve replica cluster sky-service-6c01-2. Retrying after 13.516306336032262 seconds.Details: googleapiclient.errors.HttpError: <HttpError 403 when requesting https://compute.googleapis.com/compute/v1/projects/skypilot-375900/zones/us-central1-a/instances?filter=%28%28labels.ray-cluster-name+%3D+sky-service-6c01-2-e2dc%29%29&alt=json returned "Request had insufficient authentication scopes.". Details: "[{'message': 'Insufficient Permission', 'domain': 'global', 'reason': 'insufficientPermissions'}]">
E 01-25 02:02:30 replica_managers.py:167]   Traceback: Traceback (most recent call last):
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 151, in terminate_cluster
E 01-25 02:02:30 replica_managers.py:167]     sky.down(cluster_name)
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 01-25 02:02:30 replica_managers.py:167]     return f(*args, **kwargs)
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/core.py", line 487, in down
E 01-25 02:02:30 replica_managers.py:167]     backend.teardown(handle, terminate=True, purge=purge)
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 01-25 02:02:30 replica_managers.py:167]     return f(*args, **kwargs)
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 366, in _record
E 01-25 02:02:30 replica_managers.py:167]     return f(*args, **kwargs)
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend.py", line 146, in teardown
E 01-25 02:02:30 replica_managers.py:167]     self._teardown(handle, terminate, purge)
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 3680, in _teardown
E 01-25 02:02:30 replica_managers.py:167]     self.teardown_no_lock(
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4022, in teardown_no_lock
E 01-25 02:02:30 replica_managers.py:167]     provisioner.teardown_cluster(repr(cloud),
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/provisioner.py", line 208, in teardown_cluster
E 01-25 02:02:30 replica_managers.py:167]     provision.terminate_instances(cloud_name, cluster_name.name_on_cloud,
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/__init__.py", line 52, in _wrapper
E 01-25 02:02:30 replica_managers.py:167]     return impl(*args, **kwargs)
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/gcp/instance.py", line 549, in terminate_instances
E 01-25 02:02:30 replica_managers.py:167]     handler_to_instances = _filter_instances(handlers, project_id, zone,
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/gcp/instance.py", line 38, in _filter_instances
E 01-25 02:02:30 replica_managers.py:167]     instance_dict = instance_handler.filter(
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/provision/gcp/instance_utils.py", line 396, in filter
E 01-25 02:02:30 replica_managers.py:167]     response = (cls.load_resource().instances().list(
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
E 01-25 02:02:30 replica_managers.py:167]     return wrapped(*args, **kwargs)
E 01-25 02:02:30 replica_managers.py:167]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/googleapiclient/http.py", line 938, in execute
E 01-25 02:02:30 replica_managers.py:167]     raise HttpError(resp, content, uri=self.uri)
E 01-25 02:02:30 replica_managers.py:167] googleapiclient.errors.HttpError: <HttpError 403 when requesting https://compute.googleapis.com/compute/v1/projects/skypilot-375900/zones/us-central1-a/instances?filter=%28%28labels.ray-cluster-name+%3D+sky-service-6c01-2-e2dc%29%29&alt=json returned "Request had insufficient authentication scopes.". Details: "[{'message': 'Insufficient Permission', 'domain': 'global', 'reason': 'insufficientPermissions'}]">

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 160, in terminate_cluster
    raise RuntimeError('Failed to terminate the sky serve replica '
RuntimeError: Failed to terminate the sky serve replica cluster sky-service-6c01-2.
E 01-25 02:02:47 service.py:116] Replica 2 failed to terminate.
I 01-25 02:02:47 service.py:123] Cleaning up storage for version 1, task_yaml: /home/sky/.sky/serve/sky_service_6c01/task_v1.yaml
I 01-25 02:02:47 storage.py:645] Verifying bucket for storage skypilot-filemounts-andyl-75edb7ce
I 01-25 02:02:47 storage.py:1000] Storage type StoreType.GCS already exists.
E 01-25 02:02:55 service.py:78] Failed to clean up storage: sky.exceptions.StorageBucketDeleteError: Failed to delete GCS bucket skypilot-filemounts-andyl-75edb7ce.Detailed error: b'Removing gs://skypilot-filemounts-andyl-75edb7ce/job-75edb7ce/workdir/server.py#1737763198494882...\nRemoving gs://skypilot-filemounts-andyl-75edb7ce/job-75edb7ce/workdir/task.yaml#1737763198492095...\nRemoving gs://skypilot-filemounts-andyl-75edb7ce/...\nAccessDeniedException: 403 Access denied.\n'
E 01-25 02:02:55 service.py:81]   Traceback: Traceback (most recent call last):
E 01-25 02:02:55 service.py:81]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 2213, in _delete_gcs_bucket
E 01-25 02:02:55 service.py:81]     subprocess.check_output(remove_obj_command,
E 01-25 02:02:55 service.py:81]   File "/opt/conda/lib/python3.10/subprocess.py", line 421, in check_output
E 01-25 02:02:55 service.py:81]     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
E 01-25 02:02:55 service.py:81]   File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
E 01-25 02:02:55 service.py:81]     raise CalledProcessError(retcode, process.args,
E 01-25 02:02:55 service.py:81] subprocess.CalledProcessError: Command '[[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; };skypilot_gsutil rm -r gs://skypilot-filemounts-andyl-75edb7ce' returned non-zero exit status 1.
E 01-25 02:02:55 service.py:81] 
E 01-25 02:02:55 service.py:81] During handling of the above exception, another exception occurred:
E 01-25 02:02:55 service.py:81] 
E 01-25 02:02:55 service.py:81] Traceback (most recent call last):
E 01-25 02:02:55 service.py:81]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/serve/service.py", line 76, in cleanup_storage
E 01-25 02:02:55 service.py:81]     backend.teardown_ephemeral_storage(task)
E 01-25 02:02:55 service.py:81]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 01-25 02:02:55 service.py:81]     return f(*args, **kwargs)
E 01-25 02:02:55 service.py:81]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend.py", line 138, in teardown_ephemeral_storage
E 01-25 02:02:55 service.py:81]     return self._teardown_ephemeral_storage(task)
E 01-25 02:02:55 service.py:81]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 3631, in _teardown_ephemeral_storage
E 01-25 02:02:55 service.py:81]     storage.delete()
E 01-25 02:02:55 service.py:81]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 1110, in delete
E 01-25 02:02:55 service.py:81]     store.delete()
E 01-25 02:02:55 service.py:81]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 1907, in delete
E 01-25 02:02:55 service.py:81]     deleted_by_skypilot = self._delete_gcs_bucket(self.name)
E 01-25 02:02:55 service.py:81]   File "/home/sky/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 2220, in _delete_gcs_bucket
E 01-25 02:02:55 service.py:81]     raise exceptions.StorageBucketDeleteError(
E 01-25 02:02:55 service.py:81] sky.exceptions.StorageBucketDeleteError: Failed to delete GCS bucket skypilot-filemounts-andyl-75edb7ce.Detailed error: b'Removing gs://skypilot-filemounts-andyl-75edb7ce/job-75edb7ce/workdir/server.py#1737763198494882...\nRemoving gs://skypilot-filemounts-andyl-75edb7ce/job-75edb7ce/workdir/task.yaml#1737763198492095...\nRemoving gs://skypilot-filemounts-andyl-75edb7ce/...\nAccessDeniedException: 403 Access denied.\n'
E 01-25 02:02:55 service.py:81] 
E 01-25 02:02:55 service.py:289] Service sky-service-6c01 failed to clean up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions