Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Mount failure exception handling fix #2117

Merged
merged 2 commits into from
Jun 30, 2023

Conversation

romilbhardwaj
Copy link
Collaborator

Our storage mounting was failing silently and not raising any exceptions. E.g., if you change the install_cmd to exit 1, it would still succeed. This was because we were not re-raising the exception when e.returncode != exceptions.MOUNT_PATH_NON_EMPTY_CODE.

New output:

(base) ➜  test_yamls git:(storage_mounting_fail_handling) ✗ SKYPILOT_DEBUG=0 sky launch -c test --cloud gcp temp_test_storage_mounting.yaml
Task from YAML spec: temp_test_storage_mounting.yaml
I 06-21 15:59:15 optimizer.py:636] == Optimizer ==
I 06-21 15:59:15 optimizer.py:647] Target: minimizing cost
I 06-21 15:59:15 optimizer.py:659] Estimated cost: $0.4 / hour
I 06-21 15:59:15 optimizer.py:659] 
I 06-21 15:59:15 optimizer.py:732] Considered resources (1 node):
I 06-21 15:59:15 optimizer.py:781] --------------------------------------------------------------------------------------------
I 06-21 15:59:15 optimizer.py:781]  CLOUD   INSTANCE        vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 06-21 15:59:15 optimizer.py:781] --------------------------------------------------------------------------------------------
I 06-21 15:59:15 optimizer.py:781]  GCP     n2-standard-8   8       32        -              us-central1   0.39          ✔     
I 06-21 15:59:15 optimizer.py:781] --------------------------------------------------------------------------------------------
I 06-21 15:59:15 optimizer.py:781] 
Launching a new cluster 'test'. Proceed? [Y/n]: Y
I 06-21 15:59:22 cloud_vm_ray_backend.py:3689] Creating a new cluster: "test" [1x GCP(n2-standard-8)].
I 06-21 15:59:22 cloud_vm_ray_backend.py:3689] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 06-21 15:59:23 cloud_vm_ray_backend.py:1321] To view detailed progress: tail -n100 -f /Users/romilb/sky_logs/sky-2023-06-21-15-59-15-853502/provision.log
I 06-21 15:59:25 cloud_vm_ray_backend.py:1644] Launching on GCP us-central1 (us-central1-a)
I 06-21 16:02:22 cloud_vm_ray_backend.py:1457] Successfully provisioned or found existing VM.
I 06-21 16:02:31 cloud_vm_ray_backend.py:3884] Processing 1 storage mount.
I 06-21 16:02:31 backend_utils.py:1286] Mounting (to 1 node): s3://digitalcorpora -> /mount_public_s3
⠹ MountingE 06-21 16:02:32 subprocess_utils.py:70] bash: warning: here-document at line 37 delimited by end-of-file (wanted `EOF')
E 06-21 16:02:32 subprocess_utils.py:70] Installing goofys...
E 06-21 16:02:32 subprocess_utils.py:70] 
Clusters
NAME  LAUNCHED        RESOURCES              STATUS  AUTOSTOP  COMMAND                        
test  a few secs ago  1x GCP(n2-standard-8)  UP      -         sky launch -c test --cloud...  

sky.exceptions.CommandError: Command to mount failed with return code 1.
Failed to run command before rsync s3://digitalcorpora -> /mount_public_s3. Ensure that the network is stable, then retry. See logs in ~/sky_logs/sky-2023-06-21-15-59-15-853502/storage_mounts.log

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • pytest tests/test_smoke.py -k 'test_aws_storage_mounts or test_gcp_storage_mounts or test_cloudflare_storage_mounts' --cloudflare --aws
  • pytest tests/test_smoke.py::test_spot_storage

Copy link
Collaborator

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed the silent failure before the fix and the error raised without the heredoc being displayed after the fix. LGTM

@romilbhardwaj romilbhardwaj merged commit 18976d8 into master Jun 30, 2023
@romilbhardwaj romilbhardwaj deleted the storage_mounting_fail_handling branch June 30, 2023 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants