Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Error when launching another task with the same file mount #1319

Closed
pschafhalter opened this issue Oct 28, 2022 · 2 comments · Fixed by #1320
Closed

[Storage] Error when launching another task with the same file mount #1319

pschafhalter opened this issue Oct 28, 2022 · 2 comments · Fixed by #1320
Assignees
Milestone

Comments

@pschafhalter
Copy link

Hi,

I'm repeatedly launching jobs on the same machine that use the same S3 mount point. On subsequent invocations, I run into an error when re-mounting:

echo "Mounting $SOURCE_BUCKET to $MOUNT_PATH with $MOUNT_BINARY..."
goofys -o allow_other --stat-cache-ttl 5s --type-cache-ttl 5s cloud-av-workloads /data
echo "Mounting done."
) && chmod +x ~/.sky/mount_785106.sh && bash ~/.sky/mount_785106.sh && rm ~/.sky/mount_785106.sh failed with return code 1.
Failed to run command before rsync cloud-av-workloads -> /data.

And here is the output of storage_mounts.log:

bash: warning: here-document at line 37 delimited by end-of-file (wanted `EOF')
Path already mounted - unmounting...
fusermount: failed to unmount /data: Device or resource busy
Successfully unmounted /data.
goofys already installed. Proceeding...
Mount path /data is not empty. Please make sure its empty.

I think this could be resolved by detecting whether the mount points are identical and keeping the file mount alive between jobs rather than unmounting and re-mounting.

@romilbhardwaj romilbhardwaj self-assigned this Oct 28, 2022
@romilbhardwaj
Copy link
Collaborator

romilbhardwaj commented Oct 28, 2022

Thanks for the report @pschafhalter!

I am successfully able to replicate this issue with this minimal YAML which keeps the mounted path busy:

resources:
  cloud: aws

file_mounts:
  /data:
    name: sky-romilb-mounttest

run: |
  tail -f /dev/null > /data/myfile &

First sky launch is successful, but the second one fails because the unmounting fails (and the mounting script incorrectly continues). Will fix this.

Also (you might know this already) you can use sky exec <cluster_name> <task.yaml> to submit jobs to the cluster without re-executing file mounts and setup :)

@romilbhardwaj
Copy link
Collaborator

@pschafhalter this should be fixed with #1320. Feel free to update your local branch and give it a go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants