Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Demo][WIP] Storage mounting #527

Closed
wants to merge 2 commits into from
Closed

[Demo][WIP] Storage mounting #527

wants to merge 2 commits into from

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Mar 10, 2022

This PR serves as a functioning demo for the upcoming storage mounting feature. This MVP has been implemented on AWS/S3 using goofys.

To try it out:

  1. Checkout storagemounting branch
  2. Edit examples/storage_mount_demo.yaml and specify a custom bucket name (instead of romil-fs)
  3. Do whatever you'd like in setup and run.
  4. sky launch storage_mount_demo.yaml
  5. Install goofys on your laptop and view the results locally with goofys <bucket> <mntpath>

Also tested working for multinode mounting (num_nodes: 2).

Ignore the code quality, ignore the failing tests - this is intended to be a demo PR and will not be merged.

@gmittal
Copy link
Collaborator

gmittal commented Mar 10, 2022

👀

@Michaelvll
Copy link
Collaborator

Wow, this looks very cool! I will try it out tomorrow.

Copy link
Collaborator

@michaelzhiluo michaelzhiluo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to try it out 😮

examples/storage_mount_demo.yaml Outdated Show resolved Hide resolved
sky/task.py Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator

Michaelvll commented Mar 11, 2022

I was trying to install goofys on Mac OS 11.5, but failed. Even if I install the macfuse first with brew install --cask macfuse restart the computer, and then brew install goofys. It still tells me Error: goofys has been disabled because it requires closed-source macFUSE! I may need to install it from the source, but that would be a hard experience for the user.

name: romil-fs
source: ~/tmp # Empty dir for MVP, this will not be required if mode==MOUNT
persistent: True
mode: 'MOUNT'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can't it just be mode: Mount, when need the tick?

sky/data/storage.py Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator

Michaelvll commented Mar 11, 2022

I just tried the distirbuted-resnet training with the mounted bucket for checkpoints. It worked well. I used the following yaml file and commands.

  1. sky launch -c mount-test task.yaml
  2. sky cancel mount-test 1 after 90 epochs
  3. sky exec mount-test task.yaml it resumed the training from 90 epochs.

PS, I had to manually comment out the following lines to avoid the checkpoints being overwritten.
https://github.com/sky-proj/sky/blob/e63957ab0529442e7b7d43fbb2c40925b4e31e7a/sky/data/storage.py#L356-L362

name: resnet-distributed-app


resources:
    accelerators: V100

num_nodes: 2

file_mounts:
    /checkpoints:
        name: sky-checkpoints-zhwu
        source: ~/tmp
        mode: MOUNT

setup: |
    pip3 install --upgrade pip
    rm -r ./pytorch-distributed-resnet
    git clone https://github.com/Michaelvll/pytorch-distributed-resnet.git
    cd pytorch-distributed-resnet && pip3 install -r requirements.txt
    mkdir -p data  && mkdir -p saved_models && cd data && \
    wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
    tar -xvzf cifar-10-python.tar.gz
    mkdir -p /checkpoints/torch_ddp_resnet/

run: |
    cd pytorch-distributed-resnet
    git pull

    num_nodes=`echo "$SKY_NODE_IPS" | wc -l`
    master_addr=`echo "$SKY_NODE_IPS" | head -n1`
    python3 -m torch.distributed.launch --nproc_per_node=1 \
    --nnodes=$num_nodes --node_rank=${SKY_NODE_RANK} --master_addr=$master_addr \
    --master_port=8008 resnet_ddp.py --num_epochs 100 --model_dir /checkpoints/torch_ddp_resnet/ \
    --resume --model_filename resnet_distributed-with-epochs.pth

@Michaelvll
Copy link
Collaborator

Out of curiosity, is it possible to mount the storage under the workdir? Since many ML codebase will save the checkpoint in the same directory of the code, can we mount the storage under ~/sky_workdir/checkpoints? If not, one possible way to solve it is to add the target path of the file_mounts to the exclude list of rsync.

@romilbhardwaj
Copy link
Collaborator Author

Thanks for the comments everyone! Closing this PR for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants