-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Demo][WIP] Storage mounting #527
Conversation
👀 |
Wow, this looks very cool! I will try it out tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to try it out 😮
I was trying to install goofys on Mac OS 11.5, but failed. Even if I install the macfuse first with |
name: romil-fs | ||
source: ~/tmp # Empty dir for MVP, this will not be required if mode==MOUNT | ||
persistent: True | ||
mode: 'MOUNT' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can't it just be mode: Mount
, when need the tick?
I just tried the distirbuted-resnet training with the mounted bucket for checkpoints. It worked well. I used the following yaml file and commands.
PS, I had to manually comment out the following lines to avoid the checkpoints being overwritten. name: resnet-distributed-app
resources:
accelerators: V100
num_nodes: 2
file_mounts:
/checkpoints:
name: sky-checkpoints-zhwu
source: ~/tmp
mode: MOUNT
setup: |
pip3 install --upgrade pip
rm -r ./pytorch-distributed-resnet
git clone https://github.com/Michaelvll/pytorch-distributed-resnet.git
cd pytorch-distributed-resnet && pip3 install -r requirements.txt
mkdir -p data && mkdir -p saved_models && cd data && \
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvzf cifar-10-python.tar.gz
mkdir -p /checkpoints/torch_ddp_resnet/
run: |
cd pytorch-distributed-resnet
git pull
num_nodes=`echo "$SKY_NODE_IPS" | wc -l`
master_addr=`echo "$SKY_NODE_IPS" | head -n1`
python3 -m torch.distributed.launch --nproc_per_node=1 \
--nnodes=$num_nodes --node_rank=${SKY_NODE_RANK} --master_addr=$master_addr \
--master_port=8008 resnet_ddp.py --num_epochs 100 --model_dir /checkpoints/torch_ddp_resnet/ \
--resume --model_filename resnet_distributed-with-epochs.pth
|
Out of curiosity, is it possible to mount the storage under the |
Thanks for the comments everyone! Closing this PR for now. |
This PR serves as a functioning demo for the upcoming storage mounting feature. This MVP has been implemented on AWS/S3 using goofys.
To try it out:
storagemounting
branchromil-fs
)setup
andrun
.sky launch storage_mount_demo.yaml
goofys <bucket> <mntpath>
Also tested working for multinode mounting (
num_nodes: 2
).Ignore the code quality, ignore the failing tests - this is intended to be a demo PR and will not be merged.