Support acclerate multi gpu training #558

mshukor · 2024-12-08T10:31:06Z

What this does

Based on this PR. It includes:

The ability to keep training without accelerate
Updated to the recent main
Some minor fixes

Note: we still need to merge with vla branch before merging

How it was tested

ENV=aloha
ENV_TASK=AlohaTransferCube-v0
dataset_repo_id=lerobot/aloha_sim_transfer_cube_human
policy=act
LR=1e-5
LR_SCHEDULER=
USE_AMP=false
ASYNC_ENV=false

GPUS=2
EVAL_FREQ=10000 #51000 #10000 51000
OFFLINE_STEPS=100000 #25000 17000 12500 50000
TRAIN_BATCH_SIZE=4 # global batch size / num of gpus
EVAL_BATCH_SIZE=50

TASK_NAME=lerobot_${ENV}_transfer_cube_${policy}_2gpus

python -m accelerate.commands.launch --num_processes=$GPUS --mixed_precision=fp16 lerobot/scripts/train.py \
 hydra.job.name=base_distributed_aloha_transfer_cube \
 hydra.run.dir=/data/mshukor/logs/lerobot/${TASK_NAME} \
 dataset_repo_id=$dataset_repo_id \
 policy=$policy \
 env=$ENV env.task=$ENV_TASK \
 training.offline_steps=$OFFLINE_STEPS training.batch_size=$TRAIN_BATCH_SIZE \
 training.eval_freq=$EVAL_FREQ eval.n_episodes=50 eval.use_async_envs=$ASYNC_ENV eval.batch_size=$EVAL_BATCH_SIZE \
 training.lr_scheduler=$LR_SCHEDULER training.lr=$LR \
 wandb.enable=true

Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>

Co-authored-by: Remi <remi.cadene@huggingface.co> Co-authored-by: Remi Cadene <re.cadene@gmail.com>

…uggingface#466)

Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>

Co-authored-by: Remi <re.cadene@gmail.com> Co-authored-by: Remi <remi.cadene@huggingface.co>

Co-authored-by: Remi <remi.cadene@huggingface.co>

Co-authored-by: jess-moss <jess.moss@huggingface.co> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>

…icy (huggingface#484) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com>

Signed-off-by: ivelin <ivelin117@gmail.com>

…uggingface#450)

…ggingface#489)

)

Co-authored-by: Remi <remi.cadene@huggingface.co>

Cadene and others added 29 commits October 3, 2024 17:05

Enable CI for robot devices with mocked versions (huggingface#398)

26f97cf

Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>

Add support for Stretch (hello-robot) (huggingface#409)

1a343c3

Co-authored-by: Remi <remi.cadene@huggingface.co> Co-authored-by: Remi Cadene <re.cadene@gmail.com>

Fix nightly by updating .cache in dockerignore (huggingface#464)

d5b6696

Fix issue with wrong using index instead of camera_index in opencv (h…

c29e70e

…uggingface#466)

Add policy/act_aloha_real.yaml + env/act_real.yaml (huggingface#429)

97b1feb

Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>

Refactor record with add_frame (huggingface#468)

77478d5

Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>

Make say(blocking=True) work for Linux (huggingface#460)

cd0fc26

Fix gymnasium version as pre-1.0.0 (huggingface#471)

c351e1f

Co-authored-by: Remi <re.cadene@gmail.com> Co-authored-by: Remi <remi.cadene@huggingface.co>

Update 9_use_aloha.md, missing comma (huggingface#479)

2efee45

Fix link (huggingface#482)

114870d

Co-authored-by: Remi <remi.cadene@huggingface.co>

Add FeetechMotorsBus, SO-100, Moss-v1 (huggingface#419)

07e8716

Co-authored-by: jess-moss <jess.moss@huggingface.co> Co-authored-by: Simon Alibert <75076266+aliberts@users.noreply.github.com>

Fix autocalib moss (huggingface#486)

55e4ff6

[Fix] Move back to manual calibration (huggingface#488)

172809a

feat: enable to use multiple rgb encoders per camera in diffusion pol…

538455a

…icy (huggingface#484) Co-authored-by: Alexander Soare <alexander.soare159@gmail.com>

Fix config file (huggingface#495)

e0df56d

fix: broken images and a few minor typos in README (huggingface#499)

963738d

Signed-off-by: ivelin <ivelin117@gmail.com>

Add support for Windows (huggingface#494)

8af6935

bug causes error uploading to huggingface, unicode issue on windows. (h…

20f4667

…uggingface#450)

Add distinction between two unallowed cases in name check "eval_" (hu…

975c1c2

…ggingface#489)

Rename deprecated argument (temporal_ensemble_momentum) (huggingface#490

96c7052

)

Dataset v2.0 (huggingface#461)

32eb0ce

Co-authored-by: Remi <remi.cadene@huggingface.co>

add changes from accelerate branch

93e6c3b

training with accelerate utils

ec66c36

disabel rendering

9f11f8a

remove disable rendering

bcd902b

log acclerate to wandb and fix symlink

6dbe067

fix loading to wandb

d3cbb77

fix eval on aloha

c70e17a

precommit

c3339f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support acclerate multi gpu training #558

Support acclerate multi gpu training #558

mshukor commented Dec 8, 2024

Support acclerate multi gpu training #558

Are you sure you want to change the base?

Support acclerate multi gpu training #558

Conversation

mshukor commented Dec 8, 2024

What this does

How it was tested