Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

复现K400 预训练 loss一直降不下来 #132

Open
Inscredion opened this issue Dec 20, 2024 · 0 comments
Open

复现K400 预训练 loss一直降不下来 #132

Inscredion opened this issue Dec 20, 2024 · 0 comments

Comments

@Inscredion
Copy link

Inscredion commented Dec 20, 2024

最近复现K400 ViT-Small 预训练,2x8 H100, 单卡bs50,loss后面到0.6就降不下来了

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node={len(gpu_ids)}
--master_port={port} --nnodes={num_nodes} --node_rank={index} --master_addr={MASTER_ADDR}
--use-env
run_mae_pretraining.py
--data_path {DATA_PATH}
--mask_type tube
--mask_ratio 0.9
--model pretrain_videomae_small_patch16_224
--decoder_depth 4
--batch_size 50
--num_frames 16
--sampling_rate 4
--opt adamw
--opt_betas 0.9 0.95
--warmup_epochs 40
--lr 1.5e-4
--save_ckpt_freq 50
--epochs 800
--resume {MODEL_PATH}
--log_dir {OUTPUT_DIR}
--output_dir {OUTPUT_DIR}

后面的loss是这样的
eta: 0:03:24 lr: 0.000040 min_lr: 0.000040 loss: 0.6245 (0.6292) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0500 (0.0500) grad_norm: 0.3587 (0.4485) time: 1.0438 data: 0.2533 max mem: 9071
想问下是什么原因

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant