We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
最近复现K400 ViT-Small 预训练,2x8 H100, 单卡bs50,loss后面到0.6就降不下来了
OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node={len(gpu_ids)} --master_port={port} --nnodes={num_nodes} --node_rank={index} --master_addr={MASTER_ADDR} --use-env run_mae_pretraining.py --data_path {DATA_PATH} --mask_type tube --mask_ratio 0.9 --model pretrain_videomae_small_patch16_224 --decoder_depth 4 --batch_size 50 --num_frames 16 --sampling_rate 4 --opt adamw --opt_betas 0.9 0.95 --warmup_epochs 40 --lr 1.5e-4 --save_ckpt_freq 50 --epochs 800 --resume {MODEL_PATH} --log_dir {OUTPUT_DIR} --output_dir {OUTPUT_DIR}
后面的loss是这样的 eta: 0:03:24 lr: 0.000040 min_lr: 0.000040 loss: 0.6245 (0.6292) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0500 (0.0500) grad_norm: 0.3587 (0.4485) time: 1.0438 data: 0.2533 max mem: 9071 想问下是什么原因
The text was updated successfully, but these errors were encountered:
No branches or pull requests
最近复现K400 ViT-Small 预训练,2x8 H100, 单卡bs50,loss后面到0.6就降不下来了
OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node={len(gpu_ids)}
--master_port={port} --nnodes={num_nodes} --node_rank={index} --master_addr={MASTER_ADDR}
--use-env
run_mae_pretraining.py
--data_path {DATA_PATH}
--mask_type tube
--mask_ratio 0.9
--model pretrain_videomae_small_patch16_224
--decoder_depth 4
--batch_size 50
--num_frames 16
--sampling_rate 4
--opt adamw
--opt_betas 0.9 0.95
--warmup_epochs 40
--lr 1.5e-4
--save_ckpt_freq 50
--epochs 800
--resume {MODEL_PATH}
--log_dir {OUTPUT_DIR}
--output_dir {OUTPUT_DIR}
后面的loss是这样的
eta: 0:03:24 lr: 0.000040 min_lr: 0.000040 loss: 0.6245 (0.6292) loss_scale: 65536.0000 (65536.0000) weight_decay: 0.0500 (0.0500) grad_norm: 0.3587 (0.4485) time: 1.0438 data: 0.2533 max mem: 9071
想问下是什么原因
The text was updated successfully, but these errors were encountered: