-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When use one Gpu do model training, met one issue. #135
Comments
And then execute converting weights to Huggingface format, met the dimension and target size of the extension operation do not match.
|
This model has 32 layers, so if you only have one GPU, --num-layers must be 32 |
|
bash training/finetune_RedPajama-INCITE-Chat-3B-v1.sh${DIR}/dist_clm_train.py $ (echo ${ARGS}) --cuda-id 0 --rank 0
My configurations changes as below:
--lr 1e-5 --seq-length 2048 --batch-size 8 --micro-batch-size 1 --gradient-accumulate-step 1
--num-layers 2 --embedding-dim 2560
--world-size 1 --pipeline-group-size 1 --data-group-size 1
(trap 'kill 0' SIGINT;
python
&
My environment:
GPU: NVIDIA RTX A4000
Graphics card memory:16 GB
Number of CPUs available for use: 8
Memory : 60 GB
Free space: 200 GB
error log:
Rank 0 node forward pass 0/1 takes 1.84s
{'loss': 16.892578125, 'lr': 1e-05}
Rank 0 node backward pass 0/1 takes 1.09s
cuda:0 cuda:0 cuda:0
!!! Warning: find inf in fp16 optimizer-step() !!!
/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of
lr_scheduler.step()
beforeoptimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order:optimizer.step()
beforelr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-ratewarnings.warn("Detected call of
lr_scheduler.step()
beforeoptimizer.step()
. "after cuda sync 0
Rank 0 node optimizer step takes 0.07s
Rank 0 node whole iteration takes 3.00s
Rank 0 node forward pass 0/1 takes 0.36s
{'loss': 16.744140625, 'lr': 9e-06}
With this error log, it seems also produce one model locate in model_ckpts/redpajama-incite-chat-3b-sample/checkpoint_10/, I don't if it's model is right?
The text was updated successfully, but these errors were encountered: