When use one Gpu do model training, met one issue. #135

yxy123 · 2023-06-08T06:02:47Z

bash training/finetune_RedPajama-INCITE-Chat-3B-v1.sh
My configurations changes as below:
--lr 1e-5 --seq-length 2048 --batch-size 8 --micro-batch-size 1 --gradient-accumulate-step 1
--num-layers 2 --embedding-dim 2560
--world-size 1 --pipeline-group-size 1 --data-group-size 1
(trap 'kill 0' SIGINT;
python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 0 --rank 0
&
My environment:
GPU: NVIDIA RTX A4000
Graphics card memory:16 GB
Number of CPUs available for use: 8
Memory : 60 GB
Free space: 200 GB

error log:
Rank 0 node forward pass 0/1 takes 1.84s
{'loss': 16.892578125, 'lr': 1e-05}
Rank 0 node backward pass 0/1 takes 1.09s
cuda:0 cuda:0 cuda:0
!!! Warning: find inf in fp16 optimizer-step() !!!
/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
after cuda sync 0
Rank 0 node optimizer step takes 0.07s
Rank 0 node whole iteration takes 3.00s

Rank 0 node forward pass 0/1 takes 0.36s
{'loss': 16.744140625, 'lr': 9e-06}

With this error log, it seems also produce one model locate in model_ckpts/redpajama-incite-chat-3b-sample/checkpoint_10/, I don't if it's model is right?

The text was updated successfully, but these errors were encountered:

yxy123 · 2023-06-08T07:47:52Z

And then execute converting weights to Huggingface format, met the dimension and target size of the extension operation do not match.
(myconda) root@ZaodV6:/mnt/tet/OpenChatKit-main# mkdir huggingface_models \

&& python tools/convert_to_hf_gptneox.py
--config-name EleutherAI/pythia-6.9b-deduped
--ckpt-path model_ckpts/redpajama-incite-chat-3b-sample/checkpoint_10
--save-path huggingface_models/Pythia-Chat-Base-3B
--n-stages 4
--n-layer-per-stage 8
--fp16
loading config...
loaded config.
loading tokenizer...
loaded tokenizer.
creating empty model...
created empty model.
loading model ckpt...
loading stage 0
Traceback (most recent call last):
File "tools/convert_to_hf_gptneox.py", line 123, in
load_decentralized_checkpoint(
File "tools/convert_to_hf_gptneox.py", line 48, in load_decentralized_checkpoint
model.gpt_neox.embed_in.weight.data[:] = _tmp['embed_in.weight']
RuntimeError: The expanded size of the tensor (4096) must match the existing size (2560) at non-singleton dimension 1. Target sizes: [50432, 4096]. Tensor sizes: [50432, 2560]

ChengYen-Tang · 2023-08-07T05:21:16Z

bash training/finetune_RedPajama-INCITE-Chat-3B-v1.sh My configurations changes as below: --lr 1e-5 --seq-length 2048 --batch-size 8 --micro-batch-size 1 --gradient-accumulate-step 1 --num-layers 2 --embedding-dim 2560 --world-size 1 --pipeline-group-size 1 --data-group-size 1 (trap 'kill 0' SIGINT; python DIR/distclmtrain.py(echo ${ARGS}) --cuda-id 0 --rank 0 & My environment: GPU: NVIDIA RTX A4000 Graphics card memory:16 GB Number of CPUs available for use: 8 Memory : 60 GB Free space: 200 GB

error log:

Rank 0 node forward pass 0/1 takes 1.84s
{'loss': 16.892578125, 'lr': 1e-05}
Rank 0 node backward pass 0/1 takes 1.09s
cuda:0 cuda:0 cuda:0
!!! Warning: find inf in fp16 optimizer-step() !!!
/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of before . In PyTorch 1.1.0 and later, you should call them in the opposite order: before . Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of before . "
after cuda sync 0
Rank 0 node optimizer step takes 0.07s
Rank 0 node whole iteration takes 3.00slr_scheduler.step()``optimizer.step()``optimizer.step()``lr_scheduler.step()``lr_scheduler.step()``optimizer.step()
Rank 0 node forward pass 0/1 takes 0.36s {'loss': 16.744140625, 'lr': 9e-06}

With this error log, it seems also produce one model locate in model_ckpts/redpajama-incite-chat-3b-sample/checkpoint_10/, I don't if it's model is right?

This model has 32 layers, so if you only have one GPU, --num-layers must be 32
https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1/blob/main/config.json#L16

ChengYen-Tang · 2023-08-07T05:22:07Z

And then execute converting weights to Huggingface format, met the dimension and target size of the extension operation do not match. (myconda) root@ZaodV6:/mnt/tet/OpenChatKit-main# mkdir huggingface_models \

&& python tools/convert_to_hf_gptneox.py
--config-name EleutherAI/pythia-6.9b-deduped
--ckpt-path model_ckpts/redpajama-incite-chat-3b-sample/checkpoint_10
--save-path huggingface_models/Pythia-Chat-Base-3B
--n-stages 4
--n-layer-per-stage 8
--fp16
loading config...
loaded config.
loading tokenizer...
loaded tokenizer.
creating empty model...
created empty model.
loading model ckpt...
loading stage 0
Traceback (most recent call last):
File "tools/convert_to_hf_gptneox.py", line 123, in
load_decentralized_checkpoint(
File "tools/convert_to_hf_gptneox.py", line 48, in load_decentralized_checkpoint
model.gpt_neox.embed_in.weight.data[:] = _tmp['embed_in.weight']
RuntimeError: The expanded size of the tensor (4096) must match the existing size (2560) at non-singleton dimension 1. Target sizes: [50432, 4096]. Tensor sizes: [50432, 2560]

#86 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When use one Gpu do model training, met one issue. #135

When use one Gpu do model training, met one issue. #135

yxy123 commented Jun 8, 2023 •

edited

Loading

yxy123 commented Jun 8, 2023

ChengYen-Tang commented Aug 7, 2023

error log:

ChengYen-Tang commented Aug 7, 2023

When use one Gpu do model training, met one issue. #135

When use one Gpu do model training, met one issue. #135

Comments

yxy123 commented Jun 8, 2023 • edited Loading

yxy123 commented Jun 8, 2023

ChengYen-Tang commented Aug 7, 2023

error log:

ChengYen-Tang commented Aug 7, 2023

yxy123 commented Jun 8, 2023 •

edited

Loading