Skip to content

Megatron-LM pretrain_bert with deepspeed #179

@sj6077

Description

@sj6077

I got this error message following GPT2 tutorial to the BERT code.
Can you let me know something that I missed?
The error message is "RuntimeError: expected scalar type Float but found Half (data at /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/include/ATen/core/TensorMethods.h:1821)", and below is the trace.

Traceback (most recent call last):
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 617, in
main()
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 595, in main
timers, args)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 354, in train
args, timers)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 304, in train_step
args, timers)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/pretrain_bert.py", line 232, in forward_step
checkpoint_activations=args.checkpoint_activations)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/deepspeed/pt/deepspeed_light.py", line 613, in forward
loss = self.module(*inputs, **kwargs)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/model/distributed.py", line 78, in forward
return self.module(*inputs, **kwargs)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(input, **kwargs)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 65, in forward
return fp16_to_fp32(self.module(
(fp32_to_fp16(inputs)), **kwargs))
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/model/model.py", line 82, in forward
checkpoint_activations=checkpoint_activations)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/model/modeling.py", line 944, in forward
output_all_encoded_layers=False, checkpoint_activations=checkpoint_activations)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, *kwargs)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/model/modeling.py", line 869, in forward
embedding_output = self.embeddings(input_ids, token_type_ids)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(input, kwargs)
File "/home/soojeong/forked/DeepSpeed/DeepSpeedExamples/Megatron-LM/model/modeling.py", line 300, in forward
embeddings = self.LayerNorm(embeddings)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(input, kwargs)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/apex/normalization/fused_layer_norm.py", line 159, in forward
input, self.weight, self.bias, self.normalized_shape,self.eps)
File "/home/soojeong/deepspeed_venv/lib/python3.6/site-packages/apex/normalization/fused_layer_norm.py", line 25, in forward
input_, ctx.normalized_shape, weight_, bias_, ctx.eps)
RuntimeError: expected scalar type Float but found Half (data at /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/include/ATen/core/TensorMethods.h:1821)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fcc922f5273 in /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: float
at::Tensor::data() const + 0x449 (0x7fc8843aa5e9 in /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #2: cuda_layer_norm(at::Tensor
, at::Tensor
, at::Tensor
, at::Tensor
, int, int, c10::ArrayRef, at::Tensor, at::Tensor, double) + 0x725 (0x7fc8843a76c5 in /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #3: layer_norm_affine(at::Tensor, c10::ArrayRef, at::Tensor, at::Tensor, double) + 0x2a4 (0x7fc884394ca4 in /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x1e254 (0x7fc8843a5254 in /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #5: + 0x1a8e0 (0x7fc8843a18e0 in /home/soojeong/deepspeed_venv/lib/python3.6/site-packages/fused_layer_norm_cuda.cpython-36m-x86_64-linux-gnu.so)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions