-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Environment
A100x8 GPU's
Using container nvcr.io#nvidia/pytorch:21.05-py3
apt update
pip3 install nvidia-pyindex
pip3 install nvidia-tensorflow
pip3 install numpy --upgrade
export TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0 8.6+PTX"
DS_BUILD_OPS=1 pip3 install deepspeed
pip3 install mpi4py
root@x8a100-0000:/workspace# ds_report
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.9.0a0+2ecb2c7
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.4.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.3
root@x8a100-0000:/workspace#
Without --deepspeed_transformer_kernel training job runs fine on multiple A100 GPU's, but when I add --deepspeed_transformer_kernel
!!!! kernel execution error. (m: 6144, n: 2048, k: 2048, error: 13)
!!!! kernel execution error. (m: 2048, n: 2048, k: 8192, error: 13)
!!!! kernel execution error. (m: 6144, n: 2048, k: 2048, error: 13)
!!!! kernel execution error. (m: 512, n: 512, k: 64, error: 13)
!!!! kernel execution error. (m: 64, n: 512, k: 512, error: 13)
Traceback (most recent call last):
File "train.py", line 519, in
main()
File "train.py", line 511, in main
run(args, model, optimizer)
File "train.py", line 482, in run
train(args, model, optimizer)
File "train.py", line 180, in train
validation(args, global_data_samples, model)
File "train.py", line 102, in validation
_, (tmp_mlm_loss, tmp_nsp_loss) = model.network(batch, log=False)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1086, in forward
loss = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/nfs2/pndall/bert/src/bert/pytorch/nvidia/modelingpreln.py", line 1156, in forward
sequence_output, pooled_output = self.bert(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/nfs2/pndall/bert/src/bert/pytorch/nvidia/modelingpreln.py", line 981, in forward
encoded_layers = self.encoder(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/nfs2/pndall/bert/src/bert/pytorch/nvidia/modelingpreln.py", line 602, in forward
hidden_states = layer_module(hidden_states, attention_mask)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/transformer.py", line 592, in forward
return DeepSpeedTransformerFunction.apply(hidden_states,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/transformer.py", line 208, in forward
layer_norm_mean) = forward_func(config.layer_id,
RuntimeError: /home/scratch.efomenko_sw/ml/wip/cask.wip/xmma/cask_plugin/src/gemm/runner.cu:107: cudaFuncSetAttribute(kernel_entry, cudaFuncAttributeMaxDynamicSharedMemorySize, integer_cast<int32_t>(launch_configs[0].smemSizeInBytes)): an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Do you have any suggestions on how I can fix this?
Thank you.