Bert training model failed when add --deepspeed_transformer_kernel

**Environment**
A100x8 GPU's
Using container nvcr.io#nvidia/pytorch:21.05-py3
apt update
pip3 install nvidia-pyindex
pip3 install nvidia-tensorflow
pip3 install numpy --upgrade
export TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0 8.6+PTX"
DS_BUILD_OPS=1 pip3 install deepspeed
pip3 install mpi4py

root@x8a100-0000:/workspace# ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.9.0a0+2ecb2c7
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.4.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.3
root@x8a100-0000:/workspace#

Without --deepspeed_transformer_kernel training job runs fine on multiple A100 GPU's, but when I add --deepspeed_transformer_kernel

!!!! kernel execution error. (m: 6144, n: 2048, k: 2048, error: 13)
!!!! kernel execution error. (m: 2048, n: 2048, k: 8192, error: 13)
!!!! kernel execution error. (m: 6144, n: 2048, k: 2048, error: 13)
!!!! kernel execution error. (m: 512, n: 512, k: 64, error: 13)
!!!! kernel execution error. (m: 64, n: 512, k: 512, error: 13)
Traceback (most recent call last):
  File "train.py", line 519, in <module>
    main()
  File "train.py", line 511, in main
    run(args, model, optimizer)
  File "train.py", line 482, in run
    train(args, model, optimizer)
  File "train.py", line 180, in train
    validation(args, global_data_samples, model)
  File "train.py", line 102, in validation
    _, (tmp_mlm_loss, tmp_nsp_loss) = model.network(batch, log=False)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1086, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/nfs2/pndall/bert/src/bert/pytorch/nvidia/modelingpreln.py", line 1156, in forward
    sequence_output, pooled_output = self.bert(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/nfs2/pndall/bert/src/bert/pytorch/nvidia/modelingpreln.py", line 981, in forward
    encoded_layers = self.encoder(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/nfs2/pndall/bert/src/bert/pytorch/nvidia/modelingpreln.py", line 602, in forward
    hidden_states = layer_module(hidden_states, attention_mask)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/transformer.py", line 592, in forward
    return DeepSpeedTransformerFunction.apply(hidden_states,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/transformer.py", line 208, in forward
    layer_norm_mean) = forward_func(config.layer_id,
RuntimeError: /home/scratch.efomenko_sw/ml/wip/cask.wip/xmma/cask_plugin/src/gemm/runner.cu:107: cudaFuncSetAttribute(kernel_entry, cudaFuncAttributeMaxDynamicSharedMemorySize, integer_cast<int32_t>(launch_configs[0].smemSizeInBytes)): an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered

Do you have any suggestions on how I can fix this?
Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bert training model failed when add --deepspeed_transformer_kernel #1155

root@x8a100-0000:/workspace# ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bert training model failed when add --deepspeed_transformer_kernel #1155

Description

root@x8a100-0000:/workspace# ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]