[FastTransformer v3.0/Pytorch] FasterTransformer v3.0 decoding doesn't work with small vocab_size #746

upczww · 2020-11-06T08:56:31Z

Related to Model/Framework(s)
FasterTransformer/v3.0

Describe the bug
Runed FasterTransformer decoding under FP32 on PyTorch with:

./bin/decoding_gemm 8 4 8 64 3153 32 512 0
python pytorch/decoding_sample.py 8 6 32 8 64 4 3153 --time

where the vocab_size is 3153 instead of the original 31538, it raised error:

=============== Argument ===============
batch_size: 8
layer_num: 6
seq_len: 32
head_num: 8
head_size: 64
hidden_dim: 512
beam_size: 4
vocab_size: 3153
use_pretrained: False
use_fp16: False
TorchScript mode: False
test_time: True
========================================

tensor([[[1910, 1692, 1692,  ..., 1692, 1692, 1692],
         [2027, 1692, 1692,  ..., 1692, 1692, 1692],
         [1910, 1692, 1692,  ..., 1692, 1692, 2803],
         [1910, 1692, 1692,  ..., 1692, 2803, 2027]],

        [[2021,  154, 2021,  ..., 2794,  154, 2794],
         [2021,  154, 2021,  ..., 2794, 1892,  814],
         [2021,  154, 2021,  ..., 2794, 1892, 1892],
         [2021,  154, 2021,  ..., 2794,  814, 2794]],

        [[ 356, 2803, 2803,  ..., 2803, 2803, 2803],
         [2021, 2794, 2803,  ..., 2803, 2803, 2803],
         [2021, 2794, 2803,  ..., 2803, 2803, 2803],
         [2021, 2794, 2803,  ..., 2803, 2803, 2803]],

        ...,

        [[2696, 2027, 1782,  ..., 2696, 2027, 2696],
         [2696, 2027, 1782,  ..., 2696, 1779, 2696],
         [2696, 2027, 1782,  ..., 2696, 2696, 2027],
         [2696, 2027, 1782,  ..., 2696, 2027, 1146]],

        [[2794, 2794, 2794,  ..., 2794, 2794, 2794],
         [2794, 2794, 2794,  ..., 2794, 2794, 1146],
         [1910, 2794, 2794,  ..., 2794, 2794, 2794],
         [2794, 2794, 2794,  ..., 2794, 1146, 2794]],

        [[1910, 1910, 1910,  ..., 1910, 1910, 1910],
         [2803, 1910, 1910,  ..., 1910, 1910, 1910],
         [1910, 1910, 1910,  ..., 1910, 1910, 2027],
         [1910, 1910, 1910,  ..., 1910, 1910,  814]]], device='cuda:0',
       dtype=torch.int32)
tensor([[32, 32, 32, 32],
        [32, 32, 32, 32],
        [32, 32, 32, 32],
        [32, 32, 32, 32],
        [32, 32, 32, 32],
        [32, 32, 32, 32],
        [32, 32, 32, 32],
        [32, 32, 32, 32]], device='cuda:0')
tensor([[[1910, 1692, 1692,  ..., 1692, 1692, 1692],
         [2027, 1692, 1692,  ..., 1692, 1692, 1692],
         [1910, 1692, 1692,  ..., 1692, 1692, 2803],
         [1910, 1692, 1692,  ..., 1692, 2803, 2027]],

        [[2021,  154, 2021,  ..., 2794,  154, 2794],
         [2021,  154, 2021,  ..., 2794, 1892,  814],
         [2021,  154, 2021,  ..., 2794, 1892, 1892],
         [2021,  154, 2021,  ..., 2794,  814, 2794]],

        [[ 356, 2803, 2803,  ..., 2803, 2803, 2803],
         [2021, 2794, 2803,  ..., 2803, 2803, 2803],
         [2021, 2794, 2803,  ..., 2803, 2803, 2803],
         [2021, 2794, 2803,  ..., 2803, 2803, 2803]],

        ...,

        [[2696, 2027, 1782,  ..., 2696, 2027, 2696],
         [2696, 2027, 1782,  ..., 2696, 1779, 2696],
         [2696, 2027, 1782,  ..., 2696, 2696, 2027],
         [2696, 2027, 1782,  ..., 2696, 2027, 1146]],

        [[2794, 2794, 2794,  ..., 2794, 2794, 2794],
         [2794, 2794, 2794,  ..., 2794, 2794, 1146],
         [1910, 2794, 2794,  ..., 2794, 2794, 2794],
         [2794, 2794, 2794,  ..., 2794, 1146, 2794]],

        [[1910, 1910, 1910,  ..., 1910, 1910, 1910],
         [2803, 1910, 1910,  ..., 1910, 1910, 1910],
         [1910, 1910, 1910,  ..., 1910, 1910, 2027],
         [1910, 1910, 1910,  ..., 1910, 1910,  814]]], device='cuda:0',
       dtype=torch.int32)
tensor([[32, 32, 32, 32],
        [32, 32, 32, 32],
        [32, 32, 32, 32],
        [32, 32, 32, 32],
        [32, 32, 32, 32],
        [32, 32, 32, 32],
        [32, 32, 32, 32],
        [32, 32, 32, 32]], device='cuda:0')

Traceback (most recent call last):
  File "pytorch/decoding_sample.py", line 167, in <module>
    main()
  File "pytorch/decoding_sample.py", line 131, in main
    output2, lens2 = custom_decoding(args.batch_size, args.beam_size, args.seq_len, mem, mem_seq_lens)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/FasterTransformer/build/pytorch/utils/decoding.py", line 473, in forward
    output_ids, parent_ids, out_seq_lens = self.decoding.forward(batch_size, beam_size, max_seq_len, extended_memory, extended_memory_seq_lens)
RuntimeError: [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_EXECUTION_FAILED /work/FasterTransformer/fastertransformer/cuda/open_decoder.cu:838

To Reproduce
Steps to reproduce the behavior:
1.build with Pytorch image nvcr.io/nvidia/pytorch:20.03-py3

mkdir build
cd build
cmake -DSM=60 -DCMAKE_BUILD_TYPE=Release -DBUILD_THE=ON -DBUILD_THS=ON -DBUILD_THSOP=ON -DCXX_STD=14 ..
make

2.install opennmt-py:

pip install opennmt-py==1.1.1

3.generate GEMM config:

./bin/decoding_gemm 8 4 8 64 3153 32 512 0

4.run decoding_sample

python pytorch/decoding_sample.py 8 6 32 8 64 4 3153 --time

Expected behavior
The decoding sample worked with vocab_size=31538 as the decoding demos，when I decreased the vocab_size, it raised error.

Environment
Please provide at least:

Container version (e.g. pytorch:19.05-py3): nvcr.io/nvidia/pytorch:20.03-py3
GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): 4x Tesla P40 24G
CUDA driver version (e.g. 418.67): 440.64
HOST CUDA version: 10.2

The text was updated successfully, but these errors were encountered:

byshiue · 2020-11-09T02:26:06Z

Thanks for your feedback. This bug is fixed in #747.

upczww added the bug Something isn't working label Nov 6, 2020

upczww changed the title ~~[Model/Framework] FasterTransformer v3.0 decoding doesn't work with small vocab_size~~ [FastTransformer v3.0/Pytorch] FasterTransformer v3.0 decoding doesn't work with small vocab_size Nov 6, 2020

byshiue self-assigned this Nov 6, 2020

byshiue added a commit that referenced this issue Nov 9, 2020

Fix: Fix the bugs of allocating workspace (#746,#747)

a095658

byshiue closed this as completed Nov 9, 2020

changlan pushed a commit to changlan/DeepLearningExamples that referenced this issue Apr 5, 2021

Fix: Fix the bugs of allocating workspace (NVIDIA#746,NVIDIA#747)

10df7b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FastTransformer v3.0/Pytorch] FasterTransformer v3.0 decoding doesn't work with small vocab_size #746

[FastTransformer v3.0/Pytorch] FasterTransformer v3.0 decoding doesn't work with small vocab_size #746

upczww commented Nov 6, 2020 •

edited

Loading

byshiue commented Nov 9, 2020

[FastTransformer v3.0/Pytorch] FasterTransformer v3.0 decoding doesn't work with small vocab_size #746

[FastTransformer v3.0/Pytorch] FasterTransformer v3.0 decoding doesn't work with small vocab_size #746

Comments

upczww commented Nov 6, 2020 • edited Loading

byshiue commented Nov 9, 2020

upczww commented Nov 6, 2020 •

edited

Loading