FasterTransformer XLNet

The FasterTransformer XLNet contains the XLNet model, which is an extension of the Transformer-XL model pre-trained using an autoregressive method to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization order.

The XLNet model was presented in XLNet: Generalized Autoregressive Pretraining for Language Understanding. The abstract of the paper is the following:

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking

This FasterTransformer XLNet implementation is aligned with XLNet Github.

XLNet Encoder Computation Flow

Fig. 1 Flowchart of XLNet.

Demo

In this demo, you can run the XLNet as a C++ program.

Requirements

CMake >= 3.8
CUDA 11.0 or newer version
Python 3 is recommended because some features are not supported in python 2
Tensorflow: Verify on 1.15, 1.13 and 1.14 should work.

Recommend to use image nvcr.io/nvidia/tensorflow:22.09-tf1-py3.

Setup

Start the docker container, ensure mounting the project directory into it. For example:

docker run \
    -it \
    --shm-size 5g \
    --rm \
    --gpus=all \
    -v {YOUR_FASTER_TRANSFORMER_PROJECT_DIR_ON_HOST}:/workspace/FasterTransformer \
    --workdir /workspace/FasterTransformer \
    nvcr.io/nvidia/tensorflow:22.09-tf1-py3 bash
    
cd /workspace/FasterTransformer

Here, we use nvcr.io/nvidia/tensorflow:22.09-tf1-py3, you can also switch it to another CUDA-enabled Tensorflow containers, but need to comply with the previous requirements.

Build the FasterTransformer with C++:

mkdir -p build
cd build
cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release ..
make -j12

Note: xx is the compute capability of your GPU. For example, 60 (P40) or 61 (P4) or 70 (V100) or 75(T4) or 80 (A100).

Run

cd /workspace/FasterTransformer/build

Run XLNet on C++

./bin/xlnet_gemm <batch_size> <sequence_length> <head_number> <size_per_head> <data_type>
./bin/xlnet_example <batch_size> <num_layers> <sequence_length> <head_number> <size_per_head> <data_type>

Data Type = 0 (FP32) or 1 (FP16) or 2 (BF16)

Run XLNet under FP32 on C++

./bin/xlnet_gemm 8 128 12 64 0
./bin/xlnet_example 8 12 128 12 64 0

Run XLNet under FP16 on C++

./bin/xlnet_gemm 8 128 12 64 1
./bin/xlnet_example 8 12 128 12 64 1

Run XLNet under BF16 on C++

./bin/xlnet_gemm 8 128 12 64 2
./bin/xlnet_example 8 12 128 12 64 2

Verify the correctness

cd  examples/tensorflow/xlnet
bash downloadModel.sh #Download the input and model data
bash verifyCorrectness.sh # For FP32 model
bash verifyCorrectness.sh -f 1 #For FP16 model

Args for the verifyCorrectness.sh

Usage: bash verifyCorrectness_FP32.sh -d data_dir -m model_dir -s npz_dir -e gemm_file -g gpu_id -f is_use_fp16
        -d The directory of input data. Default: ./data/STS-B
        -n The data name. Default: sts-b
        -m The directory of the xlnet models. Default: ./data/xlnet_cased_L-12_H-768_A-12
        -s The directory which stores the generated npz files. Default: ./data
        -e The gemm file for selecting cublas functions. Default: ./gemm.in
        -g Specify which GPU to use. Default: 0
        -f Specify use float16 or not. 1 means run in float16 mode. Default: 0

Performance

Hardware settings:

T4 (with mclk 5000MHz, pclk 1590MHz) with Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

We here compared the performance between XLNet and FT XLNet on T4. The hyper-parameters of the model are:

head_num = 12
size_per_head = 64
num_layers = 12

FP32

Batch_size	Seq_len	Precision	Tensorflow XLA Latency (ms)	FasterTransformers Latency (ms)	Speedup
16	128	FP32	135.49	91.20	1.49
32	128	FP32	271.12	186.88	1.45
64	128	FP32	535.80	398.14	1.35
16	384	FP32	646.19	381.30	1.69
32	384	FP32	1291.76	763.32	1.69
64	384	FP32	2751.80	1554.84	1.77
16	512	FP32	949.05	544.85	1.74
32	512	FP32	2057.27	1165.06	1.77
64	512	FP32	1771.6	2269.14	1.94

FP16

Batch_size	Seq_len	Precision	Tensorflow XLA Latency (ms)	FasterTransformers Latency (ms)	Speedup
16	128	FP16	53.04	26.08	2.03
32	128	FP16	84.76	51.78	1.64
64	128	FP16	151.09	106.12	1.42
16	384	FP16	228.29	132.78	1.72
32	384	FP16	528.38	255.89	2.06
64	384	FP16	1079.63	539.69	2.00
16	512	FP16	356.38	194.59	1.83
32	512	FP16	851.15	397.92	2.14
64	512	FP16	1771.6	773.58	2.29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xlnet_guide.md

xlnet_guide.md

FasterTransformer XLNet

Table of Contents

XLNet Encoder Computation Flow

Demo

Requirements

Setup

Run

Run XLNet on C++

Verify the correctness

Performance

FP32

FP16

Files

xlnet_guide.md

Latest commit

History

xlnet_guide.md

File metadata and controls

FasterTransformer XLNet

Table of Contents

XLNet Encoder Computation Flow

Demo

Requirements

Setup

Run

Run XLNet on C++

Verify the correctness

Performance

FP32

FP16