Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add iluvatar bert case. #56

Merged
merged 8 commits into from
May 31, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion training/benchmarks/bert/pytorch/config/_base.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请restore此文件,或者修正后面的空行

Original file line number Diff line number Diff line change
Expand Up @@ -174,4 +174,4 @@
n_device: int = 1
n_gpu: int = 1

eval_interval_samples: int = 0
eval_interval_samples: int = 0
yuzhou03 marked this conversation as resolved.
Show resolved Hide resolved
1 change: 1 addition & 0 deletions training/benchmarks/driver/config_manager.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请restore或修正空行

Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ def activate(base_config,

parsed_params = parse_from_args_and_config(params, cmd_args, ext_config,
enable_extern_config)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

# TODO:后续考虑换一个更优雅的方式
if "tensorflow2" in base_config.__path__:
base_config.override(parsed_params.__dict__, False)
Expand Down
40 changes: 40 additions & 0 deletions training/iluvatar/bert-pytorch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
## 模型信息
### 模型介绍

BERT stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Please refer to this paper for a detailed description of BERT:
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

### 模型Checkpoint下载
[模型Checkpoint下载](../../benchmarks/bert/pytorch/readme.md#模型信息与数据集模型checkpoint下载)
### 测试数据集下载
[测试数据集下载](../../benchmarks/bert/pytorch/readme.md#模型信息与数据集模型checkpoint下载)

### 天数智芯 BI-V100 GPU配置与运行信息参考
#### 环境配置
- ##### 硬件环境
- 机器、加速卡型号: Iluvatar BI-V100 32GB

- ##### 软件环境
- OS版本:Ubuntu 20.04
- OS kernel版本: 4.15.0-156-generic x86_64
- 加速卡驱动版本:3.0.0
- Docker 版本:20.10.8
- 训练框架版本:torch-1.10.2+corex.3.0.0
- 依赖软件版本:无


### 运行情况
| 训练资源 | 配置文件 | 运行时长(s) | 目标精度 | 收敛精度 | Steps数 | 性能(samples/s) |
| -------- | ------------------ | ---------- | ------- | ------- | ------- | --------------- |
| 单机1卡 | config_BI-V100x1x1 | 17854.76 | 0.72 | 0.7325 | 25000 |17.00 |
| 单机8卡 | config_BI-V100x1x8 | 20312.57 | 0.72 | 0.9619 | 25000 |118.45 |
yuzhou03 marked this conversation as resolved.
Show resolved Hide resolved
| 两机8卡 | config_BI-V100x2x8 | pending | 0.72 | pending | pending |pending |

### 许可证

本项目基于Apache 2.0 license。

本项目部分代码基于MLCommons https://github.com/mlcommons/training_results_v1.0/tree/master/NVIDIA 实现。
25 changes: 25 additions & 0 deletions training/iluvatar/bert-pytorch/config/config_Ampere_common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from config_common import *
from torch.cuda.amp import GradScaler
import os

grad_scaler = GradScaler(init_scale=float(os.getenv("INIT_LOSS_SCALE", 2**20)),
growth_interval=2000)

fp16 = True
ddp_type = "apex"
dist_backend = "nccl"

#train_batch_size = 56 if get_gpu_mem() > 75 else 27
yuzhou03 marked this conversation as resolved.
Show resolved Hide resolved
train_batch_size = 12
max_steps = 1000000

fused_gelu_bias = True
fused_mha = True
unpad = True
unpad_fmha = False
dense_seq_output = True
exchange_padding = True

dwu_num_rs_pg = 1
dwu_num_ar_pg = 1
dwu_num_blocks = 1
19 changes: 19 additions & 0 deletions training/iluvatar/bert-pytorch/config/config_BI-V100x1x1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from config_Ampere_common import *

gradient_accumulation_steps = 1
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 0

distributed_lamb = False
exchange_padding = False
learning_rate = 3.5e-4
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999

eval_batch_size = train_batch_size
max_samples_termination = 4500000
cache_eval_data = True

seed = 9031
18 changes: 18 additions & 0 deletions training/iluvatar/bert-pytorch/config/config_BI-V100x1x2.py
yuzhou03 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from config_Ampere_common import *

gradient_accumulation_steps = 1
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 0

distributed_lamb = False
learning_rate = 3.5e-4
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999

eval_batch_size = train_batch_size
max_samples_termination = 4500000
cache_eval_data = True

seed = 9031
18 changes: 18 additions & 0 deletions training/iluvatar/bert-pytorch/config/config_BI-V100x1x4.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from config_Ampere_common import *

gradient_accumulation_steps = 1
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 0

distributed_lamb = False
yuzhou03 marked this conversation as resolved.
Show resolved Hide resolved
learning_rate = 3.5e-4
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999

eval_batch_size = train_batch_size
max_samples_termination = 4500000
cache_eval_data = True

seed = 9031
18 changes: 18 additions & 0 deletions training/iluvatar/bert-pytorch/config/config_BI-V100x1x8.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from config_Ampere_common import *

gradient_accumulation_steps = 1
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 0

distributed_lamb = False
learning_rate = 3.5e-4
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999

eval_batch_size = train_batch_size
max_samples_termination = 4500000
cache_eval_data = True

seed = 9031
83 changes: 83 additions & 0 deletions training/iluvatar/bert-pytorch/config/config_common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import torch
yuzhou03 marked this conversation as resolved.
Show resolved Hide resolved

# 'segmented' or 'full_iteration' options for CUDA graph capture.
# 'segmented' option: Pytorch Autograd orchestrates execution of backward ops every iteration.
# 'full_iteration' option: CUDA graph orchestrates execution of bwd ops every iteration without Autograd involvement (has composability limitations but could be more performant allowing optimizer and collectives capture).
cuda_graph_mode: str = "segmented"

# Maximum number of iterations to capture in a single graph.
# Requires 'full_iteration' option for '--cuda_graph_mode'.
max_iterations_per_graph: int = 4

# Whether to do allreduces during gradient accumulation steps.
allreduce_post_accumulation: bool = False

# Whether to do fp16 allreduce post accumulation.
allreduce_post_accumulation_fp16: bool = False

# Whether to run with unpadding.
unpad: bool = False

# Whether to run with unpadding.
unpad_fmha: bool = False

# Whether to pad tokens.
pad: bool = False

# Whether to disable fusion of the scaling to BMM1.
disable_fuse_scale: bool = False

# Whether to disable fusion of the QKV GEMMs.
disable_fuse_qkv: bool = False

# Whether to disable apex softmax.
disable_apex_softmax: bool = False

# Enable use of streams for pad case.
enable_stream: bool = False

# Whether to run with optimizations.
fused_mha: bool = False

# Enable CUDA graph execution.
use_cuda_graph: bool = False

# DDP type: 'apex' or 'native'.
ddp_type: str = "apex"

# Bypass AMP unscaling and inf/nan checks for SOL measurements.
bypass_amp: bool = False

# Whether to use distributed lamb.
distributed_lamb: bool = False

# distributed weight update group size. If arg is 0, defaults to one node
dwu_group_size: int = 0

# number of blocks in dwu scheme
dwu_num_blocks: int = 4

# number of chunks in dwu scheme
dwu_num_chunks: int = 1

# number of reduction-scatter streams in dwu scheme
dwu_num_rs_pg: int = 2

# number of all-reduce streams in dwu scheme
dwu_num_ar_pg: int = 4

# number of all-gather streams in dwu scheme
dwu_num_ag_pg: int = 2

# whether to overlap reductions with backprop
dwu_overlap_reductions: bool = False

# do allgather with e5m2 floats
dwu_e5m2_allgather: bool = False

# the apex optimization level, value: [O1, O2]
opt_level: str = "O2"


def get_gpu_mem():
return torch.cuda.get_device_properties("cuda:0").total_memory / 1e+9
forestlee95 marked this conversation as resolved.
Show resolved Hide resolved
Empty file.
4 changes: 4 additions & 0 deletions training/iluvatar/bert-pytorch/config/layers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from .attention import *
from .fmha import *
from .mha import *
from .softmax import *
81 changes: 81 additions & 0 deletions training/iluvatar/bert-pytorch/config/layers/attention.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
from torch import nn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config 下面不该出现模型结构相关的代码哈,layers/attention.py 路径位置不对

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个layers文件夹也是从nvidia的标准case copy过来的-。-
image


import apex
from apex.contrib.multihead_attn import SelfMultiheadAttn
from model.models.modeling import jit_dropout_add, BertSelfOutput
from .fmha import FMHA
from .mha import FastUnpadBertSelfAttention

#apex.amp.register_half_function(apex.normalization.fused_layer_norm, 'FusedLayerNorm')
import apex.normalization
#apex.amp.register_float_function(apex.normalization.FusedLayerNorm, 'forward')
from apex.contrib.layer_norm import FastLayerNorm as BertLayerNorm


# This module uses Apex C++ multihead attention implementation with fusions.
class FastBertAttention(nn.Module):

def __init__(self, config):
super(FastBertAttention, self).__init__()
self.multi_head_attention = SelfMultiheadAttn(
config.hidden_size,
config.num_attention_heads,
dropout=config.attention_probs_dropout_prob,
bias=True,
include_norm_add=False,
impl='fast',
separate_qkv_params=True,
mask_additive=True)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.p = config.hidden_dropout_prob
self.fused_dropout_add = config.fused_dropout_add
self.layer_norm = BertLayerNorm(config.hidden_size, eps=1e-12)

def forward(self, input_tensor, attention_mask, *args, **kwargs):
residual = input_tensor
multi_head_attention_output, _ = self.multi_head_attention(
query=input_tensor,
key=input_tensor,
value=input_tensor,
key_padding_mask=attention_mask,
need_weights=True,
attn_mask=None,
is_training=self.training)
if self.fused_dropout_add:
attention_output = jit_dropout_add(multi_head_attention_output,
residual, self.p, self.training)
attention_output = self.layer_norm(attention_output)
return attention_output
else:
attention_output = self.dropout(multi_head_attention_output)
attention_output = self.layer_norm(attention_output + residual)
return attention_output


class FastUnpadBertAttention(nn.Module):

def __init__(self, config):
super(FastUnpadBertAttention, self).__init__()
if config.unpad_fmha:
self.self = FMHA(config)
else:
self.self = FastUnpadBertSelfAttention(
config,
enable_stream=config.enable_stream,
enable_sync=False,
fuse_mask=config.fuse_mask,
fuse_scale=config.fuse_scale,
fuse_qkv=config.fuse_qkv,
fuse_dropout=config.fuse_dropout,
apex_softmax=config.apex_softmax,
pad=config.pad)
self.output = BertSelfOutput(config)

def forward(self, input_tensor, attention_mask, seqlen, batch):
self_output = self.self(input_tensor,
attention_mask,
seqlen,
batch,
is_training=self.training)
attention_output = self.output(self_output, input_tensor)
return attention_output
Loading