[kunlunxin] transformer model, fix running error (FlagOpen#337)

* [kunlunxin] transformer model, fix running error xacc args install dlloger compatibility with newer numpy xpu1*1, xpu2*8 config * [kunlunxin] update property info of transformer-pytorch with kunlunxin * [kunlunxin] transformer model, remove save/load checkpoint * [kunlunxin] transformer model, update README --------- Co-authored-by: chenrui22 <chenrui22@baidu.com> Co-authored-by: Zhou Yu <zycosmos@gmail.com>
clveryang · Dec 15, 2023 · 25d4d36 · 25d4d36
1 parent ced2419
commit 25d4d36
Show file tree

Hide file tree

Showing 8 changed files with 49 additions and 15 deletions.
diff --git a/training/benchmarks/transformer/pytorch/fairseq/data/indexed_dataset.py b/training/benchmarks/transformer/pytorch/fairseq/data/indexed_dataset.py
@@ -30,7 +30,7 @@ def write_longs(f, a):
     3: np.int16,
     4: np.int32,
     5: np.int64,
-    6: np.float,
+    6: np.float32,
     7: np.double,
 }
 
@@ -173,7 +173,7 @@ class IndexedDatasetBuilder(object):
         np.int16: 2,
         np.int32: 4,
         np.int64: 8,
-        np.float: 4,
+        np.float32: 4,
         np.double: 8
     }
 

diff --git a/training/benchmarks/transformer/pytorch/train/trainer.py b/training/benchmarks/transformer/pytorch/train/trainer.py
@@ -38,7 +38,6 @@ def __init__(self, driver: Driver, adapter, evaluator: Evaluator,
         super(Trainer, self).__init__(self.config, self.model)
 
     def init(self, train_dataloader):
-        load_checkpoint(self.config, self, train_dataloader)
         # Send a dummy batch to warm the caching allocator
         src_dict, tgt_dict = data_utils.load_dictionaries(self.config)
         add_extra_items_to_checkpoint({'src_dict': src_dict, 'tgt_dict': tgt_dict})
@@ -107,7 +106,6 @@ def train_one_epoch(self, train_dataloader, valid_dataloader):
                 state.converged_success()
 
         trainer.lr_step(epoch_itr.epoch, state.valid_loss)
-        save_checkpoint(args, trainer, epoch_itr, state.valid_loss)
         torch.cuda.synchronize()
         driver.event(Event.EPOCH_END, state.epoch)
 

diff --git a/training/kunlunxin/transformer-pytorch/README.md b/training/kunlunxin/transformer-pytorch/README.md
@@ -16,15 +16,35 @@
   - OS版本：Ubuntu 20.04
   - OS kernel版本: 5.4.0-26-generic
   - 加速卡驱动版本：4.0.25
-  - Docker镜像和版本：xmlir/xmlir_ubuntu_2004_x86_64:v0.24
-  - 训练框架版本：xmlir+a28ac56f
-  - 依赖软件版本：pytorch-1.12.1+cpu
+  - Docker镜像和版本：pytorch1.12.1-cpu-ubuntu20.04:v0.01
+  - 训练框架版本：XPyTorch 1.12.1
+
+#### 运行情况
+
+* 通用指标
+
+| 指标名称       | 指标值                    | 特殊说明                                                                                                                     |
+| -------------- | ------------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
+| 任务类别       | Language Modelling && LLM |                                                                                                                              |
+| 模型           | Transformer               |                                                                                                                              |
+| 数据集         | WMT14                     | http://statmt.org/wmt14/translation-task.html#Download                                                                       |
+| 数据精度       | precision,见“性能指标”    | 可选fp32                                                                                                                     |
+| 超参修改       | fix_hp,见“性能指标”       | 跑满硬件设备评测吞吐量所需特殊超参                                                                                           |
+| 硬件设备简称   | R300                      |                                                                                                                              |
+| 硬件存储使用   | mem,见“性能指标”          | 通常称为“显存”,单位为GiB                                                                                                     |
+| 端到端时间     | e2e_time,见“性能指标”     | 总时间+Perf初始化等时间                                                                                                      |
+| 总吞吐量       | p_whole,见“性能指标”      | 实际训练样本数除以总时间(performance_whole)                                                                                  |
+| 训练吞吐量     | p_train,见“性能指标”      | 不包含每个epoch末尾的评估部分耗时                                                                                            |
+| 计算吞吐量     | p_core,见“性能指标”       | 不包含数据IO部分的耗时(p3>p2>p1)                                                                                             |
+| 训练结果       | bleu,见“性能指标”         | BLEU (BiLingual Evaluation Understudy) 是一种自动评估机器翻译文本的指标，用于衡量机器翻译文本与一组高质量参考翻译的相似度。  |
+| 额外修改项     | 无                        |                                                                                                                              |
+
+
+* 性能指标
+
+| 配置                | precision | fix_hp | e2e_time | p_whole | p_train | p_core | final_bleu  | mem       |
+| ------------------- | --------- | ------ | -------- | ------- | ------- | ------ | ----------- | --------- |
+| R300单机单卡（1x1） | fp32      |        |          |         |         |        |             | 30.5/32.0 |
+| R300单机8卡（1x8）  | fp32      |        |          |         |         |        |    27.07    | 26.7/32.0 |
+| R300两机8卡（2x8）  | fp32      |        |          |         |         |        |             | 27.4/32.0 |
 
-
-### 运行情况
-
-| 训练资源 | 配置文件        | 运行时长(s) | 目标精度 | 收敛精度 | Steps数 | 性能（tokens/s) |
-| -------- | --------------- | ----------- | -------- | -------- | ------- | ---------------- |
-| 单机8卡  | config_R300x1x8 |           |   27.0   |   27.27  |   24370 |                |
-
-[官方精度](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/Transformer#training-performance-nvidia-dgx-a100-8x-a100-40gb)为27.92，按照[官方配置](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/Transformer#training-performance-nvidia-dgx-a100-8x-a100-40gb)，训完得到的精度为27.08
diff --git a/training/kunlunxin/transformer-pytorch/config/config_R300x1x1.py b/training/kunlunxin/transformer-pytorch/config/config_R300x1x1.py
@@ -0,0 +1,6 @@
+from config_common import *
+
+max_tokens = 8192
+max_epoch = 30
+max_update = 6500
+lr = [0.000846]
diff --git a/training/kunlunxin/transformer-pytorch/config/config_R300x2x8.py b/training/kunlunxin/transformer-pytorch/config/config_R300x2x8.py
@@ -0,0 +1,6 @@
+from config_common import *
+
+max_tokens = 8192
+max_epoch = 30
+max_update = 3000
+lr = [0.000846]
diff --git a/training/kunlunxin/transformer-pytorch/config/environment_variables.sh b/training/kunlunxin/transformer-pytorch/config/environment_variables.sh
@@ -0,0 +1,2 @@
+export XACC=1
+export XACC_ARGS="-L O0 -Lamp"
diff --git a/training/kunlunxin/transformer-pytorch/config/requirements.txt b/training/kunlunxin/transformer-pytorch/config/requirements.txt
@@ -1 +1,2 @@
 sacrebleu
+dllogger
diff --git a/training/run_benchmarks/config/test_conf.py b/training/run_benchmarks/config/test_conf.py
@@ -124,6 +124,7 @@
     # "distilbert:pytorch:R300:1:8:1": "/raid/dataset/distilbert/",
     # "swin_transformer:pytorch:R300:1:8:1": "/raid/dataset/ImageNet_1k_2012/",
     # "tacotron2:pytorch:R300:1:8:1": "/raid/dataset/tacotron2/LJSpeech/",
+    # "transformer:pytorch:R300:1:8:1": "/raid/dataset/transformer/wmt14_en_de_joined_dict",
     # "bigtransfer:pytorch:R300:1:8:1": "/raid/dataset/ImageNet_1k_2012/",
 
     # mthreads cases