Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add T5-Small training model #201

Merged
merged 8 commits into from
Sep 18, 2023
Merged

Add T5-Small training model #201

merged 8 commits into from
Sep 18, 2023

Conversation

dynamicheart
Copy link
Contributor

No description provided.

@dynamicheart
Copy link
Contributor Author

Running Log:
run20230816220901.zip

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch import nn
import config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有一些import语句没有引用,可以去掉。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除

@shh2000
Copy link
Collaborator

shh2000 commented Aug 21, 2023

请计算一下这个case的MFU。对于transformer decoder模型单机训练,如果MFU不在0.2-0.6之间,需要确定下是原始的Model类有问题,还是训练pipeline计算p_core的方式存在无用部分。如果是采用的Model类,例如T5ForGeneratexxx有问题,请将其裸模型无梯度dummy_input前向传播的MFU计算一下标注在readme中。

@dynamicheart
Copy link
Contributor Author

params: 60 million(https://github.com/google-research/text-to-text-transfer-transformer)
A100 fp32: 19.5 TFLOPS(https://www.nvidia.com/en-us/data-center/a100/)

p_core:186.49376621231423
tokens per sample: 1024

MFU= 186 * 1024 * 60000000 * 6 / (19.5 * 1000 * 1000 * 1000 * 1000) / 8 = 43.95%

@dynamicheart dynamicheart reopened this Aug 28, 2023
@dynamicheart
Copy link
Contributor Author

经过更正,模型fp32训练的GEMM使用的是tf32的计算,A100 tf32算力为156 TFLOPS

186 * 1024 * 60000000 * 6 / (156 * 1000 * 1000 * 1000 * 1000) / 8 = 5.49 %

@shh2000
Copy link
Collaborator

shh2000 commented Aug 28, 2023

经过更正,模型fp32训练的GEMM使用的是tf32的计算,A100 tf32算力为156 TFLOPS

186 * 1024 * 60000000 * 6 / (156 * 1000 * 1000 * 1000 * 1000) / 8 = 5.49 %

请使用类似:
dummyinput = torch.randn(xx,xx,xx).float().cuda()
torch.cuda.synchronize()
for i in range(10000):
y = model(x)
torch.cuda.synchronize()
的方法计算一下T5ForGeneratexxx这个module类本身的MFU,如果也在5.49%附近,则说明添加进perf及计算p_core的过程没有异常

@dynamicheart
Copy link
Contributor Author

dynamicheart commented Aug 29, 2023

run20230829110822.zip
在独占dgx服务器的情况下:

1x8的MFU更正为:293.32947086516964 * 1024 * 60000000 * 6 / (156 * 1000 * 1000 * 1000 * 1000) / 8 = 8.66%

关于上述的MFU统计方法,稍后我再尝试下。

@dynamicheart
Copy link
Contributor Author

dynamicheart commented Aug 29, 2023

单卡、只统计forward、统计时间前后加synchronize:

MFU = 123.43439543146734 * 1024 * 60000000 * 2 / (156 * 1000 * 1000 * 1000 * 1000) = 9.72% (注意:这里是参数数量 * 2)

修改方法如下:

+        num_steps = 10000
         for step, batch in enumerate(data_loader):
+            if step >= num_steps:
+                break
             batch = self.process_batch(batch, device)

+            torch.cuda.synchronize()
             pure_start_time = time.time()

             outputs = model(**batch)
             loss = outputs.loss

-            self.accelerator.backward(loss)
-            optimizer.step()
-            self.lr_scheduler.step()
-            optimizer.zero_grad()
+            #self.accelerator.backward(loss)
+            #optimizer.step()
+            #self.lr_scheduler.step()
+            #optimizer.zero_grad()

             if step % self.config.log_freq == 0:
                 print("Train Step " + str(step) + "/" + str(len(data_loader)) +
                       ", Loss : " + str(float(loss)))
-
+            torch.cuda.synchronize()
             self.training_state.purecomputetime += time.time(
             ) - pure_start_time

日志:
run20230829191725.zip

@@ -0,0 +1,2 @@
train_batch_size = 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看只占用了4.3GiB VRAM。可否适当调大使得p_core再高一些,比方说把两个batchsize都调为32或64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以调高试一下,当时是遵循了huggingface transformers的标准配置



def convert_model(model: nn.Module) -> nn.Module:
"""convert_model"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

麻烦确认下,这里没有对DDP并行、amp等的处理,是已经包含在了accelerate.prepare()函数中了吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accelerate.prepare()是会将模型转成DDP模型,accelerate.prepare()不会进行AMP处理


| 配置 | precision | fix_hp | e2e_time | p_whole | p_train | p_core | rouge1 | rouge2 | rougeL | rougeLsum | mem |
| ------------------ | --------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| A100单机8卡(1x8) | fp32 | / | 2658 | 135 | 168 | 186 | 41.27 | 19.02 | 29.27 | 38.47 | 4.3 /40.0 |
Copy link
Collaborator

@shh2000 shh2000 Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个结果没有问题。麻烦补充一下1*1/2*8的实验,不用填写e2etime和所有训练结果相关的指标,只需要稳定运行10-20分钟计算一个p_core出来即可

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@dynamicheart
Copy link
Contributor Author

1x8 提升到bs=32后的日志,内存占用36GB,收敛正常
run20230914180104.zip

MFU = 400.26068691305795 * 1024 * 60000000 * 6 / (156 * 1000 * 1000 * 1000 * 1000) / 8 = 11.8%

@dynamicheart
Copy link
Contributor Author

dynamicheart commented Sep 18, 2023

1x1 运行两小时以上日志:
run20230918111640.zip

@shh2000 shh2000 merged commit 14bbdc0 into FlagOpen:main Sep 18, 2023
yuzhou03 added a commit to yuzhou03/FlagPerf that referenced this pull request Nov 14, 2023
* bert: bugfix for 1x1 training (FlagOpen#160)

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* add Efficientnet on xpu (FlagOpen#155)

* init

* add efficientnet

* modify config

* modify config

* modify config

* add efficientnet

* modify config

* add efficientnet

* bug fix

* add efficientnet

* add efficientnet

* fix code style

* fix code style

* fix code style

* Revert "fix code style"

This reverts commit ae86109.

* fix code style

* fix code style

* fix code style

* fix code style

* fix code style

* add stardard case readme

* fix code style

* add efficientnet xpu case

* add efficientnet xpu case

---------

Co-authored-by: Feilei Du <dufeilei@foxmail.com>

* refine mobilenetv2 (FlagOpen#153)

* refine retinanet

* update case readme

* upadte case readme for bs=512

* remove 1x4 config

---------

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* bert: update case readme (FlagOpen#161)

* bert: update case readme

* remove mlm_acc

---------

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* add kunlunxin resnet50 1x1 config (FlagOpen#164)

* add kunlunxin resnet50 2x8 config (FlagOpen#166)

* transformer model, fix No module named 'fairseq.data.batch_C' (FlagOpen#163)

Co-authored-by: chenrui22 <chenrui22@baidu.com>

* Iluvatar update bert conf (FlagOpen#165)

* update iluvatar bert config

* update iluvatar bert README

* add iluvatar bert 1x1 2x8 conf

* update iluvatar bert README

* add faster_rcnn for kunlunxin (FlagOpen#167)

* fix bug for iluvatar fast_rcnn 1x1 conf (FlagOpen#169)

* fix bug for iluvatar fast_rcnn 1x1 conf

* adjust iluvatar fast_rcnn 1x1 batch size

* Update trainer_adapter.py (FlagOpen#171)

Update Kunlunxin bert trainer_adapter.py to fix time collecting bug under 1x1 scenario.

* refine retinanet (FlagOpen#157)

* refine retina

* fix create_model

---------

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* add efficientnet for iluvatar (FlagOpen#170)

* add efficientnet for iluvatar

* update

* add Iluvatar retinanet case. (FlagOpen#173)

* add iluvatar retinanet case

* update README

* update iluvatar retinanet config and README

---------

Co-authored-by: uuup <55571217+upvenly@users.noreply.github.com>

* Iluvatar transformer (FlagOpen#174)

* add iluvatar transformer

* update

* add paddle Bert kunlunxin case (FlagOpen#172)

* add config

* update

* update

* update

* update

* fix

* add

* fix

* Update README.md

---------

Co-authored-by: WZD09 <wangzhengdan@stu.pku.edu.cn>

* Inference frame (FlagOpen#136)

* upd ign

* init inference

* fix trtexec

* fix trtexec

* fix

* upd pipe

* rm secret

* fix

* add 5time 4perf and summary in run_inference

* update monitor (#1)

* finish logdir

* finish merge

* format

* fix

* lic & rdm

* ur

* Update README.md

* fix log output

* fix cal perf

* fix sync

* fix output

* fix

* fixbug

* fix frame

* ur

* add skip validation

* fix

* fix kunlun

* fix

---------

Co-authored-by: uuup <55571217+upvenly@users.noreply.github.com>

* Update Regularly (FlagOpen#177)

* common

* add pd

* update faster-rcnn for kunlunxin (FlagOpen#176)

* update faster-rcnn for kunlunxin

* 修正配置描述

* fix iluvatar ixsmi monitor bug (FlagOpen#183)

* retinanet: fix case readme (FlagOpen#182)

* retinanet: fix case readme

* remove redudant

---------

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* refine maskrcnn (FlagOpen#168)

* refine maskrcnn: add 4 perf and 3 time

* fix var

* mask-rcnn: update case readme

* maskrcnn: fix readme

* refactor variable names

---------

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* refine cpm (FlagOpen#179)

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* update iluvatar retinaNet 1x1 2x8 config (FlagOpen#181)

* update iluvatar retinaNet 1x1 2x8 config

* fix retinaNet README info

* add mAP and mem info

* bertLarge stdcase (FlagOpen#180)

* bert

* fix

* add

* add MFU

* retinanet: update case readme (FlagOpen#184)

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* upd docs (FlagOpen#178)

* upd docs

* Update inference-case-doc.md

* Update inference-case-doc.md

* Update inference-case-doc.md

* Update inference-case-doc.md

* Iluvatar VisionTransformer repo (FlagOpen#188)

* Iluvatar Bigtransfer case

* iluvatar transformer

* mobilenetv2: add 1x1, 2x8 to case readme (FlagOpen#189)

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* Upd readme for future plan (FlagOpen#193)

* bert

* fix

* add

* add MFU

* vit

* addsrc

* ud

* dd

* Update README.md

* ud

* assets

* d

* up

* a

* a

* a

* a

* a

* update config (FlagOpen#194)

* support yolov5 (FlagOpen#190)

* upd ign

* init inference

* fix trtexec

* fix trtexec

* fix

* upd pipe

* rm secret

* fix

* add 5time 4perf and summary in run_inference

* update monitor (#1)

* finish logdir

* finish merge

* format

* fix

* lic & rdm

* ur

* Update README.md

* fix log output

* fix cal perf

* fix sync

* fix output

* fix

* fixbug

* fix frame

* ur

* add skip validation

* fix

* support yolov5l

* dev

* dev

* dev

* dev

* dev

* dev

---------

Co-authored-by: shh2000 <13820618441@163.com>

* stable diffusion stdcase (FlagOpen#191)

* bert

* fix

* add

* add MFU

* vit

* addsrc

* sd

* ViT stdcase (FlagOpen#186)

* bert

* fix

* add

* add MFU

* vit

* addsrc

* support yolov5 fp16 (FlagOpen#197)

* upd ign

* init inference

* fix trtexec

* fix trtexec

* fix

* upd pipe

* rm secret

* fix

* add 5time 4perf and summary in run_inference

* update monitor (#1)

* finish logdir

* finish merge

* format

* fix

* lic & rdm

* ur

* Update README.md

* fix log output

* fix cal perf

* fix sync

* fix output

* fix

* fixbug

* fix frame

* ur

* add skip validation

* fix

* support yolov5l

* dev

* dev

* dev

* dev

* dev

* dev

* support fp16

* support fp16

* support fp16

* support fp16

---------

Co-authored-by: shh2000 <13820618441@163.com>

* Update Inference Readme (FlagOpen#198)

* bert

* fix

* add

* add MFU

* vit

* addsrc

* upd

* Kunlunxin inference (FlagOpen#192)

* kunlunxin inference

* change docker version

* xtcl support fp16 onnx

* add kunlun monitor

* kunlunxin sync and remove d2h time

---------

Co-authored-by: zhaoyixuan02 <zhaoyixuan02@baidu.com>
Co-authored-by: zhoujiamin01 <zhoujiamin01@baidu.com>

* Fix resnet50 evaluation (FlagOpen#202)

* add cpu model for nvidia training case readme (FlagOpen#199)

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* Iluvatar inference Resnet50 (FlagOpen#195)

* add ixrt

* add torch sync

* customized input & output

* merge latest

* update

* update readme

* update readme

* update

---------

Co-authored-by: stezpy <peiyuan.zhang@iluvatar.com>

* training: clean 1x2, 1x4 configs (FlagOpen#204)

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* refine GLM (FlagOpen#187)

* refine GLM

* style

* glm: add 1x1

* add MFU

* add MFU annotation for case readme

* add e2e_time for GLM 1x1

* update 1x1 e2e_time to about 2h

---------

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* Iluvatar paddle bert (FlagOpen#207)

* Iluvatar Bigtransfer case

* iluvatar transformer

* Iluvatar paddle bert case update

* swinTransformer stdcase (FlagOpen#206)

* swin

* change to base

* rm

* Update README.md

* Update README.md

* update iluvatar cpm config (FlagOpen#210)

* 1.update iluvatar cpm config.
2.update iluvatar sdk info.

* update cpm 1x1 2x8 mem info

* update cpm performance info

* Llama2 7b mmlu stdcase (FlagOpen#211)

* test

* finishfp32

* upd

* upd

* upd

* glm: add 2x8 statistics (FlagOpen#216)

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* fix cpm 1x1 for FP32 (FlagOpen#215)

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* 修复kunlunxin设置随机种子问题 (FlagOpen#222)

快速修复非cuda兼容模式下的kunlunxin的seed问题

* common update (FlagOpen#221)

* support special packages (FlagOpen#220)

* support special packages

* Update prepare_in_container.py

* Iluvatar update glm (FlagOpen#217)

* update iluvatar GLM

* update glm performance info

---------

Co-authored-by: sen.li <sen.li@iluvatar.com>

* fix llama readme (FlagOpen#223)

* Update README.md

* Update README.md

* upd readme (FlagOpen#224)

* upd

* upd

* Update start_pytorch_task.py - Handle non-zero return code in process execution (FlagOpen#225)

feat: Handle non-zero return code in process execution

Refactor the code to check the return code of each process execution.
If the return code is non-zero, an exception is raised with a descriptive
error message indicating the process ID and suggesting to check the relevant
issue for further details.

* init (FlagOpen#218)

* support aquila7b (FlagOpen#209)

* support aquila7b

* support aquila7b

* modify Aquila7b according to comment

* modify Aquila7b according to comment

* modify Aquila7b according to comment

* Update Dockerfile,environment_variables.sh(kunlunxin-cpm),pytorch_install.sh (FlagOpen#219)

* 一键run.py

* 一键run.py更新

* Launch 'run.py' with a single command.

* Launch 'run.py' with a single command.

* Launch 'run.py' with a single command.

---------

Co-authored-by: zhangytong04 <zhangytong04@baidu.com>

* Enhance the execution speed of CPM dataloaders (FlagOpen#230)

* Update start_pytorch_task.py - Handle non-zero return code in process execution

feat: Handle non-zero return code in process execution

Refactor the code to check the return code of each process execution.
If the return code is non-zero, an exception is raised with a descriptive
error message indicating the process ID and suggesting to check the relevant
issue for further details.

* Enhance the execution speed of CPM dataloaders

Enhance the execution speed of CPM dataloaders, potentially reducing the time by around 30 seconds, subject to potential variations due to different environments 

Initialize jieba library using jieba.initialize()

* Kunlunxin-cpm supports fp16 training (FlagOpen#229)

* kunlunxin-cpm supports fp16

* Add cpm 1x1 2x8 configs

* Refine kunlunxin cpm configs

* Add performance in Readme

* Update environment_variables.sh for kunlunxin-cpm (FlagOpen#234)

* kunlunxin update glm config (FlagOpen#236)

* glm_config

* fix_#1

* glm-config_updated

* glm-config-updated#2

* glm_config-updated#2

* glm_config-#2

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update pytorch_install.sh

* Create config_common

* Update README.md

* Rename config_common to config_common.py

* Update config_R300x2x8.py

* Update config_R300x1x1.py

* Update config_R300x1x8.py

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update requirements.txt

* Update README.md

* Update config_R300x1x1.py

* Update config_R300x1x8.py

* Update config_R300x2x8.py

* Update config_R300x1x1.py

* Update config_R300x1x8.py

* Update config_R300x2x8.py

* Update config_common.py

* Update config_R300x1x1.py

* Update config_R300x2x8.py

* Update README.md

* Update README.md

* Update README.md

* Update config_R300x1x1.py

* Update config_R300x1x8.py

* Update config_R300x2x8.py

---------

Co-authored-by: guanlongjie <guanlongjie@MacBook-Pro.local>

* Fix kunlunxin-glm training. (FlagOpen#242)

* Fix kunlunxin GLM training configs

* Relocate xacc install logic

* Modify max_steps for config 1x1 and 2x8

* glm: fix dataset url (FlagOpen#248)

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* kunlunxin berfLarge inference configs && results (FlagOpen#212)

* kunlunxin inference : add bertLarge

* Revert "kunlunxin inference : add bertLarge"

This reverts commit cd9127c.

* kunlunxin inference : add bertLarge

* kunlunxin : remove re-install transformers

* adjust env for bertlarge

* kunlunxin: update bertLarge performance

* Update BertLarge performance

---------

Co-authored-by: zhaoyixuan02 <zhaoyixuan02@baidu.com>
Co-authored-by: Shi Jinxiang <shijinxiang@baidu.com>

* update cpm 1x1 running stats (FlagOpen#238)

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* update data_dir for test_conf (FlagOpen#247)

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* Add DistilBERT model (FlagOpen#249)

* Add DistilBert with training logic under developing

* DistilBert for 1x1 GPU training

* DistilBert for 1x8 GPU training

* Add README and externel configs

* Remove non-necessary files

* Restore environment_varaibles.sh from kunlunxin-cpm

* Update training configurations in _base.py

update max_epoch and target_acc

* Update README.md

* Add nvidia pytorch1.12 docker

* Update README.md

* Add 1x1 2x8 cases

* Add p_core unit name

* Add p_core unit name

* Update README.md

---------

Co-authored-by: wangyakai <root@szzj-isa-ai-chip1.szzj.baidu.com>

* GPT2 (FlagOpen#205)

* Add gpt2 model

* Add gpt2 test case in test_conf.py

* refine README and python files

* Remove redundant codes and re-organize denpendency

* remove redundancy files

* refine gpt_dataset

* "Refine traing job"

* Refine README

* fix typo in README.md

* Update README.md

* Add config for 1x1 2x8

* Update README.md 1x1 config

* Update README.md

* Add T5-Small training model (FlagOpen#201)

* add t5 small

* t5_small use huggingface accelerate

* fix coding style for t5_small model

* update t5_small bs config

* add MFU information in t5-small nvidia README

* fix t5_small doc typo

* iluvatar_infer_resnet50 (FlagOpen#259)

Co-authored-by: 杨智超 <zhichao.yang@iluvatar.com>

* zixiao:add resnet50 inference configs && results (FlagOpen#256)

* zixiao:add resnet50 inference configs && results

* zixiao: modify resnet50 config & add log file

* zixiao: remote log file

* zixiao: fix resnet50 inference result

* zixiao: update zxrt.py & resnet50 result (FlagOpen#262)

* zixiao: update zxrt.py & resnet50 result

* zixiao: update resnet50 test batch_size

* kunlunxin: add BERT readme (FlagOpen#260)

* Add BERT readme

* Update 1x8 result in README.md

* Update header in README.md

* Iluvatar Ixrt environment (FlagOpen#265)

* Ixrt environment

* add touch config

---------

Co-authored-by: 杨智超 <zhichao.yang@iluvatar.com>

* Add ViT model for FlagPerf (FlagOpen#200)

* Add ViT model

* update the script based on zhiyuan's model

* Update script based on PR review

* Update ViT performance in README.md

* support swin_transformer on XPU (FlagOpen#255)

* support swin_transformer on XPU

* support swin_transformer on XPU

---------

Co-authored-by: wangdongyu04 <wangdongyu04@baidu.com>

* Kunlunxin add stable diffusion v 1_4  case (FlagOpen#227)

* kunlunxin inference

* xtcl support fp16 onnx

* Add stable diffusion fp32 case

* kunlunxin add yolov5 case

* update resnet50 fp16 performance

* add stable_diffusion_v1_4 kunlunxin mem_usage

---------

Co-authored-by: zhaoyixuan02 <zhaoyixuan02@baidu.com>
Co-authored-by: zhoujiamin01 <zhoujiamin01@baidu.com>

* kunlunxin swinTransformer inference configs && results (FlagOpen#243)

* kunlunxin swinTransformer inference configs && results

* kunlunxin swinTransformer inference configs && results

{'vendor': 'kunlunxin', 'compiler': 'xtcl', 'precision': 'fp32', 'batchsize': 256, 'flops': 723982880000.0, 'e2e_time(second)': 543.745, 'p_validation_whole(qps)': None, 'p_validation_core(qps)': None, 'p_inference_whole(qps)': 166.937, '*p_inference_core(qps)': 175.724, 'val_average_acc': None, 'infer_average_acc': 0.832}

---------

Co-authored-by: SHIHONGHAO <13820618441@163.com>

* kunlunxin sam_h (FlagOpen#244)

* add Transformer XL model (FlagOpen#258)

* add transfoxl

* update readme and add new config for 2x8

* update readme

* add 1x1 config for transformer xl

* fix nvidia readme for transformer XL

* modification of kunlunxin-RetinaNet (FlagOpen#264)

* Add kunlunxin retinanet

* Update environment_variables.sh

* Update environment_variables.sh

* Add 2x8 config

* Modify 1x1 2x8 config

* remove max_steps logic

* add readme

---------

Co-authored-by: Reiase <reiase@gmail.com>
Co-authored-by: root <root@szzj-isa-ai-chip0.szzj.baidu.com>

* [LLM-paddle] add llama1-7b pretrain with callback (FlagOpen#239)

* modify gitignore

* add paddle llama

* add recompute and sharding for llama7b

* adapte to the driver & fix start_paddle_task

* fix llama1-7b fig files and trainer
fix llama1-7b docker run cmd
modify docker paddle version

* [callback] llama1-7B pretrain

* modify the llama case config name in test_conf.py
fix llama run_pretraining.py
fix llama1-13b config
fix llama1-7b and llama1-13b readme
[LLM] add llama1-13b pretrain
[LLM] llama1-7b pretrain with callback

* update config

* update config

* add metrics in README.md

* update README.md

* remove llama 13B files

---------

Co-authored-by: DrownFish19 <DrownFish19@gmail.com>

* [paddle] add metrics for llama-7b (FlagOpen#278)

* fix run_pretraining

* fix config

* update scale_loss

* fix warmup_steps setting

* remove evaluate

* update config

* update config for pp

* update config

* update

* add metrics of llama-7b

* update llama1-7B 80G mertics

* fix

* update

* update llama1-13b metrics

* fix

* remove 13B metrics

* Distilbert kunlunxin (FlagOpen#272)

* Fit distilbert on kunlunxin

* Add kunlunxin readme

* Refine kunlunxin readme

* Refine task kind  kunlunxin readme

* Add vendor name in config_common.py

---------

Co-authored-by: root <root@szzj-isa-ai-chip0.szzj.baidu.com>

* add KUNLUNXIN XPU t5_small config & log. (FlagOpen#269)

* add KUNLUNXIN XPU t5_small config & log.

* Update README.md

* Update README.md

* Gpt2 kunlunxin (FlagOpen#273)

* Fit gpt2 on kunlunxin

* Add kunlunxin readme

* Refine task kind  kunlunxin readme

* Fix unit of p_whole in README.md

* Refine 1x1 config

---------

Co-authored-by: root <root@szzj-isa-ai-chip0.szzj.baidu.com>

* update readme for v1.0 (FlagOpen#268)

* ur

* ur

* 11

* refine tacotron2, add nv configs and results (FlagOpen#251)

* refine tacotron2

* update test_conf && req.txt for pytorch1.13

* update 1x1 and 1x8

* update 2x8

---------

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* refine efficientnet, add configs && results (FlagOpen#252)

* refine efficientnet

* update results

---------

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* Add kunlunxin mask-rcnn (FlagOpen#276)

* Add kunlunxin mask-rcnn

* Refine mask-rcnn

---------

Co-authored-by: root <root@szzj-isa-ai-chip0.szzj.baidu.com>

* [paddle] add llama1-13b metric (FlagOpen#279)

* fix run_pretraining

* fix config

* update scale_loss

* fix warmup_steps setting

* remove evaluate

* update config

* update config for pp

* update config

* update

* add llama1-13B files

* update config

* config recompute

* update config

* add metrics of llama-7b

* add llama-13b metrics

* add test_config

* add requirements.txt for transformer_xl stdcase (FlagOpen#281)

* fix_#1

* Create config_common.py

* Update config_common.py

* transformer_xl-benchmark_req

* stdcasefix_#1

* stdcasefix_#2

* stdcasefix_#3

* stdcasefix_#4

---------

Co-authored-by: guanlongjie <guanlongjie@MacBook-Pro.local>

* add Transformer_xl configs for kunlunxin (FlagOpen#277)

* fix_#1

* Create config_common.py

* Update config_common.py

* transformer_xl-config

* transformer_xl-config-#1

* transformer_xl-config#2

* transformer_xl-config#2

* fix_#2

* fix_#3

* fix_#4

* fix_#5

* fix_#6

* fix_#7

* config_#9

* fix_#8

* fix_#9

* fix_#10

* fix_#11

---------

Co-authored-by: guanlongjie <guanlongjie@MacBook-Pro.local>

* add longformer training stdcase (FlagOpen#282)

* add longformer

* fix typos in README.md

* full resnet50 precision(bf16+amp) (FlagOpen#253)

* full resnet50

* add ieee754

* add ieee754

* refine swin transformer, fix 1x1, update results (FlagOpen#283)

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* [paddle] add gpt3 benchmark (FlagOpen#233)

* add new feature

* fix

* update

* update

* update

* update

* update

* update

* update

* update

* update

* add continue_training

* update

* rename config name with soft link

* update config

* replace nvidia-docker with docker

* update config

* add README.md

* set converged state

* update

* update target ppl metric

* update GPT-3 case config

* update base config

* add use_fused_rms_norm config

* update config

* update paddle dockerfile

* update config

* update GPT-3 config

* update config

* update configs

* update GPT-3 config

* update config

* rename GPT-3 folders

* update start_paddle_task

* update config

* update run_pretrain.py

* update

* update and add gpt3 configs

* add gpt3-13b benchmarks

* remove try and catch

* update dataloader

* update filename

* update

* update configs

* update config

* add gpt3 metrics

* update test_config

* update README.md

* add detr model (FlagOpen#266)

* add detr on GPU

* refine detr on gpu

* modify detr code and upload test data on gpu

* update the format of test data and add detr test case

* update detr test metric

* add gpu 1x1 log for detr

* update 1x1 log

* add detr in test_conf.py

---------

Co-authored-by: wangdongyu04 <wangdongyu04@baidu.com>

* update readme for Q3 (FlagOpen#285)

* u1012

* ur

* detr

* Iluvatar infer yolov5 (FlagOpen#287)

* Ixrt environment

* add touch config

* Iluvatar yolov5 case

* fix mistake

---------

Co-authored-by: 杨智超 <zhichao.yang@iluvatar.com>

* Kunlunxin detr (FlagOpen#288)

* add detr on xpu

* add mAP jpg

* add mAP png and rm mAP.jpg

* add xpu 2x8 log

* update memory data

* add description on mAP.png

---------

Co-authored-by: wangdongyu04 <wangdongyu04@baidu.com>

* update klx swin_transformer's data (FlagOpen#290)

Co-authored-by: wangdongyu04 <wangdongyu04@baidu.com>

* update klx bertLarge performance (FlagOpen#291)

* update klx bertLarge performance

* update klx bertLarge performance

---------

Co-authored-by: Shi Jinxiang <shijinxiang@baidu.com>

* remove performance (FlagOpen#292)

* remove performance

* update klx stable diffusion performance

---------

Co-authored-by: zhoujiamin01 <zhoujiamin01@baidu.com>

* Update the ViT model's README (FlagOpen#293)

* Add ViT model

* update the script based on zhiyuan's model

* Update script based on PR review

* Update ViT benchmark README.md

* Update ViT performance in README.md

* Update Vit model's README

* Update ViT model's README file

---------

Co-authored-by: zangzhan <zangzhan@baidu.com>

* llama2 7B pretrain标准case (FlagOpen#289)

* init

* fix

* upd result

* Update deepspeed-nvidia_install.sh

* Update run_pretraining.py

* kunlunxin pytorch resnet50 add requirements.txt and environment_variables.sh (FlagOpen#298)

* update gpt2 kunlunxin config (FlagOpen#300)

* gpt2 env config

* gpt2 config

* Update test_conf.py

---------

Co-authored-by: zhangyutong04 <zhangyutong04@baidu.com>

* add bert_hf openwebtext (FlagOpen#267)

* add bert_hf_small_dataset

* addtestconf

* upd exp

* Update iluvatar retinanet conf to avoid CUDA OOM. (FlagOpen#310)

* update iluvatar retinaNet 1x1 2x8 config

* fix retinaNet README info

* add mAP and mem info

* update 1*8 conf to avoid cuda OOM.

* update kunlunxin transformer_xl config (FlagOpen#307)

* update kunlunxin pytorch_install.sh (FlagOpen#311)

* update kunlunxin glm config (FlagOpen#312)

* klx: update requirements.txt and env for faster_rcnn (FlagOpen#302)

* upd (FlagOpen#313)

* refine VIT && update NV results (FlagOpen#309)

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* update kunlunxin retinanet configs (FlagOpen#304)

Co-authored-by: wangdongyu04 <wangdongyu04@baidu.com>
Co-authored-by: Zhou Yu <zycosmos@gmail.com>

* update kunlunxin maskrcnn configs (FlagOpen#305)

Co-authored-by: wangdongyu04 <wangdongyu04@baidu.com>
Co-authored-by: Zhou Yu <zycosmos@gmail.com>

* [paddle] fix paddlenlp version for llama1 and gpt3 (FlagOpen#301)

* fix paddle

* update Dockerfile

* update config for llama

* remove file

* fix kunlunxin t5_small 1x1 training error (FlagOpen#315)

* 【iluvatar】update mobilenetv2 config (FlagOpen#295)

* update iluvatar mobilenetv2 config

* update iluvatar mobilenetv2 README

* fix mobilenetv2 on kunlunxin (FlagOpen#314)

* init

* add efficientnet

* modify config

* modify config

* modify config

* add efficientnet

* modify config

* add efficientnet

* bug fix

* add efficientnet

* add efficientnet

* fix code style

* fix code style

* fix code style

* Revert "fix code style"

This reverts commit ae86109.

* fix code style

* fix code style

* fix code style

* fix code style

* fix code style

* bug fix

* add kunlunxin readme

* fix mobilenetv2 on kunlunxin

* add mobilenet config_R300x2x8.py

---------

Co-authored-by: Feilei Du <dufeilei@foxmail.com>

* 121 (FlagOpen#319)

* refine bigtransfer (FlagOpen#317)

* refine bigtransfer, add configs and update results

* update readme

---------

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>

* 1107 (FlagOpen#316)

* Aquila2_7B-flagscale pretraining (FlagOpen#299)

* init

* fix

* upd result

* init code

* fix rdm

* rm llama2

* upd rpt

* upd

* 67 (FlagOpen#322)

* 【iluvatar】fix docker bug (FlagOpen#320)

* fix the error that cannot generate docker image

* fix iluvatar docker bug

* fix the spelling error

* add 'apt install -y libncursesw5'

* add klx-training-pre-pr-check.yml

---------

Co-authored-by: zhouyu <zhouyu@baai.ac.cn>
Co-authored-by: Stanley <290227932@qq.com>
Co-authored-by: Feilei Du <dufeilei@foxmail.com>
Co-authored-by: Jianbang Yang <yangjianbang112@gmail.com>
Co-authored-by: Rain Chan <chenrui9312@qq.com>
Co-authored-by: chenrui22 <chenrui22@baidu.com>
Co-authored-by: forestlee95 <82379785+forestlee95@users.noreply.github.com>
Co-authored-by: Reiase <reiase@gmail.com>
Co-authored-by: KungYork <30741085+KungYork@users.noreply.github.com>
Co-authored-by: stezpy <stezpy@gmail.com>
Co-authored-by: uuup <55571217+upvenly@users.noreply.github.com>
Co-authored-by: WZD09 <102740885+WZD09@users.noreply.github.com>
Co-authored-by: WZD09 <wangzhengdan@stu.pku.edu.cn>
Co-authored-by: SHIHONGHAO <13820618441@163.com>
Co-authored-by: clveryang <50865584+clveryang@users.noreply.github.com>
Co-authored-by: zjm <815496138@qq.com>
Co-authored-by: zhaoyixuan02 <zhaoyixuan02@baidu.com>
Co-authored-by: zhoujiamin01 <zhoujiamin01@baidu.com>
Co-authored-by: stezpy <peiyuan.zhang@iluvatar.com>
Co-authored-by: sen.li <sen.li@iluvatar.com>
Co-authored-by: clemente0420 <32806348+clemente0420@users.noreply.github.com>
Co-authored-by: flying tree <54765721+dayuyuhai@users.noreply.github.com>
Co-authored-by: zhangytong04 <zhangytong04@baidu.com>
Co-authored-by: GGuanl <143151018+GGuanl@users.noreply.github.com>
Co-authored-by: guanlongjie <guanlongjie@MacBook-Pro.local>
Co-authored-by: jinxiangshi <44688400+jinxiangshi@users.noreply.github.com>
Co-authored-by: Shi Jinxiang <shijinxiang@baidu.com>
Co-authored-by: wangyakai <root@szzj-isa-ai-chip1.szzj.baidu.com>
Co-authored-by: 杨智超 <zhichao.yang@iluvatar.com>
Co-authored-by: feldmanshan <145551134+feldmanshan@users.noreply.github.com>
Co-authored-by: gganduu_zz <gganduu_zz@163.com>
Co-authored-by: TWANG07 <91315832+TWANG07@users.noreply.github.com>
Co-authored-by: wangdongyu04 <wangdongyu04@baidu.com>
Co-authored-by: liuyumoye <452803476@qq.com>
Co-authored-by: Quanfeng Li <liquanfeng7@foxmail.com>
Co-authored-by: root <root@szzj-isa-ai-chip0.szzj.baidu.com>
Co-authored-by: laixinyi <1798419979@qq.com>
Co-authored-by: DrownFish19 <DrownFish19@gmail.com>
Co-authored-by: Xiao Han <56230697+xiaohan4420@users.noreply.github.com>
Co-authored-by: zangzhan <zangzhan@baidu.com>
Co-authored-by: zhangyutong04 <zhangyutong04@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants