Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support tf2 resnet50 #1

Closed
wants to merge 72 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
64e97e1
support resnet50-tf2
Jan 20, 2023
49eecd5
support resnet50-tf2
Jan 29, 2023
1a73ac4
rm
Jan 29, 2023
7d14804
rm unnecessary file
Jan 31, 2023
5e71532
Modify according to review comments
Feb 6, 2023
5f74151
Modify according to review comments
Feb 6, 2023
80798ab
check pep8
Feb 9, 2023
c17e0c2
check pep8
Feb 9, 2023
caf2032
support adapter
Feb 9, 2023
30b7aaa
update pip source to baidu for more stable accessibility (#20)
yuzhou03 Mar 10, 2023
2eb5bce
add standard case modification PR requirements
Mar 10, 2023
62f587f
update run_pretraining.example.py. config should be declared as globa…
Mar 13, 2023
e7d7139
rename file to skip flake8 check
Mar 13, 2023
a2abc50
init_dist_training_env support single gpu (#23)
upvenly Mar 13, 2023
d106cfc
Merge pull request #25 from yuzhou03/docs-dev
upvenly Mar 13, 2023
b59cbf7
Update tox.ini
yuzhou03 Mar 14, 2023
a0a01e9
Create .github/workflows/yapf-check.yaml
yuzhou03 Mar 15, 2023
b7d374b
GLM: fix distributed handling for single-card training (#28)
yuzhou03 Mar 16, 2023
8caf9a7
fix single-card training for CPM (#27)
yuzhou03 Mar 17, 2023
a9f5625
GLM: add kunlunxin R300x1x8 config
dynamicheart Mar 16, 2023
d410c08
fix yapf style check warnings
Mar 17, 2023
f059298
update markdown:1.solve display issues of angle brackets. 2.replace l…
yuzhou03 Mar 18, 2023
2b697c9
source environment_variables.sh before start task (#33)
dynamicheart Mar 20, 2023
05f517b
Merge pull request #32 from yuzhou03/yapf-style
upvenly Mar 20, 2023
7353e9f
Merge branch 'main' into doc-case-standard
yuzhou03 Mar 20, 2023
e977f03
Merge pull request #24 from yuzhou03/doc-case-standard
upvenly Mar 20, 2023
12ec9ef
doc: update supported case table for repo readme
Mar 21, 2023
114e67a
update case-adaption spec: add configs
Mar 21, 2023
b8226b3
remove get_finished_info from helper.py. update related content in st…
Mar 21, 2023
99f52db
GLM: add nv configs, add running statistics to GLM-pytorch case readme
Mar 21, 2023
d210b62
process logEvent for evaluate, step_begin and init_evaluate (#35)
yuzhou03 Mar 22, 2023
a5f0aff
Merge pull request #30 from yuzhou03/repo-readme
upvenly Mar 22, 2023
83b605f
Merge pull request #36 from yuzhou03/doc-adaption
upvenly Mar 22, 2023
e5be778
Merge pull request #29 from yuzhou03/glm-running-stat
upvenly Mar 22, 2023
4ef6057
GLM: add comments for kunlunxin config
dynamicheart Mar 23, 2023
48ad35b
Merge pull request #31 from dynamicheart/main
upvenly Mar 24, 2023
d229491
add pytorch glm by iluvatar
yan-rui Mar 24, 2023
b0397dd
fix same bugs
yan-rui Mar 24, 2023
14efa78
print cmd result after runing for easy debug
yan-rui Mar 24, 2023
f1c5d7d
update iluvatar GLM-pytorch info
yan-rui Mar 24, 2023
f0e8f32
update iluvatar GLM-pytorch info
yan-rui Mar 24, 2023
2389c3f
first commit for iluvatar cpm-pytorch
yan-rui Mar 24, 2023
843095c
fix same bugs
yan-rui Mar 25, 2023
98ef713
add configs : 1x1 1x2 1x4 1x8 2x8
yan-rui Mar 25, 2023
8845a3c
Rename config_BI100x1x8.py to config_BI-V100x1x8.py
yan-rui Mar 25, 2023
75809c1
rename configs
Mar 25, 2023
f712404
update performence of iluvatar cpm-pytorch
yan-rui Mar 25, 2023
07949a1
iluvatar glm-pytorch reformart by yapf
Mar 25, 2023
c222721
iluvatar cpm-pytorch reformart by yapf
Mar 25, 2023
37cb352
Merge pull request #37 from yuzhou03/finished_info
upvenly Mar 27, 2023
25f5795
modify tf2-res50
Mar 27, 2023
95c11c9
Iluvatar glm-pytorch: remove unused code
yan-rui Mar 27, 2023
6d01d59
add email for softwares; fix some bugs;
yan-rui Mar 27, 2023
e86056b
1. Adjusted the config file.
forestlee95 Mar 27, 2023
ec6d202
GLM: add GLM case document for kunlunxin
dynamicheart Mar 27, 2023
107c1a4
modify tf2-res50
Mar 28, 2023
b8f09f9
modify tf2-res50
Mar 28, 2023
2ac585e
modify tf2-res50
Mar 28, 2023
da56212
Iluvatar GLM-pytorch: remove unused files
yan-rui Mar 28, 2023
a3378ef
reset doc
Mar 28, 2023
0254099
Iluvatar CPM-pytorch: merge glm
yan-rui Mar 28, 2023
8f380af
rm unnecessary file
Mar 28, 2023
f6ce7fa
Merge pull request #40 from yan-rui/iluvatar_cpm
upvenly Mar 28, 2023
bbb4eeb
Merge pull request #41 from dynamicheart/kunlunxin_doc
upvenly Mar 29, 2023
7607122
rm unnecessary file
Mar 29, 2023
cb598c9
support resume & pep8
Mar 29, 2023
8dee852
update readme
Mar 29, 2023
d9e6055
update readme
Mar 29, 2023
800b9c8
update readme
Mar 29, 2023
a18737f
update readme
Mar 29, 2023
ef80bcb
update readme
Mar 29, 2023
bb54b1e
rm unnecessary code & update readme
Mar 30, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/linters/tox.ini
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[flake8]
max-line-length = 120

extend-ignore = E203,E265,E401,E402,E713,F401,W291,W292
extend-ignore = E231,E266,E203,E265,E401,E402,E713,F401,F403,W291,W292

# reference: https://www.flake8rules.com/
17 changes: 17 additions & 0 deletions .github/workflows/yapf-check.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
name: YAPF Formatting Check
on:
push:
branches: [ "main", "master"]
pull_request:
branches: [ "main", "master"]
workflow_dispatch: {}
jobs:
formatting-check:
name: Formatting Check
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: run YAPF to test if python code is correctly formatted
uses: AlexanderMelde/yapf-action@master
with:
args: --verbose
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@
.ijwb/
.vscode/
__pycache__/
.pytest_cache
.pytest_cache
training/result/*
156 changes: 143 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,147 @@ FlagPerf是一款面向AI异构芯片的通用基准测试平台。我们希望

----------
### 支持模型列表
| 模型 | AI框架 | 状态 |已适配的厂商|
| ---- | --------------- | -------- | ---- |
| [BERT](training/benchmarks/bert/paddle/readme.md) | [paddle](training/benchmarks/bert/paddle/readme.md) | 已支持 | |
| [CPM](training/benchmarks/cpm/pytorch/readme.md) | [pytorch](training/benchmarks/cpm/pytorch/readme.md) | 已支持 | |
| [GLM](training/benchmarks/glm/README.md) | [pytorch](training/benchmarks/glm/pytorch/readme.md) | 已支持 ||
| ResNet50 | pytorch | DOING | |
| MobileNetV2 | pytorch | DOING ||
| ViT | pytorch | DOING ||
| EVA-CLIP |pytorch | DOING ||
| EVA-Seg | pytorch | DOING ||
| EVA-Det | pytorch | DOING ||
<table width="960" border="0" cellpadding="0" cellspacing="0" style='width:960pt;border-collapse:collapse;table-layout:fixed;'>
<col width="73.60" style='mso-width-source:userset;mso-width-alt:3588;'/>
<col width="70" style='mso-width-source:userset;mso-width-alt:3413;'/>
<col width="200.75" style='mso-width-source:userset;mso-width-alt:9788;'/>
<col width="195.80" style='mso-width-source:userset;mso-width-alt:9547;'/>
<col width="185.40" style='mso-width-source:userset;mso-width-alt:9040;'/>
<tr height="16.80" class="xl65" style='height:16.80pt;'>
<td class="xl66" height="16.80" width="73.60" style='height:16.80pt;width:73.60pt;' x:str>模型</td>
<td class="xl66" width="70" style='width:70.00pt;' x:str>框架</td>
<td class="xl67" width="581.95" colspan="3" style='width:581.95pt;border-right:none;border-bottom:none;' x:str>厂商</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td height="16.80" colspan="2" style='height:16.80pt;mso-ignore:colspan;'></td>
<td class="xl65" x:str>英伟达</td>
<td class="xl65" x:str>昆仑芯</td>
<td class="xl65" x:str>天数智芯</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl65" height="33.60" rowspan="2" style='height:33.60pt;border-right:none;border-bottom:none;' x:str>BERT</td>
<td class="xl69" x:str>PyTorch</td>
<td class="xl69" x:str>✅</td>
<td class="xl69" x:str>✅</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl69" x:str>Paddle</td>
<td class="xl69" x:str>✅</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl65" height="33.60" rowspan="2" style='height:33.60pt;border-right:none;border-bottom:none;' x:str>CPM</td>
<td class="xl69" x:str>PyTorch</td>
<td class="xl69" x:str>✅</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>✅</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl69" x:str>Paddle</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl70" height="33.60" rowspan="2" style='height:33.60pt;border-right:none;border-bottom:none;' x:str><a href="https://github.com/FlagOpen/FlagPerf/tree/main/training/benchmarks/glm" target="_parent">GLM</a></td>
<td class="xl71" x:str><a href="https://github.com/FlagOpen/FlagPerf/tree/main/training/benchmarks/glm/pytorch" target="_parent">PyTorch</a></td>
<td class="xl69" x:str>✅</td>
<td class="xl69" x:str>✅</td>
<td class="xl69" x:str>✅</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl69" x:str>Paddle</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl65" height="50.40" rowspan="3" style='height:50.40pt;border-right:none;border-bottom:none;' x:str>ResNet50</td>
<td class="xl69" x:str>PyTorch</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl69" x:str>Paddle</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl69" x:str>TensorFlow2</td>
<td class="xl69" x:str>✅</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl65" height="33.60" rowspan="2" style='height:33.60pt;border-right:none;border-bottom:none;' x:str>MobileNetV2</td>
<td class="xl69" x:str>PyTorch</td>
<td class="xl69" x:str>✅</td>
<td class="xl69" x:str>✅</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl69" x:str>Paddle</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl65" height="33.60" rowspan="2" style='height:33.60pt;border-right:none;border-bottom:none;' x:str>ViT</td>
<td class="xl69" x:str>PyTorch</td>
<td class="xl69" x:str>N/As</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl69" x:str>Paddle</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl65" height="33.60" rowspan="2" style='height:33.60pt;border-right:none;border-bottom:none;' x:str>EVA-CLIP</td>
<td class="xl69" x:str>PyTorch</td>
<td class="xl69" x:str>doing</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl69" x:str>Paddle</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl65" height="33.60" rowspan="2" style='height:33.60pt;border-right:none;border-bottom:none;' x:str>EVA-Seg</td>
<td class="xl69" x:str>PyTorch</td>
<td class="xl69" x:str>doing</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl69" x:str>Paddle</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl65" height="33.60" rowspan="2" style='height:33.60pt;border-right:none;border-bottom:none;' x:str>EVA-Det</td>
<td class="xl69" x:str>PyTorch</td>
<td class="xl69" x:str>doing</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
<tr height="16.80" style='height:16.80pt;'>
<td class="xl69" x:str>Paddle</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
<td class="xl69" x:str>N/A</td>
</tr>
</table>

### 部署说明

Expand Down Expand Up @@ -95,7 +225,7 @@ __Tips:__
'''Test Configs, including'''
# -*-coding:utf-8 -*-

# Set accelerator's vendor name, e.g. iluvatar, cambricon and kunlun.
# Set accelerator's vendor name, e.g. iluvatar, cambricon and kunlunxin.
# We will run benchmarks in training/<vendor>
VENDOR = "nvidia"
# Accelerator options for docker. TODO FIXME support more accelerators.
Expand Down Expand Up @@ -260,7 +390,7 @@ nvidia_monitor.log rank1.out.log rank4.out.log rank7.out.log

### 贡献代码

本项目目前由北京智源人工智能研究院、天数智芯、百度PaddlePaddle、昆仑芯共同建设中
本项目目前由北京智源人工智能研究院、昆仑芯、天数智芯、百度PaddlePaddle共同建设中
诚邀各框架、芯片团队与个人参与!
### 联系我们

Expand Down
3 changes: 1 addition & 2 deletions docs/dev/readme-templates/case-readme-template.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# case README文件模版

> 文档信息说明:
>
> - 文档位置:每个Case的REAMDE文档位于training/benchmarks/<model>/ <Framework> 目录下
> - 文档位置:每个Case的REAMDE文档位于training/benchmarks/&lt;model&gt;/&lt;Framework&gt; 目录下
> - 文档使用的语言:默认为中文README.md,可提供英文版本README.en.md
> - 文档内容:数据集和模型文件的进一步处理/转换要求、英伟达GPU环境下的运行配置与运行信息参考

Expand Down
8 changes: 3 additions & 5 deletions docs/dev/readme-templates/model-readme-template.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

> 文档信息说明
>
> - 文档位置:每个模型的REAMDE文档位于training/benchmarks/<model>/ 目录下
> - 文档位置:每个模型的REAMDE文档位于training/benchmarks/&lt;model&gt;/ 目录下
> - 文档使用的语言:默认为中文README.md,可提供英文版本README.en.md
> - 文档内容:标准模型的说明,包括:模型的信息(模型信息、原始模型代码),测试用的数据集和模型checkpoint下载方法,目前支持的框架

Expand Down Expand Up @@ -46,14 +46,12 @@

## 4. 框架与芯片支持情况说明

- 目前FlagPerf提供 <Framework> 的实现,参见:*<对应Framework标准Case的README链接,中英文版本对应>*
- 目前FlagPerf提供 &lt;Framework&gt; 的实现,参见:*<对应Framework标准Case的README链接,中英文版本对应>*
- 目前已适配本模型的芯片如下:

*//下表为样例,默认支持Nvidia GPU,对于各芯片,可放入厂商目录里对应Case的README链接,路径在 <vendor>/<model>-<framework>/README.md,如无对应的README文档,直接打✅即可*
*//下表为样例,默认支持Nvidia GPU,对于各芯片,可放入厂商目录里对应Case的README链接,路径在 &lt;vendor>/&lt;model&gt;-&lt;framework&gt;/README.md,如无对应的README文档,直接打✅即可*

| | *Pytorch* | *Paddle* | *TensorFlow2* |
| ------------ | ---------------------------- | -------- | ------------- |
| *Nvidia GPU* | *✅* | *✅* | *N/A* |
| *Kunlun XPU* | *<Kunlun对Case的README链接>* | *N/A* | *N/A* |

- ##
36 changes: 18 additions & 18 deletions docs/dev/readme-templates/readme-layout.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,39 +2,39 @@

## README目录层级

docs/dev/ 规范文档主目录
```bash

run_pretraining.example.py # run_pretraining模版
docs/dev/ 规范文档主目录

├── readme-templates *# readme模版主目录*
│ ├── case-readme-template.md *# case readme*
│ ├── model-readme-template.md *# 模型readme*
│ ├── readme-layout.md *# readme层级结构*
│ ├── vendor-case-readme-template.md *# 厂商case readme*
│ └── vendor-readme-template.md *# 厂商readme*
└── specifications *# 规范文档主目录*
├── case-adatpion-spec.md # 厂商适配case规范
└── standard-case-spec.md # 标准case规范
run_pretraining.example.py # run_pretraining模版

├── readme-templates # readme模版主目录
│ ├── case-readme-template.md # case readme
│ ├── model-readme-template.md # 模型readme
│ ├── readme-layout.md # readme层级结构
│ ├── vendor-case-readme-template.md # 厂商case readme
│ └── vendor-readme-template.md # 厂商readme
└── specifications # 规范文档主目录
├── case-adatpion-spec.md # 厂商适配case规范
└── standard-case-spec.md # 标准case规范
```


Repo README:repo根目录下,一般无需修改

模型 README: training/benchmark/<model>下,每个模型一个文档
模型 README: training/benchmark/&lt;model&gt; 下,每个模型一个文档

标准Case README:training/benchmark/<model>-<framework>下,每个case一个文档
标准Case README:training/benchmark/&lt;model&gt;-&lt;framework&gt;下,每个case一个文档

厂商 README: training/<vendor>/下,每个vendor一个文档,向用户介绍厂商信息,说明适配FlagPerf测试Case的软、硬件环境信息及加速卡监控采集指标
厂商 README: training/&lt;vendor&gt;/下,每个vendor一个文档,向用户介绍厂商信息,说明适配FlagPerf测试Case的软、硬件环境信息及加速卡监控采集指标

厂商适配case README: training/<vendor>/<model>-<framework>下,产商的每个case一个文档
厂商适配case README: training/&lt;vendor&gt;/&lt;model&gt;-&lt;framework&gt;下,产商的每个case一个文档

```Bash
├── LICENSE.md
├── README.md # REPO README
├── README.md # REPO README
├── docs
│ └── dev
│ ├── case_adaptation.md # case适配规范
│ ├── case_creation.md # case添加规范
│ └── run_pretraining.example.py # run_pretraining模版
├── flagperf_wechat.png
├── logo.png
Expand Down
2 changes: 1 addition & 1 deletion docs/dev/readme-templates/vendor-case-readme-template.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

> 重点写厂商适配的情况,以及在厂商芯片的机器上的运行情况。
>
> 文件路径:training/<vendor>/<model>-<framework>/ 目录下
> 文件路径:training/&lt;vendor&gt;/&lt;model&gt;-&lt;framework&gt;/ 目录下
>
> 文档使用的语言:默认为中文README.md,可提供英文版本README.en.md

Expand Down
8 changes: 4 additions & 4 deletions docs/dev/readme-templates/vendor-readme-template.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 厂商README文件模版

- 文档位置:每个厂商的REAMDE文档位于training/<vendor>/ 目录下
- 文档位置:每个厂商的REAMDE文档位于training/&lt;vendor&gt;/ 目录下
- 文档使用的语言:默认为中文README.md,可提供英文版本README.en.md
- 文档的目的:向用户介绍厂商信息,说明适配FlagPerf测试Case的软、硬件环境信息及加速卡监控采集指标。

Expand Down Expand Up @@ -34,10 +34,10 @@
### 容器镜像信息

- 容器构建信息
- Dockerfile路径:<vendor>/docker_image/<framework>/Dockerfile
- 构建后软件安装脚本:<vendor>/docker_image/<framework>/<framework>_install.sh
- Dockerfile路径:&lt;vendor&gt;/docker_image/&lt;framework&gt;/Dockerfile
- 构建后软件安装脚本:&lt;vendor&gt;/docker_image/&lt;framework&gt;/&lt;framework&gt;_install.sh
- ​ *// 软件安装、目录创建、文件copy等*
- 核心软件信息(例如cuda等,通常包含在容器的基础镜像中或软件安装脚本<framework>_install.sh)
- 核心软件信息(例如cuda等,通常包含在容器的基础镜像中或软件安装脚本&lt;framework>_install.sh)
- AI框架&版本
- ​ *//* *例如pip install torch_xmlir --pipsource <pipsource>*

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@

def main() -> Tuple[Any, Any]:
global logger
global config

# init
init_helper = InitHelper(config)
Expand Down Expand Up @@ -141,8 +142,23 @@ def main() -> Tuple[Any, Any]:
config, state = main()
if not dist_pytorch.is_main_process():
sys.exit(0)

# 训练信息写日志
global_batch_size = dist_pytorch.global_batch_size(config)
finished_info = get_finished_info(start, state, config.do_train,
global_batch_size)
e2e_time = time.time() - now
if config.do_train:

# TODO 构建训练所需的统计信息,包括不限于:e2e_time、training_sequences_per_second、
# converged、final_accuracy、raw_train_time、init_time
training_perf = (dist_pytorch.global_batch_size(config) *
state.global_steps) / state.raw_train_time
finished_info = {
"e2e_time": e2e_time,
"training_sequences_per_second": training_perf,
"converged": state.converged,
"final_accuracy": state.eval_accuracy,
"raw_train_time": state.raw_train_time,
"init_time": state.init_time,
}
else:
finished_info = {"e2e_time": e2e_time}
logger.log(Event.FINISHED, message=finished_info, stacklevel=0)
Loading