Skip to content

Commit

Permalink
【metax】First PR & faster_rcnn project (FlagOpen#402)
Browse files Browse the repository at this point in the history
* update readme

* add company info

* faster_rcnn update & first PR

* fix readme

* add config 1x8 bs=16

* fix typo A100->C500

* remove torchvision in requirements.txt

* update readme

* update fasterrcnn readme

* update

* add 2x8 info & add 带宽

* fix typo

* delete history

* update info

* update info

* update table

* delete history

* add info in test-conf

* fix typo

* delete history

* fix env bug & add mx tf32 env

* update requirements

* fix bug

---------

Co-authored-by: Shengchu Zhao <shengchu.zhao@metax-tech.com>
  • Loading branch information
fred1912 and Shengchu Zhao authored Jan 25, 2024
1 parent fd5e799 commit 9419bde
Show file tree
Hide file tree
Showing 16 changed files with 465 additions and 2 deletions.
5 changes: 5 additions & 0 deletions training/benchmarks/driver/helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,3 +83,8 @@ def set_seed(self, seed: int, vendor: str = None):
else:
# TODO 其他厂商设置seed,在此扩展
pass

if os.environ.get("METAX_USE_TF32"):
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# =================================================
# Export variables
# =================================================

export METAX_USE_TF32=1
6 changes: 6 additions & 0 deletions training/kunlunxin/faster_rcnn-pytorch/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
/root/.cache/torch/hub/checkpoints/torchvision-0.15.1+mc2.19.0.2-cp38-cp38-linux_x86_64.whl
/root/.cache/torch/hub/checkpoints/torch-2.0.0+gite544b36-cp38-cp38-linux_x86_64.whl
pycocotools
numpy
tqdm
schedule
70 changes: 70 additions & 0 deletions training/metax/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# 厂商信息

官网: https://www.metax-tech.com/

沐曦集成电路(上海)有限公司,于2020年9月成立于上海,并在北京、南京、成都、杭州、深圳、武汉和长沙等地建立了全资子公司暨研发中心。沐曦拥有技术完备、设计和产业化经验丰富的团队,核心成员平均拥有近20年高性能GPU产品端到端研发经验,曾主导过十多款世界主流高性能GPU产品研发及量产,包括GPU架构定义、GPU IP设计、GPU SoC设计及GPU系统解决方案的量产交付全流程。

沐曦致力于为异构计算提供全栈GPU芯片及解决方案,可广泛应用于智算、智慧城市、云计算、自动驾驶、数字孪生、元宇宙等前沿领域,为数字经济发展提供强大的算力支撑。

沐曦打造全栈GPU芯片产品,推出曦思®N系列GPU产品用于智算推理,曦云®C系列GPU产品用于通用计算,以及曦彩®G系列GPU产品用于图形渲染,满足“高能效”和“高通用性”的算力需求。沐曦产品均采用完全自主研发的GPU IP,拥有完全自主知识产权的指令集和架构,配以兼容主流GPU生态的完整软件栈(MXMACA®),具备高能效和高通用性的天然优势,能够为客户构建软硬件一体的全面生态解决方案,是“双碳”背景下推动数字经济建设和产业数字化、智能化转型升级的算力基石。



# FlagPerf适配验证环境说明
## 环境配置参考
- 硬件
- 机器型号: 同泰怡 G658V3
- 加速卡型号: 曦云®C500 64G
- 多机网络类型、带宽: InfiniBand,2x200 Gb/s
- 软件
- OS版本:Ubuntu 20.04.6
- OS kernel版本: 5.4.0-26-generic
- 加速卡驱动版本:2.18.0.8
- VBIOS:1.0.102.0
- Docker版本:24.0.7


## 容器镜像信息
- 容器构建信息
- Dockerfile路径:metax/docker_image/pytorch_2.0/Dockerfile
- 构建后软件安装脚本:metax/docker_image/pytorch_2.0/pytorch_install.sh

- 核心软件信息
- AI框架&相关版本:
torch: pytorch-2.0-mc
torchvision: torchvision-0.15-mc
maca: 2.18.0.8


## 加速卡监控采集
- 加速卡使用信息采集命令

```shell
mx_smi
```
- 监控项示例:

+---------------------------------------------------------------------------------+
|&emsp; MX-SMI 2.0.12&emsp; &emsp; &emsp; &emsp; &emsp; Kernel Mode Driver Version: 2.2.0&emsp; &emsp; &emsp; &thinsp; |
|&emsp; MACA Version: 2.0&emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; BIOS Version: 1.0.102.0&emsp; &emsp; &emsp; &thinsp; &thinsp; |
|------------------------------------+---------------------+----------------------+
|&emsp; GPU&emsp;&emsp;&thinsp; NAME &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;| Bus-i&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&thinsp;| GPU-Util&emsp;&emsp;&emsp;&emsp;&emsp;&thinsp;|
|&emsp; Temp&emsp;&emsp;Power &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;| Memory-Usage&emsp;&thinsp;&thinsp;&thinsp;|&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&thinsp;&thinsp;|
|=====================+============+==============|
| &emsp;0&emsp;&emsp;&thinsp; MXC500 &emsp;&emsp;&thinsp;&emsp;&emsp;&thinsp;&emsp;&emsp;| 0000:1b:00.0 &emsp;&emsp;&thinsp;&thinsp; | 0%&emsp;&emsp;&thinsp;&emsp;&thinsp;&thinsp;&emsp;&thinsp;&thinsp;&emsp;&thinsp;&thinsp;&emsp;&thinsp;&thinsp;|
| &emsp;35C &emsp;&emsp;&thinsp;56W &emsp;&emsp;&thinsp;&emsp;&emsp;&emsp;&emsp; &thinsp; | 914/65536 MiB &thinsp; &thinsp; &thinsp; | &emsp;&emsp;&thinsp; &emsp;&emsp;&thinsp;&thinsp;&emsp;&thinsp;&thinsp;&emsp;&emsp;&emsp;|
+------------------------------------+---------------------+----------------------+


- 加速卡使用信息采集项说明

|监控项| 日志文件 | 格式 |
|---|---|---|
|温度| mx_monitor.log | xxx C |
|功耗 |mx_monitor.log | xxx W |
|显存占用大小 |mx_monitor.log |xxx MiB |
|总显存大小 |mx_monitor.log |xxx MiB |
|显存使用率 |mx_monitor.log |xxx % |



3 changes: 3 additions & 0 deletions training/metax/docker_image/pytorch_2.0/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
FROM maca-2.18.0.8-ubuntu18.04-amd64:FlagPerf-base-v1
ENV PATH="/opt/conda/bin:${PATH}"
RUN /bin/bash -c "uname -a"
5 changes: 5 additions & 0 deletions training/metax/docker_image/pytorch_2.0/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# 以下软件包需联系沐曦获取

>联系邮箱: shengchu.zhao@metax-tech.com
docker image: maca-2.18.0.8-ubuntu18.04-amd64
1 change: 1 addition & 0 deletions training/metax/docker_image/pytorch_2.0/pytorch_install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
#!/bin/bash
59 changes: 59 additions & 0 deletions training/metax/faster_rcnn-pytorch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
### 模型backbone权重下载
[模型backbone权重下载](../../benchmarks/faster_rcnn)

这一部分路径在FlagPerf/training/benchmarks/faster_rcnn/pytorch/model/\_\_init__.py中提供:

```python
torchvision.models.resnet.ResNet50_Weights.IMAGENET1K_V1.value.url = 'https://download.pytorch.org/models/resnet50-0676ba61.pth'
```
本case中默认配置为,从官网同路径(0676ba61)自动下载backbone权重。用户如需手动指定,可自行下载至被挂载到容器内的路径下,并于此处修改路径为"file://"+download_path

### 测试数据集下载

[测试数据集下载](https://cocodataset.org/)

### 沐曦集成电路 C500 GPU配置与运行信息参考
#### 环境配置
- ##### 硬件环境
- 机器、加速卡型号: 曦云®C500 64G
- 多机网络类型、带宽: InfiniBand,2x200 Gb/s

- ##### 软件环境
- OS版本:Ubuntu 20.04.6
- OS kernel版本: 5.4.0-26-generic
- 加速卡驱动版本:2.2.0
- Docker 版本:24.0.7
- 训练框架版本:pytorch-2.0.0+mc2.18.0.8-cp38-cp38-linux_x86_64.whl
- 依赖软件版本:无




* 通用指标

| 指标名称 | 指标值 | 特殊说明 |
| -------------- | ----------------------- | ------------------------------------------- |
| 任务类别 | 图像目标检测 | |
| 模型 | fasterRCNN | |
| 数据集 | coco2017 | |
| 数据精度 | precision,见“性能指标” | 可选fp32/amp/fp16 |
| 超参修改 | fix_hp,见“性能指标” | 跑满硬件设备评测吞吐量所需特殊超参 |
| 硬件设备简称 | MXC500 | |
| 硬件存储使用 | mem,见“性能指标” | 通常称为“显存”,单位为GiB |
| 端到端时间 | e2e_time,见“性能指标” | 总时间+Perf初始化等时间 |
| 总吞吐量 | p_whole,见“性能指标” | 实际训练图片数除以总时间(performance_whole) |
| 训练吞吐量 | p_train,见“性能指标” | 不包含每个epoch末尾的评估部分耗时 |
| **计算吞吐量** | **p_core,见“性能指标”** | 不包含数据IO部分的耗时(p3>p2>p1) |
| 训练结果 | map,见“性能指标” | 单位为平均目标检测正确率 |
| 额外修改项 || |


* 性能指标

| 配置 | precision | fix_hp | e2e_time | p_whole | p_train | p_core | map | mem |
| --------------------- | --------- | ------------ | -------- | ------- | ------- | ------ | --- | --- |
| MXC500 单机8卡(1x8) | fp32 | / | | | | | |9.9/64 |
| MXC500单机8卡(1x8) | fp32 | bs=16,lr=0.16 | | | | |36.7%|44.5/64 |
| MXC500 单机单卡(1x1)| fp32 | / | / | | | | | 31.8/64 |
| MXC500 两机8卡(2x8) | fp32 | / | / | | | | | 44.3/64 |

4 changes: 4 additions & 0 deletions training/metax/faster_rcnn-pytorch/config/config_C500x1x1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
vendor: str = "metax"
train_batch_size = 16
eval_batch_size = 16
lr = 0.16
3 changes: 3 additions & 0 deletions training/metax/faster_rcnn-pytorch/config/config_C500x1x8.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
vendor: str = "metax"
train_batch_size = 2
eval_batch_size = 2
4 changes: 4 additions & 0 deletions training/metax/faster_rcnn-pytorch/config/config_C500x2x8.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
vendor: str = "metax"
train_batch_size = 16
eval_batch_size = 16
lr = 0.08
3 changes: 3 additions & 0 deletions training/metax/faster_rcnn-pytorch/config/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
pycocotools
numpy
tqdm
Empty file.
Loading

0 comments on commit 9419bde

Please sign in to comment.