Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add device arg & Ascend npu support #2536

Merged
merged 8 commits into from
Jun 24, 2024

Conversation

MengqingCao
Copy link
Contributor

related to #2513

  • 添加 device 参数,修改 cuda 相关硬编码,方便接入 torch 已支持的不同后端
  • 保持原有调用 gpu 用法,防止大量修改破坏原有代码
  • 添加训练和推理的 Ascend NPU 支持
  • 添加使用 npu 的 aishell s0 和 whisper 训练脚本(相比 gpu 训练脚本改变了硬件信息获取指令,train.py 和 recognize.py 指定 device 参数)

@MengqingCao
Copy link
Contributor Author

请帮忙 review,提出一些修改建议,感谢! @xingchensong @Mddct

@xingchensong
Copy link
Member

可以先发一下:

  1. 用npu训练成功的日志截图
  2. 用npu推理成功的日志截图
  3. 相同代码,用gpu训练成功的日志截图
  4. 相同代码,用gpu推理成功的日志截图

我们这没有npu,不好测试

Comment on lines 139 to 142
parser.add_argument('--device',
type=str,
default="cpu",
help='accelerator to use')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用choices选项限定范围
choices=["cpu", "npu", "cuda"], etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码更新是不是没有push上来?我看还是以前的

Copy link
Contributor Author

@MengqingCao MengqingCao May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已push最新代码

@MengqingCao
Copy link
Contributor Author

可以先发一下:

  1. 用npu训练成功的日志截图
  2. 用npu推理成功的日志截图
  3. 相同代码,用gpu训练成功的日志截图
  4. 相同代码,用gpu推理成功的日志截图

我们这没有npu,不好测试

ok 我整理一下,另外我们正在申请可用于社区CI的NPU机器,后续可以向社区贡献,推动wenet+昇腾的发展和维护

@MengqingCao
Copy link
Contributor Author

MengqingCao commented May 28, 2024

NPU

train

image

average model

image

inference

image

GPU

train

image

average model

image

inference

image

@MengqingCao
Copy link
Contributor Author

MengqingCao commented May 28, 2024

训练及推理成功的截图已更新
训练和推理脚本均使用 examples/aishell/s0/run.sh 的修改脚本,修改处:

  1. 数据集路径
  2. gpu 相关变量读取改为 npu 对应的(仅npu侧)
  3. npu 指定 device 为 "npu",gpu 保持原有用法(device 默认为 cuda)

import torch_npu # noqa
return True
except ImportError:
print("Module \"torch_npu\" not found. \"pip install torch_npu\" \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里增加一个判断,防止 gpu 端推理打印该提示,影响用户体验

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

@Mddct
Copy link
Collaborator

Mddct commented May 31, 2024

模型的device。不需要从__inint__里构造,在init model的时候都会 to 一个device

forward里边用到device可以从input里边获取

@robin1001
Copy link
Collaborator

你们有一些 benchmark 的结果吗?比如训练的速度、推理的速度等。我们可以一起写篇文章介绍这个工作。

@xingchensong
Copy link
Member

@MengqingCao hi 加一下微信把,这里沟通效率有点低 ,currycode

@MengqingCao
Copy link
Contributor Author

你们有一些 benchmark 的结果吗?比如训练的速度、推理的速度等。我们可以一起写篇文章介绍这个工作。

这块我还在调试,我们的机器是arm的,openfst和srilm的安装编译似乎有问题,暂时还没有benchmark结果。有结果之后,我很乐意写文章介绍这个工作:)

@MengqingCao
Copy link
Contributor Author

@robin1001 @Mddct @xingchensong 最新 benchmark,attention 解码精度达标,ctc解码精度有偏差,其中4-GPU结果来源于 https://github.com/wenet-e2e/wenet/blob/main/examples/aishell/s0/README.md

Conformer Result
Feature info: using fbank feature, dither, cmvn, online speed perturb
Training info: lr 0.002, batch size 18 (gpu) 16 (npu), 4-gpu/ 1-npu, acc_grad 4, 240 epochs, dither 0.1
Decoding info: ctc_weight 0.5, average_num 20

decoding mode CER 4-GPU CER 1-NPU
attention decoder 5.18 5.11
ctc greedy search 4.94 5.36
ctc prefix beam search 4.94 5.37
attention rescoring 4.61 4.79

@yushanyong
Copy link

我这在Aishell-1上刚好有个实验结果,可以参考:  

train config: examples/aishell/s0/conf/train_conformer.yaml (7ce2126)
Training info: batch size 18
Decoding info: ctc_weight 0.3, reverse_weight 0.5, average_num 30
  
image

@MengqingCao
Copy link
Contributor Author

我这在Aishell-1上刚好有个实验结果,可以参考:  

train config: examples/aishell/s0/conf/train_conformer.yaml (7ce2126) Training info: batch size 18 Decoding info: ctc_weight 0.3, reverse_weight 0.5, average_num 30    image

感谢分享!请问你是在此分支上完成训练的吗?

@yushanyong
Copy link

我这在Aishell-1上刚好有个实验结果,可以参考:  
train config: examples/aishell/s0/conf/train_conformer.yaml (7ce2126) Training info: batch size 18 Decoding info: ctc_weight 0.3, reverse_weight 0.5, average_num 30    image

感谢分享!请问你是在此分支上完成训练的吗?

是的

@@ -106,7 +109,8 @@ def train(self, model, optimizer, scheduler, train_data_loader,
"lrs":
[group['lr'] for group in optimizer.param_groups]
})
save_model(model, info_dict)
if self.step % 100 == 0:
save_model(model, info_dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个self.step % 100 和line94功能重复

Comment on lines +225 to +228
if "cuda" in args.device:
torch.cuda.set_device(local_rank)
elif "npu" in args.device and TORCH_NPU_AVAILABLE:
torch.npu.set_device(local_rank)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

再来个 else 处理下异常情况

@xingchensong xingchensong merged commit dec409b into wenet-e2e:main Jun 24, 2024
6 checks passed
@285220927
Copy link

可以分享一下你的npu推理代码或者脚本吗,我使用npu推理,速度比cpu还要慢

@MengqingCao
Copy link
Contributor Author

可以分享一下你的npu推理代码或者脚本吗,我使用npu推理,速度比cpu还要慢

@285220927 我直接使用 stage5 进行推理
https://github.com/MengqingCao/wenet/blob/125cd5c8e71aa851193cc2bb15a52cbc934ea4a3/examples/aishell/s0/run_npu.sh#L189
请问你的推理是单条语音的还是批量的?如果单条语音推理,目前速度确实比CPU慢,这种情况下NPU初始化、数据在host和device的拷贝等开销占比更大,NPU更擅长大矩阵的运算

@285220927
Copy link

285220927 commented Aug 2, 2024

可以分享一下你的npu推理代码或者脚本吗,我使用npu推理,速度比cpu还要慢

@285220927 我直接使用 stage5 进行推理 https://github.com/MengqingCao/wenet/blob/125cd5c8e71aa851193cc2bb15a52cbc934ea4a3/examples/aishell/s0/run_npu.sh#L189 请问你的推理是单条语音的还是批量的?如果单条语音推理,目前速度确实比CPU慢,这种情况下NPU初始化、数据在host和device的拷贝等开销占比更大,NPU更擅长大矩阵的运算

我用的也是这个脚本,批量的推理,batch_size设置的16,慢到我无法忍受了
我最近试了很多模型,直接将模型加载到npu里面推理都非常慢,只有转为onnx再转为om的速度是比较快的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants