- Demonstration of VAD + ASR + Speaker Confirmation on Android device.
- The demo models were uploaded to the drive: https://drive.google.com/drive/folders/1ErEdY6QMyJCW0yuQR03If905IhdyAHFw?usp=drive_link
- After downloading, place the model into the assets folder.
- Remember to decompress the *.so zip file stored in the libs/arm64-v8a folder.
- The demo models, named 'FSMN-VAD' & 'Paraformer' & 'ERes2Net', were converted from ModelScope and underwent code optimizations to achieve extreme execution speed.
- Therefore, the inputs & outputs of the demo models are slightly different from the original one.
- To better adapt to ONNX Runtime on Android, the export did not use dynamic axes. Therefore, the exported ONNX model may not be optimal for x86_64 systems.
- Due to the model's limitations, the English version does not perform well currently.
- We will make the exported method public later.
- You have to approve the recorder permissions first, and then relaunch the app again.
- The time cost of ASR inference is about 25ms, and speaker confirmation also takes 25ms. VAD takes about 2ms.
- In the demo, we set the monitoring frequency to a default of 16FPS (60ms per round) to make an offline ASR model approximate the performance of an online streaming one, while maintaining the offline model's accuracy.
- It can be set to an awakening word of your choice, and the system will use a fuzzy tone to match it (Chinese awakening words only), minimizing the effect of natural speech variations. After waking up, it defaults to staying active for 30 seconds (adjustable).
- You can issue simple commands directly without invoking the awakening word. (wake free mode) For example: Open xxx, Close xxx, Navigate to xxx, Play someone's song...and so on.
- If the sentence contains conjunctions such as 'and' or other common continuation words, the system will engage in multi-intent judgment.
- Just simply say the keywords, 'adding voice' or 'adding premission'..., directly, and the system will recognize your voice as permission. The same applies to 'deleting permission'.
- Once permission is added, only the owner of the voice can modify it. The system will only recognize the authorized sound as an effective command.
- No guarantee for the permission's success ratio. For more information, please refer to the ERes2Net model introduction.
- The quantization method for the model can be seen in the folder "Do_Quantize".
- The q4(uint4) quantization method is not currently recommended because the "MatMulNBits" operator in ONNX Runtime is performing poorly.
- See more projects: https://dakeqq.github.io/overview/
- 在Android设备上进行VAD + ASR + 说话人确认的演示。
- 演示模型已上传至云端硬盘:https://drive.google.com/drive/folders/1ErEdY6QMyJCW0yuQR03If905IhdyAHFw?usp=drive_link
- 百度: https://pan.baidu.com/s/1Si-4ebtqm2HA9omxqHCMuQ?pwd=dake 提取码: dake
- 下载后,请将模型文件放入assets文件夹。
- 记得解压存放在libs/arm64-v8a文件夹中的*.so压缩文件。
- 演示模型名为'FSMN-VAD' & 'Paraformer' & 'ERes2Net',它们是从ModelScope转换来的,并经过代码优化,以实现极致执行速度。
- 因此,演示模型的输入输出与原始模型略有不同。
- 为了更好的适配ONNXRuntime-Android,导出时未使用dynamic-axes. 因此导出的ONNX模型对x86_64而言不一定是最优解.
- 由于模型的限制,目前英文版本的表现不佳。
- 我们未来会提供转换导出的方法。
- 首次使用时,您需要先授权录音权限,然后再重新启动应用程序。
- ASR推理的耗时大约为25毫秒,说话者确认也需要25毫秒。VAD大约需要2毫秒。
- 在演示中,我们将计算频率默认设置为16FPS(每轮60毫秒),以便让离线ASR模型接近在线流式处理模型的性能,同时保持离线模型的准确性。
- 自由设置您的唤醒词,系统将使用模糊音调来匹配它,减少口音的影响。醒来后,默认保持活动30秒(可调)。
- 您可以直接发出简单命令,无需唤醒。(免唤醒模式)例如:打开xxx、关闭xxx、导航到xxx、播放某人的歌曲...等
- 如果句子中包含“和”, "还有", "然后"...等等常见的连词,系统将进行多意图判断。
- 只需直接说出关键词,如 '添加声音' 或 '添加权限'...,系统将识别您的声音为权限。 '删除权限' 也适用同样的操作。
- 一旦添加权限,只有声音的所有者才能修改它。系统仅识别授权声音为有效命令。
- 不保证权限识别成功率。更多信息,请参考ERes2Net模型介绍。
- 模型的量化方法可以在文件夹 "Do_Quantize" 中查看。
- 现在不建议使用q4(uint4)量化方法, 因为ONNX Runtime的运算符"MatMulNBits"表现不佳。
- 看更多項目: https://dakeqq.github.io/overview/
- 此GIF以每秒7帧的速度生成。因此,ASR看起来可能不够流畅。This GIF was generated at 7fps. Therefore, it may not look smooth enough.