model: models/ECAPA_lip/ecapa_tdnn_lip.py
loss: losses/angleproto.py
- E. Z. Xu, Z. Song, C. Feng, M. Ye, and M. Z. Shou, “AVA-AVD: Audio-visual speaker diarization in the wild,” CoRR, vol. abs/2111.14448, 2021
- R. Gao and K. Grauman, “Visualvoice: Audio-visual speech separation with cross-modal consistency,” in Proc. CVPR, 2021, pp. 15 495–15 505.
- R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, “Is someone speaking?: Exploring long-term temporal features for audio-visual active speaker detection,” in Proc. ACM Multimedia, 2021, pp. 3927–3935.
- S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” CoRR, vol. abs/2110.13900, 2021.