DyViSE

model: models/ECAPA_lip/ecapa_tdnn_lip.py

loss: losses/angleproto.py

Reference

E. Z. Xu, Z. Song, C. Feng, M. Ye, and M. Z. Shou, “AVA-AVD: Audio-visual speaker diarization in the wild,” CoRR, vol. abs/2111.14448, 2021
R. Gao and K. Grauman, “Visualvoice: Audio-visual speech separation with cross-modal consistency,” in Proc. CVPR, 2021, pp. 15 495–15 505.
R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, “Is someone speaking?: Exploring long-term temporal features for audio-visual active speaker detection,” in Proc. ACM Multimedia, 2021, pp. 3927–3935.
S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” CoRR, vol. abs/2110.13900, 2021.