This resource list is mantained by Ramon Sanabria, Edinburgh NLP and The Centre for Speech Technology Research, The University of Edinburgh. Ex Language Technologies Institute and Robotics Institute, CMU.
This list is probably biased towards my current research directions. So, please, if there is anythin missing, please let me know. Suggestions are super welcome :)
- Kamper H, Jansen A, Goldwater S. Fully unsupervised small-vocabulary speech recognition using a segmental bayesian model INTERSPEECH 2015
- Kamper H, Jansen A, King S, Goldwater S. Unsupervised lexical clustering of speech segments using fixed-dimensional acoustic embeddings. SLT 2014
- Oord AV, Li Y, Vinyals O. Representation learning with contrastive predictive coding NIPS 2019
- Chung YA, Hsu WN, Tang H, Glass J. An unsupervised autoregressive model for speech representation learning INTERSPEECH 2019
- Pascual S, Ravanelli M, Serrà J, Bonafonte A, Bengio Y. Learning problem-agnostic speech representations from multiple self-supervised tasks INTERSPEECH 2019
- Klejch O, Fainberg J, Bell P, Renals S. Speaker Adaptive Training using Model Agnostic Meta-Learning ASRU 2019
- Klejch O, Fainberg J, Bell P. Learning to adapt: a meta-learning approach for speaker adaptation Interspeech 2018
- Wiesner M, Renduchintala A, Watanabe S, Liu C, Dehak N, Khudanpur S. Pretraining by Backtranslation for End-to-end ASR in Low-Resource Settings INTERSPEECH 2019
- Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, Le QV. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition INTERSPEECH 2019
- Baskar MK, Burget L, Watanabe S, Karafiát M, Hori T, Černocký JH. Promising Accurate Prefix Boosting for sequence-to-sequence ASR ICASSP 2019
- Collobert R, Hannun A, Synnaeve G. A fully differentiable beam search decoder arxiv 2019
- Zeyer A, Bahar P, Irie K, Schlüter R, Ney H. A Comprison Of Transformer and LSTM Encoder Decoder Models for ASR ASRU 2019
- Dong L, Xu S, Xu B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition ICASSP 2018
- Sperber, M., Niehues, J., Neubig, G., Stüker, S., Waibel, A. Self-Attentional Acoustic Models INTERSPEECH 2018
- Zhou S, Dong L, Xu S, Xu B. Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese INTERSPEECH 2018
- Prasad M, van Esch D, Ritchie S, Mortensen JF. Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data INTERSPEECH 2019
- Menon R, Kamper H, van der Westhuizen E, Quinn J, Niesler T. Feature exploration for almost zero-resource ASR-free keyword spotting using a multilingual bottleneck extractor and correspondence autoencoders INTERSPEECH 2019
- Moriya Y, Jones GJ. Multimodal Speaker Adaptation of Acoustic Model and Language Model for Asr Using Speaker Face Embedding ICASSP 2019
- Caglayan O, Sanabria R, Palaskar S, Barraul L, Metze F. Multimodal Grounding for Sequence-to-sequence Speech Recognition ICASSP 2019
- Palaskar S, Sanabria R, Metze F. End-to-end multimodal speech recognition ICASSP 2018
- Gupta A, Miao Y, Neves L, Metze F. Visual features for context-aware speech recognition ICASSP 2017
- Sun F, Harwath D, Glass J. Look, Listen, And Decode: Multimodal Speech Recognition With Images ICASSP 2016
- Pasad, Ankita, et al. On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval INTERSPEECH 2019
- Owens A, Wu J, McDermott JH, Freeman WT, Torralba A. Learning sight from sound: Ambient sound provides supervision for visual learning. IJCV 2018
- Kamper H, Settle S, Shakhnarovich G, Livescu K. Visually grounded learning of keyword prediction from untranscribed speech INTERSPEECH 2017
- Shi H, Mao J, Gimpel K, Livescu K. Visually Grounded Neural Syntax Acquisition arxiv 2019
- Inaguma H, Duh K, Kawahara T, Watanabe S. Multilingual End-to-End Speech Translation arXiv 2019
- Bansal S, Kamper H, Livescu K, Lopez A, Goldwater S. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. NAACL 2019
- Chung YA, Weng WH, Tong S, Glass J. Towards unsupervised speech-to-text translation ICASSP 2019
- Gu J, Wang Y, Chen Y, Cho K, Li VO. Meta-learning for low-resource neural machine translation VIDEO EMNLP 2018
- Caglayan O, Madhyastha P, Specia L, Barrault L. Probing the Need for Visual Context in Multimodal Machine Translation NAACL 2019
- Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks ICML 2017
- Kawakami, K., Dyer, C. and Blunsom, P. Learning to Discover, Ground and Use Words with Segmental Neural Language Models ACL 2019 * Kawakami, K., Dyer, C. and Blunsom, P. Unsupervised Word Discovery with Segmental Neural Language Models ACL 2019
- Mahajan D, Girshick R, Ramanathan V, He K, Paluri M, Li Y, Bharambe A, van der Maaten L. Exploring the limits of weakly supervised pretraining EECV 2018
- Patterson G, Hays J. Coco attributes: Attributes for people, animals, and objects EECV 2016
- Black AW. CMU Wilderness Multilingual Speech Dataset. ICASSP 2019
- Boito MZ, Havard WN, Garnerin M, Ferrand ÉL, Besacier L. MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible. arXiv 2019
- Godard P, Adda G, Adda-Decker M, Benjumea J, Besacier L, Cooper-Leavitt J, Kouarata GN, Lamel L, Maynard H, Müller M, Rialland A. A very low resource language speech corpus for computational language documentation experiments. LREC 2018
- Di Gangi MA, Cattoni R, Bentivogli L, Negri M, Turchi M. MuST-C: a Multilingual Speech Translation Corpus. NAACL 2019
- Salesky E, Burger S, Niehues J, Waibel A. Towards fluent translations from disfluent speech. SLT 2018
- Post M, Kumar G, Lopez A, Karakos D, Callison-Burch C, Khudanpur S. Improved speech-to-text translation with the Fisher and Callhome Spanish–English speech translation corpus. IWSLT 2013
- Guzmán F, Chen PJ, Ott M, Pino J, Lample G, Koehn P, Chaudhary V, Ranzato MA. Two new evaluation datasets for low-resource machine translation: Nepali-English and Sinhala-English EMNLP 2019
- Sanabria R, Caglayan O, Palaskar S, Elliott D, Barrault L, Specia L, Metze F. How2: a large-scale dataset for multimodal language understanding NIPS 2018 Workshop
- Great resource from @josh_meyer here
- Great collection of video datasets here