Paper Title: CCC-wav2vec 2.0: Clustering aided Cross Contrastive Self-Supervised Learning of Speech Representations. At IEEE SLT 2022 (arxiv link).
ccc-wav2vec 2.0 is a pre-training mechanism which uses clustering and an augmentation-based cross-contrastive loss as its self-supervised objective. Through the clustering module, we scale down the influence of those negative examples that are highly similar to the positive. The Cross-Contrastive loss is computed between the encoder output of the original sample and the quantizer output of its augmentation and vice-versa, bringing robustness to the pre-training strategy.
Primary Contributions:
- We introduce an augmentation of the original sample and use its representations to add an auxiliary Cross-Contrastive loss to the existing contrastive loss in wav2vec 2.0.
- We demonstrate the usefulness of a clustering module to segregate the negative examples and thereby control the effect of the weak non-informative negative examples in the contrastive learning task.
- Combining the above two modules leads to the development of ccc-wav2vec 2.0, a robust pre-training approach that consistently outperforms wav2vec 2.0 in tasks such as ASR, Domain Adaptation, and zero-shot decoding.
The ccc-wav2vec 2.0 BASE model pre-trained on LibriSpeech-960h has been evaluated on the multiple downstream tasks over the SUPERB benchmark. The proposed method comprehensively outperforms the baseline wav2vec 2.0 BASE model over the array of downstream tasks presented over SUPERB.
The WERs specified are without the use of any language model.
Model | Pre-training data | Fine-tuning data | Model Links | WER (test-clean | test-other) |
---|---|---|---|---|
wav2vec 2.0 Base | LibriSpeech-360h | No fine-tuning | fairseq | huggingface | --- |
wav2vec 2.0 Base | LibriSpeech-360h | LibriSpeech-100h | fairseq | huggingface | 12.8 | 31.7 |
ccc-wav2vec 2.0 Base | LibriSpeech-360h | No fine-tuning | fairseq | huggingface | --- |
ccc-wav2vec 2.0 Base | LibriSpeech-360h | LibriSpeech-100h | fairseq | huggingface | 10.8 | 27.7 |
ccc-wav2vec 2.0 Base | LibriSpeech-960h | No fine-tuning | fairseq | huggingface | --- |
ccc-wav2vec 2.0 Base | LibriSpeech-960h | LibriSpeech-100h | fairseq | huggingface | 5.5 | 12.4 |
ccc-wav2vec 2.0 Base SUPERB | LibriSpeech-960h | No fine-tuning | fairseq SUPERB model | huggingface SUPERB model | --- |
- Pre-training and fine-tuning procedures can be found here.
- PyTorch version >= 1.10.0
- Python version >= 3.8
- For training new models, you'll also need an NVIDIA GPU and NCCL
- To install fairseq with ccc-wav2vec 2.0 and develop locally:
git clone https://github.com/Speech-Lab-IITM/CCC-wav2vec-2.0
cd fairseq
pip install --editable ./
- For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
--global-option="--deprecated_fused_adam" --global-option="--xentropy" \
--global-option="--fast_multihead_attn" ./
-
For large datasets install PyArrow:
pip install pyarrow
-
If you use Docker make sure to increase the shared memory size either with
--ipc=host
or--shm-size
as command line options tonvidia-docker run
. -
For Augmentations to work install torchaudio-augmentations:
git clone https://github.com/Speech-Lab-IITM/torchaudio-augmentations
cd torchaudio-augmentations
pip install --editable ./
- The clustering module functions on GPU needs fast-pytorch-kmeans to be installed:
pip install fast-pytorch-kmeans
- The
$\alpha$ ,$\beta$ and$\gamma$ values from the paper can be caliberated from thecc_weights
parameter in thecriterion
section of the pre-training configs which can be found from the pre-training config. - The
cluster_factor
andscale_factor
parameters can be modified from themodel
section of the pre-training configs which can be found from the pre-training config. - The augmentations used for ccc-wav2vec 2.0 requires the noise set of MUSAN dataset. The path to the same is to be specified in the
path_to_musan_noise_set
variable of the getitem method of the raw_audio_dataset file.
- Facebook AI Research Sequence-to-Sequence Toolkit written in Python. fairseq