I have a few concerns and suggestions regarding the current training setup:
- On-the-fly Feature Extraction: Models like MERT, mHuBERT, and DCAE are all frozen during training, yet features are still extracted on-the-fly. This significantly slows down training and often causes OOM. It would be much more efficient to extract these features beforehand and load them during training. At the very least, the DCAE features should be pre-extracted. Compared to MERT and mHuBERT, DCAE features are much smaller in storage size and easier to handle offline.
- Using Intermediate Layers from MERT: For MERT, it might be worth exploring embeddings from intermediate layers (e.g., layer 7) instead of the final layer. I discussed this with one of the MERT authors, and they suggested that using middle-layer representations might yield better results.