[w2v-bert] Questions about average duration by token #1774

bofenghuang · 2024-01-27T18:47:01Z

Thank you for the new blog post about fine-tuning w2v-BERT.

However, I have some doubts about the "average duration seen by each token", or perhaps I might be mistaken.

The feature extractor employs a hop_length of 160 and a reshape with a stride of 2. Therefore, for a 1-second signal with 16000 samples, it outputs 16000 / 160 / 2 = 50 (actually 48) tokens. This means that each token sees 1000 ms / 50 = 20 ms of signal.

And if we concatenate the encoder with a single conv adapter with a adpter_stride of 2, the 50 tokens get subsampled to 25 tokens, which means that each token sees now 40 ms of signal.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[w2v-bert] Questions about average duration by token #1774

[w2v-bert] Questions about average duration by token #1774

bofenghuang commented Jan 27, 2024

[w2v-bert] Questions about average duration by token #1774

[w2v-bert] Questions about average duration by token #1774

Comments

bofenghuang commented Jan 27, 2024