You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for the new blog post about fine-tuning w2v-BERT.
However, I have some doubts about the "average duration seen by each token", or perhaps I might be mistaken.
The feature extractor employs a hop_length of 160 and a reshape with a stride of 2. Therefore, for a 1-second signal with 16000 samples, it outputs 16000 / 160 / 2 = 50 (actually 48) tokens. This means that each token sees 1000 ms / 50 = 20 ms of signal.
And if we concatenate the encoder with a single conv adapter with a adpter_stride of 2, the 50 tokens get subsampled to 25 tokens, which means that each token sees now 40 ms of signal.
The text was updated successfully, but these errors were encountered:
Hi @ylacombe,
Thank you for the new blog post about fine-tuning w2v-BERT.
However, I have some doubts about the "average duration seen by each token", or perhaps I might be mistaken.
The feature extractor employs a
hop_length
of 160 and a reshape with astride
of 2. Therefore, for a 1-second signal with 16000 samples, it outputs 16000 / 160 / 2 = 50 (actually 48) tokens. This means that each token sees 1000 ms / 50 = 20 ms of signal.And if we concatenate the encoder with a single conv adapter with a
adpter_stride
of 2, the 50 tokens get subsampled to 25 tokens, which means that each token sees now 40 ms of signal.The text was updated successfully, but these errors were encountered: