Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[w2v-bert] Questions about average duration by token #1774

Open
bofenghuang opened this issue Jan 27, 2024 · 0 comments
Open

[w2v-bert] Questions about average duration by token #1774

bofenghuang opened this issue Jan 27, 2024 · 0 comments

Comments

@bofenghuang
Copy link

Hi @ylacombe,

Thank you for the new blog post about fine-tuning w2v-BERT.

However, I have some doubts about the "average duration seen by each token", or perhaps I might be mistaken.

The feature extractor employs a hop_length of 160 and a reshape with a stride of 2. Therefore, for a 1-second signal with 16000 samples, it outputs 16000 / 160 / 2 = 50 (actually 48) tokens. This means that each token sees 1000 ms / 50 = 20 ms of signal.

And if we concatenate the encoder with a single conv adapter with a adpter_stride of 2, the 50 tokens get subsampled to 25 tokens, which means that each token sees now 40 ms of signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant