-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How is the number of BERT model parameters calculated? #656
Comments
通常情况 transformer 模型有很多参数需要训练。譬如 BERT BASE 模型: L=12, H=768, A=12, 需要训练的模型参数总数是 12 * 768 * 12 = 110M |
通常情况变压器模型有很多参数需要训练。譬如BERT BASE模型:L = 12,H = 768,A = 12,需要训练的模型参数总数是12 * 768 * 12 = 110M |
12 * 768 * 12 = 110M ? |
here is one layer Transformer
-- |
|
So does the attention head number get included? |
I think the attention head number is chosen such that H / A = 64 for all models, where H is the hidden size and A is the number of attention heads |
Thanks @liuqiangict |
这是来自QQ邮箱的假期自动回复邮件。
您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
|
They are different. If they are shared, weight size can be reduced by the number of heads. |
这是来自QQ邮箱的假期自动回复邮件。
您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
|
Yes, It does. Actually, for each head, the attention layer project input (which is [768]) to a small size (which is [64]). There are 12 heads in attention layer. We can see that 64 * 12 = 768. The implementation in transformer do not have 12 head explicitly, otherwise, 12 head was put together which is one linear layer (768 * 768). For the code, actually, they are the same. |
I‘m a bit confused about the 110M parameters. How is it calculated?
The text was updated successfully, but these errors were encountered: