You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the paper, the authors said "As for parameter β in eq. 2, it usually varies about 0.1, as we
set it to 10^3 divided by number of elements in attention map and batch size for each layer. "
But I am still confused. What is 10^3 mean, and how 0.1 was got?
The text was updated successfully, but these errors were encountered:
@tangbohu I assume that β is 10^3 / batch_size / (feature_map_size)^2, this division occurs in the average function here in practice, batch size is set to 128 by default, and feature map size varies in the range of 32x32, 16x16, 8x8, so the aformentioned equation varies about 0.1. Just my own conjecture from the code.
Hi.
In the paper, the authors said "As for parameter β in eq. 2, it usually varies about 0.1, as we
set it to 10^3 divided by number of elements in attention map and batch size for each layer. "
But I am still confused. What is 10^3 mean, and how 0.1 was got?
The text was updated successfully, but these errors were encountered: