Setting of β #32

tangbohu · 2018-08-09T15:52:08Z

Hi.

In the paper, the authors said "As for parameter β in eq. 2, it usually varies about 0.1, as we
set it to 10^3 divided by number of elements in attention map and batch size for each layer. "

But I am still confused. What is 10^3 mean, and how 0.1 was got?

d-li14 · 2018-08-25T09:35:03Z

@tangbohu I assume that β is 10^3 / batch_size / (feature_map_size)^2, this division occurs in the average function here in practice, batch size is set to 128 by default, and feature map size varies in the range of 32x32, 16x16, 8x8, so the aformentioned equation varies about 0.1. Just my own conjecture from the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting of β #32

Setting of β #32

tangbohu commented Aug 9, 2018

d-li14 commented Aug 25, 2018

Setting of β #32

Setting of β #32

Comments

tangbohu commented Aug 9, 2018

d-li14 commented Aug 25, 2018