Should the "factor" in "model.py" be removed during obtaining embedding? #4

koyurion · 2022-12-08T09:37:00Z

Hi, thank you for your attribition!

It seems that the "factor" in "model.py" is None first, and set to a fixed value after the first batch of first epoch, and keeped fixed during the training process. What the benefit of the strategy？

Besides，should the "factor" in "model.py" be removed or just consider the effect of self norm during obtaining embedding? As it would conssider the effect of other data in the same batch if we do not set the batch_size = 1.

Thanks again!

hwwang55 · 2022-12-08T19:53:53Z

Hi there, this is a good question!

We multiply molecule embeddings with sqrt(dim) / mean(norm(first_batch)), which is the self.factor. This means that we first divide molecule embeddings by their average norm to let the length of them to be approximately one (not exactly one, which I will explain later), then rescale their length to sqrt(dim). Here the reason is to let the distribution of embeddings be stable w.r.t. the dimension of embeddings. For example, suppose you have a 2-d vector and a 1024-d vector, if you normalize both of them, you will see that the entries in the 1024-d vector are much smaller than those in 2-d vector. This is why we need to multiply them with sqrt(dim) so that their distribution will not be affected by dim.
The second question is, why not scale each molecule embedding with sqrt(dim) / norm(this_embedding), i.e., why the factor should be fixed for all embeddings? This is because, if we scale each embedding with sqrt(dim) / norm(this_embedding), then every embedding will have the same length sqrt(dim). This is not good because molecules have different sizes. We expect large molecules to have large embeddings while small molecules to have small embeddings, so that the embedding space will be more physically meaningful. Note that a raw molecule embedding is the sum of all its atom embeddings, so the raw embedding of a large molecule will naturally be large. If we normalize them with a fixed factor, then a large molecule will still have a large embedding.

Repository owner deleted a comment from koyurion Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should the "factor" in "model.py" be removed during obtaining embedding? #4

Should the "factor" in "model.py" be removed during obtaining embedding? #4

koyurion commented Dec 8, 2022

hwwang55 commented Dec 8, 2022

Should the "factor" in "model.py" be removed during obtaining embedding? #4

Should the "factor" in "model.py" be removed during obtaining embedding? #4

Comments

koyurion commented Dec 8, 2022

hwwang55 commented Dec 8, 2022