Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When extending embeddings, multivariate distribution isn't correctly estimated even when the calculated sigma matrix is symmetric and positive definite #35075

Closed
1 of 4 tasks
MayStepanyan opened this issue Dec 4, 2024 · 7 comments
Labels

Comments

@MayStepanyan
Copy link

MayStepanyan commented Dec 4, 2024

System Info

  • transformers version: 4.37.1
  • Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.26.2
  • Safetensors version: 0.4.5
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

When resizing token embeddings for models like MobileBert, iBert etc, resize_token_embeddings calls an underlying transformers.modeling_utils._init_added_embeddings_with_mean. It should initialize new embedding weights using the old ones:

  1. calculate the mean vector of old embedding vectors
  2. calculate a sigma matrix using this vector - vector * vector.T / vector_dim
  3. check if its positive-definite, i.e. can be used as a covariance matrix for a new distribution
  • if so, sample from estimated distribution
  • else just initialize the new embeddings from the mean vector of previous ones

I noticed the check in step 3 ALWAYS fails, i.e. no matrix is considered as positive definite.

The problem seems to be in these lines

         eigenvalues = torch.linalg.eigvals(covariance)
         is_covariance_psd = bool(
            (covariance == covariance.T).all() and not torch.is_complex(eigenvalues) and (eigenvalues > 0).all()
        )

since the eigenvalues calculated with torch.linalg.eigvals are complex and torch.is_complex returns True for them. Hence, the main logic, i.e. constructing a multivariate distribution from the previous embeddings and sample from it, might never work (at least in my experiments).

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Here's an isolated example testing the lines I mentioned above:

import torch

covariance = torch.Tensor([[5,4],[1,2]])
eigenvalues = torch.linalg.eigvals(covariance)
is_covariance_psd = bool((covariance == covariance.T).all() and not torch.is_complex(eigenvalues) and (eigenvalues > 0).all())
print(is_covariance_psd)

This outputs False despite the matrix having two positive real eigenvalues - 6 and 1

Expected behavior

The function should successfully generate a multivariate normal distribution whenever the calculated sigma is positive definite and symmetric.

I think the check might be replaced with something like:

from torch.distributions import constraints

is_psd = constraints.positive_definite.check(covariance).item()
@MayStepanyan MayStepanyan changed the title Extending embeddings When extending embeddings, multivariate distribution isn't correctly estimated even when the calculated sigma matrix is symmetric and positive definite Dec 4, 2024
@Rocketknight1
Copy link
Member

Rocketknight1 commented Dec 4, 2024

Hi @MayStepanyan, there's another related issue with the embeddings initialization here: #34570. It seems like we should probably refactor this block.

Would you be willing to open a PR? I think a combination of your suggestions + the faster eigvalsh function, because covariance matrices are always symmetric, would be really nice to have. The eigvalsh function might also help avoid this issue - eigvals returns a complex tensor even when the eigenvalues are all real, whereas eigvalsh returns a float tensor because symmetric matrices always have real eigenvalues.

@MayStepanyan
Copy link
Author

Hi @Rocketknight1.

Of course, I’d like to work on the bug fix when I have some spare time.

As a side note, I think we don't need to replace eigvals with anything - the constraint check from torch.distributions.constraints seems to be enough, and we don't need to compute eigenvalues explicitly.

@Rocketknight1
Copy link
Member

@MayStepanyan you're totally right, good point! When skimming the code I missed that the eigenvalues were just used to check that condition.

@abuelnasr0
Copy link
Contributor

hi @MayStepanyan
Thanks for this, you are right. the eigenvalues are always complex.
I think your suggested solution is perfect. I have tried it locally and I can assure you that.
my only tip when you open a PR is to note that we are using 1e-9 * covariance not covariance, so the line should be is_psd = constraints.positive_definite.check(1e-9 * covariance).item()
not noticing that in the first place was the reason behind this PR #34037.

@MayStepanyan
Copy link
Author

Hi @abuelnasr0, thanks for the tip.

I wasn't going to check the scaled matrix, since given a positive definite matrix A and a positive epsilon, the epsilon * A matrix will also be positive definite. Now when you've mentioned it I figured we might very rarely get numerical instabilities when multiplying the matrix by 1e-9, so we might really check the condition for the scaled matrix.

By the way, what is the logic of using 1e-9 * covariance? Is it just for making the final distribution less scattered, or is there other reasoning?

@abuelnasr0
Copy link
Contributor

In the original article, the author multiplied the covariance matrix by 1e-5 https://nlp.stanford.edu/~johnhew/vocab-expansion.html
There is no mention of the reason behind this in the article but I had thought of it the same as you #33325 (comment)

The reason behind 1e-9 not 1e-5 is that some models have high covariance so an outlier or more can be sampled and cause this test case to fail https://github.com/abuelnasr0/transformers/blob/8e60a368221f15720fbd5a34b41c95eafd66b0d7/tests/test_modeling_common.py#L1855

Copy link

github-actions bot commented Jan 9, 2025

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants