-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add SIF-like coef #174
Conversation
Codecov ReportAttention: Patch coverage is
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments
Should be a value >= 0 and < 1.0. A value of 1e-4 is a good default. | ||
:param use_subword: Whether to keep subword tokens in the vocabulary. If this is False, you must pass a vocabulary, and the returned tokenizer will only detect full words. | ||
:return: The SIF coefficient to use. | ||
:raises: ValueError if the PCA dimension is larger than the number of dimensions in the embeddings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is not checked here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is on purpose, because even though the value errors are not thrown here, they can be thrown by this function because it calls private functions. So it is still useful to put it here, and not useful to put anywhere else.
:raises: ValueError if the vocabulary is empty after token removal. | ||
|
||
""" | ||
if apply_zipf is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part can be a bit counterintuitive, if you do apply_zipf=False, sif_coefficient=1e-4
, sif_coefficient is set to None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is meant to be this way. This is the scenario: somebody has apply_zipf
set to False
in their code, they upgrade, but don't change their code. They get the warning, ignore it, but then weighting is still applied. This is the only way in which the original behavior (i.e.,apply_zipf=False
means no weighting) can be maintained.
logger.info("Applying SIF and Zipf weighting") | ||
if sif_coefficient <= 0: | ||
raise ValueError("SIF coefficient must be positive.") | ||
inv_rank = 1 / (np.arange(2, embeddings.shape[0] + 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why +2? Is it to skip special tokens? If so, there's more than 2 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is based on the zipf formula, b
is a free parameter which we set to 1 (normally, this is set to 2.7, but I haven't experimented with it. It seemed easier to just leave it as an integer.
The full formula is:
np.arange(b, len(vocab) + b) ** a
with b==2.7
and a==1
, usually. So since a==1
, we leave it out.
model2vec/distill/distillation.py
Outdated
logger.info("Applying Zipf weighting") | ||
embeddings *= np.log(1 + np.arange(embeddings.shape[0]))[:, None] | ||
if sif_coefficient is not None: | ||
logger.info("Applying SIF and Zipf weighting") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info("Applying SIF and Zipf weighting") | |
logger.info("Applying SIF weighting") |
I think we don't do zipf weighting anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still Zipf, because we estimate the probability through Zipf's law. I think it could be rephrased.
This PR removes Zipf weighting and adds a SIF coefficient instead.
apply_zipf
is still available, but throws a DeprecationWarning.