Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add SIF-like coef #174

Merged
merged 10 commits into from
Feb 5, 2025
Merged

feat: Add SIF-like coef #174

merged 10 commits into from
Feb 5, 2025

Conversation

stephantul
Copy link
Member

@stephantul stephantul commented Feb 4, 2025

This PR removes Zipf weighting and adds a SIF coefficient instead.

apply_zipf is still available, but throws a DeprecationWarning.

@stephantul stephantul changed the title Sif coef feat: Add SIF-like coef Feb 4, 2025
@stephantul stephantul requested a review from Pringled February 4, 2025 09:39
Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 83.33333% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
tests/test_distillation.py 50.00% 5 Missing ⚠️
model2vec/distill/distillation.py 96.15% 1 Missing ⚠️
Files with missing lines Coverage Δ
model2vec/distill/distillation.py 95.83% <96.15%> (+0.52%) ⬆️
tests/test_distillation.py 90.00% <50.00%> (-1.77%) ⬇️

Copy link
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments

model2vec/distill/distillation.py Show resolved Hide resolved
Should be a value >= 0 and < 1.0. A value of 1e-4 is a good default.
:param use_subword: Whether to keep subword tokens in the vocabulary. If this is False, you must pass a vocabulary, and the returned tokenizer will only detect full words.
:return: The SIF coefficient to use.
:raises: ValueError if the PCA dimension is larger than the number of dimensions in the embeddings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is not checked here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on purpose, because even though the value errors are not thrown here, they can be thrown by this function because it calls private functions. So it is still useful to put it here, and not useful to put anywhere else.

model2vec/distill/distillation.py Show resolved Hide resolved
:raises: ValueError if the vocabulary is empty after token removal.

"""
if apply_zipf is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part can be a bit counterintuitive, if you do apply_zipf=False, sif_coefficient=1e-4, sif_coefficient is set to None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to be this way. This is the scenario: somebody has apply_zipf set to False in their code, they upgrade, but don't change their code. They get the warning, ignore it, but then weighting is still applied. This is the only way in which the original behavior (i.e.,apply_zipf=False means no weighting) can be maintained.

model2vec/distill/distillation.py Outdated Show resolved Hide resolved
logger.info("Applying SIF and Zipf weighting")
if sif_coefficient <= 0:
raise ValueError("SIF coefficient must be positive.")
inv_rank = 1 / (np.arange(2, embeddings.shape[0] + 2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why +2? Is it to skip special tokens? If so, there's more than 2 right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on the zipf formula, b is a free parameter which we set to 1 (normally, this is set to 2.7, but I haven't experimented with it. It seemed easier to just leave it as an integer.

The full formula is:

np.arange(b, len(vocab) + b) ** a with b==2.7 and a==1, usually. So since a==1, we leave it out.

logger.info("Applying Zipf weighting")
embeddings *= np.log(1 + np.arange(embeddings.shape[0]))[:, None]
if sif_coefficient is not None:
logger.info("Applying SIF and Zipf weighting")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info("Applying SIF and Zipf weighting")
logger.info("Applying SIF weighting")

I think we don't do zipf weighting anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still Zipf, because we estimate the probability through Zipf's law. I think it could be rephrased.

@stephantul stephantul merged commit da79779 into main Feb 5, 2025
5 of 6 checks passed
@stephantul stephantul deleted the sif_coef branch February 5, 2025 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants