New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: Add SIF-like coef #174

Merged

stephantul merged 10 commits into main from sif_coef

Feb 5, 2025

Member

stephantul commented Feb 4, 2025 •

edited

Loading

This PR removes Zipf weighting and adds a SIF coefficient instead.

apply_zipf is still available, but throws a DeprecationWarning.

stephantul added 5 commits

January 26, 2025 21:26


          feat: token mean and sif weighting

5b8e3ae


          fix error

b971fcb


          Merge branch 'main' into sif_coef

99fedd0


          rename

984c38f


          fix tests

dd7f2a7

stephantul changed the title ~~Sif coef~~ feat: Add SIF-like coef

stephantul requested a review from Pringled

February 4, 2025 09:39

codecov bot commented Feb 4, 2025 •

edited

Loading

Codecov Report

Attention: Patch coverage is 83.33333% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
tests/test_distillation.py	50.00%	5 Missing ⚠️
model2vec/distill/distillation.py	96.15%	1 Missing ⚠️

Files with missing lines	Coverage Δ
model2vec/distill/distillation.py	`95.83% <96.15%> (+0.52%)`	⬆️
tests/test_distillation.py	`90.00% <50.00%> (-1.77%)`	⬇️


          feat: update config

2a3ecfd

Pringled requested changes

View reviewed changes

Member

Pringled left a comment

Couple of comments

model2vec/distill/distillation.py Show resolved Hide resolved

model2vec/distill/distillation.py

+                      Should be a value >= 0 and < 1.0. A value of 1e-4 is a good default.
+                  :param use_subword: Whether to keep subword tokens in the vocabulary. If this is False, you must pass a vocabulary, and the returned tokenizer will only detect full words.
+                  :return: The SIF coefficient to use.
+                  :raises: ValueError if the PCA dimension is larger than the number of dimensions in the embeddings.

Member

Pringled Feb 4, 2025

This one is not checked here

Member Author

stephantul Feb 4, 2025

This is on purpose, because even though the value errors are not thrown here, they can be thrown by this function because it calls private functions. So it is still useful to put it here, and not useful to put anywhere else.

model2vec/distill/distillation.py Show resolved Hide resolved

model2vec/distill/distillation.py

+                  :raises: ValueError if the vocabulary is empty after token removal.
+                  """
+                  if apply_zipf is not None:

Member

Pringled Feb 4, 2025

I think this part can be a bit counterintuitive, if you do apply_zipf=False, sif_coefficient=1e-4, sif_coefficient is set to None

Member Author

stephantul Feb 4, 2025

This is meant to be this way. This is the scenario: somebody has apply_zipf set to False in their code, they upgrade, but don't change their code. They get the warning, ignore it, but then weighting is still applied. This is the only way in which the original behavior (i.e.,apply_zipf=False means no weighting) can be maintained.

model2vec/distill/distillation.py Outdated Show resolved Hide resolved

model2vec/distill/distillation.py

+                      logger.info("Applying SIF and Zipf weighting")
+                      if sif_coefficient <= 0:
+                          raise ValueError("SIF coefficient must be positive.")
+                      inv_rank = 1 / (np.arange(2, embeddings.shape[0] + 2))

Member

Pringled Feb 4, 2025

Why +2? Is it to skip special tokens? If so, there's more than 2 right?

Member Author

stephantul Feb 4, 2025

This is based on the zipf formula, b is a free parameter which we set to 1 (normally, this is set to 2.7, but I haven't experimented with it. It seemed easier to just leave it as an integer.

The full formula is:

np.arange(b, len(vocab) + b) ** a with b==2.7 and a==1, usually. So since a==1, we leave it out.

model2vec/distill/distillation.py Outdated

-                      logger.info("Applying Zipf weighting")
-                      embeddings *= np.log(1 + np.arange(embeddings.shape[0]))[:, None]
+                  if sif_coefficient is not None:
+                      logger.info("Applying SIF and Zipf weighting")

Member

Pringled Feb 4, 2025

Suggested change

      
                    logger.info("Applying SIF and Zipf weighting")
          
                    logger.info("Applying SIF weighting")

I think we don't do zipf weighting anymore?

Member Author

stephantul Feb 4, 2025

It's still Zipf, because we estimate the probability through Zipf's law. I think it could be rephrased.


          reviewer comments

66b4314

Pringled approved these changes

View reviewed changes

stephantul added 3 commits

February 5, 2025 09:18


          set normalize to True by default

4c696b9


          fix tests

6d6b036


          fix: bug and tests

79fa2a1

stephantul merged commit da79779 into main

5 of 6 checks passed

stephantul deleted the sif_coef branch

February 5, 2025 08:53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet