Hybrid tokenizers #25

stephantul · 2024-09-23T18:30:43Z

No description provided.

Pringled

LGTM

Pringled · 2024-09-25T11:13:20Z

model2vec/distill/distillation.py

+    :param device: The device to use.
+    :param pca_dims: The number of components to use for PCA. If this is None, we don't apply PCA.
+    :param apply_zipf: Whether to apply Zipf weighting to the embeddings.
+    :param use_subword: Whether to keep subword tokens in the vocabulary. If this is False, you must pass a vocabulary.


This also changes the actual tokenizer from subword to word level right? I would also specify that in the description

Pringled · 2024-09-25T13:09:23Z

model2vec/distill/distillation.py

+    :param use_subword: Whether to keep subword tokens in the vocabulary. If this is False, you must pass a vocabulary.
+    :raises: ValueError if the PCA dimension is larger than the number of dimensions in the embeddings.
+    :raises: ValueError if the vocabulary contains duplicate tokens.
+    :return: A StaticModdel


Suggested change

:return: A StaticModdel

:return: A StaticModel.

stephantul added 14 commits September 23, 2024 15:44

Add everything except add tokens

e6a668d

Add add tokens things

ce98a7a

Fix bug

d1bea4a

Add new cleaning

ccbd8bc

Merge branch 'main' into hybrid_tokenizers

b9bbcac

Add separate weighting

37ba7aa

Add use_subword

21af2a0

Add arguments to typer

073d1f1

Fix bug in glove-like tokenizer

83bade7

Merge branch 'main' into hybrid_tokenizers

bd0fd9e

COMBO: fix special token issue

1693b55

Add comments to remove_tokens

488d56e

Add comments

6c91185

merge

60e202f

Pringled approved these changes Sep 25, 2024

View reviewed changes

Fix test, docstrings

f61ef91

stephantul merged commit 0a1b672 into main Sep 25, 2024

stephantul deleted the hybrid_tokenizers branch September 25, 2024 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid tokenizers #25

Hybrid tokenizers #25

stephantul commented Sep 23, 2024

Pringled left a comment

Pringled Sep 25, 2024

Pringled Sep 25, 2024

Hybrid tokenizers #25

Hybrid tokenizers #25

Conversation

stephantul commented Sep 23, 2024

Pringled left a comment

Choose a reason for hiding this comment

Pringled Sep 25, 2024

Choose a reason for hiding this comment

Pringled Sep 25, 2024

Choose a reason for hiding this comment