Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix distill model bos and eos token #78

Merged
merged 1 commit into from
Oct 12, 2024

Conversation

zechengz
Copy link
Contributor

The bos_token_id and eos_token_id appear to be reversed in the create_output_embeddings_from_model_name function within model2vec/distill/inference.py.

When I testing with following code

from transformers import AutoModel, AutoTokenizer
from model2vec.distill import distill_from_model
model_name = "bert-base-uncased"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
m2v_model = distill_from_model(model=model, tokenizer=tokenizer, pca_dims=256)

before the fix bos_token_id is 102 and eos_token_id is 101 but for BERT tokenizer the correct bos_token_id and eos_token_id should be

>>> tokenizer.cls_token_id
101
>>> tokenizer.sep_token_id
102

@Pringled Pringled self-assigned this Oct 12, 2024
@Pringled Pringled self-requested a review October 12, 2024 13:38
Copy link
Member

@Pringled Pringled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, thanks for fixing this! As a quick sanity check I re-ran a few benchmarks and the results don't change much fortunately, but I will dive a bit deeper into different tasks and see if this improves performance somewhere, which I expect it will.

@Pringled Pringled merged commit 97d3677 into MinishLab:main Oct 12, 2024
@zechengz zechengz deleted the zecheng_fix_bos_eos branch October 12, 2024 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants