Fix distill model bos and eos token #78

zechengz · 2024-10-12T10:23:25Z

The bos_token_id and eos_token_id appear to be reversed in the create_output_embeddings_from_model_name function within model2vec/distill/inference.py.

When I testing with following code

from transformers import AutoModel, AutoTokenizer
from model2vec.distill import distill_from_model
model_name = "bert-base-uncased"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
m2v_model = distill_from_model(model=model, tokenizer=tokenizer, pca_dims=256)

before the fix bos_token_id is 102 and eos_token_id is 101 but for BERT tokenizer the correct bos_token_id and eos_token_id should be

>>> tokenizer.cls_token_id
101
>>> tokenizer.sep_token_id
102

Pringled

Great catch, thanks for fixing this! As a quick sanity check I re-ran a few benchmarks and the results don't change much fortunately, but I will dive a bit deeper into different tasks and see if this improves performance somewhere, which I expect it will.

Fix distill model bos and eos token

6f483c0

Pringled self-assigned this Oct 12, 2024

Pringled self-requested a review October 12, 2024 13:38

Pringled approved these changes Oct 12, 2024

View reviewed changes

Pringled merged commit 97d3677 into MinishLab:main Oct 12, 2024

zechengz deleted the zecheng_fix_bos_eos branch October 12, 2024 21:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix distill model bos and eos token #78

Fix distill model bos and eos token #78

zechengz commented Oct 12, 2024

Pringled left a comment

Fix distill model bos and eos token #78

Fix distill model bos and eos token #78

Conversation

zechengz commented Oct 12, 2024

Pringled left a comment

Choose a reason for hiding this comment