-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional special tokens re-added after calling train_new_from_iterator
.
#1277
Comments
Hey! As mentioned in the documentation of
You should try to print >>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained(
"googe/flan-t5-small",
additional_special_tokens=None,
extra_ids=0,
model_max_length=1e30,
)
>>> print(tokenizer.additional_special_tokens)
['<extra_id_0>',
'<extra_id_1>',
'<extra_id_2>',
'<extra_id_3>',
.......................] This is because calling
>>> tokenizer.encode("<extra_id_0>")
>>> [32099, 1] because the internal state is not updated. |
I would recommend you to try 1. 😉 |
Hello @ArthurZucker thank you for your reply. I had thought about going with 1 but I am not sure how to create a vocab file using the tokenizers library. Previously I worked with the SentencePiece library alone and it had a |
I have resulted to training a SentencePiece tokenizer then using that vocab file for the t5 tokenizer. os.makedirs(os.path.join(config.output_dir_path, "tokenizer"), exist_ok=True)
spm_filepath = os.path.join(config.output_dir_path, "tokenizer", "spiece.model")
spm_model = io.BytesIO()
spm.SentencePieceTrainer.Train(
sentence_iterator=(example["text"] for example in split_dataset["train"]),
model_writer=spm_model,
vocab_size=config.vocab_size,
num_threads=config.num_proc,
pad_id=0,
bos_id=-1,
eos_id=1,
unk_id=2,
character_coverage=1.0,
)
with open(spm_filepath, "wb") as f:
f.write(spm_model.getvalue())
tokenizer = T5Tokenizer(spm_filepath, extra_ids=0, model_max_length=1e30)
print(f"Tokenizer Vocab Size {len(tokenizer)}")
pretrained_tokenizer_path = os.path.join(config.output_dir_path, "tokenizer")
tokenizer.save_pretrained(pretrained_tokenizer_path) Let me know if there is already an easier/native way to do this. |
I think this is the correct approach 😉 |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
BTW the reason behind this is that T5TokenizerFast and slow always add special tokens when intiializing which is not a good practice. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
I am trying to train a new tokenizer without additional special tokens using a pretrained
T5Tokenizer
as the base. The issue is that after settingadditional_special_tokens=None
andextra_ids=0
. Thetokenizer.json
file still includes the additional special tokens.Here is my code:
I expect the
tokenizer.json
to only include the special tokens such asbos_token
,pad_token
e.t.c but notadditional_special_tokens
. However the additional special tokens are still present.The text was updated successfully, but these errors were encountered: