Additional special tokens re-added after calling `train_new_from_iterator`. #1277

Kinyugo · 2023-06-15T20:30:33Z

I am trying to train a new tokenizer without additional special tokens using a pretrained T5Tokenizer as the base. The issue is that after setting additional_special_tokens=None and extra_ids=0. The tokenizer.json file still includes the additional special tokens.

Here is my code:

  tokenizer = AutoTokenizer.from_pretrained(
      "googe/flan-t5-small",
      additional_special_tokens=None,
      extra_ids=0,
      model_max_length=1e30,
  )
  tokenizer = tokenizer.train_new_from_iterator(
      batch_iterator(split_dataset["train"], config.batch_size),
      vocab_size=config.vocab_size,
      additional_special_tokens=None,
      extra_ids=0,
  )

  pretrained_tokenizer_path = os.path.join(config.output_dir_path, "tokenizer")
  tokenizer.save_pretrained(pretrained_tokenizer_path)

I expect the tokenizer.json to only include the special tokens such as bos_token, pad_token e.t.c but not additional_special_tokens. However the additional special tokens are still present.

...
  {
      "id": 3,
      "content": "<extra_id_99>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 4,
      "content": "<extra_id_98>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 5,
      "content": "<extra_id_97>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
...

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-06-16T09:14:59Z

Hey! As mentioned in the documentation of train_new_from_iterator:

Trains a tokenizer on a new corpus with the same defaults (in terms of special tokens or tokenization pipeline) as the current one.

You should try to print tokenizer. additional_special_tokens right before you call train_new_from_iterator:

>>> from transformers import AutoTokenizer 
>>>   tokenizer = AutoTokenizer.from_pretrained(
      "googe/flan-t5-small",
      additional_special_tokens=None,
      extra_ids=0,
      model_max_length=1e30,
  )
>>> print(tokenizer.additional_special_tokens)
['<extra_id_0>',
 '<extra_id_1>',
 '<extra_id_2>',
 '<extra_id_3>',
.......................]

This is because calling from_pretrained will overwrite the special tokens after the init. This is expected. If you don't want any special tokens, you have two possibilities:

You initialise a new tokenizer, using tokenizer = T5Tokenizer(vocab_file, extra_ids = 0)
You remove the special tokens after from_pretrained : tokenizer.additional_special_tokens = []. The problem with 2. is that you would still have:

>>> tokenizer.encode("<extra_id_0>")
>>> [32099, 1]

because the internal state is not updated.

ArthurZucker · 2023-06-16T09:15:37Z

I would recommend you to try 1. 😉

Kinyugo · 2023-06-16T09:44:49Z

Hello @ArthurZucker thank you for your reply. I had thought about going with 1 but I am not sure how to create a vocab file using the tokenizers library. Previously I worked with the SentencePiece library alone and it had a .Train method that took care of everything. How do I go about this using the tokenizers library.

Kinyugo · 2023-06-16T11:32:33Z

I have resulted to training a SentencePiece tokenizer then using that vocab file for the t5 tokenizer.

    os.makedirs(os.path.join(config.output_dir_path, "tokenizer"), exist_ok=True)
    spm_filepath = os.path.join(config.output_dir_path, "tokenizer", "spiece.model")

    spm_model = io.BytesIO()
    spm.SentencePieceTrainer.Train(
        sentence_iterator=(example["text"] for example in split_dataset["train"]),
        model_writer=spm_model,
        vocab_size=config.vocab_size,
        num_threads=config.num_proc,
        pad_id=0,
        bos_id=-1,
        eos_id=1,
        unk_id=2,
        character_coverage=1.0,
    )

    with open(spm_filepath, "wb") as f:
        f.write(spm_model.getvalue())

    tokenizer = T5Tokenizer(spm_filepath, extra_ids=0, model_max_length=1e30)
    print(f"Tokenizer Vocab Size {len(tokenizer)}")
    pretrained_tokenizer_path = os.path.join(config.output_dir_path, "tokenizer")
    tokenizer.save_pretrained(pretrained_tokenizer_path)

Let me know if there is already an easier/native way to do this.

ArthurZucker · 2023-06-19T10:02:45Z

I think this is the correct approach 😉

github-actions · 2023-12-20T01:44:03Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker · 2023-12-20T07:46:57Z

BTW the reason behind this is that T5TokenizerFast and slow always add special tokens when intiializing which is not a good practice.

github-actions · 2024-01-21T01:53:18Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker mentioned this issue Aug 16, 2023

Error finetuning Whisper using new tokenizer huggingface/transformers#25503

Closed

4 tasks

github-actions bot added the Stale label Dec 20, 2023

github-actions bot removed the Stale label Dec 22, 2023

github-actions bot added the Stale label Jan 21, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional special tokens re-added after calling `train_new_from_iterator`. #1277

Additional special tokens re-added after calling `train_new_from_iterator`. #1277

Kinyugo commented Jun 15, 2023

ArthurZucker commented Jun 16, 2023

ArthurZucker commented Jun 16, 2023

Kinyugo commented Jun 16, 2023

Kinyugo commented Jun 16, 2023

ArthurZucker commented Jun 19, 2023

github-actions bot commented Dec 20, 2023

ArthurZucker commented Dec 20, 2023

github-actions bot commented Jan 21, 2024

Additional special tokens re-added after calling train_new_from_iterator. #1277

Additional special tokens re-added after calling train_new_from_iterator. #1277

Comments

Kinyugo commented Jun 15, 2023

ArthurZucker commented Jun 16, 2023

ArthurZucker commented Jun 16, 2023

Kinyugo commented Jun 16, 2023

Kinyugo commented Jun 16, 2023

ArthurZucker commented Jun 19, 2023

github-actions bot commented Dec 20, 2023

ArthurZucker commented Dec 20, 2023

github-actions bot commented Jan 21, 2024

Additional special tokens re-added after calling `train_new_from_iterator`. #1277

Additional special tokens re-added after calling `train_new_from_iterator`. #1277