Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional special tokens re-added after calling train_new_from_iterator. #1277

Closed
Kinyugo opened this issue Jun 15, 2023 · 8 comments
Closed
Labels

Comments

@Kinyugo
Copy link

Kinyugo commented Jun 15, 2023

I am trying to train a new tokenizer without additional special tokens using a pretrained T5Tokenizer as the base. The issue is that after setting additional_special_tokens=None and extra_ids=0. The tokenizer.json file still includes the additional special tokens.

Here is my code:

  tokenizer = AutoTokenizer.from_pretrained(
      "googe/flan-t5-small",
      additional_special_tokens=None,
      extra_ids=0,
      model_max_length=1e30,
  )
  tokenizer = tokenizer.train_new_from_iterator(
      batch_iterator(split_dataset["train"], config.batch_size),
      vocab_size=config.vocab_size,
      additional_special_tokens=None,
      extra_ids=0,
  )

  pretrained_tokenizer_path = os.path.join(config.output_dir_path, "tokenizer")
  tokenizer.save_pretrained(pretrained_tokenizer_path)

I expect the tokenizer.json to only include the special tokens such as bos_token, pad_token e.t.c but not additional_special_tokens. However the additional special tokens are still present.

...
  {
      "id": 3,
      "content": "<extra_id_99>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 4,
      "content": "<extra_id_98>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 5,
      "content": "<extra_id_97>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
...
@ArthurZucker
Copy link
Collaborator

Hey! As mentioned in the documentation of train_new_from_iterator:

Trains a tokenizer on a new corpus with the same defaults (in terms of special tokens or tokenization pipeline) as the current one.

You should try to print tokenizer. additional_special_tokens right before you call train_new_from_iterator:

>>> from transformers import AutoTokenizer 
>>>   tokenizer = AutoTokenizer.from_pretrained(
      "googe/flan-t5-small",
      additional_special_tokens=None,
      extra_ids=0,
      model_max_length=1e30,
  )
>>> print(tokenizer.additional_special_tokens)
['<extra_id_0>',
 '<extra_id_1>',
 '<extra_id_2>',
 '<extra_id_3>',
.......................]

This is because calling from_pretrained will overwrite the special tokens after the init. This is expected. If you don't want any special tokens, you have two possibilities:

  1. You initialise a new tokenizer, using tokenizer = T5Tokenizer(vocab_file, extra_ids = 0)
  2. You remove the special tokens after from_pretrained : tokenizer.additional_special_tokens = []. The problem with 2. is that you would still have:
>>> tokenizer.encode("<extra_id_0>")
>>> [32099, 1] 

because the internal state is not updated.

@ArthurZucker
Copy link
Collaborator

I would recommend you to try 1. 😉

@Kinyugo
Copy link
Author

Kinyugo commented Jun 16, 2023

Hello @ArthurZucker thank you for your reply. I had thought about going with 1 but I am not sure how to create a vocab file using the tokenizers library. Previously I worked with the SentencePiece library alone and it had a .Train method that took care of everything. How do I go about this using the tokenizers library.

@Kinyugo
Copy link
Author

Kinyugo commented Jun 16, 2023

I have resulted to training a SentencePiece tokenizer then using that vocab file for the t5 tokenizer.

    os.makedirs(os.path.join(config.output_dir_path, "tokenizer"), exist_ok=True)
    spm_filepath = os.path.join(config.output_dir_path, "tokenizer", "spiece.model")

    spm_model = io.BytesIO()
    spm.SentencePieceTrainer.Train(
        sentence_iterator=(example["text"] for example in split_dataset["train"]),
        model_writer=spm_model,
        vocab_size=config.vocab_size,
        num_threads=config.num_proc,
        pad_id=0,
        bos_id=-1,
        eos_id=1,
        unk_id=2,
        character_coverage=1.0,
    )

    with open(spm_filepath, "wb") as f:
        f.write(spm_model.getvalue())

    tokenizer = T5Tokenizer(spm_filepath, extra_ids=0, model_max_length=1e30)
    print(f"Tokenizer Vocab Size {len(tokenizer)}")
    pretrained_tokenizer_path = os.path.join(config.output_dir_path, "tokenizer")
    tokenizer.save_pretrained(pretrained_tokenizer_path)

Let me know if there is already an easier/native way to do this.

@ArthurZucker
Copy link
Collaborator

I think this is the correct approach 😉

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Dec 20, 2023
@ArthurZucker
Copy link
Collaborator

BTW the reason behind this is that T5TokenizerFast and slow always add special tokens when intiializing which is not a good practice.

@github-actions github-actions bot removed the Stale label Dec 22, 2023
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jan 21, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants