Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SpeechT5] Decode function strips space after special token #26547

Closed
4 tasks
xenova opened this issue Oct 2, 2023 · 1 comment · Fixed by #28522
Closed
4 tasks

[SpeechT5] Decode function strips space after special token #26547

xenova opened this issue Oct 2, 2023 · 1 comment · Fixed by #28522
Assignees

Comments

@xenova
Copy link
Contributor

xenova commented Oct 2, 2023

System Info

  • transformers version: 4.34.0.dev0
  • Platform: Windows-10-10.0.22621-SP0
  • Python version: 3.8.1
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.3
  • Accelerate version: 0.23.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 1.12.1+cu116 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. First load the speecht5 tokenizer
from transformers import SpeechT5Tokenizer
tokenizer = SpeechT5Tokenizer.from_pretrained('microsoft/speecht5_tts')
ids = tokenizer.encode("a = b")
# [4, 7, 4, 3, 4, 25, 2]    (3 = unknown token, 4 = metaspace)
  1. Convert ids to tokens, showing that metaspace is added before and after the unknown token
tokenizer.convert_ids_to_tokens(ids)
# ['▁', 'a', '▁', '<unk>', '▁', 'b', '</s>']    (metaspace before and after unknown)
  1. Decode, showing the space being removed after the unknown token.
tokenizer.decode(ids)
# "a <unk>b</s>"    (no space after <unk>)

Seems to be caused by this strip:

Related to huggingface/tokenizers#826

Expected behavior

The decoded string should be "a <unk> b</s>" (w/ a space after )

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Oct 3, 2023

Thanks for reporting! This is happening because:

    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        current_sub_tokens = []
        out_string = ""
        for token in tokens:
            # make sure that special tokens are not decoded using sentencepiece model
            if token in self.all_special_tokens:
                out_string += self.sp_model.decode(current_sub_tokens) + token
                current_sub_tokens = []
            else:
                current_sub_tokens.append(token)
        out_string += self.sp_model.decode(current_sub_tokens)
        return out_string.strip()

passes the inputs to the sentencepiece model after they are split, thus what the self.sp_model sees is the following:

  1. ['▁', 'a', '▁']
  2. ['▁', 'b']
    and thus the prefix space will be removed for both.
    This needs a fix 🎐

@ArthurZucker ArthurZucker self-assigned this Oct 3, 2023
xenova added a commit to huggingface/transformers.js that referenced this issue Oct 23, 2023
* Add vocoder to export

* Add tokenizer.json export for speecht5 models

* Update speecht5 supported models

* Create `SpeechT5Tokenizer`

* Add `ones` and `ones_like` tensor functions

* Add support for speecht5 text-to-speech

* Disambiguate `SpeechSeq2Seq` and `Seq2SeqLM`

* Create `TextToAudioPipeline`

* Add listed support for `text-to-audio` / `text-to-speech`

* Use unquantized vocoder by default

* Skip speecht5 unit tests for now

Due to bug in transformers: huggingface/transformers#26547

* Update example pipeline output

* Create simple in-browser TTS demo

* Add template README

* Delete package-lock.json

* Update required transformers.js version

* Add link to Transformers.js

* Double -> Single quotes

* Add link to text-to-speech demo

* Update sample speaker embeddings
@huggingface huggingface deleted a comment from github-actions bot Nov 6, 2023
@huggingface huggingface deleted a comment from github-actions bot Dec 1, 2023
@huggingface huggingface deleted a comment from github-actions bot Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants