[SpeechT5] Decode function strips space after special token #26547

xenova · 2023-10-02T18:32:58Z

System Info

transformers version: 4.34.0.dev0
Platform: Windows-10-10.0.22621-SP0
Python version: 3.8.1
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.3
Accelerate version: 0.23.0
Accelerate config: not found
PyTorch version (GPU?): 1.12.1+cu116 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

First load the speecht5 tokenizer

from transformers import SpeechT5Tokenizer
tokenizer = SpeechT5Tokenizer.from_pretrained('microsoft/speecht5_tts')
ids = tokenizer.encode("a = b")
# [4, 7, 4, 3, 4, 25, 2]    (3 = unknown token, 4 = metaspace)

Convert ids to tokens, showing that metaspace is added before and after the unknown token

tokenizer.convert_ids_to_tokens(ids)
# ['▁', 'a', '▁', '<unk>', '▁', 'b', '</s>']    (metaspace before and after unknown)

Decode, showing the space being removed after the unknown token.

tokenizer.decode(ids)
# "a <unk>b</s>"    (no space after <unk>)

Seems to be caused by this strip:

transformers/src/transformers/models/speecht5/tokenization_speecht5.py

Line 192 in 9ed538f

return out_string.strip()

Related to huggingface/tokenizers#826

Expected behavior

The decoded string should be "a <unk> b</s>" (w/ a space after )

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-10-03T11:56:07Z

Thanks for reporting! This is happening because:

    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        current_sub_tokens = []
        out_string = ""
        for token in tokens:
            # make sure that special tokens are not decoded using sentencepiece model
            if token in self.all_special_tokens:
                out_string += self.sp_model.decode(current_sub_tokens) + token
                current_sub_tokens = []
            else:
                current_sub_tokens.append(token)
        out_string += self.sp_model.decode(current_sub_tokens)
        return out_string.strip()

passes the inputs to the sentencepiece model after they are split, thus what the self.sp_model sees is the following:

['▁', 'a', '▁']
['▁', 'b']
and thus the prefix space will be removed for both.
This needs a fix 🎐

* Add vocoder to export * Add tokenizer.json export for speecht5 models * Update speecht5 supported models * Create `SpeechT5Tokenizer` * Add `ones` and `ones_like` tensor functions * Add support for speecht5 text-to-speech * Disambiguate `SpeechSeq2Seq` and `Seq2SeqLM` * Create `TextToAudioPipeline` * Add listed support for `text-to-audio` / `text-to-speech` * Use unquantized vocoder by default * Skip speecht5 unit tests for now Due to bug in transformers: huggingface/transformers#26547 * Update example pipeline output * Create simple in-browser TTS demo * Add template README * Delete package-lock.json * Update required transformers.js version * Add link to Transformers.js * Double -> Single quotes * Add link to text-to-speech demo * Update sample speaker embeddings

ArthurZucker self-assigned this Oct 3, 2023

huggingface deleted a comment from github-actions bot Nov 6, 2023

huggingface deleted a comment from github-actions bot Dec 1, 2023

huggingface deleted a comment from github-actions bot Jan 2, 2024

ArthurZucker mentioned this issue Jan 16, 2024

[SpeechT5Tokenization] Add copied from and fix the convert_tokens_to_string to match the fast decoding scheme #28522

Merged

ArthurZucker closed this as completed in #28522 Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SpeechT5] Decode function strips space after special token #26547

[SpeechT5] Decode function strips space after special token #26547

xenova commented Oct 2, 2023 •

edited

Loading

ArthurZucker commented Oct 3, 2023 •

edited

Loading

[SpeechT5] Decode function strips space after special token #26547

[SpeechT5] Decode function strips space after special token #26547

Comments

xenova commented Oct 2, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Oct 3, 2023 • edited Loading

xenova commented Oct 2, 2023 •

edited

Loading

ArthurZucker commented Oct 3, 2023 •

edited

Loading