Issue with `SentencePieceUnigramTokenizer` Handling Unknown Tokens #1576

Munikumar09 · 2024-07-22T10:20:05Z

Description:

When using the SentencePieceUnigramTokenizer with a custom vocabulary, there is no attribute to handle the unk_id, causing errors when encoding text not present in the vocabulary.

Example:

Vocabulary: {'a': -1.23, 'b': -1.34, 'c': -1.45}
Encoding: tokenizer.encode("bcd")

Suggested Fix:

class SentencePieceUnigramTokenizer(BaseTokenizer):
    def __init__(
        self,
        vocab: Optional[List[Tuple[str, float]]] = None,
        replacement: str = "▁",
        add_prefix_space: bool = True,
        unk_id: int = 0,
    ):
        if vocab is not None:
            tokenizer = Tokenizer(Unigram(vocab, unk_id=unk_id))
        else:
            tokenizer = Tokenizer(Unigram())

        tokenizer.normalizer = normalizers.Sequence([
            normalizers.Nmt(),
            normalizers.NFKC(),
            normalizers.Replace(Regex(" {2,}"), " "),
        ])
        tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(replacement=replacement)
        tokenizer.decoder = decoders.Metaspace(replacement=replacement)

        parameters = {
            "model": "SentencePieceUnigram",
            "replacement": replacement,
            "add_prefix_space": add_prefix_space,
        }

        super().__init__(tokenizer, parameters)

Issue:

The current implementation does not handle the unk_id, leading to errors when encountering unknown tokens. Adding support for unk_id in the tokenizer initialization would resolve this issue.

Please let me know is there any existing solution to it.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-08-12T08:02:26Z

Well, in this case the unk_id is missing from the vocab that you passed to the Tokenizer. Why not adding "<unk>":0 in the vocab?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with `SentencePieceUnigramTokenizer` Handling Unknown Tokens #1576

Issue with `SentencePieceUnigramTokenizer` Handling Unknown Tokens #1576

Munikumar09 commented Jul 22, 2024 •

edited

Loading

ArthurZucker commented Aug 12, 2024

Issue with SentencePieceUnigramTokenizer Handling Unknown Tokens #1576

Issue with SentencePieceUnigramTokenizer Handling Unknown Tokens #1576

Comments

Munikumar09 commented Jul 22, 2024 • edited Loading

ArthurZucker commented Aug 12, 2024

Issue with `SentencePieceUnigramTokenizer` Handling Unknown Tokens #1576

Issue with `SentencePieceUnigramTokenizer` Handling Unknown Tokens #1576

Munikumar09 commented Jul 22, 2024 •

edited

Loading