Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify Tokenizer Behavior and Ensure Sane Interfaces #936

Open
lapp0 opened this issue Jun 1, 2024 · 0 comments
Open

Unify Tokenizer Behavior and Ensure Sane Interfaces #936

lapp0 opened this issue Jun 1, 2024 · 0 comments

Comments

@lapp0
Copy link
Contributor

lapp0 commented Jun 1, 2024

What behavior of the library made you think about the improvement?

@brandonwillard and I were looking into the LlamaCppTokenizer and noticed a number of issues:

  • It's not made obvious that __getstate__ is used to serialize for hashing.
  • LlamaCppTokenizer and TransformerTokenizer are subclasses of outlines Tokenizer, but vLLM is not
  • LlamaCppTokenizer.__init__ doesn't load special_tokens
  • vLLM and transformers tokenizers use adapt_tokenizer, but llamacpp doesn't.
  • Tokenizers are intended to be immutable, but that isn't programmatically guaranteed.
  • The __hash__ and _stablehash(serialized) are calculated once per call rather than caching their hash value.

How would you like it to behave?

A lot of minor changes here. Please let me know if I'm missing something or if I've accidentally excluded something.

  • __getstate__ is a fallback for outlines.caching, and by default we implement _stablehash
  • vLLM becomes an outlines Tokenizer and uses the standard interfaces.
  • Good parameterized tests for all three tokenizers
  • outlines Tokenizer mutation is disabled
  • adapt_tokenizer is removed. All models pass themselves to their respective Tokenizer to be constructed.
  • _stablehash and __hash__ are only calculated once.
  • llamacpp tokenizer should have identical "batch decoding" behavior to the other tokens. link

Some of the work to fix this can be resurrected from #676

Status

On hold until ExLlamaV2 integration is complete (#807)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant