Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the Tokenizer object logic #1874

Merged
merged 24 commits into from
Oct 25, 2023
Merged

Add the Tokenizer object logic #1874

merged 24 commits into from
Oct 25, 2023

Conversation

JosselinSomervilleRoberts
Copy link
Contributor

@JosselinSomervilleRoberts JosselinSomervilleRoberts commented Oct 4, 2023

This PR introduces the Tokenizer object and deprecates the tokenize and decode methods of Client.

Here are the main changes:

  • Client is now pure abstract and there is a new CachableClient which requires a cache_config as well as a tokenizer. CachableClient still implements the tokenize and decode methods by calling its tokenizer and raising a warning.
  • The object Tokenizer is introduced. it's pure abstract and should implement the two previous methods. Most tokenizers actually inherit from CachableTokenizer which handles all the caching and formating of the requests and responses. A few more infos:
    • use_encode_in_cache_key is an attribute that controls the cache key. As some tokenizers return both the ids and strings, we keep both to make the Cache more powerful.
    • Some tokenizers currently do not use the encode argument in the request which is a problem (AI21Tokenizer, HTTPModelTokenizer, LitGPTTokenizer, SimpleTokenizer)
    • The TiktokenTokenizer supports decode for tokens being a List[str] while it should be a List[int]. This is needed as without it I remember we were having issues but it's something that should be fixed.
  • AutoClient is now the only client that inherits directly from Client and not CachableClient. The _get_tokenizer_client() -> Client method is deleted to be replaced by a simpler _get_tokenizer() -> Tokenizer. It still supports the tokenize and decode methods for now (which is not ideal and should be removed in a future PR).
  • Some scripts have been adapted but this might not be complete.


def decode(self, request: DecodeRequest) -> DecodeRequestResult:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No action: we should have an eventual plan to remove tokenize and decode from this class e.g. by introducing AutoTokenizer or TokenizerFactory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a TODO for this.

src/helm/proxy/clients/client.py Outdated Show resolved Hide resolved
src/helm/proxy/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
src/helm/proxy/tokenizers/tokenizer.py Show resolved Hide resolved
@JosselinSomervilleRoberts JosselinSomervilleRoberts marked this pull request as ready for review October 7, 2023 00:43
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good! I have just one high level restructure request - see the comment on CachableTokenizer. I can re-review once that's done.

src/helm/proxy/tokenizers/cachable_tokenizer.py Outdated Show resolved Hide resolved
src/helm/proxy/tokenizers/cachable_tokenizer.py Outdated Show resolved Hide resolved
src/helm/proxy/clients/client.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good at a high level. Will re-review tokenizers after the structure is finalized.

src/helm/common/request.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No action required: We should probably delete some of these scripts eventually so that we don't have to keep maintaining them...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the tokenizer files are just copy-pasted, but let me know if there is anything specific I should look like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of them yes. I added a Cahce for LitGpt and made a few syntax changes in most of them but they should behave the same

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated with the changes to #1876 and #1912... sorry about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me with that? This might be related to the error I faced bellow

@JosselinSomervilleRoberts
Copy link
Contributor Author

Cache Test Results: I tried to check if we were reusing the existing Cache.
here are the steps I followed to test this:

  • Delete my local cache
  • Run a lot of queries on this branch
  • Go to main and rerun the same query to check that everything is cached.
    I ran this:
    helm-run --conf-paths configs/tokenizers.conf --suite tokenizers_cache_test --max-eval-instances 5
    On this config:
entries: [
   {description: "billsum_legal_summarization:model=ai21/j2-jumbo", priority: 1},
   {description: "billsum_legal_summarization:model=AlephAlpha/luminous-base", priority: 1},
   {description: "billsum_legal_summarization:model=anthropic/claude-v1.3", priority: 1},
   {description: "billsum_legal_summarization:model=cohere/xlarge-20220609", priority: 1}, # HuggingFace error: Already borrowed
   {description: "billsum_legal_summarization:model=openai/davinci", priority: 1},
   {description: "billsum_legal_summarization:model=openai/gpt-4-32k-0613", priority: 1},
   {description: "billsum_legal_summarization:model=writer/palmyra-base", priority: 1},
   # {description: "billsum_legal_summarization:model=writer/silk-road", priority: 1}, # Internal error
   {description: "billsum_legal_summarization:model=simple/model1", priority: 1},

   # Together
   # {description: "billsum_legal_summarization:model=together/bloom", priority: 1}, # Not supported
   # {description: "billsum_legal_summarization:model=together/gpt-j-6b", priority: 1}, # Not supported
   {description: "billsum_legal_summarization:model=eleutherai/pythia-1b-v0", priority: 1},
   # {description: "billsum_legal_summarization:model=mistralai/mistral-7b-v0.1", priority: 1},
   # MistralAI error: Input validation error: `inputs` tokens + `max_new_tokens` must be <= 4096. Given: 8076 `inputs` tokens and 1024 `max_new_tokens`
   # {description: "billsum_legal_summarization:model=mosaicml/mpt-7b", priority: 1}, # Not supported
   {description: "billsum_legal_summarization:model=tiiuae/falcon-7b", priority: 1},
   # {description: "billsum_legal_summarization:model=together/yalm", priority: 1}, # Not supported
]

Here are the results:

CacheStats.print_status {
      prod_env/cache/ai21.sqlite: 22 queries, 0 computes
      prod_env/cache/perspectiveapi.sqlite: 50 queries, 0 computes
      prod_env/cache/AlephAlpha.sqlite: 50 queries, 0 computes
      prod_env/cache/anthropic.sqlite: 25 queries, 0 computes
      prod_env/cache/cohere.sqlite: 45 queries, 0 computes
      prod_env/cache/huggingface.sqlite: 185 queries, 5 computes
      prod_env/cache/openai.sqlite: 15 queries, 0 computes
      prod_env/cache/writer.sqlite: 5 queries, 0 computes
      prod_env/cache/simple.sqlite: 5 queries, 0 computes
      prod_env/cache/EleutherAI.sqlite: 45 queries, 0 computes
      prod_env/cache/eleutherai.sqlite: 5 queries, 0 computes
      prod_env/cache/tiiuae.sqlite: 50 queries, 0 computes
    }

Everything seems to work except for the HuggingFaceTokenizer. There are 2 problems:

  • 5 out of 185 queries were recomputed, this needs investigating.
  • There is currently an error that is raised (I think when we want to access a second HuggingFace Tokenizer): "Already Borrowed". This might be due to a wrong merge

@yifanmai
Copy link
Collaborator

Cache test looks good, thanks!

Already borrowed is a known issue #1421 - as far as I can tell, it is a warning and does not cause failures.

handle_module_not_found_error(e)


class LitGPTTokenizer(CachingTokenizer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the Lit-GPT tokenizer was a singleton before but not is not a singleton here.

@yifanmai yifanmai merged commit 50e6565 into main Oct 25, 2023
@yifanmai yifanmai deleted the joss-refactor-1-tokenizer branch October 25, 2023 00:13
brianwgoldman pushed a commit that referenced this pull request Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants