Add the Tokenizer object logic #1874

JosselinSomervilleRoberts · 2023-10-04T00:27:09Z

This PR introduces the Tokenizer object and deprecates the tokenize and decode methods of Client.

Here are the main changes:

Client is now pure abstract and there is a new CachableClient which requires a cache_config as well as a tokenizer. CachableClient still implements the tokenize and decode methods by calling its tokenizer and raising a warning.
The object Tokenizer is introduced. it's pure abstract and should implement the two previous methods. Most tokenizers actually inherit from CachableTokenizer which handles all the caching and formating of the requests and responses. A few more infos:
- use_encode_in_cache_key is an attribute that controls the cache key. As some tokenizers return both the ids and strings, we keep both to make the Cache more powerful.
- Some tokenizers currently do not use the encode argument in the request which is a problem (AI21Tokenizer, HTTPModelTokenizer, LitGPTTokenizer, SimpleTokenizer)
- The TiktokenTokenizer supports decode for tokens being a List[str] while it should be a List[int]. This is needed as without it I remember we were having issues but it's something that should be fixed.
AutoClient is now the only client that inherits directly from Client and not CachableClient. The _get_tokenizer_client() -> Client method is deleted to be replaced by a simpler _get_tokenizer() -> Tokenizer. It still supports the tokenize and decode methods for now (which is not ideal and should be removed in a future PR).
Some scripts have been adapted but this might not be complete.

yifanmai · 2023-10-04T19:51:56Z

src/helm/proxy/clients/client.py


+    def decode(self, request: DecodeRequest) -> DecodeRequestResult:


No action: we should have an eventual plan to remove tokenize and decode from this class e.g. by introducing AutoTokenizer or TokenizerFactory.

I added a TODO for this.

src/helm/proxy/clients/client.py

src/helm/proxy/tokenizers/tokenizer.py

yifanmai

Looks mostly good! I have just one high level restructure request - see the comment on CachableTokenizer. I can re-review once that's done.

src/helm/proxy/tokenizers/cachable_tokenizer.py

src/helm/proxy/tokenizers/yalm_tokenizer_src/yalm_tokenizer.py

src/helm/proxy/clients/client.py

yifanmai

Looks good at a high level. Will re-review tokenizers after the structure is finalized.

src/helm/common/request.py

yifanmai · 2023-10-17T23:12:13Z

scripts/cache/fix_anthropic_cache.py

No action required: We should probably delete some of these scripts eventually so that we don't have to keep maintaining them...

yifanmai · 2023-10-17T23:16:26Z

src/helm/proxy/tokenizers/ai21_tokenizer.py

I assume the tokenizer files are just copy-pasted, but let me know if there is anything specific I should look like.

Most of them yes. I added a Cahce for LitGpt and made a few syntax changes in most of them but they should behave the same

yifanmai · 2023-10-17T23:22:17Z

src/helm/proxy/tokenizers/huggingface_tokenizer.py

This needs to be updated with the changes to #1876 and #1912... sorry about that.

Could you help me with that? This might be related to the error I faced bellow

… and _decode_raw_response_to_text

JosselinSomervilleRoberts · 2023-10-18T00:48:26Z

Cache Test Results: I tried to check if we were reusing the existing Cache.
here are the steps I followed to test this:

Delete my local cache
Run a lot of queries on this branch
Go to main and rerun the same query to check that everything is cached.
I ran this:
helm-run --conf-paths configs/tokenizers.conf --suite tokenizers_cache_test --max-eval-instances 5
On this config:

entries: [
   {description: "billsum_legal_summarization:model=ai21/j2-jumbo", priority: 1},
   {description: "billsum_legal_summarization:model=AlephAlpha/luminous-base", priority: 1},
   {description: "billsum_legal_summarization:model=anthropic/claude-v1.3", priority: 1},
   {description: "billsum_legal_summarization:model=cohere/xlarge-20220609", priority: 1}, # HuggingFace error: Already borrowed
   {description: "billsum_legal_summarization:model=openai/davinci", priority: 1},
   {description: "billsum_legal_summarization:model=openai/gpt-4-32k-0613", priority: 1},
   {description: "billsum_legal_summarization:model=writer/palmyra-base", priority: 1},
   # {description: "billsum_legal_summarization:model=writer/silk-road", priority: 1}, # Internal error
   {description: "billsum_legal_summarization:model=simple/model1", priority: 1},

   # Together
   # {description: "billsum_legal_summarization:model=together/bloom", priority: 1}, # Not supported
   # {description: "billsum_legal_summarization:model=together/gpt-j-6b", priority: 1}, # Not supported
   {description: "billsum_legal_summarization:model=eleutherai/pythia-1b-v0", priority: 1},
   # {description: "billsum_legal_summarization:model=mistralai/mistral-7b-v0.1", priority: 1},
   # MistralAI error: Input validation error: `inputs` tokens + `max_new_tokens` must be <= 4096. Given: 8076 `inputs` tokens and 1024 `max_new_tokens`
   # {description: "billsum_legal_summarization:model=mosaicml/mpt-7b", priority: 1}, # Not supported
   {description: "billsum_legal_summarization:model=tiiuae/falcon-7b", priority: 1},
   # {description: "billsum_legal_summarization:model=together/yalm", priority: 1}, # Not supported
]

Here are the results:

CacheStats.print_status {
      prod_env/cache/ai21.sqlite: 22 queries, 0 computes
      prod_env/cache/perspectiveapi.sqlite: 50 queries, 0 computes
      prod_env/cache/AlephAlpha.sqlite: 50 queries, 0 computes
      prod_env/cache/anthropic.sqlite: 25 queries, 0 computes
      prod_env/cache/cohere.sqlite: 45 queries, 0 computes
      prod_env/cache/huggingface.sqlite: 185 queries, 5 computes
      prod_env/cache/openai.sqlite: 15 queries, 0 computes
      prod_env/cache/writer.sqlite: 5 queries, 0 computes
      prod_env/cache/simple.sqlite: 5 queries, 0 computes
      prod_env/cache/EleutherAI.sqlite: 45 queries, 0 computes
      prod_env/cache/eleutherai.sqlite: 5 queries, 0 computes
      prod_env/cache/tiiuae.sqlite: 50 queries, 0 computes
    }

Everything seems to work except for the HuggingFaceTokenizer. There are 2 problems:

5 out of 185 queries were recomputed, this needs investigating.
There is currently an error that is raised (I think when we want to access a second HuggingFace Tokenizer): "Already Borrowed". This might be due to a wrong merge

yifanmai · 2023-10-24T17:50:58Z

Cache test looks good, thanks!

Already borrowed is a known issue #1421 - as far as I can tell, it is a warning and does not cause failures.

yifanmai · 2023-10-25T00:13:04Z

src/helm/proxy/tokenizers/lit_gpt_tokenizer.py

+    handle_module_not_found_error(e)
+
+
+class LitGPTTokenizer(CachingTokenizer):


Note that the Lit-GPT tokenizer was a singleton before but not is not a singleton here.

Add the Tokenizer object logic

e6b9729

JosselinSomervilleRoberts requested a review from yifanmai October 4, 2023 00:27

yifanmai requested changes Oct 4, 2023

View reviewed changes

JosselinSomervilleRoberts added 3 commits October 5, 2023 14:55

Changed most clients to use a Tokenizer object

38784f7

Changed remaining clients to use a Tokenizer object

daf431b

Removed calls to CachableClient.tokenize()

cfd0c15

JosselinSomervilleRoberts marked this pull request as ready for review October 7, 2023 00:43

JosselinSomervilleRoberts added 2 commits October 6, 2023 17:45

Add TODOs

4bf3c46

Make client methods abstract

d6341ef

JosselinSomervilleRoberts requested a review from yifanmai October 9, 2023 22:39

Resolve merge conflicts

d2ac135

JosselinSomervilleRoberts requested a review from percyliang October 9, 2023 23:46

JosselinSomervilleRoberts added 4 commits October 9, 2023 16:47

Fix ICE Tokenizer test

4e83fd2

Fix Critique breaking change

0445a78

Revert fix

e1cbe32

Fix all window service test issues except for Cohere

40337f7

yifanmai requested changes Oct 10, 2023

View reviewed changes

JosselinSomervilleRoberts added 4 commits October 10, 2023 20:00

Resolve merge conflicts with HuggingFace refactorization

037a869

Refactor CachableClient -> CachingClient

1077699

Refactor yalm_tokenizer_src -> yalm_tokenizer_data

b0fefef

Merge #1891

0e47750

JosselinSomervilleRoberts mentioned this pull request Oct 17, 2023

Rectorization to introduce the separation between Model metadata and deployment #1903

Merged

JosselinSomervilleRoberts added 4 commits October 17, 2023 14:29

Merge branch 'main' into joss-refactor-1-tokenizer

a907774

CachableTokenizer -> CachingTokenizer

c9aa4fd

Port VLM model Idefics to use new Tokenizer logic

b1badb7

Add TODOs to remove tokenize and decode methods from Client

57fd565

yifanmai reviewed Oct 17, 2023

View reviewed changes

JosselinSomervilleRoberts added 2 commits October 17, 2023 16:48

Change methods of CachingTokenizer to preserve existing Cache

50bb454

Change raw_request to request in _tokenization_raw_response_to_tokens…

9876162

… and _decode_raw_response_to_text

Merge branch 'main' into joss-refactor-1-tokenizer

359994c

JosselinSomervilleRoberts requested a review from yifanmai October 18, 2023 20:19

JosselinSomervilleRoberts added 2 commits October 18, 2023 13:21

Merge branch 'main' into joss-refactor-1-tokenizer

8d2e0e2

Fixing type of wrap_request

6b86de7

yifanmai approved these changes Oct 24, 2023

View reviewed changes

yifanmai reviewed Oct 25, 2023

View reviewed changes

yifanmai merged commit 50e6565 into main Oct 25, 2023

yifanmai deleted the joss-refactor-1-tokenizer branch October 25, 2023 00:13

yifanmai mentioned this pull request Nov 2, 2023

Fix incorrect tokenizers in clients returned by AutoClient #1979

Merged

brianwgoldman pushed a commit that referenced this pull request Nov 6, 2023

Add the Tokenizer object logic (#1874)

6da8787

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the Tokenizer object logic #1874

Add the Tokenizer object logic #1874

JosselinSomervilleRoberts commented Oct 4, 2023 •

edited

Loading

yifanmai Oct 4, 2023

JosselinSomervilleRoberts Oct 17, 2023

yifanmai left a comment

yifanmai left a comment

yifanmai Oct 17, 2023

JosselinSomervilleRoberts Oct 18, 2023

yifanmai Oct 17, 2023

JosselinSomervilleRoberts Oct 18, 2023

yifanmai Oct 17, 2023

JosselinSomervilleRoberts Oct 18, 2023

JosselinSomervilleRoberts commented Oct 18, 2023

yifanmai commented Oct 24, 2023

yifanmai Oct 25, 2023


		def decode(self, request: DecodeRequest) -> DecodeRequestResult:

		handle_module_not_found_error(e)


		class LitGPTTokenizer(CachingTokenizer):

Add the Tokenizer object logic #1874

Add the Tokenizer object logic #1874

Conversation

JosselinSomervilleRoberts commented Oct 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifanmai left a comment

Choose a reason for hiding this comment

yifanmai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JosselinSomervilleRoberts commented Oct 18, 2023

yifanmai commented Oct 24, 2023

Choose a reason for hiding this comment

JosselinSomervilleRoberts commented Oct 4, 2023 •

edited

Loading