[kushalkodnad/tokenizer-registry] Introduce new registry for tokenizers #1386

kushalkodn-db · 2024-07-23T00:29:40Z

What Does This PR Do?

The purpose of this PR is to introduce a new registry, specifically for tokenizers. This way, in the llmfoudry/utils/builders.py build_tokenizer() method, we can build a tokenizer from the registry, as long as the tokenizer inherits the transformers.PreTrainedTokenizerBase interface. For instance, the existing TiktokenTokenizerWrapper in llmfoundry/tokenizers/tiktoken.py can be built from this registry, as opposed to using if-else clauses in build_tokenizer, which can become tedious if we add support for more tokenizers.

File-Specific Changes

`llmfoundry/registry.py`

This is where I created the tokenizers registry. See the brief description of the llmfoundry tokenizers registry, defined in _tokenizers_description. Then, I followed the callbacks registry creation to do the same thing for tokenizers when calling create_registry.

`llmfoundry/utils/builders.py`

The main idea is to replace the if-else clause that currently supports "tiktoken". As long as the tokenizer is a sub-class of transformers.PreTrainedTokenizerBase, then the tokenizer will be constructed.

`llmfoundry/tokenizers/init.py`

Following the callbacks example, the addition to this file shows how to register a transformers.PreTrainedTokenizerBase sub-class to the tokenizers registry. Here, I registered the tiktoken tokenizer to the tokenizers registry, because it inherits the same interface.

llmfoundry/registry.py

tests/tokenizers/test_registry.py

kushalkodn-db added 2 commits July 22, 2024 17:23

Create new tokenizers registry

a7efff2

Add commented out section to build tokenizer from registry

3a8d570

kushalkodn-db requested review from b-chu and dakinggg July 23, 2024 00:29

kushalkodn-db added 9 commits July 22, 2024 17:49

Add new build_tokenizer_from_registry method for tokenizer creation

c8271b0

Register tiktoken to tokenizers registry

c08168e

Update build_tokenizer() to use the tokenizers registry

6800bb0

Merge branch 'main' into kushalkodnad/tokenizer-registry

b2fb7de

Remove unused import statement

e156310

Test out tokenizers registry functionality with dummy tokenizer

54f0ef9

Fixed pyright issue

ba8cac5

Merge branch 'main' into kushalkodnad/tokenizer-registry

275804b

Add tokenizers to expected_registry_names

11c5e71

dakinggg marked this pull request as ready for review July 23, 2024 02:44

dakinggg requested a review from a team as a code owner July 23, 2024 02:44

dakinggg approved these changes Jul 23, 2024

View reviewed changes

llmfoundry/registry.py Show resolved Hide resolved

tests/tokenizers/test_registry.py Outdated Show resolved Hide resolved

kushalkodn-db added 2 commits July 22, 2024 21:26

Add tokenizers registry to __all__

e83772c

Remove unnecessary parts of DummyTokenizer for test

c5d897f

b-chu approved these changes Jul 23, 2024

View reviewed changes

kushalkodn-db added 2 commits July 23, 2024 09:17

Merge branch 'main' into kushalkodnad/tokenizer-registry

afb136a

Merge branch 'main' into kushalkodnad/tokenizer-registry

3f3ce74

kushalkodn-db merged commit 51949c4 into main Jul 23, 2024
9 checks passed

kushalkodn-db deleted the kushalkodnad/tokenizer-registry branch July 23, 2024 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kushalkodnad/tokenizer-registry] Introduce new registry for tokenizers #1386

[kushalkodnad/tokenizer-registry] Introduce new registry for tokenizers #1386

kushalkodn-db commented Jul 23, 2024 •

edited

Loading

[kushalkodnad/tokenizer-registry] Introduce new registry for tokenizers #1386

[kushalkodnad/tokenizer-registry] Introduce new registry for tokenizers #1386

Conversation

kushalkodn-db commented Jul 23, 2024 • edited Loading

What Does This PR Do?

File-Specific Changes

llmfoundry/registry.py

llmfoundry/utils/builders.py

llmfoundry/tokenizers/__init__.py

kushalkodn-db commented Jul 23, 2024 •

edited

Loading

`llmfoundry/registry.py`

`llmfoundry/utils/builders.py`

`llmfoundry/tokenizers/init.py`