Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kushalkodnad/tokenizer-registry] Introduce new registry for tokenizers #1386

Merged
merged 15 commits into from
Jul 23, 2024

Conversation

kushalkodn-db
Copy link
Contributor

@kushalkodn-db kushalkodn-db commented Jul 23, 2024

What Does This PR Do?

The purpose of this PR is to introduce a new registry, specifically for tokenizers. This way, in the llmfoudry/utils/builders.py build_tokenizer() method, we can build a tokenizer from the registry, as long as the tokenizer inherits the transformers.PreTrainedTokenizerBase interface. For instance, the existing TiktokenTokenizerWrapper in llmfoundry/tokenizers/tiktoken.py can be built from this registry, as opposed to using if-else clauses in build_tokenizer, which can become tedious if we add support for more tokenizers.

File-Specific Changes

llmfoundry/registry.py

This is where I created the tokenizers registry. See the brief description of the llmfoundry tokenizers registry, defined in _tokenizers_description. Then, I followed the callbacks registry creation to do the same thing for tokenizers when calling create_registry.

llmfoundry/utils/builders.py

The main idea is to replace the if-else clause that currently supports "tiktoken". As long as the tokenizer is a sub-class of transformers.PreTrainedTokenizerBase, then the tokenizer will be constructed.

llmfoundry/tokenizers/__init__.py

Following the callbacks example, the addition to this file shows how to register a transformers.PreTrainedTokenizerBase sub-class to the tokenizers registry. Here, I registered the tiktoken tokenizer to the tokenizers registry, because it inherits the same interface.

@dakinggg dakinggg marked this pull request as ready for review July 23, 2024 02:44
@dakinggg dakinggg requested a review from a team as a code owner July 23, 2024 02:44
llmfoundry/registry.py Show resolved Hide resolved
tests/tokenizers/test_registry.py Outdated Show resolved Hide resolved
@kushalkodn-db kushalkodn-db merged commit 51949c4 into main Jul 23, 2024
9 checks passed
@kushalkodn-db kushalkodn-db deleted the kushalkodnad/tokenizer-registry branch July 23, 2024 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants