-
Notifications
You must be signed in to change notification settings - Fork 27.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Documentation about loading a fast tokenizer within Transformers (#11029
) * Documentation about loading a fast tokenizer within Transformers * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * style Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
- Loading branch information
1 parent
6c25f52
commit 9f4e0c2
Showing
5 changed files
with
111 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
Using tokenizers from 🤗 Tokenizers | ||
======================================================================================================================= | ||
|
||
The :class:`~transformers.PreTrainedTokenizerFast` depends on the `tokenizers | ||
<https://huggingface.co/docs/tokenizers>`__ library. The tokenizers obtained from the 🤗 Tokenizers library can be | ||
loaded very simply into 🤗 Transformers. | ||
|
||
Before getting in the specifics, let's first start by creating a dummy tokenizer in a few lines: | ||
|
||
.. code-block:: | ||
>>> from tokenizers import Tokenizer | ||
>>> from tokenizers.models import BPE | ||
>>> from tokenizers.trainers import BpeTrainer | ||
>>> from tokenizers.pre_tokenizers import Whitespace | ||
>>> tokenizer = Tokenizer(BPE(unk_token="[UNK]")) | ||
>>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) | ||
>>> tokenizer.pre_tokenizer = Whitespace() | ||
>>> files = [...] | ||
>>> tokenizer.train(files, trainer) | ||
We now have a tokenizer trained on the files we defined. We can either continue using it in that runtime, or save it to | ||
a JSON file for future re-use. | ||
|
||
Loading directly from the tokenizer object | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Let's see how to leverage this tokenizer object in the 🤗 Transformers library. The | ||
:class:`~transformers.PreTrainedTokenizerFast` class allows for easy instantiation, by accepting the instantiated | ||
`tokenizer` object as an argument: | ||
|
||
.. code-block:: | ||
>>> from transformers import PreTrainedTokenizerFast | ||
>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer) | ||
This object can now be used with all the methods shared by the 🤗 Transformers tokenizers! Head to :doc:`the tokenizer | ||
page <main_classes/tokenizer>` for more information. | ||
|
||
Loading from a JSON file | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
In order to load a tokenizer from a JSON file, let's first start by saving our tokenizer: | ||
|
||
.. code-block:: | ||
>>> tokenizer.save("tokenizer.json") | ||
The path to which we saved this file can be passed to the :class:`~transformers.PreTrainedTokenizerFast` initialization | ||
method using the :obj:`tokenizer_file` parameter: | ||
|
||
.. code-block:: | ||
>>> from transformers import PreTrainedTokenizerFast | ||
>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json") | ||
This object can now be used with all the methods shared by the 🤗 Transformers tokenizers! Head to :doc:`the tokenizer | ||
page <main_classes/tokenizer>` for more information. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters