Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
This reverts commit ab8e9d9.
  • Loading branch information
fabiocapsouza authored Nov 15, 2020
1 parent 424f76a commit f43899b
Show file tree
Hide file tree
Showing 2 changed files with 1 addition and 8 deletions.
6 changes: 0 additions & 6 deletions docs/source/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -284,12 +284,6 @@ The tokenizer also accept pre-tokenized inputs. This is particularly useful when
predictions in `named entity recognition (NER) <https://en.wikipedia.org/wiki/Named-entity_recognition>`__ or
`part-of-speech tagging (POS tagging) <https://en.wikipedia.org/wiki/Part-of-speech_tagging>`__.

.. warning::

Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them though the tokenizer
if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
like BPE).

If you want to use pre-tokenized inputs, just set :obj:`is_pretokenized=True` when passing your inputs to the
tokenizer. For instance, we have:

Expand Down
3 changes: 1 addition & 2 deletions src/transformers/tokenization_utils_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1088,8 +1088,7 @@ def all_special_ids(self) -> List[int]:
returned to provide some overlap between truncated and overflowing sequences. The value of this
argument defines the number of overlapping tokens.
is_pretokenized (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer
will skip the pre-tokenization step. This is useful for NER or token classification.
Whether or not the input is already tokenized.
pad_to_multiple_of (:obj:`int`, `optional`):
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
Expand Down

0 comments on commit f43899b

Please sign in to comment.