diff --git a/docs/source/preprocessing.rst b/docs/source/preprocessing.rst index f1435490874c33..76eade2f4d0cac 100644 --- a/docs/source/preprocessing.rst +++ b/docs/source/preprocessing.rst @@ -284,6 +284,12 @@ The tokenizer also accept pre-tokenized inputs. This is particularly useful when predictions in `named entity recognition (NER) `__ or `part-of-speech tagging (POS tagging) `__. +.. warning:: + + Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them though the tokenizer + if that was the case) but just split into words (which is often the first step in subword tokenization algorithms + like BPE). + If you want to use pre-tokenized inputs, just set :obj:`is_pretokenized=True` when passing your inputs to the tokenizer. For instance, we have: diff --git a/src/transformers/tokenization_utils_base.py b/src/transformers/tokenization_utils_base.py index 39d09b8e0d81f3..017fd77477e60d 100644 --- a/src/transformers/tokenization_utils_base.py +++ b/src/transformers/tokenization_utils_base.py @@ -1088,7 +1088,8 @@ def all_special_ids(self) -> List[int]: returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens. is_pretokenized (:obj:`bool`, `optional`, defaults to :obj:`False`): - Whether or not the input is already tokenized. + Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer + will skip the pre-tokenization step. This is useful for NER or token classification. pad_to_multiple_of (:obj:`int`, `optional`): If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).