In the case we maintain the possibility to preprocess text inside the library:
Probable problem here: https://github.com/InseeFrLab/torch-fastText/blob/main/torchFastText/datasets/tokenizer.py#L281-L282.
Basically, in the tokenizer's constructor, we count and map the words from the raw training_text.
However, if the user employs the preprocess parameter in tokenizer.tokenize, it tokenizes its preprocessed text while retaining the mapping and word count established on the raw text.