Inconsistency between the tokenizer's constructor and the preprocessing option in tokenizer.tokenize

In the case we maintain the possibility to preprocess text inside the library:

Probable problem here: https://github.com/InseeFrLab/torch-fastText/blob/main/torchFastText/datasets/tokenizer.py#L281-L282.

Basically, in the tokenizer's constructor, we count and map the words from the **raw** training_text. 
However, if the user employs the preprocess parameter in `tokenizer.tokenize`, it tokenizes its preprocessed text while retaining the mapping and word count established on the raw text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistency between the tokenizer's constructor and the preprocessing option in tokenizer.tokenize #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistency between the tokenizer's constructor and the preprocessing option in tokenizer.tokenize #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions