You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Sometimes, it is handy to have some text processing happen before tokenization, at the char level.
For instance, lowercasing could happen at that layer. Unicode normalization as well (#407).
Resolving HTML entities and removing tags could also happen here.
Handling Diacritics as well.
Describe the solution you'd like
Following lucene, we could have the concept of CharFilter that would happen before the tokenizer.
It might be time to rename the tokenizer module into analyzer or even "text", and have an analyzer
embed a charstream AND a tokenizer.
A bit of juggling will be required to keep track of the original char offsets.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Sometimes, it is handy to have some text processing happen before tokenization, at the char level.
For instance, lowercasing could happen at that layer. Unicode normalization as well (#407).
Resolving HTML entities and removing tags could also happen here.
Handling Diacritics as well.
Describe the solution you'd like
Following lucene, we could have the concept of CharFilter that would happen before the tokenizer.
It might be time to rename the tokenizer module into analyzer or even "text", and have an analyzer
embed a charstream AND a tokenizer.
A bit of juggling will be required to keep track of the original char offsets.
The text was updated successfully, but these errors were encountered: