Add support for `CharFilter` #409

fulmicoton · 2018-09-06T01:08:44Z

Is your feature request related to a problem? Please describe.

Sometimes, it is handy to have some text processing happen before tokenization, at the char level.
For instance, lowercasing could happen at that layer. Unicode normalization as well (#407).

Resolving HTML entities and removing tags could also happen here.
Handling Diacritics as well.

Describe the solution you'd like

Following lucene, we could have the concept of CharFilter that would happen before the tokenizer.
It might be time to rename the tokenizer module into analyzer or even "text", and have an analyzer
embed a charstream AND a tokenizer.

A bit of juggling will be required to keep track of the original char offsets.

fulmicoton added the enhancement label Sep 6, 2018

fulmicoton added this to the 0.8 milestone Sep 6, 2018

fulmicoton modified the milestones: 0.8, 0.9 Dec 26, 2018

fulmicoton mentioned this issue Aug 20, 2019

Discussion : Roadmap to 1.0.0 #638

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for `CharFilter` #409

Add support for `CharFilter` #409

fulmicoton commented Sep 6, 2018

Add support for CharFilter #409

Add support for CharFilter #409

Comments

fulmicoton commented Sep 6, 2018

Add support for `CharFilter` #409

Add support for `CharFilter` #409