Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CharFilter #409

Open
fulmicoton opened this issue Sep 6, 2018 · 0 comments
Open

Add support for CharFilter #409

fulmicoton opened this issue Sep 6, 2018 · 0 comments
Milestone

Comments

@fulmicoton
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

Sometimes, it is handy to have some text processing happen before tokenization, at the char level.
For instance, lowercasing could happen at that layer. Unicode normalization as well (#407).

Resolving HTML entities and removing tags could also happen here.
Handling Diacritics as well.

Describe the solution you'd like

Following lucene, we could have the concept of CharFilter that would happen before the tokenizer.
It might be time to rename the tokenizer module into analyzer or even "text", and have an analyzer
embed a charstream AND a tokenizer.

A bit of juggling will be required to keep track of the original char offsets.

@fulmicoton fulmicoton added this to the 0.8 milestone Sep 6, 2018
@fulmicoton fulmicoton modified the milestones: 0.8, 0.9 Dec 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant