Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corpus#removeWords is not working properly with unicode characters #8

Open
1 task
namirsab opened this issue Jan 9, 2017 · 0 comments
Open
1 task

Comments

@namirsab
Copy link

namirsab commented Jan 9, 2017

Observed

If you have a word like zurück in your documents, and you have this set of words to remove ['zur']
Then this step will remove zur in the word, converting zurück into ück.
That's happening because the function is using word boundaries (\b) which are known not to work with Unicode.

Expected

  • the function uses an unicode compatible regexp.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant