Corpus#removeWords is not working properly with unicode characters #8

namirsab · 2017-01-09T10:34:29Z

Observed

If you have a word like zurück in your documents, and you have this set of words to remove ['zur']
Then this step will remove zur in the word, converting zurück into ück.
That's happening because the function is using word boundaries (\b) which are known not to work with Unicode.

Expected

the function uses an unicode compatible regexp.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpus#removeWords is not working properly with unicode characters #8

Corpus#removeWords is not working properly with unicode characters #8

namirsab commented Jan 9, 2017

Corpus#removeWords is not working properly with unicode characters #8

Corpus#removeWords is not working properly with unicode characters #8

Comments

namirsab commented Jan 9, 2017

Observed

Expected