Tokenizer improvements #40

sylvinus · 2016-03-16T22:29:28Z

Our current tokenizer is... rather simple :)

Let's discuss what would be reasonable, short-term improvements as well as some mid-term ideas?

We should take into account the way documents are indexed in elasticsearch (currently a big list of words) and the tokenization we could do on search queries (currently none).

Sentimentron · 2016-03-17T15:51:17Z

One thing that's occured to me is that Python 2's re module isn't fully Unicode aware. Choosing some examples from Wikipedia's page on this:

>>> _RE_WHITESPACE.split(u']\u2029[')
[u']\u2029[']

Whereas in Python 3's interpreter:

>>> _RE_WHITESPACE.split(u']\u2029[')
[']', '[']

Back in Python 2, the split method actually works better:

>>> u']\u2029['.split()
[u']', u'[']

sylvinus · 2016-03-17T22:34:22Z

Right. One more reason not to use simple regexes for this :)

Sentimentron · 2016-05-04T12:21:45Z

I've just become aware of the NLTK's nltk.tokenize.casual module, which might be appropriate for this job.

sylvinus added enhancement medium needs discussion python elasticsearch labels Mar 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer improvements #40

Tokenizer improvements #40

sylvinus commented Mar 16, 2016

Sentimentron commented Mar 17, 2016

sylvinus commented Mar 17, 2016

Sentimentron commented May 4, 2016

Tokenizer improvements #40

Tokenizer improvements #40

Comments

sylvinus commented Mar 16, 2016

Sentimentron commented Mar 17, 2016

sylvinus commented Mar 17, 2016

Sentimentron commented May 4, 2016