Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer improvements #40

Open
sylvinus opened this issue Mar 16, 2016 · 3 comments
Open

Tokenizer improvements #40

sylvinus opened this issue Mar 16, 2016 · 3 comments

Comments

@sylvinus
Copy link
Contributor

Our current tokenizer is... rather simple :)

Let's discuss what would be reasonable, short-term improvements as well as some mid-term ideas?

We should take into account the way documents are indexed in elasticsearch (currently a big list of words) and the tokenization we could do on search queries (currently none).

@Sentimentron
Copy link
Contributor

One thing that's occured to me is that Python 2's re module isn't fully Unicode aware. Choosing some examples from Wikipedia's page on this:

>>> _RE_WHITESPACE.split(u']\u2029[')
[u']\u2029[']

Whereas in Python 3's interpreter:

>>> _RE_WHITESPACE.split(u']\u2029[')
[']', '[']

Back in Python 2, the split method actually works better:

>>> u']\u2029['.split()
[u']', u'[']

@sylvinus
Copy link
Contributor Author

Right. One more reason not to use simple regexes for this :)

@Sentimentron
Copy link
Contributor

I've just become aware of the NLTK's nltk.tokenize.casual module, which might be appropriate for this job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants