Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad tokenization of hyphenated words and em dash separated tokens #302

Closed
rlvoyer opened this issue Mar 23, 2016 · 4 comments
Closed

Bad tokenization of hyphenated words and em dash separated tokens #302

rlvoyer opened this issue Mar 23, 2016 · 4 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@rlvoyer
Copy link

rlvoyer commented Mar 23, 2016

I just observed a couple of obvious tokenization errors related to handling of hyphens:

In [274]: nlp(u"The well-bred maid instinctively makes little of a guest's accident, and decent--let alone well-bred--people.")
Out[274]: The well-bred maid instinctively makes little of a guest's accident, and decent--let alone well-bred--people.

In [275]: doc = nlp(s.decode("utf-8"))

In [276]: [t for t in doc]
Out[276]:
[The ,
 well,
 -,
 bred ,
 maid ,
 instinctively ,
 makes ,
 little ,
 of ,
 a ,
 guest,
 's ,
 accident,
 , ,
 and ,
 decent--let ,
 alone ,
 well,
 -,
 bred--people,
 .]

Any suggestions for better handling of hyphens and em dashes?

@rlvoyer rlvoyer changed the title Bad tokenization of hyphenated words and em-dash separated tokens Bad tokenization of hyphenated words and em dash separated tokens Mar 23, 2016
@syllog1sm syllog1sm added the bug Bugs and behaviour differing from documentation label Mar 29, 2016
@honnibal
Copy link
Member

I've added a rule to the infix.txt file that addresses most of the problem. We still have some difficulty, with the case of well-bred--people. This requires a change to the Tokenizer class, not just the rules being loaded. This has been reported before and we do intend to fix it, but we've not gotten to it yet.

The next data release should include the updated tokenizer rule, allowing you to work with it. If you want to work around the problem currently, you could make the addition to the infix.txt file in your local copy of the data.

@rlvoyer
Copy link
Author

rlvoyer commented Mar 29, 2016

I'm happy to send a PR for this if it's fix that doesn't require exhaustive knowledge of the code. A pointer in the right direction would be awesome.

@honnibal
Copy link
Member

Sorry for the delay on this. I think I've got it fixed.

I was reluctant to set you loose on it, since the code in the tokenizer is quite tricky, and there's not much documentation.

@honnibal honnibal closed this as completed May 4, 2016
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

3 participants