Bad tokenization of hyphenated words and em dash separated tokens #302

rlvoyer · 2016-03-23T05:27:13Z

I just observed a couple of obvious tokenization errors related to handling of hyphens:

In [274]: nlp(u"The well-bred maid instinctively makes little of a guest's accident, and decent--let alone well-bred--people.")
Out[274]: The well-bred maid instinctively makes little of a guest's accident, and decent--let alone well-bred--people.

In [275]: doc = nlp(s.decode("utf-8"))

In [276]: [t for t in doc]
Out[276]:
[The ,
 well,
 -,
 bred ,
 maid ,
 instinctively ,
 makes ,
 little ,
 of ,
 a ,
 guest,
 's ,
 accident,
 , ,
 and ,
 decent--let ,
 alone ,
 well,
 -,
 bred--people,
 .]

Any suggestions for better handling of hyphens and em dashes?

The text was updated successfully, but these errors were encountered:

honnibal · 2016-03-29T03:35:17Z

I've added a rule to the infix.txt file that addresses most of the problem. We still have some difficulty, with the case of well-bred--people. This requires a change to the Tokenizer class, not just the rules being loaded. This has been reported before and we do intend to fix it, but we've not gotten to it yet.

The next data release should include the updated tokenizer rule, allowing you to work with it. If you want to work around the problem currently, you could make the addition to the infix.txt file in your local copy of the data.

rlvoyer · 2016-03-29T16:50:08Z

I'm happy to send a PR for this if it's fix that doesn't require exhaustive knowledge of the code. A pointer in the right direction would be awesome.

honnibal · 2016-04-14T08:39:21Z

Sorry for the delay on this. I think I've got it fixed.

I was reluctant to set you loose on it, since the code in the tokenizer is quite tricky, and there's not much documentation.

lock · 2018-05-09T13:11:59Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

rlvoyer changed the title ~~Bad tokenization of hyphenated words and em-dash separated tokens~~ Bad tokenization of hyphenated words and em dash separated tokens Mar 23, 2016

syllog1sm added the bug Bugs and behaviour differing from documentation label Mar 29, 2016

honnibal added a commit that referenced this issue Mar 29, 2016

* Add infix rule for double hyphens, re Issue #302

910a6c8

honnibal added a commit that referenced this issue Mar 29, 2016

* Add test for hyphenation problem in Issue #302

9c73983

mfelice mentioned this issue Apr 9, 2016

Tokenization issues #326

Closed

honnibal closed this as completed May 4, 2016

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad tokenization of hyphenated words and em dash separated tokens #302

Bad tokenization of hyphenated words and em dash separated tokens #302

rlvoyer commented Mar 23, 2016

honnibal commented Mar 29, 2016

rlvoyer commented Mar 29, 2016

honnibal commented Apr 14, 2016

lock bot commented May 9, 2018

Bad tokenization of hyphenated words and em dash separated tokens #302

Bad tokenization of hyphenated words and em dash separated tokens #302

Comments

rlvoyer commented Mar 23, 2016

honnibal commented Mar 29, 2016

rlvoyer commented Mar 29, 2016

honnibal commented Apr 14, 2016

lock bot commented May 9, 2018