-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization issues #326
Comments
…n empirical data, to make sure this doesn't break other cases.
Thanks, am thinking these through. Currently the tokenizer is fairly conservative in segmentation --- it tends to under segment, rather than over segment. I think we should rather switch to often over segmenting, and then use the This sort of change takes some experimentation, though. It's at least partly an empirical question, because it's not easy to intuit what cases are common. I'll keep this ticket open and update when I've had a chance to experiment. |
Are you aware of any quick fix for (3)? |
@honnibal : I found another tokenization issue yesterday that was doing my head in. Possibly it's already mentioned in the above. Turn on the tv. = |
This issue should be fixed with the recent updates to the language data. Re 1./2./3. Re 4. The inconsistency should now be fixed – unless an exception is added, all infix hyphens are split. By default, all tokens are handled this way. If you want to add custom tokenization rules, for example to keep Re 5. To stay consistent with the parser training data, spaCy follows the Penn Treebank tokenization scheme, which splits |
Sorry to bump this thread, but it seems like the special cases for english (e.g.
I am aware that I could just add some exceptions, but I don't think I could catch them all; I was wondering if there's any quick fix on your side |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Tokenization seems incorrect in a number of cases:
Hello,world
is currently kept as one token but should beHello
,
world
.,;:hello:!.world
is currently kept as one token but should be.
,
;
:
hello
:
!
.
world
.Hello world.
gives.Hello
world
.
(but should be.
Hello
world
.
).I suppose dots are preserved as part of a token in case this make up an acronym, but they should not be allowed at the beginning. Basically, no punctuation should be allowed at the beginning, middle or end, except hyphens/dashes/en-dashes in the middle for compounds (as pointed out in #302) and dots for acronyms (in the middle or end).
a.m.
>a.m.
CIA.
>CIA
.
K.G.B.
>K.G.B.
.A.
>.
A
.
.AB.
>.
AB
.
.AB.C
>.
AB
.
C
.AB.C.
>.
AB.C.
Something like
E.ON
(the energy supplier) would cause trouble, but it would be a rare exception (in fact, it should beE·ON
).next-of-kin
is currentlynext
-
of-kin
three-year-old
is currentlythree
-
year-old
jack-in-the-box
is currentlyjack
-
in-the-box
But they should be one word. The third case is particularly interesting, as it generates a token with more than one hyphen (
in-the-box
). Clearly, the tokenizer seems to split only on the first hyphen.cannot
is currently tokenized ascan
not
. Strict grammarians would say there is a difference between these two forms, socannot
should not be tokenized ascan
not
. I understand spaCy might not want to make this distinction, in which case I wonder how I can force the tokenizer to keepcannot
as one word without modifying any files. Ideally, I'd like to add this exception dynamically while/after loading spaCy.Thank you.
The text was updated successfully, but these errors were encountered: