-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad tokenization of hyphenated words and em dash separated tokens #302
Comments
I've added a rule to the The next data release should include the updated tokenizer rule, allowing you to work with it. If you want to work around the problem currently, you could make the addition to the |
I'm happy to send a PR for this if it's fix that doesn't require exhaustive knowledge of the code. A pointer in the right direction would be awesome. |
Sorry for the delay on this. I think I've got it fixed. I was reluctant to set you loose on it, since the code in the tokenizer is quite tricky, and there's not much documentation. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I just observed a couple of obvious tokenization errors related to handling of hyphens:
Any suggestions for better handling of hyphens and em dashes?
The text was updated successfully, but these errors were encountered: