-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenization issues: ' followed by s, m, t, etc #1
Comments
Are you saying that it should be tokenized as |
yes, those should be |
lmk if you need or want some assistance scripting changes like that |
I think I've got this, thanks! Will let you know if I need help though |
How about cases where a noun is followed by
|
Here's what I'm seeing when inspecting some processed data:
Can't find the cases you're talking about. Was that perhaps only for the raw annotated data? |
the possessive when i was going through the data myself, i'd occasionally fix them when i came across such errors
|
i'm fairly certain most of those can be cleaned up via a script... just look for again, i can take that on ... maybe i should just go ahead and do that |
If you could, that would be great. If you have time, of course. |
I'm about half done with checking incorrect
and in one file, |
alright, i have taken on the the others are still TODO |
US, titles, and ellipses are now cleared up. Would still like to look for decade+s |
did the decades as well maybe still need to look for |
it's
gets tokenized into three tokens,it
,'
,s
that should be fixed
same with
'm
't
etcThe text was updated successfully, but these errors were encountered: