Skip to content

strange behavior of skipping characters in tokenization #784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bishanyang opened this issue Oct 26, 2018 · 3 comments
Closed

strange behavior of skipping characters in tokenization #784

bishanyang opened this issue Oct 26, 2018 · 3 comments
Labels
wrapper-bug A problem with a package that provides an API or interface to CoreNLP, but not CoreNLP itself

Comments

@bishanyang
Copy link

I found that the PTBLexer sometimes decides to remove characters in the input text, as a result the character offsets would not apply to the original text any more. I wonder why that is and how to avoid it. For example, if the input is "%ACTL", the tokenizer would remove "%AC", treating it as a single character, and parse only "TL". I tried resetting different "untokenizable" options (e.g., "allKeep", "noneKeep") but that doesn't seem to help. Could someone please help me with this? Thanks.

@bishanyang
Copy link
Author

bishanyang commented Oct 26, 2018

I realized that this problem doesn't exist if I use the tokenizer from Java. It may be something to do with unicode characters. I am currently using the python interface in python3.6 and using text.encode('utf-8') as input to the corenlp server. Not sure if I did anything wrong.

@manning
Copy link
Member

manning commented Oct 28, 2018

I don't think this is a unicode thing but rather a "web thing". That is, things aren't being correctly marshalled/unmarshalled across the webservice requests. The requests are percent encoded (https://en.wikipedia.org/wiki/Percent-encoding) and so %AC is treated as a binary byte 0xAC. What should happen is the Python side should avoid this by escaping the % by encoding it as %25 but is failing to.

What package are you using on the Python side -- there are several Python interfaces to CoreNLP.

@bishanyang
Copy link
Author

Thanks for your reply, Chris. I think you are right. I implemented a web service for the tokenizer using Java Spark and then use it from python. It seems like the problem is gone now. I was using the python interface from https://github.com/smilli/py-corenlp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wrapper-bug A problem with a package that provides an API or interface to CoreNLP, but not CoreNLP itself
Projects
None yet
Development

No branches or pull requests

2 participants