strange behavior of skipping characters in tokenization #784

bishanyang · 2018-10-26T13:33:51Z

I found that the PTBLexer sometimes decides to remove characters in the input text, as a result the character offsets would not apply to the original text any more. I wonder why that is and how to avoid it. For example, if the input is "%ACTL", the tokenizer would remove "%AC", treating it as a single character, and parse only "TL". I tried resetting different "untokenizable" options (e.g., "allKeep", "noneKeep") but that doesn't seem to help. Could someone please help me with this? Thanks.

bishanyang · 2018-10-26T14:01:25Z

I realized that this problem doesn't exist if I use the tokenizer from Java. It may be something to do with unicode characters. I am currently using the python interface in python3.6 and using text.encode('utf-8') as input to the corenlp server. Not sure if I did anything wrong.

manning · 2018-10-28T19:04:52Z

I don't think this is a unicode thing but rather a "web thing". That is, things aren't being correctly marshalled/unmarshalled across the webservice requests. The requests are percent encoded (https://en.wikipedia.org/wiki/Percent-encoding) and so %AC is treated as a binary byte 0xAC. What should happen is the Python side should avoid this by escaping the % by encoding it as %25 but is failing to.

What package are you using on the Python side -- there are several Python interfaces to CoreNLP.

bishanyang · 2018-10-28T19:49:27Z

Thanks for your reply, Chris. I think you are right. I implemented a web service for the tokenizer using Java Spark and then use it from python. It seems like the problem is gone now. I was using the python interface from https://github.com/smilli/py-corenlp.

manning mentioned this issue Nov 20, 2018

Code fails to escape percent for web service requests smilli/py-corenlp#33

Open

manning added the wrapper-bug A problem with a package that provides an API or interface to CoreNLP, but not CoreNLP itself label Nov 20, 2018

manning closed this as completed Nov 20, 2018

na2hiro mentioned this issue Oct 9, 2020

Percent (%) and following 2 characters are removed, possibly due to URL escape issue stanfordnlp/python-stanford-corenlp#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strange behavior of skipping characters in tokenization #784

strange behavior of skipping characters in tokenization #784

bishanyang commented Oct 26, 2018

bishanyang commented Oct 26, 2018 •

edited

Loading

manning commented Oct 28, 2018

bishanyang commented Oct 28, 2018

strange behavior of skipping characters in tokenization #784

strange behavior of skipping characters in tokenization #784

Comments

bishanyang commented Oct 26, 2018

bishanyang commented Oct 26, 2018 • edited Loading

manning commented Oct 28, 2018

bishanyang commented Oct 28, 2018

bishanyang commented Oct 26, 2018 •

edited

Loading