You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found that the PTBLexer sometimes decides to remove characters in the input text, as a result the character offsets would not apply to the original text any more. I wonder why that is and how to avoid it. For example, if the input is "%ACTL", the tokenizer would remove "%AC", treating it as a single character, and parse only "TL". I tried resetting different "untokenizable" options (e.g., "allKeep", "noneKeep") but that doesn't seem to help. Could someone please help me with this? Thanks.
The text was updated successfully, but these errors were encountered:
I realized that this problem doesn't exist if I use the tokenizer from Java. It may be something to do with unicode characters. I am currently using the python interface in python3.6 and using text.encode('utf-8') as input to the corenlp server. Not sure if I did anything wrong.
I don't think this is a unicode thing but rather a "web thing". That is, things aren't being correctly marshalled/unmarshalled across the webservice requests. The requests are percent encoded (https://en.wikipedia.org/wiki/Percent-encoding) and so %AC is treated as a binary byte 0xAC. What should happen is the Python side should avoid this by escaping the % by encoding it as %25 but is failing to.
What package are you using on the Python side -- there are several Python interfaces to CoreNLP.
Thanks for your reply, Chris. I think you are right. I implemented a web service for the tokenizer using Java Spark and then use it from python. It seems like the problem is gone now. I was using the python interface from https://github.com/smilli/py-corenlp.
I found that the PTBLexer sometimes decides to remove characters in the input text, as a result the character offsets would not apply to the original text any more. I wonder why that is and how to avoid it. For example, if the input is "%ACTL", the tokenizer would remove "%AC", treating it as a single character, and parse only "TL". I tried resetting different "untokenizable" options (e.g., "allKeep", "noneKeep") but that doesn't seem to help. Could someone please help me with this? Thanks.
The text was updated successfully, but these errors were encountered: