You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to tokenize multilingual (rather multi script) strings - into components where each component is of only one script (as defined by Unicode). I tried using -segment_alphabet_change but this also breaks at spaces.
The following
the rootकृ in the sense of frequency; e.g. चर्करीति, चर्कर्ति, बोभवीति बोभोति
should break as 4 tokens
"the root" "कृ " "in the sense of frequency; e.g." "चर्करीति, चर्कर्ति, बोभवीति बोभोति"
The text was updated successfully, but these errors were encountered:
I am trying to tokenize multilingual (rather multi script) strings - into components where each component is of only one script (as defined by Unicode). I tried using -segment_alphabet_change but this also breaks at spaces.
The following
the rootकृ in the sense of frequency; e.g. चर्करीति, चर्कर्ति, बोभवीति बोभोति
should break as 4 tokens
"the root" "कृ " "in the sense of frequency; e.g." "चर्करीति, चर्कर्ति, बोभवीति बोभोति"
The text was updated successfully, but these errors were encountered: