-
-
Notifications
You must be signed in to change notification settings - Fork 50
Tokenizing Ellipsis creates empty tokens #120
Comments
@polm Thank you for the report. We believe this can be fixed in the same way as this Java Sudachi PR WorksApplications/Sudachi#118 . Let us look into it. |
Currently, the one-character ellipsis
By applying the fix already applied to the Java Sudachi
And the empty (zero-length) morphemes will still be there. This is not a bug, but by specification, the expected behavior of Sudachi. The original input only has one character, so Sudachi allots this for the first morpheme, and set the remainder to be the "zero-length" morphemes. These empty zero-length morphemes do have the normalized form For the |
Huh, OK. I applied the fix and got the output above and assumed I had done something wrong.
Where can I see the specification? I'm curious what the motivation for generating zero-length morphemes is. |
Actually, there is no written specification, as far as I know ... The above is according to the main developer @kazuma-t , and I meant something like "that was what Sudachi was intended to do".
This is because the number of tokens after normalization is bigger than the original ones. |
Ah OK then, good to know.
Couldn't you just treat |
Right, it is in the lexicon, so the above result is that the analysis result happened to be that way, due to their scores. We could treat this particular We let the users configure the character normalization, so it is possible to have cases where the output morphemes are longer than the input. For example, if the input
The approach 3. is that we are giving up the correct analysis. Between 1. and 2., we think that 2. is probably a better solution, thus the current behavior. @kazuma-t Please correct me if I am explaining something wrong, or elaborate to make it clearer. |
While working on the spaCy Japanese model support and integrating Sudachi, ran into the issue that the one-character ellipsis (
…
) was causing errors. If you tokenize this ellipsis you get three tokens from SudachiPy, with surfaces like['', '', '…']
.I assume this is a bug but wasn't able to track down where it's happening. I also checked ㍻, and while that is also normalized internally it seems to be output as a single character without issue.
The text was updated successfully, but these errors were encountered: