Skip to content
This repository has been archived by the owner on Mar 9, 2023. It is now read-only.

Tokenizing Ellipsis creates empty tokens #120

Open
polm opened this issue May 28, 2020 · 6 comments
Open

Tokenizing Ellipsis creates empty tokens #120

polm opened this issue May 28, 2020 · 6 comments
Assignees

Comments

@polm
Copy link
Contributor

polm commented May 28, 2020

While working on the spaCy Japanese model support and integrating Sudachi, ran into the issue that the one-character ellipsis () was causing errors. If you tokenize this ellipsis you get three tokens from SudachiPy, with surfaces like ['', '', '…'].

I assume this is a bug but wasn't able to track down where it's happening. I also checked ㍻, and while that is also normalized internally it seems to be output as a single character without issue.

@sorami
Copy link
Collaborator

sorami commented May 28, 2020

@polm Thank you for the report.

We believe this can be fixed in the same way as this Java Sudachi PR WorksApplications/Sudachi#118 .

Let us look into it.

@sorami
Copy link
Collaborator

sorami commented May 29, 2020

Currently, the one-character ellipsis is analyzed as follows;

$ echo … | sudachipy
	補助記号,句点,*,*,*,*	.
	補助記号,句点,*,*,*,*	.
…	補助記号,句点,*,*,*,*	.
EOS

By applying the fix already applied to the Java Sudachi
WorksApplications/Sudachi#118 , this will be changed to

$ echo … | sudachipy
…	補助記号,句点,*,*,*,*	.
	補助記号,句点,*,*,*,*	.
	補助記号,句点,*,*,*,*	.
EOS

And the empty (zero-length) morphemes will still be there. This is not a bug, but by specification, the expected behavior of Sudachi.

The original input only has one character, so Sudachi allots this for the first morpheme, and set the remainder to be the "zero-length" morphemes. These empty zero-length morphemes do have the normalized form ..

For the case, the input is only one character, but after the normalization, 平成 is 1 morpheme as well hence no empty morphemes.

@sorami sorami self-assigned this May 29, 2020
@polm
Copy link
Contributor Author

polm commented May 29, 2020

Huh, OK. I applied the fix and got the output above and assumed I had done something wrong.

And the empty (zero-length) morphemes will still be there. This is not a bug, but by specification, the expected behavior of Sudachi.

Where can I see the specification?

I'm curious what the motivation for generating zero-length morphemes is.

@sorami
Copy link
Collaborator

sorami commented May 29, 2020

Actually, there is no written specification, as far as I know ... The above is according to the main developer @kazuma-t , and I meant something like "that was what Sudachi was intended to do".

I'm curious what the motivation for generating zero-length morphemes is.

This is because the number of tokens after normalization is bigger than the original ones. is normalized to three morphemes . / . / . and there simply aren't enough tokens to align with. It is not that we want to have zero-length morphemes, but that is the only way we can think of.

@polm
Copy link
Contributor Author

polm commented May 29, 2020

Actually, there is no written specification, as far as I know ... The above is according to the main developer @kazuma-t , and I meant something like "that was what Sudachi was intended to do".

Ah OK then, good to know.

This is because the number of tokens after normalization is bigger than the original ones. … is normalized to three morphemes . / . / . and there simply aren't enough tokens to align with. It is not that we want to have zero-length morphemes, but that is the only way we can think of.

Couldn't you just treat ... as a single morpheme? It's already an entry in small_lex.csv...

@sorami
Copy link
Collaborator

sorami commented Jun 1, 2020

Couldn't you just treat ... as a single morpheme? It's already an entry in small_lex.csv...

Right, it is in the lexicon, so the above result is that the analysis result happened to be that way, due to their scores.

We could treat this particular case in a single morpheme way, but in general, that "zero-length morpheme" case can happen.

We let the users configure the character normalization, so it is possible to have cases where the output morphemes are longer than the input.

For example, if the input A produces the longer output B C D, we can think of 3 ways to treat such case;

  1. Each output have the same repeated original form: A -> B(A) C(A) D(A) (If we concatenate the original forms there will be duplicates)
  2. Only the first one keeps the original form: A -> B(A) C() D() (The current behavior of Sudachi)
  3. Always make it a single morpheme: A -> BCD(A)

The approach 3. is that we are giving up the correct analysis. Between 1. and 2., we think that 2. is probably a better solution, thus the current behavior.

@kazuma-t Please correct me if I am explaining something wrong, or elaborate to make it clearer.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants