Tokenizing Ellipsis creates empty tokens #120

polm · 2020-05-28T07:57:21Z

While working on the spaCy Japanese model support and integrating Sudachi, ran into the issue that the one-character ellipsis (…) was causing errors. If you tokenize this ellipsis you get three tokens from SudachiPy, with surfaces like ['', '', '…'].

I assume this is a bug but wasn't able to track down where it's happening. I also checked ㍻, and while that is also normalized internally it seems to be output as a single character without issue.

The text was updated successfully, but these errors were encountered:

sorami · 2020-05-28T09:08:04Z

@polm Thank you for the report.

We believe this can be fixed in the same way as this Java Sudachi PR WorksApplications/Sudachi#118 .

Let us look into it.

sorami · 2020-05-29T01:44:38Z

Currently, the one-character ellipsis … is analyzed as follows;

$ echo … | sudachipy
	補助記号,句点,*,*,*,*	.
	補助記号,句点,*,*,*,*	.
…	補助記号,句点,*,*,*,*	.
EOS

By applying the fix already applied to the Java Sudachi
WorksApplications/Sudachi#118 , this will be changed to

$ echo … | sudachipy
…	補助記号,句点,*,*,*,*	.
	補助記号,句点,*,*,*,*	.
	補助記号,句点,*,*,*,*	.
EOS

And the empty (zero-length) morphemes will still be there. This is not a bug, but by specification, the expected behavior of Sudachi.

The original input only has one character, so Sudachi allots this for the first morpheme, and set the remainder to be the "zero-length" morphemes. These empty zero-length morphemes do have the normalized form ..

For the ㍻ case, the input is only one character, but after the normalization, 平成 is 1 morpheme as well hence no empty morphemes.

polm · 2020-05-29T04:00:28Z

Huh, OK. I applied the fix and got the output above and assumed I had done something wrong.

And the empty (zero-length) morphemes will still be there. This is not a bug, but by specification, the expected behavior of Sudachi.

Where can I see the specification?

I'm curious what the motivation for generating zero-length morphemes is.

sorami · 2020-05-29T04:12:10Z

Actually, there is no written specification, as far as I know ... The above is according to the main developer @kazuma-t , and I meant something like "that was what Sudachi was intended to do".

I'm curious what the motivation for generating zero-length morphemes is.

This is because the number of tokens after normalization is bigger than the original ones. … is normalized to three morphemes . / . / . and there simply aren't enough tokens to align with. It is not that we want to have zero-length morphemes, but that is the only way we can think of.

polm · 2020-05-29T04:17:38Z

Actually, there is no written specification, as far as I know ... The above is according to the main developer @kazuma-t , and I meant something like "that was what Sudachi was intended to do".

Ah OK then, good to know.

This is because the number of tokens after normalization is bigger than the original ones. … is normalized to three morphemes . / . / . and there simply aren't enough tokens to align with. It is not that we want to have zero-length morphemes, but that is the only way we can think of.

Couldn't you just treat ... as a single morpheme? It's already an entry in small_lex.csv...

sorami · 2020-06-01T10:30:50Z

Couldn't you just treat ... as a single morpheme? It's already an entry in small_lex.csv...

Right, it is in the lexicon, so the above result is that the analysis result happened to be that way, due to their scores.

We could treat this particular … case in a single morpheme way, but in general, that "zero-length morpheme" case can happen.

We let the users configure the character normalization, so it is possible to have cases where the output morphemes are longer than the input.

For example, if the input A produces the longer output B C D, we can think of 3 ways to treat such case;

Each output have the same repeated original form: A -> B(A) C(A) D(A) (If we concatenate the original forms there will be duplicates)
Only the first one keeps the original form: A -> B(A) C() D() (The current behavior of Sudachi)
Always make it a single morpheme: A -> BCD(A)

The approach 3. is that we are giving up the correct analysis. Between 1. and 2., we think that 2. is probably a better solution, thus the current behavior.

@kazuma-t Please correct me if I am explaining something wrong, or elaborate to make it clearer.

polm mentioned this issue May 28, 2020

Japanese Model explosion/spaCy#3756

Closed

sorami self-assigned this May 29, 2020

sorami mentioned this issue Jun 2, 2020

Fix a bug causing … is converted to "", "", "…" #121

Merged

polm mentioned this issue Aug 24, 2020

cannot analyze ̄ ̄ with japanese models explosion/spaCy#5961

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizing Ellipsis creates empty tokens #120

Tokenizing Ellipsis creates empty tokens #120

polm commented May 28, 2020 •

edited

Loading

sorami commented May 28, 2020

sorami commented May 29, 2020 •

edited

Loading

polm commented May 29, 2020

sorami commented May 29, 2020

polm commented May 29, 2020

sorami commented Jun 1, 2020

Tokenizing Ellipsis creates empty tokens #120

Tokenizing Ellipsis creates empty tokens #120

Comments

polm commented May 28, 2020 • edited Loading

sorami commented May 28, 2020

sorami commented May 29, 2020 • edited Loading

polm commented May 29, 2020

sorami commented May 29, 2020

polm commented May 29, 2020

sorami commented Jun 1, 2020

polm commented May 28, 2020 •

edited

Loading

sorami commented May 29, 2020 •

edited

Loading