Fix the out of alphabet token handling in analyses generation #52

AMR-KELEG · 2019-05-10T23:33:11Z

Solves #45
Consider alphanumeric characters to be part of the vocabulary.

Solves apertium#45 Consider alphanumeric characters to be part of the vocabulary.

unhammer · 2019-05-11T15:16:04Z

This looks like a major change to tokenisation. Has this been tested with several language pairs to ensure no regressions?

AMR-KELEG · 2019-05-11T15:22:33Z

This looks like a major change to tolenisation. Had this been tested with several language pairs to ensure no regressions? Den Lau 11 mai 2019, klokka 09:02, skreiv Tino Didriksen:
…
Merged #52 <#52> into master. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#52 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAN4JBSR6RDBCAE4ZIG5QDPUZ4SHANCNFSM4HMGZ6GA.

I haven't tested it on large language-pairs.
Do you have recommendations for doing so?
What should the input and the expected output be?

TinoDidriksen · 2019-05-11T15:42:32Z

I just noted the build checks passed - that's good enough for me to merge it. If this breaks downstream pairs, revert and add a relevant build test.

unhammer · 2019-05-14T07:52:18Z

I haven't seen any regressions in nno-nob at least (240k lines passed without changes to output). I suppose it'll affect pairs with missing <alphabet> members more (in which case it's probably a change we want).

The reason I asked is that it takes away some freedom in defining tokenisation, e.g. with an empty alphabet you could define a very stupid tokeniser for languages without spaces (thai):

$ echo nullein | ~/src/ap/lttoolbox/lttoolbox/lt-proc /tmp/foo.bin # before this commit
^null/null<det><qnt><un><pl>$^ein/ein<det><qnt><m><sg>$

$ echo nullein|/usr/bin/lt-proc /tmp/foo.bin  # after this commit
^nullein/*nullein$

(ie. it could analyse with no spaces between LU's even though they're in a type="standard" section) but I don't think anyone's seriously doing that since LRLM fails on anything non-trivial. I guess if there actually are breakages people will complain :)

unhammer · 2019-10-28T08:06:06Z

I guess if there actually are breakages people will complain :)

And they do! See #75

Fix the out of alphabet token handling in analyses generation

5c8ec17

Solves apertium#45 Consider alphanumeric characters to be part of the vocabulary.

TinoDidriksen merged commit 944ed25 into apertium:master May 11, 2019

flammie self-assigned this May 14, 2019

unhammer mentioned this pull request Oct 28, 2019

Support empty alphabet, for simple CJK word segmentation #75

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the out of alphabet token handling in analyses generation #52

Fix the out of alphabet token handling in analyses generation #52

AMR-KELEG commented May 10, 2019

unhammer commented May 11, 2019 via email •

edited

Loading

AMR-KELEG commented May 11, 2019 •

edited

Loading

TinoDidriksen commented May 11, 2019

unhammer commented May 14, 2019 •

edited

Loading

unhammer commented Oct 28, 2019

Fix the out of alphabet token handling in analyses generation #52

Fix the out of alphabet token handling in analyses generation #52

Conversation

AMR-KELEG commented May 10, 2019

unhammer commented May 11, 2019 via email • edited Loading

AMR-KELEG commented May 11, 2019 • edited Loading

TinoDidriksen commented May 11, 2019

unhammer commented May 14, 2019 • edited Loading

unhammer commented Oct 28, 2019

unhammer commented May 11, 2019 via email •

edited

Loading

AMR-KELEG commented May 11, 2019 •

edited

Loading

unhammer commented May 14, 2019 •

edited

Loading