Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the out of alphabet token handling in analyses generation #52

Merged
merged 1 commit into from
May 11, 2019

Conversation

AMR-KELEG
Copy link
Contributor

Solves #45
Consider alphanumeric characters to be part of the vocabulary.

Solves apertium#45
Consider alphanumeric characters to be part of the vocabulary.
@TinoDidriksen TinoDidriksen merged commit 944ed25 into apertium:master May 11, 2019
@unhammer
Copy link
Member

unhammer commented May 11, 2019 via email

@AMR-KELEG
Copy link
Contributor Author

AMR-KELEG commented May 11, 2019

This looks like a major change to tolenisation. Had this been tested with several language pairs to ensure no regressions? Den Lau 11 mai 2019, klokka 09:02, skreiv Tino Didriksen:

Merged #52 <#52> into master. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#52 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAN4JBSR6RDBCAE4ZIG5QDPUZ4SHANCNFSM4HMGZ6GA.

I haven't tested it on large language-pairs.
Do you have recommendations for doing so?
What should the input and the expected output be?

@TinoDidriksen
Copy link
Member

I just noted the build checks passed - that's good enough for me to merge it. If this breaks downstream pairs, revert and add a relevant build test.

@unhammer
Copy link
Member

unhammer commented May 14, 2019

I haven't seen any regressions in nno-nob at least (240k lines passed without changes to output). I suppose it'll affect pairs with missing <alphabet> members more (in which case it's probably a change we want).


The reason I asked is that it takes away some freedom in defining tokenisation, e.g. with an empty alphabet you could define a very stupid tokeniser for languages without spaces (thai):

$ echo nullein | ~/src/ap/lttoolbox/lttoolbox/lt-proc /tmp/foo.bin # before this commit
^null/null<det><qnt><un><pl>$^ein/ein<det><qnt><m><sg>$

$ echo nullein|/usr/bin/lt-proc /tmp/foo.bin  # after this commit
^nullein/*nullein$

(ie. it could analyse with no spaces between LU's even though they're in a type="standard" section) but I don't think anyone's seriously doing that since LRLM fails on anything non-trivial. I guess if there actually are breakages people will complain :)

@unhammer
Copy link
Member

I guess if there actually are breakages people will complain :)

And they do! See #75

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants