-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand (some) ICU character classes in regex_compiler? #160
Comments
Code can ask ICU for the list of characters, but then the finished FST will change depending on which version of ICU (and thus Unicode) it was built with. It could encode the version in the file and do it at runtime if the version differs. |
Classes typically only get wider, so that sounds fine by me. I don't see a need for fst's to be perfectly reproducible when built on differing libraries – though encoding the ICU version in the file sounds like a good idea anyway. |
The current binary format for alphabets makes some assumptions about alphabet symbols (see apertium/apertium-yid#3 (comment)) that I think would make having non-expanded class symbols almost certainly require a file version bump (though I suppose you'd get that from including the ICU version anyway...). |
At the very least, getting Lower and Upper ranges would be nice, so we could
and whatnot.
If we do the "simple" thing and just expand like ranges in https://github.com/apertium/lttoolbox/blob/acx-spaces/lttoolbox/regexp_compiler.cc we get quite a lot of transitions
– https://www.compart.com/en/unicode/category/Ll (probably unreliable source) claims 2155 lowercase letters. But maybe it's OK if we keep regexes in their own
<section>
– more research needed.Alternatively, could/should we do something like insert a special symbol and have fst_processor treat it specially? (any tools operating on the compiled fst's like lt-trim or lt-print|hfst-txt2fst|hfst-stuff would just have to treat it opaquely)
The text was updated successfully, but these errors were encountered: