Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unidirectional Compilation #155

Merged
merged 7 commits into from
Aug 17, 2022
Merged

Unidirectional Compilation #155

merged 7 commits into from
Aug 17, 2022

Conversation

mr-martian
Copy link
Contributor

The goal of this PR to make it so that in place of

lt-comp lr spa.dix spa.automorf.bin # compile left-to-right and general paths
lt-comp rl spa.dix spa.autogen.bin  # compile right-to-left and general paths

we can instead write

lt-comp u spa.dix .deps/spa.dix.bin               # compile all paths, marking ones that have restrictions
lt-restrict lr .deps/spa.dix.bin spa.automorf.bin # remove right-to-left paths
lt-restrict rl .deps/spa.dix.bin .deps/spa.RL.bin # remove left-to-right paths
lt-invert .deps/spa.RL.bin spa.autogen.bin        # invert

Why, you might ask, would we want to replace 2 commands with 4 (or 3, if I make lt-restrict invert the fst when the direction is rl)? Well, if LT_RELEASE is unset or is set to no, lt-restrict will not minimize the transducer (which, even after recent optimizations, is still by far the biggest piece of the process), significantly cutting down on overall compile time, especially for languages like -oci where the dictionary is getting compiled 6 times.

This PR is a draft because in order for this to be fully usable, I need to also write a tool to apply an ACX file to an already-compiled transducer.

Oh, and I wrote a wrapper around getopt because I was tired of typing the same boilerplate over and over again.

configure.ac Show resolved Hide resolved
@mr-martian
Copy link
Contributor Author

I tested this on -oci and got

setup invocations of lt-comp time
current 6 ~7:30
merging variants 1 ~14:00
merging directions 3 ~6:00

I also observed a slowdown in runtime, which, if it's due to the different fst structure would roughly cancel out the benefits if your workflow involves running a large corpus through the pipeline after each recompilation.

It would probably also be worth checking whether a language with less divergence between variants would have as much of a slowdown from merging them.

And I should add tests.

@mr-martian
Copy link
Contributor Author

mr-martian commented Jul 25, 2022

I tested this on -cat and got

setup invocations of lt-comp time
current 4 2:07
merging variants 1 1:23
merging, release mode 1 2:01

So it seems that the usefulness of this will need to be determined on a language-by-language basis.

I also made lt-restrict rl invert the transducers and lt-comp accept more than one variant or alt value separated by spaces (replacing apertium-genvdix).

@mr-martian mr-martian marked this pull request as ready for review August 2, 2022 16:30
@mr-martian mr-martian merged commit f8fce9a into master Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants