Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to output '~' as compound separator #177

Open
unhammer opened this issue Oct 19, 2023 · 1 comment
Open

Option to output '~' as compound separator #177

unhammer opened this issue Oct 19, 2023 · 1 comment

Comments

@unhammer
Copy link
Member

unhammer commented Oct 19, 2023

apertium-pretransfer has option -e treat ~ as compound separator – I don't know if any other tools have this, but it would be nice if we could implement support for that throughout the pipeline so that we can keep + in the sense of <j/> and ~ in the sense of compounds separate.

Motivation:

Currently, transfer has no way of knowing if there was an actual space in input or it was just placed there by pretransfer which saw a + and output a space.

If in transfer you want to match a compound followed by something else, you currently have to output the first two parts with no <b/> and then a <b/> – but that first blank that you output will be the space that was added by pretransfer, with rules like

<out>
<lu><clip pos="1" side="tl" part="whole"/></lu>              <!-- no b/ here -->
<lu><clip pos="2" side="tl" part="whole"/></lu><b/>     <!-- this will output the blank that was between 1 and 2! -->
<lu><clip pos="3" side="tl" part="whole"/></lu>
</out>

This means that on the one hand we get luftputebåten<em>min</em> → lt-proc → luftpute+båten[<em>]min[</em>] → pretransfer → luftpute båten[<em>]min[</em>] → transfer → luftputebåten min[<em>][</em>], but also that it's not really possible to make a general rule that matches, say, both dynamic compounds and number-compounds that should be treated the same way but had no space added by pretransfer (that first <b/> will then be empty, turning 2.-kvartalet deres into 2.kvartaletdeira).

@unhammer
Copy link
Member Author

@ftyers @TinoDidriksen @mr-martian thoughts? It seems ~ was chosen in pretransfer already, but it stopped there. Any reason not to implement this in lt-proc/apertium-tagger/cg-proc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant