You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
apertium-pretransfer has option -e treat ~ as compound separator – I don't know if any other tools have this, but it would be nice if we could implement support for that throughout the pipeline so that we can keep + in the sense of <j/> and ~ in the sense of compounds separate.
Motivation:
Currently, transfer has no way of knowing if there was an actual space in input or it was just placed there by pretransfer which saw a + and output a space.
If in transfer you want to match a compound followed by something else, you currently have to output the first two parts with no <b/> and then a <b/> – but that first blank that you output will be the space that was added by pretransfer, with rules like
<out>
<lu><clip pos="1" side="tl" part="whole"/></lu> <!-- no b/ here -->
<lu><clip pos="2" side="tl" part="whole"/></lu><b/> <!-- this will output the blank that was between 1 and 2! -->
<lu><clip pos="3" side="tl" part="whole"/></lu>
</out>
This means that on the one hand we get luftputebåten<em>min</em> → lt-proc → luftpute+båten[<em>]min[</em>] → pretransfer → luftpute båten[<em>]min[</em>] → transfer → luftputebåten min[<em>][</em>], but also that it's not really possible to make a general rule that matches, say, both dynamic compounds and number-compounds that should be treated the same way but had no space added by pretransfer (that first <b/> will then be empty, turning 2.-kvartalet deres into 2.kvartaletdeira).
The text was updated successfully, but these errors were encountered:
@ftyers@TinoDidriksen@mr-martian thoughts? It seems ~ was chosen in pretransfer already, but it stopped there. Any reason not to implement this in lt-proc/apertium-tagger/cg-proc?
apertium-pretransfer
has option-e treat ~ as compound separator
– I don't know if any other tools have this, but it would be nice if we could implement support for that throughout the pipeline so that we can keep+
in the sense of<j/>
and~
in the sense of compounds separate.Motivation:
Currently, transfer has no way of knowing if there was an actual space in input or it was just placed there by pretransfer which saw a + and output a space.
If in transfer you want to match a compound followed by something else, you currently have to output the first two parts with no
<b/>
and then a<b/>
– but that first blank that you output will be the space that was added by pretransfer, with rules likeThis means that on the one hand we get
luftputebåten<em>min</em>
→ lt-proc →luftpute+båten[<em>]min[</em>]
→ pretransfer →luftpute båten[<em>]min[</em>]
→ transfer →luftputebåten min[<em>][</em>]
, but also that it's not really possible to make a general rule that matches, say, both dynamic compounds and number-compounds that should be treated the same way but had no space added by pretransfer (that first<b/>
will then be empty, turning2.-kvartalet deres
into2.kvartaletdeira
).The text was updated successfully, but these errors were encountered: