Write a utility to assign weights to a compiled transducer based on a corpus #16

ftyers · 2018-07-01T16:30:45Z

I imagine it will be called lt-reweight

It should have two arguments:

a binary lttoolbox file e.g. grn.automorf.bin
a tagged corpus grn.tagged

$ lt-reweight grn.automorf.bin grn.tagged

Where grn.tagged looks like:

^Avañeʼẽ/avañeʼẽ<n>$
^ha/ha<cnjcoo>$
^Guarani/guarani<n>$
^ñeʼẽ/ñeʼẽ<n>$
^ombohéra/o<prn><p3><sg>+mbohéra<v><tv><pres>$
^hikuái/hikuái<aux><impf><p3><pl>$
^umi/umi<adj><dem><pl>$
^Guaranikuéra/guarani<n>+kuéra<det><pl>$
^pe/pe<post>$
^ñeʼẽ/ñeʼẽ<n>$
^teépe/tee<n>+pe<post>$
^./.<sent>$

^Guarani/guarani<n>$
^haʼe/haʼe<vbser><iv><pres>$
^peteĩva/peteĩ<num>+va<subs><dem>$
^umi/umi<adj><dem><pl>$
^teʼyikuéra/teʼyi<n>+kuéra<det><pl>$
^Amérika-gua/Amérika<np><top>+gua<post>$
^ñeʼẽnguéra/ñeʼẽ<n>+kuéra<det><pl>$
^apytépe/apytépe<post>$
^hetave/heta<adv>+ve<comp>$
^iñeʼẽhárava/iñeʼẽhárava<adj>$
^,/,<cm>$
^oñemohendáva/o<prn><p3><sg>+je<pass>+mohenda<v><tv><pres>+va<subs><dem>$
^irundy/irundy<num>$
^tetãnguéra/tetã<n>+kuéra<det><pl>$
^iñambuévape/iñambuéva<adj>+pe<post>$
^(/(<lpar>$
^Paraguái/Paraguái<np><top>$
^,/,<cm>$
^Argentina/Argentina<np><top>$
^,/,<cm>$
^Volívia/Volívia<np><top>$
^ha/ha<cnjcoo>$
^Brasil/Brasil<np><top>$
^)/)<rpar>$
^./.<sent>$

^Avei/avei<adv>$
^,/,<cm>$
^haʼe/haʼe<vbser><iv><pres>$
^ñoite/ñoite<adv>$
^ojehechakuaáva/o<prn><p3><sg>+je<pass>+hechakuaa<v><tv><pres>+va<subs><dem>$
^ñeʼẽ/ñeʼẽ<n>$
^teéramo/tee<n>+ramo<post>$
^peteĩ/peteĩ<num>$
^tetã/tetã<n>$
^Ñembyamérika-guápe/Ñembyamérika<np><top>+gua<post>+pe<post>$
^./.<sent>$

^Tupi/Tupi<n>$
^ha/ha<cnjcoo>$
^guarani/guarani<n>$
^ñeʼẽ/ñeʼẽ<n>$
^aty/aty<n>$
^guasu/guasu<adj>$
^rehegua/rehegua<post>$

^,/,<cm>$
^oguereko/o<prn><p3><sg>+guereko<v><tv><pres>$
^hetáichagua/hetáichagua<adj>$

^ñeʼẽnunga/ñeʼẽnunga<n>$
^,/,<cm>$
^upéicharõ/upéicha<adv>+rõ<post>$
^jepe/jepe<adv>$
^oĩ/oĩ<v><iv><pres>$
^jekupyty/jekupyty<v><tv><pres>$
^ijapytepekuéra/i<prn><p3><sg>+japyte<n>+pe<post>+kuéra<det><pl>$
^ha/ha<cnjcoo>$
^heta/heta<adv>$
^mbaʼépe/mbaʼe<n>+pe<post>$
^ojojogua/ojojogua<n>$
^koʼã/koʼã<adj><dem><pl>$
^ñeʼẽnungakuéra/ñeʼẽnunga<n>+kuéra<det><pl>$
^./.<sent>$

^Avañeʼẽ/avañeʼẽ<n>$
^ha/ha<cnjcoo>$
^karaiñeʼẽ/karaiñeʼẽ<n>$
^haʼe/haʼe<vbser><iv><pres>$
^Paraguái/Paraguái<np><top>$
^retãme/tetã<n>+pe<post>$
^ñeʼẽ/ñeʼẽ<n>$
^tee/tee<adj>$
^ary/ary<n>$
^1992/1992<num>$
^guive/guive<post>$
^./.<sent>$

^Japypateĩ/Japypateĩ<num>$
^2006/2006<num>$
^guive/guive<post>$
^haʼe/haʼe<vbser><iv><pres>$
^avei/avei<adv>$
^ñeʼẽ/ñeʼẽ<n>$
^tee/tee<adj>$
^Mercosur-pe/Mercosur<np><org>+pe<case>$
^,/,<cm>$
^karaiñeʼẽ/karaiñeʼẽ<n>$
^ha/ha<cnjcoo>$
^poytugañeʼẽ/poytugañeʼẽ<n>$
^ykére/ykére<post>$
^./.<sent>$

And the output of the analyser for e.g. poytugañeʼẽ is:

^poytugañeʼẽ/poytugañeʼẽ<n>/a<prn><p1><sg>+poytugañeʼẽ<n>/re<prn><p2><sg>+poytugañeʼẽ<n>$^./.<sent>$

So, the analyses should be weighted

poytugañeʼẽ : poytugañeʼẽ<n> = 1.0
poytugañeʼẽ : a<prn><p1><sg>+poytugañeʼẽ<n> = 0.0
poytugañeʼẽ  : re<prn><p2><sg>+poytugañeʼẽ<n>  = 0.0

The text was updated successfully, but these errors were encountered:

Techievena · 2018-07-05T23:00:27Z

Is it similar to supervised tagger training? @flammie

Write the utility to process tagged corpus and the binary lttoolbox file and return weighted analyses. Closes apertium#16

flammie · 2018-07-06T01:08:10Z

Pretty much I'd say, a unigram tagger should work exactly the same if I haven't missed anything.

Write the utility to process tagged corpus and the binary lttoolbox file and return weighted analyses. Closes apertium#16

ftyers · 2020-06-27T08:21:50Z

So one way this could work is:

Load original transducer, A
Read tagged corpus into a weighted FST, B
Intersect B and A, making C
Priority union C and A.

Questions:

Does intersection in lttoolbox do the right thing with weights?
We don't have priority union, or subtract, it seems a bit difficult to do without either of those.

@flammie @unhammer thoughts ?

flammie · 2020-06-27T13:07:36Z

I don't think even openfst has a defined intersection of weighted or two-tape automata, they just do the encoded intersection where a:b::W is treated as a special symbol in an automata intersection. It might be possible to add weights by way of intersection algorithm at least when the automata were mostly synchronised, otherwise I'd just do with composing.

For the experiments I published on weighing automata we did compose(A, B), or at most compose(minus'(A', B'), B) which does something similar to priority union I guess. It required some trickery though. One could even just do the union(A, B) since B is gold corpus with good tags, right? In compose method you mainly lose if there is non 1:1 relation from the direction you compose I think, e.g. if you have foo+X:bar foo+X:baz.

The part of A that doesn't get weighted by corpus should usually receive the penalty weight of unseen tokens.

ftyers · 2020-06-27T15:24:45Z

The reason for not just doing union is that then we would have multiple identical analyses with different weights, right?

I was thinking compose would work, but we also don't have an implementation of compose in lttoolbox at the moment.

unhammer · 2022-09-24T11:26:25Z

#161 adds a compose (optional on matching sub-paths), though not very extensively tested :) also I have no idea what the expected value of composed weights would be

flammie · 2022-09-24T12:04:27Z

I think weights are just added together in our WFSAs? Or theoretically using the weight structure's semiring's collect operation but we've always used the tropical semiring which is just +.

unhammer · 2022-09-24T14:45:03Z

So whatever operation you use on weights when following arcs should be used when composing? And if you want to compose g . f without changing the weights that are in f, then all arcs of g need to be the identity (ie. 0 if operation is +)?

unhammer · 2022-09-25T21:55:16Z

@flammie newest uses +

Techievena added a commit to Techievena/lttoolbox that referenced this issue Jul 5, 2018

lt-reweight: assign weights to a compiled transducer based on a corpus

fc49173

Write the utility to process tagged corpus and the binary lttoolbox file and return weighted analyses. Closes apertium#16

Techievena added a commit to Techievena/lttoolbox that referenced this issue Jul 5, 2018

lt-reweight: assign weights to a compiled transducer based on a corpus

444c595

Write the utility to process tagged corpus and the binary lttoolbox file and return weighted analyses. Closes apertium#16

Techievena added a commit to Techievena/lttoolbox that referenced this issue Jul 12, 2018

lt-reweight: assign weights to a compiled transducer based on a corpus

29f19c1

Write the utility to process tagged corpus and the binary lttoolbox file and return weighted analyses. Closes apertium#16

Techievena added a commit to Techievena/lttoolbox that referenced this issue Jul 14, 2018

lt-reweight: assign weights to a compiled transducer based on a corpus

847e2cc

Write the utility to process tagged corpus and the binary lttoolbox file and return weighted analyses. Closes apertium#16

Techievena added a commit to Techievena/lttoolbox that referenced this issue Jul 31, 2018

lt-reweight: assign weights to a compiled transducer based on a corpus

9b99a07

Write the utility to process tagged corpus and the binary lttoolbox file and return weighted analyses. Closes apertium#16

ftyers added enhancement New feature or request weighting labels Jun 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a utility to assign weights to a compiled transducer based on a corpus #16

Write a utility to assign weights to a compiled transducer based on a corpus #16

ftyers commented Jul 1, 2018

Techievena commented Jul 5, 2018

flammie commented Jul 6, 2018

ftyers commented Jun 27, 2020 •

edited

Loading

flammie commented Jun 27, 2020

ftyers commented Jun 27, 2020

unhammer commented Sep 24, 2022 •

edited

Loading

flammie commented Sep 24, 2022

unhammer commented Sep 24, 2022

unhammer commented Sep 25, 2022

Write a utility to assign weights to a compiled transducer based on a corpus #16

Write a utility to assign weights to a compiled transducer based on a corpus #16

Comments

ftyers commented Jul 1, 2018

Techievena commented Jul 5, 2018

flammie commented Jul 6, 2018

ftyers commented Jun 27, 2020 • edited Loading

flammie commented Jun 27, 2020

ftyers commented Jun 27, 2020

unhammer commented Sep 24, 2022 • edited Loading

flammie commented Sep 24, 2022

unhammer commented Sep 24, 2022

unhammer commented Sep 25, 2022

ftyers commented Jun 27, 2020 •

edited

Loading

unhammer commented Sep 24, 2022 •

edited

Loading