Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a utility to assign weights to a compiled transducer based on a corpus #16

Open
ftyers opened this issue Jul 1, 2018 · 9 comments
Labels
enhancement New feature or request weighting

Comments

@ftyers
Copy link
Member

ftyers commented Jul 1, 2018

I imagine it will be called lt-reweight

It should have two arguments:

  1. a binary lttoolbox file e.g. grn.automorf.bin
  2. a tagged corpus grn.tagged
$ lt-reweight grn.automorf.bin grn.tagged

Where grn.tagged looks like:

^Avañeʼẽ/avañeʼẽ<n>$
^ha/ha<cnjcoo>$
^Guarani/guarani<n>$
^ñeʼẽ/ñeʼẽ<n>$
^ombohéra/o<prn><p3><sg>+mbohéra<v><tv><pres>$
^hikuái/hikuái<aux><impf><p3><pl>$
^umi/umi<adj><dem><pl>$
^Guaranikuéra/guarani<n>+kuéra<det><pl>$
^pe/pe<post>$
^ñeʼẽ/ñeʼẽ<n>$
^teépe/tee<n>+pe<post>$
^./.<sent>$

^Guarani/guarani<n>$
^haʼe/haʼe<vbser><iv><pres>$
^peteĩva/peteĩ<num>+va<subs><dem>$
^umi/umi<adj><dem><pl>$
^teʼyikuéra/teʼyi<n>+kuéra<det><pl>$
^Amérika-gua/Amérika<np><top>+gua<post>$
^ñeʼẽnguéra/ñeʼẽ<n>+kuéra<det><pl>$
^apytépe/apytépe<post>$
^hetave/heta<adv>+ve<comp>$
^iñeʼẽhárava/iñeʼẽhárava<adj>$
^,/,<cm>$
^oñemohendáva/o<prn><p3><sg>+je<pass>+mohenda<v><tv><pres>+va<subs><dem>$
^irundy/irundy<num>$
^tetãnguéra/tetã<n>+kuéra<det><pl>$
^iñambuévape/iñambuéva<adj>+pe<post>$
^(/(<lpar>$
^Paraguái/Paraguái<np><top>$
^,/,<cm>$
^Argentina/Argentina<np><top>$
^,/,<cm>$
^Volívia/Volívia<np><top>$
^ha/ha<cnjcoo>$
^Brasil/Brasil<np><top>$
^)/)<rpar>$
^./.<sent>$

^Avei/avei<adv>$
^,/,<cm>$
^haʼe/haʼe<vbser><iv><pres>$
^ñoite/ñoite<adv>$
^ojehechakuaáva/o<prn><p3><sg>+je<pass>+hechakuaa<v><tv><pres>+va<subs><dem>$
^ñeʼẽ/ñeʼẽ<n>$
^teéramo/tee<n>+ramo<post>$
^peteĩ/peteĩ<num>$
^tetã/tetã<n>$
^Ñembyamérika-guápe/Ñembyamérika<np><top>+gua<post>+pe<post>$
^./.<sent>$

^Tupi/Tupi<n>$
^ha/ha<cnjcoo>$
^guarani/guarani<n>$
^ñeʼẽ/ñeʼẽ<n>$
^aty/aty<n>$
^guasu/guasu<adj>$
^rehegua/rehegua<post>$

^,/,<cm>$
^oguereko/o<prn><p3><sg>+guereko<v><tv><pres>$
^hetáichagua/hetáichagua<adj>$

^ñeʼẽnunga/ñeʼẽnunga<n>$
^,/,<cm>$
^upéicharõ/upéicha<adv>+rõ<post>$
^jepe/jepe<adv>$
^oĩ/oĩ<v><iv><pres>$
^jekupyty/jekupyty<v><tv><pres>$
^ijapytepekuéra/i<prn><p3><sg>+japyte<n>+pe<post>+kuéra<det><pl>$
^ha/ha<cnjcoo>$
^heta/heta<adv>$
^mbaʼépe/mbaʼe<n>+pe<post>$
^ojojogua/ojojogua<n>$
^koʼã/koʼã<adj><dem><pl>$
^ñeʼẽnungakuéra/ñeʼẽnunga<n>+kuéra<det><pl>$
^./.<sent>$

^Avañeʼẽ/avañeʼẽ<n>$
^ha/ha<cnjcoo>$
^karaiñeʼẽ/karaiñeʼẽ<n>$
^haʼe/haʼe<vbser><iv><pres>$
^Paraguái/Paraguái<np><top>$
^retãme/tetã<n>+pe<post>$
^ñeʼẽ/ñeʼẽ<n>$
^tee/tee<adj>$
^ary/ary<n>$
^1992/1992<num>$
^guive/guive<post>$
^./.<sent>$

^Japypateĩ/Japypateĩ<num>$
^2006/2006<num>$
^guive/guive<post>$
^haʼe/haʼe<vbser><iv><pres>$
^avei/avei<adv>$
^ñeʼẽ/ñeʼẽ<n>$
^tee/tee<adj>$
^Mercosur-pe/Mercosur<np><org>+pe<case>$
^,/,<cm>$
^karaiñeʼẽ/karaiñeʼẽ<n>$
^ha/ha<cnjcoo>$
^poytugañeʼẽ/poytugañeʼẽ<n>$
^ykére/ykére<post>$
^./.<sent>$

And the output of the analyser for e.g. poytugañeʼẽ is:

^poytugañeʼẽ/poytugañeʼẽ<n>/a<prn><p1><sg>+poytugañeʼẽ<n>/re<prn><p2><sg>+poytugañeʼẽ<n>$^./.<sent>$

So, the analyses should be weighted

poytugañeʼẽ : poytugañeʼẽ<n> = 1.0
poytugañeʼẽ : a<prn><p1><sg>+poytugañeʼẽ<n> = 0.0
poytugañeʼẽ  : re<prn><p2><sg>+poytugañeʼẽ<n>  = 0.0
@Techievena
Copy link
Member

Is it similar to supervised tagger training? @flammie

Techievena added a commit to Techievena/lttoolbox that referenced this issue Jul 5, 2018
Write the utility to process tagged corpus and the binary lttoolbox
file and return weighted analyses.

Closes apertium#16
Techievena added a commit to Techievena/lttoolbox that referenced this issue Jul 5, 2018
Write the utility to process tagged corpus and the binary lttoolbox
file and return weighted analyses.

Closes apertium#16
@flammie
Copy link
Member

flammie commented Jul 6, 2018

Pretty much I'd say, a unigram tagger should work exactly the same if I haven't missed anything.

Techievena added a commit to Techievena/lttoolbox that referenced this issue Jul 12, 2018
Write the utility to process tagged corpus and the binary lttoolbox
file and return weighted analyses.

Closes apertium#16
Techievena added a commit to Techievena/lttoolbox that referenced this issue Jul 14, 2018
Write the utility to process tagged corpus and the binary lttoolbox
file and return weighted analyses.

Closes apertium#16
Techievena added a commit to Techievena/lttoolbox that referenced this issue Jul 31, 2018
Write the utility to process tagged corpus and the binary lttoolbox
file and return weighted analyses.

Closes apertium#16
@ftyers ftyers added enhancement New feature or request weighting labels Jun 27, 2020
@ftyers
Copy link
Member Author

ftyers commented Jun 27, 2020

So one way this could work is:

  • Load original transducer, A
  • Read tagged corpus into a weighted FST, B
  • Intersect B and A, making C
  • Priority union C and A.

Questions:

  • Does intersection in lttoolbox do the right thing with weights?
  • We don't have priority union, or subtract, it seems a bit difficult to do without either of those.

@flammie @unhammer thoughts ?

@flammie
Copy link
Member

flammie commented Jun 27, 2020

I don't think even openfst has a defined intersection of weighted or two-tape automata, they just do the encoded intersection where a:b::W is treated as a special symbol in an automata intersection. It might be possible to add weights by way of intersection algorithm at least when the automata were mostly synchronised, otherwise I'd just do with composing.

For the experiments I published on weighing automata we did compose(A, B), or at most compose(minus'(A', B'), B) which does something similar to priority union I guess. It required some trickery though. One could even just do the union(A, B) since B is gold corpus with good tags, right? In compose method you mainly lose if there is non 1:1 relation from the direction you compose I think, e.g. if you have foo+X:bar foo+X:baz.

The part of A that doesn't get weighted by corpus should usually receive the penalty weight of unseen tokens.

@ftyers
Copy link
Member Author

ftyers commented Jun 27, 2020

The reason for not just doing union is that then we would have multiple identical analyses with different weights, right?

I was thinking compose would work, but we also don't have an implementation of compose in lttoolbox at the moment.

@unhammer
Copy link
Member

unhammer commented Sep 24, 2022

#161 adds a compose (optional on matching sub-paths), though not very extensively tested :) also I have no idea what the expected value of composed weights would be

@flammie
Copy link
Member

flammie commented Sep 24, 2022

I think weights are just added together in our WFSAs? Or theoretically using the weight structure's semiring's collect operation but we've always used the tropical semiring which is just +.

@unhammer
Copy link
Member

So whatever operation you use on weights when following arcs should be used when composing? And if you want to compose g . f without changing the weights that are in f, then all arcs of g need to be the identity (ie. 0 if operation is +)?

@unhammer
Copy link
Member

@flammie newest uses +

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request weighting
Projects
None yet
Development

No branches or pull requests

4 participants