Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

együttjárhatnak: incorrect POS tags #8

Open
DavidNemeskey opened this issue Nov 15, 2016 · 5 comments
Open

együttjárhatnak: incorrect POS tags #8

DavidNemeskey opened this issue Nov 15, 2016 · 5 comments
Assignees
Labels

Comments

@DavidNemeskey
Copy link
Contributor

DavidNemeskey commented Nov 15, 2016

The analysis of együttjárhatnak (QT,HFSTLemm,ML3-PosLem-hfstcode) is [V][_Mod][Prs.NDef.3Pl], which is incorrect: the tags [V] and [_Mod] should be [/V] and [_Mod/V], respectively.

HFST does not recognize the word (probably because it should be written separately), so it might be some fallback module that produces this analysis?

Similar invalid analyses are

  • Mal-csoporthoz / [N][All]
  • Tíz- / [Num][Nom] (interestingly enough, HFST returns an analysis for tíz-, so why doesn't it appear in GATE? This word was at the beginning of the sentence, hence the capitalization, but usually it is not a problem)
@sassbalint
Copy link
Member

Balázs, can you check this issue, please? :)

@dlazesz
Copy link

dlazesz commented Nov 29, 2016

Did you checked the training corpus? Do we have a bugtracker for it at all?
(We discovered numerous bugs in it.)
For example: Edit (woman first name) has the lemma Edi and some Accusative tag...

I checked the corpus:

./cwszt.conll-2009_ready.disamb.new:együttjárhatnak	együttjárhat	V	SubPOS=m|Mood=i|Tense=p|Per=3|Num=p|Def=n	együttjár[V][_Mod][Prs.NDef.3Pl]
./gazdtar.conll-2009_ready.disamb.new:utasíthatja.[Gt.	utasíthatja.[Gt.	X	_	utasít[V][_Mod][Prs.Def.3Sg][Punct]

@DavidNemeskey:
Could you check if the gold standard analysis of each word of the traininig corpus is in the set of the given alalyses of emMorph? This would be the fist step of fixing this kind of issues.

@vinczev :
If I get a newer version of the corpus i'll do a train and this issue will be solved instantly.
(I do not want to make changes in it on my own as it would diverge from the versions used by others...)
There should be some central repository with a bugtracker for the corpus too!

@DavidNemeskey
Copy link
Contributor Author

@dlazesz Sorry, but I think this should be done by the owner of the corpus and the tagger, not a third party. :)

That said, the above three are all I have discovered; though I did not specifically look for these differences, I added a mapping for the erroneous tags, and this is the list I ended up with:

{
  '[N]': '[/N]',
  '[V]': '[/V]',
  '[Num]': '[/Num]',
  '[_Mod]': '[_Mod/V]'
}

@dlazesz
Copy link

dlazesz commented Nov 29, 2016

I am not the owner of the corpus.
To avoid later errors in the chain I'll wait for a new version of the corpus.
This issue has nothing to do with the tagger.

@DavidNemeskey:
Please be so kind and help the corpus owners by finding bugs in the corpus instead of blaming others, who have nothing to do with the issue.

@DavidNemeskey DavidNemeskey removed their assignment Nov 29, 2016
@DavidNemeskey
Copy link
Contributor Author

I am not blaming anybody, I just don't know where this error stems from. I have already listed all errors I found.

@vinczev I second the notion of having a bug-tracker for the corpus. The errors I sent a few weeks earlier (disagreement between the old and new-style tags, [Acc] missing, etc.) via email should also be fixed before a new model can be trained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants