Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML3 tokenizer chokes on certain inputs #19

Open
DavidNemeskey opened this issue Dec 13, 2016 · 0 comments
Open

ML3 tokenizer chokes on certain inputs #19

DavidNemeskey opened this issue Dec 13, 2016 · 0 comments
Assignees
Labels

Comments

@DavidNemeskey
Copy link
Contributor

This sentence breaks HungarianTokenizerSentenceSplitter: Abban az esetben, ha a - fiktív - www.kereso.elte.hu szervertől kérjük a www.kereso.elte.hu/nev=kiss,jozsef%kar=jog%tagozat=nappali címen található oldalt, akkor az elképzelt kiszolgálónk a kérésre megmutatná a megnevezett egyetemi hallgatóról rendelkezésre álló adatokat. The online demo, on the other hand, processes this sentence without problems

As far as I can understand, the error comes from two sources:

  • the URL is broken into tokens
  • MySplitter inserts both the URL and the tokens returned by the main splitter class, so the tokens would be [..., "www.kereso.elte.hu/nev=kiss,jozsef%kar=jog%tagozat=nappali", "www", ".", "kereso.elte.hu", ...]

The former behaviour is the same in the online demo; the second seems to be particular to hunlp-GATE. Which is strange, since when I downloaded magyarlanc and replaced the jar in hunlp-GATE with it, the error persisted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants