Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct language information from the VLO is sometimes overridden by erroneous language #18

Open
twagoo opened this issue Apr 4, 2019 · 2 comments

Comments

@twagoo
Copy link
Member

twagoo commented Apr 4, 2019

The language detection logic sometimes incorrectly classifies the language of text (plain text, pdf, ...) content. In some (not all) cases correct language information is passed on by the colling tool (i.e. the VLO). There exist an intersection of these two cases, which means that correct language information is overridden by incorrect language information. We should think of ways of avoiding this. Some thoughts:

  • Never automatically override provided language information but let the user 'opt in' to detection
  • In case of a conflict between provided and detected language, offer both options to the user
  • Converting to plain text before detection (Update dropwizard (fix Snyk vulnerabilities) #75) might help to improve detection accuracy
  • Perhaps a confidence score for the detection is available and can be utilised somehow?
@emanueldima emanueldima transferred this issue from clarin-eric/LRSwitchboard Sep 26, 2019
@emanueldima
Copy link
Collaborator

@twagoo can you provide a test case? Since April 2019 we changed the library for language detection, and this bug may not reproduce anymore. Related to point 3, we now convert to text before language detection.

@twagoo
Copy link
Member Author

twagoo commented Sep 21, 2021

@twagoo can you provide a test case? Since April 2019 we changed the library for language detection, and this bug may not reproduce anymore. Related to point 3, we now convert to text before language detection.

I don't have a test case at hand, and unfortunately I didn't document how to reproduce this when the ticket was created. If we have not experienced this ourselves since 2019, i.e. if language detection is deemed reliable I think this ticket could also be closed. If we want to consider using the language information from the metadata (passed by the VLO) in some way, that should probably be a different/new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants