Correct language information from the VLO is sometimes overridden by erroneous language #18

twagoo · 2019-04-04T07:36:19Z

The language detection logic sometimes incorrectly classifies the language of text (plain text, pdf, ...) content. In some (not all) cases correct language information is passed on by the colling tool (i.e. the VLO). There exist an intersection of these two cases, which means that correct language information is overridden by incorrect language information. We should think of ways of avoiding this. Some thoughts:

Never automatically override provided language information but let the user 'opt in' to detection
In case of a conflict between provided and detected language, offer both options to the user
Converting to plain text before detection (Update dropwizard (fix Snyk vulnerabilities) #75) might help to improve detection accuracy
Perhaps a confidence score for the detection is available and can be utilised somehow?

emanueldima · 2021-09-21T07:57:41Z

@twagoo can you provide a test case? Since April 2019 we changed the library for language detection, and this bug may not reproduce anymore. Related to point 3, we now convert to text before language detection.

twagoo · 2021-09-21T08:24:23Z

@twagoo can you provide a test case? Since April 2019 we changed the library for language detection, and this bug may not reproduce anymore. Related to point 3, we now convert to text before language detection.

I don't have a test case at hand, and unfortunately I didn't document how to reproduce this when the ticket was created. If we have not experienced this ourselves since 2019, i.e. if language detection is deemed reliable I think this ticket could also be closed. If we want to consider using the language information from the metadata (passed by the VLO) in some way, that should probably be a different/new issue.

emanueldima transferred this issue from clarin-eric/LRSwitchboard Sep 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct language information from the VLO is sometimes overridden by erroneous language #18

Correct language information from the VLO is sometimes overridden by erroneous language #18

twagoo commented Apr 4, 2019

emanueldima commented Sep 21, 2021

twagoo commented Sep 21, 2021

Correct language information from the VLO is sometimes overridden by erroneous language #18

Correct language information from the VLO is sometimes overridden by erroneous language #18

Comments

twagoo commented Apr 4, 2019

emanueldima commented Sep 21, 2021

twagoo commented Sep 21, 2021