You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The language detection logic sometimes incorrectly classifies the language of text (plain text, pdf, ...) content. In some (not all) cases correct language information is passed on by the colling tool (i.e. the VLO). There exist an intersection of these two cases, which means that correct language information is overridden by incorrect language information. We should think of ways of avoiding this. Some thoughts:
Never automatically override provided language information but let the user 'opt in' to detection
In case of a conflict between provided and detected language, offer both options to the user
@twagoo can you provide a test case? Since April 2019 we changed the library for language detection, and this bug may not reproduce anymore. Related to point 3, we now convert to text before language detection.
@twagoo can you provide a test case? Since April 2019 we changed the library for language detection, and this bug may not reproduce anymore. Related to point 3, we now convert to text before language detection.
I don't have a test case at hand, and unfortunately I didn't document how to reproduce this when the ticket was created. If we have not experienced this ourselves since 2019, i.e. if language detection is deemed reliable I think this ticket could also be closed. If we want to consider using the language information from the metadata (passed by the VLO) in some way, that should probably be a different/new issue.
The language detection logic sometimes incorrectly classifies the language of text (plain text, pdf, ...) content. In some (not all) cases correct language information is passed on by the colling tool (i.e. the VLO). There exist an intersection of these two cases, which means that correct language information is overridden by incorrect language information. We should think of ways of avoiding this. Some thoughts:
The text was updated successfully, but these errors were encountered: