Skip to content

What is the correct answer for detect() on a multilingual string? #13

Closed
@fergald

Description

@fergald

If a string is 100 chars of English followed by 900 characters of French, what is the ideal result? Is it the following?

[
  {language: "en", confidence:.1},
  {language: "fr", confidence:.9},
]

I haven't been able to come up with a better idea than that each language in the result should tell you what fraction of the string is in that language.

This gets more complicated when the language of segments of the string are themselves ambiguous. E.g. for an English article talking about words that are shared between Chinese and Japanese, what is the correct answer? Assuming the text is 80% English with 10% of it being Chinese/Japanese. What is the ideal result? Is it

  {language: "en", confidence:.8},
  {language: "ja", confidence:.0.1},
  {language: "zh", confidence:.0.1},
]

even though 20% of the text is Japanese and 20% is Chinese? I can't think of a better "correct" answer but maybe there is one.

Also from an implementation perspective, the above "correct" answer is relatively easy. Models may have a fixed maximum input size and the above can be calculated by breaking the string into chunks and averaging over the results for each chunk.

Questions

  • Should we even be trying to spec level of detail?
  • If so, should we spec the above?

Metadata

Metadata

Assignees

No one assigned

    Labels

    i18n-trackerGroup bringing to attention of Internationalization, or tracked by i18n but not needing response.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions