confidences of language detection

Language detection returns a map of languages to confidence. There are many languages. Chrome's current language detector outputs about 120 entries in this map. Usually some small number will have high confidence and most of will have a confidence < 1%. The long tail is just noise. However it is good to have the confidences summing to 1.0.

If we mask low-confidence langauges, we should add their weight to the undefined language (`"und"`) so the sum continues to be 1.0.

# Approaches to cutting off the output:

## Fixed cutoff

We set a fixed cutoff and hide any languages below that cutoff. This could be problematic when a text is genuinely multilingual and genuinely contains many smaller portions of different languages. If they are below the cutoff they will not be mentioned. It's possible for every language to be below the cutoff.

## Fixed cumulative cutoff

We set a fixed cutoff and sum the weights from highest to lowest until we exceed the cutoff. We merge all subsequent languages into `"und"`. E.g. with a cumulative cut-off of 0.99, the returned languages make up at least 99% of the weight and the omitted languages make up at most 1%. If the text contains equal amounts of many different languages, all or most of them will be present in the output.

# Conclusion

I think fixed cumulative is simple enough to implement (sort then accumulate). We still need to pick a cutoff. 1% seems reasonable, if the tail sums to less than 1%, it seems like it cannot be impactful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

confidences of language detection #39

Approaches to cutting off the output:

Fixed cutoff

Fixed cumulative cutoff

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

confidences of language detection #39

Description

Approaches to cutting off the output:

Fixed cutoff

Fixed cumulative cutoff

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions