Skip to content

confidences of language detection #39

Closed
@fergald

Description

@fergald

Language detection returns a map of languages to confidence. There are many languages. Chrome's current language detector outputs about 120 entries in this map. Usually some small number will have high confidence and most of will have a confidence < 1%. The long tail is just noise. However it is good to have the confidences summing to 1.0.

If we mask low-confidence langauges, we should add their weight to the undefined language ("und") so the sum continues to be 1.0.

Approaches to cutting off the output:

Fixed cutoff

We set a fixed cutoff and hide any languages below that cutoff. This could be problematic when a text is genuinely multilingual and genuinely contains many smaller portions of different languages. If they are below the cutoff they will not be mentioned. It's possible for every language to be below the cutoff.

Fixed cumulative cutoff

We set a fixed cutoff and sum the weights from highest to lowest until we exceed the cutoff. We merge all subsequent languages into "und". E.g. with a cumulative cut-off of 0.99, the returned languages make up at least 99% of the weight and the omitted languages make up at most 1%. If the text contains equal amounts of many different languages, all or most of them will be present in the output.

Conclusion

I think fixed cumulative is simple enough to implement (sort then accumulate). We still need to pick a cutoff. 1% seems reasonable, if the tail sums to less than 1%, it seems like it cannot be impactful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions