Use all char points when generating indexes instead of only the first #41

TonyStew · 2024-11-19T19:07:42Z

When generating indexes for a given label we currently only take into account the first codepoint of any given char. This can cause issues. For example the common LGR currently provided by ICANN contains:

    <char cp="0073 0073" ref="118" comment="Sequence added for variant mapping">
      <var cp="00DF" type="blocked" ref="118" comment="IDNA2003 Compatibility" />
      <var cp="03B2" type="blocked" ref="118" />
    </char>

This is meant to mark "ß" as a variant of "ss" but because of the current index generation logic any label with "ss" will return an index containing only "s". IE "sharpness" -> "sharpnes". I don't think this is intended behavior and will cause a lot of collisions.

This PR seems to resolve the issue for me ("sharpness" -> "sharpnes", "teßt" -> "tesst") but more rigorous testing may be needed.

Use all char points when generating indexes instead of only the first

2751926

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use all char points when generating indexes instead of only the first #41

Use all char points when generating indexes instead of only the first #41

TonyStew commented Nov 19, 2024

Use all char points when generating indexes instead of only the first #41

Are you sure you want to change the base?

Use all char points when generating indexes instead of only the first #41

Conversation

TonyStew commented Nov 19, 2024