Use all char points when generating indexes instead of only the first #41
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When generating indexes for a given label we currently only take into account the first codepoint of any given char. This can cause issues. For example the common LGR currently provided by ICANN contains:
This is meant to mark "ß" as a variant of "ss" but because of the current index generation logic any label with "ss" will return an index containing only "s". IE "sharpness" -> "sharpnes". I don't think this is intended behavior and will cause a lot of collisions.
This PR seems to resolve the issue for me ("sharpness" -> "sharpnes", "teßt" -> "tesst") but more rigorous testing may be needed.