Correct handling of context-sensitive unicode case folding #11

djudd · 2022-07-15T00:15:40Z

There are a couple of places where we ignore the fact that unicode case folding can generate multiple characters, and I don't think there are any cases where we correctly handle context. rust-lang/rust#98490 would make this much easier.

djudd · 2022-07-15T02:34:52Z

A little more research suggests what we actually want is case folding, e.g. via https://github.com/unicode-rs/rust-caseless. But this might be slower (particularly given need for a different normalization) and it's not clear if it really matters given that we're often running against already transliterated data, or only interested in an initial.

djudd · 2022-07-18T12:38:35Z

#20 makes us consistently respect the possibility of multiple characters for a case mapping, but I'll leave this open to handle context.

djudd mentioned this issue Jul 18, 2022

Improve correctness of titlecasing & case folding #20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct handling of context-sensitive unicode case folding #11

Correct handling of context-sensitive unicode case folding #11

djudd commented Jul 15, 2022

djudd commented Jul 15, 2022

djudd commented Jul 18, 2022

Correct handling of context-sensitive unicode case folding #11

Correct handling of context-sensitive unicode case folding #11

Comments

djudd commented Jul 15, 2022

djudd commented Jul 15, 2022

djudd commented Jul 18, 2022