Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for non-ASCII digits #74

Closed
siegfriedpammer opened this issue Apr 10, 2024 · 4 comments
Closed

Add support for non-ASCII digits #74

siegfriedpammer opened this issue Apr 10, 2024 · 4 comments

Comments

@siegfriedpammer
Copy link

: c is >= '0' and <= '9'
? TokenDigits
: TokenOther;

There are many more Unicode codepoints that can be used as digits, as can be seen here: https://www.compart.com/en/unicode/category/Nd Each of these has a numeric value assigned, for example https://www.compart.com/en/unicode/U+0A68 (which has the value 2).

I suggest using char.IsDigit instead to handle this correctly.

see https://github.com/christophwille/poc-oh/blob/main/src/NaturalSortTests/Program.cs for a comparison with StrCmpLogicalW:

Input: A, A10, A11, Z, A੨, A੨੨
NaturalSort.Extensions: A, A੨, A੨੨, A10, A11, Z
StrCmpLogicalW: A, A੨, A10, A11, A੨੨, Z

The sort order of StrCmpLogicalW makes perfect sense if you replace ੨ with 2.

@tompazourek
Copy link
Owner

tompazourek commented Apr 11, 2024

Good point, thanks for contributing.

If I use these unicode digits, I'll need to find some way how to compare the string segments that are composed of unicode digits. Essentially "parsing" the unicode digits into numbers and comparing them. Currently if I only consider 0-9, the comparison is trivial and fast, and the number parsing doesn't even occur. I'm not sure if the current simple comparison of digit values would work well enough. But I suppose it might work better than just treating unicode digits as "other characters".

I like your example comparing results to StrCmpLogicalW. I think these sort of comparisons would be useful to add into tests.

@tompazourek
Copy link
Owner

I see that the Windows compare treats ੨ as something between 2 and 3. It would be interesting to find some simple mechanism that will let me do the same thing fast:

A
A2
A੨
A3
A10
A11
A22
A੨੨
A33
Z

@tompazourek tompazourek changed the title GetTokenFromChar does not support non-ASCII digits Add support for non-ASCII digits Apr 11, 2024
@christophwille
Copy link

I like your example comparing results to StrCmpLogicalW. I think these sort of comparisons would be useful to add into tests.

Feel free to do so, the code is from https://github.com/icsharpcode/ILSpy/blob/master/ILSpy/TreeNodes/NaturalStringComparer.cs - we were looking for options to no longer use a native import. That is when we were like "Wait, Unicode is more than 0-9".

@tompazourek
Copy link
Owner

tompazourek commented Apr 14, 2024

This is now implemented in c303896

It is released as version 4.3.0 (https://github.com/tompazourek/NaturalSort.Extension/releases/tag/4.3.0)

In case you find discrepancies, please file new issues.

Thank you again for contributing with this idea, it wouldn't have happened without you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants