-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 combining characters and normalization in reverse() #6165
Comments
I think we decided not to normalize strings by default. Strings are user data. |
Yeah, but the problem is how should be string operations like |
I know reversing codepoints is not what you always want, but at least it's a primitive. Having only a |
I suspect that doing NFC normalization during reversal is the right default. There can be a |
+1 |
We might want UnicodeNormalization objects with NFC <: UnicodeNormalization, etc. That way we could pass the normalization as an argument and dispatch on it, which could be quite handy. |
What is the use case for reversing strings? The intended application should determine functionality here. If you want to reverse graphemes, utf8proc can identify the graphemes for you. (Normalization won't necessarily eliminate all combining characters.) Note that we already have Unicode case folding. |
That's the problem – we don't know, since we're writing a standard library, not an application. You're right that grapheme reversal is probably the better way to do this. It would be good to come up with a reverse function that generally does what you want. It should generally operate at the level of graphemes, but cases like ligatures would, ideally, be split before reversing. Ligature splitting would imply that |
Silly question: if the |
Very possible. Maybe |
@StefanKarpinski, even library writers should have some kind of practical application in mind. Otherwise, why include a |
Largely because every language standard library seems to have one. I'm not entirely sure what it's useful for except for cute examples with sorting dictionary words. If someone has real world examples of using the |
I agree it's not so important. One reason to have it is that reversing UTF-8 code points efficiently is difficult, so on the off chance you need this it's good to have built in. I prefer not to guess what people want, and instead do something primitive so that you can understand what it does and build what you want on top. |
See e.g. this discussion on supposed uses for string reversal. It is telling that all of the answers seem to be bogus (there always seems to be a much better way to solve the problem that string reversal supposedly solves). My favorite answer is:
I vote that we just remove |
This function is probably most useful as a showcase of how smart Julia is when it comes to handling Unicode strings. Not sure that justifies its existence... ;-) |
I'm not sure whether including a useless function shows off how smart we are... |
It's possible that |
We could also have a reverse iteration protocol, which strings already kind of have with prevind. |
Adding that (maybe |
So shall we deprecate |
I actually just used it (related: #6276). |
Yeah, I really don't think that removing the |
@carlobaldassi Wouldn't it be better to use reverse search rather than reverting the whole string just to match a few characters? |
It would but reverse string search has just as many difficulties as reversing a string does if you think about it. And PCRE doesn't support it.
|
Well, if I want to find the last occurrence of |
It could almost be reworked fairly easily to use In any case, I was so annoyed by the experience that I started writing a |
@nalimilan see also #6276 |
Are you looking for |
@carlobaldassi Yeah, but your solution isn't perfect either (is @StefanKarpinski The point is, you don't need to reverse any of them to find the last occurrence of |
I think that @StefanKarpinski's proposal of an argument to
Indeed in this case it would, it'd just be more verbose (not a particularly compelling argument for keeping |
With #9261, (The |
Perhaps grapheme reversal is the correct default behavior. |
If the application of |
Since we have two possible behaviors, we should document which we have picked. |
@stevengj You scared me! This indeed sounds like a very cool feature, and I agree with @StefanKarpinski that it should be the default. I think it would be more intuitive, as it is based on what the standard reference you quoted (http://www.unicode.org/reports/tr29/) calls "user-perceived characters". OTC the current behavior only makes sense when using regexps and it requires a deep understanding of Unicode. I'd actually argue that iterating over strings should also go over graphemes, and that calling # ë as a single codepoint, OK
julia> [c for c in "noël"]
4-element Array{Char,1}:
'n'
'o'
'ë'
'l'
julia> join(reverse([c for c in "noël"]))
"lëon"
julia> reverse("noël")
"lëon"
# ë as two codepoints, weird
julia> [c for c in "noe\u0308l"]
5-element Array{Char,1}:
'n'
'o'
'e'
'̈'
'l'
# These really don't make any sense to 80% of users IMHO
julia> join(reverse([c for c in "noe\u0308l"]))
"l̈eon"
julia> reverse("noe\u0308l")
"l̈eon"
I'd rather offer a function to to reverse regex matching (as I suggested at #9249 (comment)), and make |
@nalimilan, making iteration go over graphemes by default would be a huge change because graphemes are substrings, not codepoints ( I continue to think that the behavior of any function, A reverse-regex function should be discussed in #6276. My preference would be to simply document this in |
Ah, right. Unicode is so complex. Maybe we just need some amount of normalization when iterating and reversing, so that they give the same result for both I don't really like the idea that |
Problems with just exposing a function for
Normalizing on iteration (and reversal) would (a) greatly slow down those functions, (b) involve more memory allocation, and (c) wouldn't fix the problem anyway because NFC eliminates some but not all combining characters. |
@stevengj: "I vote that we just remove reverse(::String) function from Base". I may not have thought this through, but doesn't reverse imply indexing must be well defined (and it isn't - see #9297 (comment)) If/when indexing is gone and you can reintroduce it explicitly, say you want codepoint-indexing or whatever (byte, grapheme), does reverse get trivial then? |
Add breaking label? Appropriate if changing from codepoint-meaning (or just dropping function..). Seems I just do not have the right. |
I think this issue can be closed. Per the above discussion, it seems to have become clear that we need a No one has come up with any practical use for reversing graphemes. This is possible in Julia as noted above, via the |
👏 Good to see this closed! |
(it appears that Github also doesn't like them, examples taken from here)
Backspace support is also broken on the REPL: if you press backspace twice from "noël", it shows you "no", but actually represents "noe".
What should be our approach in this case? NFKC does solve both issues above, but I'm not sure how good it is to normalize every string before operating on them:
(related to #5434, #5903)
The text was updated successfully, but these errors were encountered: