-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of invalid UTF-8 #11968
Conversation
But this loses information, is this really a good idea? If I pass a list of bytes to some subsystem that assumes Unicode losing this information seems like a very bad idea. |
If the subsystem intends to interpret the list of bytes as Unicode code points encoded with UTF-8, then no interpretable information is lost when invalid bytes are replaced with U+FFFD. The Note that the unicode module already does this non-reversible replacement in
|
Meh, ok. |
This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions. |
The primary goal of this PR is to fix the handling of invalid UTF-8 in the unicode module, but it also contains some other minor changes. The diff is pretty big so I've tried to list the changes in the PR description. Although this PR contains breaking changes, the behavior will be the same for most cases. All the old tests for the unicode module still pass. I still need to write some tests and documentation, but I'm opening a draft PR early so I don't forget about this and in case someone wants to do a review.
Breaking changes:
Rune
is nowdistinct range[0..0x10FFFF]
treated as the replacement rune (0xFFFD). This change affect every proc
in the unicode module that takes a string as input.
and strip that doesn't process the entire string will still output
invalid UTF-8 if they received it.
New exported symbols:
`<`(Rune, Rune): bool
`<=`(Rune, Rune): bool
runeAndSizeAt(string, Natural): (Rune, int)
const ReplacementRune* = Rune(0xFFFD)
runeSizeAt
Deprecated symbols:
Rune16
`<%`(Rune, Rune)
(replaced by<
)`<=%`(Rune, Rune)
(replaced by<=
)fastRuneAt
(replaced byruneAndSizeAt
)fastToUtf8Copy
(replaced byadd
)runeLenAt
(replaced byruneSizeAt
)TODO
Fixes nim-lang/RFCs#151
Fixes #10750