Improve handling of invalid UTF-8 #11968

GULPF · 2019-08-17T09:02:06Z

The primary goal of this PR is to fix the handling of invalid UTF-8 in the unicode module, but it also contains some other minor changes. The diff is pretty big so I've tried to list the changes in the PR description. Although this PR contains breaking changes, the behavior will be the same for most cases. All the old tests for the unicode module still pass. I still need to write some tests and documentation, but I'm opening a draft PR early so I don't forget about this and in case someone wants to do a review.

Breaking changes:

Rune is now distinct range[0..0x10FFFF]
Invalid UTF-8 input is now properly handled, with every invalid byte being
treated as the replacement rune (0xFFFD). This change affect every proc
in the unicode module that takes a string as input.
Invalid UTF-8 is never generated anymore. However, procs like align
and strip that doesn't process the entire string will still output
invalid UTF-8 if they received it.

New exported symbols:

`<`(Rune, Rune): bool
`<=`(Rune, Rune): bool
runeAndSizeAt(string, Natural): (Rune, int)
const ReplacementRune* = Rune(0xFFFD)
runeSizeAt

Deprecated symbols:

Rune16
`<%`(Rune, Rune) (replaced by <)
`<=%`(Rune, Rune) (replaced by <=)
fastRuneAt (replaced by runeAndSizeAt)
fastToUtf8Copy (replaced by add)
runeLenAt (replaced by runeSizeAt)

TODO

Tests for invalid UTF-8
Documentation

Fixes nim-lang/RFCs#151
Fixes #10750

Araq · 2019-08-17T09:32:23Z

Invalid UTF-8 input is now properly handled, with every invalid byte being
treated as the replacement rune (0xFFFD). This change affect every proc
in the unicode module that takes a string as input.

But this loses information, is this really a good idea? If I pass a list of bytes to some subsystem that assumes Unicode losing this information seems like a very bad idea.

GULPF · 2019-08-17T10:59:45Z

But this loses information, is this really a good idea? If I pass a list of bytes to some subsystem that assumes Unicode losing this information seems like a very bad idea.

If the subsystem intends to interpret the list of bytes as Unicode code points encoded with UTF-8, then no interpretable information is lost when invalid bytes are replaced with U+FFFD. The unicode module currently treats UTF-8 as a way to encode arbitrary int32 values, which is silly.

Note that the unicode module already does this non-reversible replacement in fastRuneAt (although it ignores most cases). Meanwhile, other procs like runeLenAt performs no input validation at all. This PR standardizes the behavior so that the entire module behaves as recommended by the Unicode conformance document:

For example, in UTF-8 every code unit of the form 110xxxx must be followed by a code unit of the form 10xxxxxx. A sequence such as 110xxxxx 0xxxxxxx is ill-formed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant process must treat the first code unit 110xxxxx as an illegally terminated code unit sequence—for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD replacement character.

Araq · 2019-08-17T17:30:42Z

Note that the unicode module already does this non-reversible replacement in fastRuneAt (although it ignores most cases).

Meh, ok.

stale · 2020-08-31T19:06:10Z

This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions.

GULPF added 6 commits August 18, 2019 08:56

Improve handling of invalid UTF-8

861c460

toUTF8->toUtf8

5752ce8

Add proc sanitizeUtf8

26392ce

Readd procs removed by accident

a7cd922

Fix failing runnable example

93f450a

Minor changes

19038ef

GULPF force-pushed the new-unicode branch from d6e8ac4 to 19038ef Compare September 1, 2019 18:16

stale bot added the stale Staled PR/issues; remove the label after fixing them label Aug 31, 2020

stale bot closed this Sep 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of invalid UTF-8 #11968

Improve handling of invalid UTF-8 #11968

GULPF commented Aug 17, 2019 •

edited

Loading

Araq commented Aug 17, 2019

GULPF commented Aug 17, 2019

Araq commented Aug 17, 2019

stale bot commented Aug 31, 2020

Improve handling of invalid UTF-8 #11968

Improve handling of invalid UTF-8 #11968

Conversation

GULPF commented Aug 17, 2019 • edited Loading

Breaking changes:

New exported symbols:

Deprecated symbols:

TODO

Araq commented Aug 17, 2019

GULPF commented Aug 17, 2019

Araq commented Aug 17, 2019

stale bot commented Aug 31, 2020

GULPF commented Aug 17, 2019 •

edited

Loading