Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of invalid UTF-8 #11968

Closed
wants to merge 6 commits into from
Closed

Conversation

GULPF
Copy link
Member

@GULPF GULPF commented Aug 17, 2019

The primary goal of this PR is to fix the handling of invalid UTF-8 in the unicode module, but it also contains some other minor changes. The diff is pretty big so I've tried to list the changes in the PR description. Although this PR contains breaking changes, the behavior will be the same for most cases. All the old tests for the unicode module still pass. I still need to write some tests and documentation, but I'm opening a draft PR early so I don't forget about this and in case someone wants to do a review.

Breaking changes:

  • Rune is now distinct range[0..0x10FFFF]
  • Invalid UTF-8 input is now properly handled, with every invalid byte being
    treated as the replacement rune (0xFFFD). This change affect every proc
    in the unicode module that takes a string as input.
  • Invalid UTF-8 is never generated anymore. However, procs like align
    and strip that doesn't process the entire string will still output
    invalid UTF-8 if they received it.

New exported symbols:

  • `<`(Rune, Rune): bool
  • `<=`(Rune, Rune): bool
  • runeAndSizeAt(string, Natural): (Rune, int)
  • const ReplacementRune* = Rune(0xFFFD)
  • runeSizeAt

Deprecated symbols:

  • Rune16
  • `<%`(Rune, Rune) (replaced by <)
  • `<=%`(Rune, Rune) (replaced by <=)
  • fastRuneAt (replaced by runeAndSizeAt)
  • fastToUtf8Copy (replaced by add)
  • runeLenAt (replaced by runeSizeAt)

TODO

  • Tests for invalid UTF-8
  • Documentation

Fixes nim-lang/RFCs#151
Fixes #10750

@Araq
Copy link
Member

Araq commented Aug 17, 2019

Invalid UTF-8 input is now properly handled, with every invalid byte being
treated as the replacement rune (0xFFFD). This change affect every proc
in the unicode module that takes a string as input.

But this loses information, is this really a good idea? If I pass a list of bytes to some subsystem that assumes Unicode losing this information seems like a very bad idea.

@GULPF
Copy link
Member Author

GULPF commented Aug 17, 2019

But this loses information, is this really a good idea? If I pass a list of bytes to some subsystem that assumes Unicode losing this information seems like a very bad idea.

If the subsystem intends to interpret the list of bytes as Unicode code points encoded with UTF-8, then no interpretable information is lost when invalid bytes are replaced with U+FFFD. The unicode module currently treats UTF-8 as a way to encode arbitrary int32 values, which is silly.

Note that the unicode module already does this non-reversible replacement in fastRuneAt (although it ignores most cases). Meanwhile, other procs like runeLenAt performs no input validation at all. This PR standardizes the behavior so that the entire module behaves as recommended by the Unicode conformance document:

For example, in UTF-8 every code unit of the form 110xxxx must be followed by a code unit of the form 10xxxxxx. A sequence such as 110xxxxx 0xxxxxxx is ill-formed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant process must treat the first code unit 110xxxxx as an illegally terminated code unit sequence—for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD replacement character.

@Araq
Copy link
Member

Araq commented Aug 17, 2019

Note that the unicode module already does this non-reversible replacement in fastRuneAt (although it ignores most cases).

Meh, ok.

@stale
Copy link

stale bot commented Aug 31, 2020

This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions.

@stale stale bot added the stale Staled PR/issues; remove the label after fixing them label Aug 31, 2020
@stale stale bot closed this Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Staled PR/issues; remove the label after fixing them
Projects
None yet
2 participants