Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlong utf-8 sequences should be treated as invalid utf-8 #3787

Closed
Blub opened this issue Oct 16, 2012 · 4 comments
Closed

overlong utf-8 sequences should be treated as invalid utf-8 #3787

Blub opened this issue Oct 16, 2012 · 4 comments
Labels
A-Unicode Area: Unicode

Comments

@Blub
Copy link

Blub commented Oct 16, 2012

The current utf-8 implementation contains some assertions for the validity of utf-8 bytes. Specifically, passing a sequence such as "\x80\xae" to a string function will throw an Assertion is_utf8(v) failed.
However, overlong encodings are accepted without any such error. So a sequence as "\xC0\xAE" (an overlong encoding for \x2E, a dot) will be accepted, and appear in the final rust-string.
This raises some security concerns as described in RFC3629 Section 10:
https://tools.ietf.org/html/rfc3629#section-10

Short example: when a program allows a user to access files, but wants to restrict access to "../", it must not be possible to circumvent this check by using an overlong encoding of a dot, and the author of the program shouldn't have to rely on the OS to perform any such check either.

fn main() {
    // overlong dot, should be invalid but is accepted
    let s1 = str::from_bytes([0xc0 as u8, 0xae as u8]);
    io::println(fmt!("len: %u, chars: %u, value: %s", s1.len(), str::char_len(s1), s1));
    // regular invalid utf, triggering an assertion fail
    let s2 = str::from_bytes([0x80 as u8, 0xae as u8]);
    io::println(fmt!("len: %u, chars: %u, value: %s", s2.len(), str::char_len(s2), s2));
}
@vertexclique
Copy link
Member

Also this fails on macos 10.8.2 when make check-stage2-rpass TESTNAME=* this executes ./x86_64-apple-darwin/stage2/bin/compiletest and throws Assertion is_utf8(v) failed

Changes made in eclipse or eclipse-like ide saves documents in current os charset format. This causes above error.

@lilyball
Copy link
Contributor

There are some other invalid UTF-8 sequences that the is_utf8 check doesn't handle, such as surrogate characters (which are invalid in UTF-8) or codepoints over U+10FFFF.

bors added a commit that referenced this issue Jul 30, 2013
Fix is_utf8 and UTF-8 char width functions to deny non-canonical 'overlong encodings' in UTF-8.

We address the function is_utf8 to make it more strict and correct, but no changes are made to the handling of invalid UTF-8.

Fixes issue #3787
@bluss
Copy link
Member

bluss commented Jul 30, 2013

overlong encodings are fixed, surrogate characters remain. Anything else?

@thestinger
Copy link
Contributor

This is fixed, and surrogate characters are issue #8319.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode
Projects
None yet
Development

No branches or pull requests

5 participants