path: Windows paths may contain non-utf8-representable sequences #12056

lilyball · 2014-02-06T02:05:39Z

It turns out that Windows paths may contain unpaired UTF-16 surrogates, which is not legally representable in UTF-8. This means that WindowsPath is not capable of representing such paths, as it uses ~str internally.

The text was updated successfully, but these errors were encountered:

SimonSapin · 2014-05-12T16:24:32Z

It turns out that Windows paths may contain unpaired UTF-16 surrogates

@kballard What is this based on? Trying to use CreateFileW directly to create such a file failed for me. Test case:

extern crate libc;

use std::ptr;
use std::io::IoError;

fn main() {
    for name in [&['a' as u16, 0xD83D, 0xDCA9, 'b' as u16, 0],
                 &['a' as u16, 0xD83D,         'b' as u16, 0]].iter() {
        let handle = unsafe {
            libc::CreateFileW(name.as_ptr(),
                              libc::FILE_GENERIC_WRITE,
                              0,
                              ptr::mut_null(),
                              libc::CREATE_ALWAYS,
                              libc::FILE_ATTRIBUTE_NORMAL,
                              ptr::mut_null())
        };
        let is_invalid = handle == libc::INVALID_HANDLE_VALUE as libc::HANDLE;
        println!("{} {} {}", name, is_invalid, IoError::last_error())
    }
}

Output:

[97, 55357, 56489, 98, 0] false unknown error (OS Error 0: The operation completed successfully.
)
[97, 55357, 98, 0] true unknown error (OS Error 87: The parameter is incorrect.
)

lilyball · 2014-05-12T16:57:31Z

@SimonSapin I don't know the precise details, but there exist portions of Windows in which paths are UCS2 rather than UTF-16. I ignored it because I thought it wasn't going to be an issue but at some point someone (and I wish I could remember who) showed me some output that showed that they were actually getting a UCS2 path from some Windows call and Path was unable to parse it.

SimonSapin · 2014-05-12T16:59:10Z

Maybe CreateFileW is doing an explicit check that not all parts of the API are doing. sigh

lilyball · 2014-05-12T17:06:31Z

My recollection is the bad path came from iterating a temporary directory on this person's filesystem. But I don't remember for certain.

SimonSapin · 2014-05-12T17:33:07Z

I’d be interested to see a test case, because this seems like it would affect every other language or library that tries to use filenames as Unicode.

lilyball · 2014-05-12T17:34:07Z

I'd love a test case too. Not being a Windows user is making it hard for me to actually test this stuff.

SimonSapin · 2014-05-12T17:36:16Z

FWIW I used a virtual machine image from http://www.modern.ie/ and a nightly build to run the test above.

klutzy · 2014-05-13T01:09:50Z

@kballard @SimonSapin #13338?
I think it depends on OS version and possibly even on locale. I ran your code on win7/kr and got:

[97, 55357, 56489, 98, 0] false unknown error (OS Error 0 (FormatMessageW() returned error 15100))
[97, 55357, 98, 0] false unknown error (OS Error 0 (FormatMessageW() returned error 15105))

SimonSapin · 2014-05-13T07:52:10Z

@klutzy false for is_invalid and OS Error 0 looks like CreateFileW was successful. Do you see the two files created in the current directory?

SimonSapin · 2014-05-13T07:59:49Z

Oh, wait, "success" here is not good news as it means you managed to create a file with a broken name. Does io::fs::readdir() trigger this failure, then?

klutzy · 2014-05-13T10:50:50Z

@SimonSapin

Do you see the two files created in the current directory?
Does io::fs::readdir() trigger this failure, then?

Yes and yes. :'( I'm curious if it only occurs on recent OSes. (maybe >= 7?) Could you confirm OS name of VM you use?

SimonSapin · 2014-05-13T10:53:23Z

Windows 7 Enterprise SP1, 32-bit, en-US locale

klutzy · 2014-05-13T10:59:46Z

Ugh. I have no idea now.

Anyway, I'm ok to just ignore non-unicode filenames: I don't think there will be many use cases regarding it.

lilyball · 2014-05-13T18:36:35Z

Perhaps. If it really is that rare, then maybe it's ok. It certainly simplifies things. And the programmer always has the option of dropping down and doing the system calls directly if they need to handle this.

But it makes me a bit uncomfortable regardless.

However, as I'm not a Windows user, I'll defer to Windows users on this subject.

vadimcn · 2014-08-30T07:40:47Z

Non-UTF-16 file names are certainly rare, but when you come across one... it's very annoying when you cannot use any standard tools to delete such a file, for example.
Will I be a terrible person if I suggest that we bend rfc3629 a bit, and allow encoding unpaired UTF-16 surrogates "as themselves" in UTF-8 ?

SimonSapin · 2014-08-30T08:56:07Z

Will I be a terrible person if I suggest that we bend rfc3629 a bit, and allow encoding unpaired UTF-16 surrogates "as themselves" in UTF-8 ?

Yes, this would be terrible. Let’s not do that in the String and str types, it would not be UTF-8. We could have a NotReallyUtf8String type separately, but it would be of limited use.

retep998 · 2014-08-30T09:03:44Z

We could have a NotReallyUtf8String type separately

Wouldn't that just be called Path?

SimonSapin · 2014-08-30T09:13:13Z

Maybe it wouldn’t be so terrible if it’s an internal detail of std::path::windows::Path, with the display() method converting to real UTF-8. But then we might as well just use UCS-2 in Vec<u16> internally. (Not sure how that would interact with BytesContainer, though. Would windows::Path::new convert from UTF-8?)

lilyball · 2014-08-30T10:03:54Z

Using UCS-2 means none of the *_str() methods would ever work. Encoding UCS-2 as an invalid UTF-8 as an internal implementation detail is something I considered a while back and is probably the best approach.

On Aug 30, 2014, at 2:13 AM, Simon Sapin notifications@github.com wrote:

Maybe it wouldn’t be so terrible if it’s an internal detail of std::path::windows::Path, with the display() method converting to real UTF-8. But then we might as well just use UCS-2 in Vec internally. (Not sure how that would interact with BytesContainer, though. Would windows::Path::new convert from UTF-8?)

—
Reply to this email directly or view it on GitHub.

retep998 · 2014-09-06T17:13:45Z

For comparison, MSVC C++ when converting an std::path to a utf-8 string will encode UCS-2 as invalid UTF-8, so there is precedent for us to follow this route.

SimonSapin · 2014-09-16T18:05:53Z

I suggest calling this WTF-8: an encoding that uses the same algorithm as UTF-8, but has the "value space" of UCS-2. (That is, bigger than Unicode since unpaired surrogates are allowed.) We can decide later what the name an acronym for.

Note: I call UCS-2 here any sequence of 16 bit units. Surrogate pairs have a special meaning, but unpaired surrogates are allowed.

To convert UCS-2 to WTF-8:

Valid surrogate pairs are interpreted as a non-BMP code point and encoded as a 4-byte UTF-8 sequence
Any other 16 bit code unit, including lone surrogates, are encoded as sequence of 1 to 3 bytes using UTF-8’s algorithm. This is invalid UTF-8 for lone surrogates, but valid WTF-8.

To convert WTF-8 to UCS-2:

4-byte sequences are interpreted as a non-BMP code point and encoded as a surrogate pair of 16 bit units
1 to 3 bytes sequences in UTF-8’s algorithm, including surrogates, are encoded as a one 16 bit unit

Consecutive 3-byte sequences for a lead surrogate followed by a trail surrogate are invalid in WTF-8. A 4-byte sequence should be used instead. (This ensures that the WTF-8 encoding of any UCS-2 string is unique.)

WTF-8 has the same "value space" as UCS-2, which is bigger than Unicode (since they include unpaired surrogates.)

Any valid UTF-8 string is also valid WTF-8, with the same byte representation. A WTF-8 string is also valid UTF-8 if and only if its UCS-2 conversion is valid UTF-16.

To convert a WTF-8 to UTF-8, either:

Strictly: return an error if the string contains a 3-byte sequence for a surrogate code point, otherwise return the string unchanged
Lossily: replace any 3-byte sequence for a surrogate code point by 0xEF 0xBF 0xBD, the UTF-8 representation of the replacement character U+FFFD.

To concatenate two WTF-8 strings: if the earlier one ends with a lead surrogate and the latter one starts with a trail surrogate, both surrogate need to be removed and replaced with a 4-byte sequence.

Note: WTF-8 is different from CESU-8.

In terms of Rust implementation, WTF-8 data should be kept in a dedicated type that wraps a private Vec<u8> field, with APIs that maintain the encoding invariants, like String does for UTF-8.

SimonSapin · 2014-09-26T13:50:20Z

I’m gonna work of WTF-8 anyway, for Servo to interact with JavaScript. ECMAScript clearly says that strings are sequences of 16-bit integers.

For Windows however, it’s not so clear to me that this is actually needed. MSDN claims that the encoding used in Windows is UTF-16. The documentation for functions converting to and from UTF-16 says:

Starting with Windows Vista, this function fully conforms with the Unicode 4.1 specification for UTF-8 and UTF-16. The function used on earlier operating systems encodes or decodes lone surrogate halves or mismatched surrogate pairs.

I believe (by opposition to the following sentence) that "fully conforms" here means replaces unpaired surrogates with U+FFFD. (I haven’t tested it, though.)

Now, we’ve seen that it’s possible to create a file with an invalid UTF-16 name. But it’s not easy, we’re sometimes prevented from doing it. It may be considered a bug that it was possible. The question is, and I couldn’t find an answer on MSDN, are Windows applications expected to handle such files correctly?

retep998 · 2014-09-26T20:09:27Z

According to MSDN

There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs.

This isn't exactly UTF-16. So while the rest of WinAPI uses UTF-16, the file API doesn't. I've tested creating filenames with invalid surrogates in their name and I was able to create them and manipulate them easily through both code, and numerous applications including cmd.exe Notepad and Windows explorer. So far Windows 8.1 hasn't tried to stop me from using invalid surrogates for files, so I definitely think Windows applications are expected to handle such files correctly.

SimonSapin · 2014-09-26T22:46:04Z

"Unicode normalization" refers to something else, but "opaque sequence of WCHARs" does indeed imply that unpaired surrogates can occur. Alright then, WTF-8 for Path it is.

vadimcn · 2014-09-27T00:22:06Z

So why use WTF-8 for internal representation, and not UCS-2 then? If WTFString is a distinct type from String, you might as well save on encoding/decoding them when interacting with Windows APIs.

SimonSapin · 2014-09-27T00:54:52Z

As noted by @kballard above, none of the *_str methods would work with UCS-2 internally.

This PR implements [path reform](rust-lang/rfcs#474), and motivation and details for the change can be found there. For convenience, the old path API is being kept as `old_path` for the time being. Updating after this PR is just a matter of changing imports to `old_path` (which is likely not needed, since the prelude entries still export the old path API). This initial PR does not include additional normalization or platform-specific path extensions. These will be done in follow up commits or PRs. [breaking-change] Closes rust-lang#20034 Closes rust-lang#12056 Closes rust-lang#11594 Closes rust-lang#14028 Closes rust-lang#14049 Closes rust-lang#10035

828: Avoid using display on paths. r=Emilgardis a=Alexhuszagh Adds a helper trait `ToUtf8` which converts the path to UTF-8 if possible, and if not, creates a pretty debugging representation of the path. We can then change `display()` to `to_utf8()?` and be completely correct in all cases. On POSIX systems, this doesn't matter since paths must be UTF-8 anyway, but on Windows some paths can be WTF-8 (see rust-lang/rust#12056). Either way, this can avoid creating a copy in cases, is always correct, and is more idiomatic about what we're doing. We might not be able to handle non-UTF-8 paths currently (like ISO-8859-1 paths, which are historically very common). So, this doesn't change ergonomics: the resulting code is as compact and correct. It also requires less copies in most cases, which should be a good thing. But most importantly, if we're mounting data we can't silently fail or produce incorrect data. If something was lossily generated, we probably shouldn't try to mount with a replacement character, and also print more information than there was just an invalid Unicode character. Co-authored-by: Alex Huszagh <ahuszagh@gmail.com>

Fixes: rust-lang#12050 - `identity_op` correctly suggests a deference for coerced references When `identity_op` identifies a `no_op`, provides a suggestion, it also checks the type of the type of the variable. If the variable is a reference that's been coerced into a value, e.g. ``` let x = &0i32; let _ = x + 0; ``` the suggestion will now use a derefence. This is done by identifying whether the variable is a reference to an integral value, and then whether it gets dereferenced. changelog: false positive: [`identity_op`]: corrected suggestion for reference coerced to value. fixes: rust-lang#12050

Before this change, Base_directory was a char array. In general, it’s better to use an std::filesystem::path to represent paths. Here’s why: • char arrays have a limited size. If a user creates a file or directory that has a very long path, then it might not be possible to store that path in a given char array. You can try to make the char array long enough to store the longest possible path, but future operating systems may increase that limit. With std::filesystem::paths, you don’t have to worry about path length limits. • char arrays cannot necessarily represent all paths. We try our best to use UTF-8 for all char arrays [1], but unfortunately, it’s possible to create a path that cannot be represent using UTF-8 [2]. With an std::filesystem::paths, stuff like that is less likely to happen because std::filesystem::path::value_type is platform-specific. • char arrays don’t have any built-in mechanisms for manipulating paths. At the moment, we work around this problem with things like the ddio module. If we make sure that we always use std::filesystem::paths, then we’ll be able to get rid of or simplify a lot of the ddio module. [1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20) [2]: <rust-lang/rust#12056>

Before this change, Base_directory was a char array. In general, it’s better to use an std::filesystem::path to represent paths. Here’s why: • char arrays have a limited size. If a user creates a file or directory that has a very long path, then it might not be possible to store that path in a given char array. You can try to make the char array long enough to store the longest possible path, but future operating systems may increase that limit. With std::filesystem::paths, you don’t have to worry about path length limits. • char arrays cannot necessarily represent all paths. We try our best to use UTF-8 for all char arrays [1], but unfortunately, it’s possible to create a path that cannot be represent using UTF-8 [2]. With an std::filesystem::paths, stuff like that is less likely to happen because std::filesystem::path::value_type is platform-specific. • char arrays don’t have any built-in mechanisms for manipulating paths. Before this commit, the code would work around this problem in two different ways: 1. It would use Descent 3–specific functions that implement path operations on char arrays/pointers (for example, ddio_MakePath()). Once all paths use std::filesystem::path, we’ll be able to simplify the code by removing those functions. 2. It would temporarily convert Base_directory into an std::filesystem::path. This commit simplifies the code by removing those temporary conversions because they are no longer necessary. [1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20) [2]: <rust-lang/rust#12056>

Before this change, Base_directory was a char array. In general, it’s better to use an std::filesystem::path to represent paths. Here’s why: • char arrays have a limited size. If a user creates a file or directory that has a very long path, then it might not be possible to store that path in a given char array. You can try to make the char array long enough to store the longest possible path, but future operating systems may increase that limit. With std::filesystem::paths, you don’t have to worry about path length limits. • char arrays cannot necessarily represent all paths. We try our best to use UTF-8 for all char arrays [1], but unfortunately, it’s possible to create a path that cannot be represent using UTF-8 [2]. With an std::filesystem::paths, stuff like that is less likely to happen because std::filesystem::path::value_type is platform-specific. • char arrays don’t have any built-in mechanisms for manipulating paths. Before this commit, the code would work around this problem in two different ways: 1. It would use Descent 3–specific functions that implement path operations on char arrays/pointers (for example, ddio_MakePath()). Once all paths use std::filesystem::path, we’ll be able to simplify the code by removing those functions. 2. It would temporarily convert Base_directory into an std::filesystem::path. This commit simplifies the code by removing those temporary conversions because they are no longer necessary. [1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20) [2]: <rust-lang/rust#12056> TODO: • Test on Windows.

Before this change, Base_directory was a char array. In general, it’s better to use an std::filesystem::path to represent paths. Here’s why: • char arrays have a limited size. If a user creates a file or directory that has a very long path, then it might not be possible to store that path in a given char array. You can try to make the char array long enough to store the longest possible path, but future operating systems may increase that limit. With std::filesystem::paths, you don’t have to worry about path length limits. • char arrays cannot necessarily represent all paths. We try our best to use UTF-8 for all char arrays [1], but unfortunately, it’s possible to create a path that cannot be represent using UTF-8 [2]. With an std::filesystem::paths, stuff like that is less likely to happen because std::filesystem::path::value_type is platform-specific. • char arrays don’t have any built-in mechanisms for manipulating paths. Before this commit, the code would work around this problem in two different ways: 1. It would use Descent 3–specific functions that implement path operations on char arrays/pointers (for example, ddio_MakePath()). Once all paths use std::filesystem::path, we’ll be able to simplify the code by removing those functions. 2. It would temporarily convert Base_directory into an std::filesystem::path. This commit simplifies the code by removing those temporary conversions because they are no longer necessary. [1]: 7cd7902 (Merge pull request DescentDevelopers#494 from Jayman2000/encoding-improvements, 2024-07-20) [2]: <rust-lang/rust#12056>

SimonSapin mentioned this issue May 13, 2014

RFC: Add byte and byte string literals rust-lang/rfcs#69

Merged

brson added the A-windows label Aug 12, 2014

retep998 mentioned this issue Sep 5, 2014

Refactor of readdir to not fail as much #17034

Merged

SimonSapin mentioned this issue Sep 24, 2014

Support document.write servo/html5ever#6

Closed

aturon mentioned this issue Jan 29, 2015

Rename std::path to std::old_path; introduce new std::path #21759

Merged

bors closed this as completed in #21759 Feb 4, 2015

simonbyrne mentioned this issue May 24, 2017

[WIP] Using types and not strings to represent Paths JuliaLang/Juleps#26

Closed

2 tasks

nojb mentioned this issue Jun 10, 2017

Unicode support for the Windows runtime: Let's do it! ocaml/ocaml#1200

Merged

8 tasks

ckamm mentioned this issue Oct 17, 2018

cannot encode <Cyrillic-names> to local encoding 2252 owncloud/client#6810

Closed

This was referenced Oct 18, 2018

fs: cannot interact with invalid UTF-16 filenames on Windows, even with Buffers nodejs/node#23735

Closed

win: uv_fs_XXX lossily decoding invalid UTF-16 filenames / paths libuv/libuv#2048

Closed

B4dM4n mentioned this issue Nov 7, 2018

Invalid characters in the Local File System cause rclone copy and sync to re-download existing files under the new name rclone/rclone#2633

Closed

hundt mentioned this issue May 30, 2019

syscall: Windows filenames with unpaired surrogates are not handled correctly golang/go#32334

Closed

jyn514 mentioned this issue Jun 23, 2019

Handle binary for shebangs proot-me/proot-rs#9

Closed

phadej mentioned this issue Dec 22, 2020

The Char Kind ghc-proposals/ghc-proposals#387

Merged

ryanofsky mentioned this issue Sep 10, 2021

Use std::filesystem. Remove Boost Filesystem & System bitcoin/bitcoin#20744

Merged

ryanofsky mentioned this issue Oct 5, 2021

refactor: Forbid calling unsafe fs::path(std::string) constructor and fs::path::string() method bitcoin/bitcoin#22937

Merged

Alexhuszagh mentioned this issue Jun 21, 2022

Avoid using display on paths. cross-rs/cross#828

Merged

EliahKagan mentioned this issue Sep 22, 2023

Add more checks for the validity of refnames gitpython-developers/GitPython#1672

Merged

KartikSoneji mentioned this issue Nov 13, 2024

fix: build-sbf Home Directory Error on Windows anza-xyz/agave#3597

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

path: Windows paths may contain non-utf8-representable sequences #12056

path: Windows paths may contain non-utf8-representable sequences #12056

lilyball commented Feb 6, 2014

SimonSapin commented May 12, 2014

lilyball commented May 12, 2014

SimonSapin commented May 12, 2014

lilyball commented May 12, 2014

SimonSapin commented May 12, 2014

lilyball commented May 12, 2014

SimonSapin commented May 12, 2014

klutzy commented May 13, 2014

SimonSapin commented May 13, 2014

SimonSapin commented May 13, 2014

klutzy commented May 13, 2014

SimonSapin commented May 13, 2014

klutzy commented May 13, 2014

lilyball commented May 13, 2014

vadimcn commented Aug 30, 2014

SimonSapin commented Aug 30, 2014

retep998 commented Aug 30, 2014

SimonSapin commented Aug 30, 2014

lilyball commented Aug 30, 2014 •

edited

Loading

retep998 commented Sep 6, 2014

SimonSapin commented Sep 16, 2014

SimonSapin commented Sep 26, 2014

retep998 commented Sep 26, 2014

SimonSapin commented Sep 26, 2014

vadimcn commented Sep 27, 2014

SimonSapin commented Sep 27, 2014

path: Windows paths may contain non-utf8-representable sequences #12056

path: Windows paths may contain non-utf8-representable sequences #12056

Comments

lilyball commented Feb 6, 2014

SimonSapin commented May 12, 2014

lilyball commented May 12, 2014

SimonSapin commented May 12, 2014

lilyball commented May 12, 2014

SimonSapin commented May 12, 2014

lilyball commented May 12, 2014

SimonSapin commented May 12, 2014

klutzy commented May 13, 2014

SimonSapin commented May 13, 2014

SimonSapin commented May 13, 2014

klutzy commented May 13, 2014

SimonSapin commented May 13, 2014

klutzy commented May 13, 2014

lilyball commented May 13, 2014

vadimcn commented Aug 30, 2014

SimonSapin commented Aug 30, 2014

retep998 commented Aug 30, 2014

SimonSapin commented Aug 30, 2014

lilyball commented Aug 30, 2014 • edited Loading

retep998 commented Sep 6, 2014

SimonSapin commented Sep 16, 2014

SimonSapin commented Sep 26, 2014

retep998 commented Sep 26, 2014

SimonSapin commented Sep 26, 2014

vadimcn commented Sep 27, 2014

SimonSapin commented Sep 27, 2014

lilyball commented Aug 30, 2014 •

edited

Loading