Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clients of the new path need to be updated for non-utf8 paths #9639

Closed
1 of 2 tasks
lilyball opened this issue Oct 1, 2013 · 11 comments
Closed
1 of 2 tasks

Clients of the new path need to be updated for non-utf8 paths #9639

lilyball opened this issue Oct 1, 2013 · 11 comments
Labels
A-Unicode Area: Unicode

Comments

@lilyball
Copy link
Contributor

lilyball commented Oct 1, 2013

The new Path module (coming soon) represents POSIX paths as vectors (#7225) of non-NUL bytes instead of as ~str. This is because Linux allows for non-unicode paths.

Unfortunately, there are a lot of clients of Path, including related functionality like Process arguments and environment variables, that still use &str. These need to be updated to handle &[u8] so they can represent non-unicode Paths.

All of the known locations where this is an issue are marked up by // FIXME lines referencing this issue. Note that in the case of Process args and environments, these comments are at the usage locations rather than the implementation.

Sub-tasks that have been filed for various components:

bors added a commit that referenced this issue Oct 16, 2013
Rewrite the entire `std::path` module from scratch.

`PosixPath` is now based on `~[u8]`, which fixes #7225.
Unnecessary allocation has been eliminated.

There are a lot of clients of `Path` that still assume utf-8 paths.
This is covered in #9639.
@SimonSapin
Copy link
Contributor

Path::display seems to still assume UTF-8. Is that something that needs to be fixed?

@alexcrichton
Copy link
Member

When formatted with fmt::Default, the display object will replace all non-utf8 sequences with the "unicode replacement character" which I believe is \ufffd or something like that. The idea with it was to provide an easy way to print the string in a readable fashion, but not to make it usable to other functions (because it's not the original path).

@SimonSapin
Copy link
Contributor

@alexcrichton , so display will have lots of replacement characters for byte-based non-UTF-8 filesystems. Is that considered good enough?

@alexcrichton
Copy link
Member

It's considered good enough in the sense that Path::new(format!("{}", path.to_str())) != path. The display is not expected to be equal to the path itself, and is intended purely for display purposes when something expects a number of utf-8 bytes to show.

@SimonSapin
Copy link
Contributor

Yes, not round-tripping in all cases for display is fine. But in cases the filesystem is known to have a different encoding (see Python’s sys.getfilesystemencoding()), shouldn’t we try and use that?

@alexcrichton
Copy link
Member

Right now we don't have a great story of encoding/decoding strings in and out of UTF-8, but it would be pretty awesome once that story gets fleshed out to support re-encoding as utf-8 if we know the filesystem is different.

@lilyball
Copy link
Contributor Author

I agree. Being able to support other encodings based on the known filesystem encoding would be great. But that can be added later, once it's actually possible to do so.

@SimonSapin
Copy link
Contributor

@kballard Thanks, I think that answers what I wanted to know. Supporting various character encodings is a separate issue. (See my API proposal on the wiki and related discussion on rust-dev linked from there.) But once we have that, dealing with filesystems in various encodings is something that should be fixed.

@lilyball
Copy link
Contributor Author

@SimonSapin Yeah, this wasn't actually planned initially, but due to how Display works, if we can figure out the filesystem encoding, Display could be updated to use the appropriate encoding. Alternatively, we could add another method .display_encoding(enc) that returns a Display object for the appropriate encoding, for the situation where Path can't figure out the filesystem encoding but the caller can.

@alexcrichton
Copy link
Member

This has largely been accomplished since this was written both in terms of an implementation strategy as well as propagating throughout the libraries. The stabilization passes we've been doing will turn up any more related clients, and I've requested that the remaining issue, #11916, be transferred to the rust-lang/rust repo, so closing.

@Eh2406
Copy link
Contributor

Eh2406 commented Sep 3, 2017

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 3, 2023
…=Nilstrieb

Handle non-utf8 rpaths (fix FIXME)

Removes a FIXME for rust-lang#9639 which is closed since long ago.

Part of rust-lang#44366 which is E-help-wanted.

(Also see rust-lang#114377)
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 3, 2023
compiletest: Handle non-utf8 paths (fix FIXME)

Removes the last FIXME in the code for rust-lang#9639  🎉 (which was closed 8 years ago)

Part of rust-lang#44366 which is E-help-wanted.

(The other two PRs that does this are rust-lang#114377 and rust-lang#114427)
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 4, 2023
compiletest: Handle non-utf8 paths (fix FIXME)

Removes the last FIXME in the code for rust-lang#9639  🎉 (which was closed 8 years ago)

Part of rust-lang#44366 which is E-help-wanted.

(The other two PRs that does this are rust-lang#114377 and rust-lang#114427)
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 4, 2023
…=Nilstrieb

Handle non-utf8 rpaths (fix FIXME)

Removes a FIXME for rust-lang#9639 which is closed since long ago.

Part of rust-lang#44366 which is E-help-wanted.

(Also see rust-lang#114377)
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Aug 9, 2023
…tf-8, r=thomcc

test_get_dbpath_for_term(): handle non-utf8 paths (fix FIXME)

Removes a FIXME for rust-lang#9639

Part of rust-lang#44366 which is E-help-wanted

The remaining two FIXMEs for rust-lang#9639 are considerably more complicated, so I will create separate PRs for them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode
Projects
None yet
Development

No branches or pull requests

5 participants