Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid using display on paths. #828

Merged
merged 5 commits into from
Jun 21, 2022
Merged

Avoid using display on paths. #828

merged 5 commits into from
Jun 21, 2022

Conversation

Alexhuszagh
Copy link
Contributor

Adds a helper trait ToUtf8 which converts the path to UTF-8 if possible, and if not, creates a pretty debugging representation of the path. We can then change display() to to_utf8()? and be completely correct in all cases.

On POSIX systems, this doesn't matter since paths must be UTF-8 anyway, but on Windows some paths can be WTF-8 (see rust-lang/rust#12056). Either way, this can avoid creating a copy in cases, is always correct, and is more idiomatic about what we're doing. We might not be able to handle non-UTF-8 paths currently (like ISO-8859-1 paths, which are historically very common).

So, this doesn't change ergonomics: the resulting code is as compact and correct. It also requires less copies in most cases, which should be a good thing. But most importantly, if we're mounting data we can't silently fail or produce incorrect data. If something was lossily generated, we probably shouldn't try to mount with a replacement character, and also print more information than there was just an invalid Unicode character.

Adds a helper trait `ToUtf8` which converts the ath to UTF-8 if
possible, and if not, creates a pretty debugging representation of the
path. We can then change `display()` to `to_utf8()?` and be completely
correct in all cases.

On POSIX systems, this doesn't matter since paths must be UTF-8 anyway,
but on Windows some paths can be WTF-8 (see @rust-lang/rust#12056).
Either way, this can avoid creating a copy in cases, is always correct,
and is more idiomatic about what we're doing.
@Alexhuszagh Alexhuszagh added the no changelog A valid PR without changelog (no-changelog) label Jun 21, 2022
@Alexhuszagh Alexhuszagh requested a review from a team as a code owner June 21, 2022 07:27
@Alexhuszagh
Copy link
Contributor Author

Alexhuszagh commented Jun 21, 2022

I'll be adding the other as_posix helper in this PR, but a quick explanation on a Linux system:

Say I create a file with an ISO-8859-1 filename, ensuring this is done in Python:

>>> with open("Überraschung".encode("ISO-8859-15"), "w") as f:
...     f.write("Hello")
... 
5
>>> import os; os.listdir(os.getcwd())
['\udcdcberraschung']

We can confirm this is correct in Bash by checking the output in UTF-8 (only works if the terminal emulator also expects UTF-8 encoded output, which it should):

$ LANG=en_us.UTF-8 ls
''$'\334''berraschung

So now I generate my Rust code:

use std::{fs, io, env};

pub fn main() -> io::Result<()> {
    let cwd = env::current_dir()?;
    for entry in fs::read_dir(cwd)? {
        let path = entry?.path();
        println!("debug={path:?}");
        println!("display={}", path.display());
    }

    Ok(())
}

And this outputs:

./main 
debug="/home/ahuszagh/Desktop/tmp/\xDCberraschung"
display=/home/ahuszagh/Desktop/tmp/�berraschung

Or, our output isn't exactly correct: we shouldn't be using display either for error messages or mounting. At least in the debug output, we get the \xDC escape character printed, so we should be able to guess our encoding that our file we failed to mount was.

@Emilgardis
Copy link
Member

does it make sense to add Path::display to clippy lint disallowed_methods

@Alexhuszagh
Copy link
Contributor Author

Alexhuszagh commented Jun 21, 2022

I have no direct testing, but considering as_os_str would be the encoding of the host locale, and our container should probably always be in UTF-8 I believe, we might also need to look over the code in docker_cwd. Let me test a bit and convert this to a draft.

Update: that would require us having a non-Unicode locale on the host (extremely unlikely) and change the locale on the image to a non-Unicode one as well. Both are very unlikely, especially for new development, so let's just formalize UTF-8 assumptions and fix them if the need ever occurs.

@Alexhuszagh
Copy link
Contributor Author

does it make sense to add Path::display to clippy lint disallowed_methods

Yes I think so. It is incorrect for our uses, even if it's a good method for more general cases.

@Alexhuszagh Alexhuszagh marked this pull request as draft June 21, 2022 07:40
Since we use our mount paths, which will be UTF-8 inside the Linux
container, it makes sense to ensure they're valid UTF-8 prior so they
can be valid paths for the locale in the container.

I've left the following functions alone:
- `MountFinder::new`, since `OsStr` is cheaper for path comparisons than
  `str` is.
- `cargo_metadata_with_args`, since the manifest path is on the host,
  and therefore this is always valid.
- `pretty_path`, since it handles both valid Unicode paths and debug
  representations of non-Unicode paths.
@Alexhuszagh Alexhuszagh marked this pull request as ready for review June 21, 2022 08:03
@Alexhuszagh Alexhuszagh removed the no changelog A valid PR without changelog (no-changelog) label Jun 21, 2022
clippy.toml Outdated Show resolved Hide resolved
@Alexhuszagh
Copy link
Contributor Author

bors r=Emilgardis

bors bot added a commit that referenced this pull request Jun 21, 2022
828: Avoid using display on paths. r=Emilgardis a=Alexhuszagh

Adds a helper trait `ToUtf8` which converts the path to UTF-8 if possible, and if not, creates a pretty debugging representation of the path. We can then change `display()` to `to_utf8()?` and be completely correct in all cases.

On POSIX systems, this doesn't matter since paths must be UTF-8 anyway, but on Windows some paths can be WTF-8 (see rust-lang/rust#12056). Either way, this can avoid creating a copy in cases, is always correct, and is more idiomatic about what we're doing. We might not be able to handle non-UTF-8 paths currently (like ISO-8859-1 paths, which are historically very common).

So, this doesn't change ergonomics: the resulting code is as compact and correct. It also requires less copies in most cases, which should be a good thing. But most importantly, if we're mounting data we can't silently fail or produce incorrect data. If something was lossily generated, we probably shouldn't try to mount with a replacement character, and also print more information than there was just an invalid Unicode character.

Co-authored-by: Alex Huszagh <ahuszagh@gmail.com>
@Alexhuszagh
Copy link
Contributor Author

Alexhuszagh commented Jun 21, 2022

Just an additional comment in case this ever needs to be discussed again, as unlikely as that is: we need both the host and the container locale to be the same, I believe, since we need to provide bind mount mappings from the host to the container that are identical. Code like this (I believe) cannot exist if the path is not valid Unicode, since our container expects this to be a path that can be represented in UTF-8:

cross/src/docker/shared.rs

Lines 173 to 186 in 2937032

pub(crate) fn mount(
docker: &mut Command,
val: &Path,
prefix: &str,
verbose: bool,
) -> Result<PathBuf> {
let host_path = file::canonicalize(val)?;
let mount_path = canonicalize_mount_path(&host_path, verbose)?;
docker.args(&[
"-v",
&format!("{}:{prefix}{}", host_path.display(), mount_path.display()),
]);
Ok(mount_path)
}

However, nothing here stops you from providing non-UTF-8 filenames or directories as part of your crate data or external mounts to cross, just they cannot be the mount points. For example, if I have /path/to/dir/{{ISO-8859-15}}, I can provide a mount to /path/to/dir, and this should work fine, allowing me to access my files with non-UTF-8 filenames without issue (common if you need to process data extracted from a legacy source onto a modern system).

If any of these assumptions are wrong, this PR has made it easier to patch assumptions about UTF-8 paths in the future.

@Alexhuszagh
Copy link
Contributor Author

Alexhuszagh commented Jun 21, 2022

bors r-

This isn't a bug, but it does lead to / being duplicated for the path immediately after root, so /foo/bar becomes //foo/bar. I'll add tests too for this.

@bors
Copy link
Contributor

bors bot commented Jun 21, 2022

Canceled.

Previously, it would double the `//root` path, since it would add a `/`
for the root, and an additional one for the first path afterwards. Also
added unittests to ensure `as_posix` works as expected with drive
letters and normal paths.
@Alexhuszagh
Copy link
Contributor Author

bors r=Emilgardis

@bors
Copy link
Contributor

bors bot commented Jun 21, 2022

Build succeeded:

@bors bors bot merged commit 0cb97dd into cross-rs:main Jun 21, 2022
Alexhuszagh added a commit to Alexhuszagh/cross that referenced this pull request Jun 21, 2022
Prior to cross-rs#828, we added a '/' separator before component in our
makeshift `as_posix`, so the path would always become absolute, even
though the actual path was relative. `as_posix` therefore handled this
as a relative path, so using `project` and not `/project` as the working
directory for the cross command caused commands to fail.
bors bot added a commit that referenced this pull request Jun 21, 2022
831: Fixes regression in #828 with absolute path. r=Emilgardis a=Alexhuszagh

Prior to #828, we added a '/' separator before component in our makeshift `as_posix`, so the path would always become absolute, even though the actual path was relative. `as_posix` therefore handled this as a relative path, so using `project` and not `/project` as the working directory for the cross command caused commands to fail.

Co-authored-by: Alex Huszagh <ahuszagh@gmail.com>
@Alexhuszagh Alexhuszagh deleted the utf8_path branch June 23, 2022 23:25
@Emilgardis Emilgardis added this to the v0.2.2 milestone Jun 24, 2022
@Emilgardis Emilgardis mentioned this pull request Jul 3, 2022
@Alexhuszagh Alexhuszagh added the no-ci-targets PRs that do not affect any cross-compilation targets. label Nov 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-ci-targets PRs that do not affect any cross-compilation targets.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants