-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add non-ut8 support to glob #11972
Add non-ut8 support to glob #11972
Conversation
@kballard r? |
use std::path::{is_sep_byte, BytesContainer}; | ||
|
||
// Bytes used by the Pattern matcher | ||
// as UTF-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "as UTF-8" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
erm, looks like I forgot to update the comments!
This is an unfortunate change, to handling only bytes. What happens on say windows where paths are unicode? E.g. what happens with a pattern like |
And, just thinking about it: what happens to that pattern on linux? At a guess, it will behave like This might be the "correct" behaviour, but it's certainly very peculiar. Python 2 has 2 "modes":
(Python3 has the same behaviour as the |
@@ -28,10 +28,23 @@ | |||
#[crate_type = "dylib"]; | |||
#[license = "MIT/ASL2"]; | |||
|
|||
extern mod extra; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra
is only used for the tests. You should #[cfg(test)]
this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmh, right!
Seems related to #11650. |
I am going to get lunch, I will continue reviewing this afternoon. I also share @huonw's worry about supporting clients who want to use unicode codepoints in their character classes. Plausibly, we could detect whether the pattern is a string or a byte vector, and if it's a string, use character processing after all, although I am concerned that this may be confusing. If we do this, we still need to support non-utf8 filenames as those can still match |
Agreed. I hadn't replied to @huonw's comment because I was putting some thought on it. I'm not sure what the right solution to this is. As you pointed out supporting both will be confusing. We also have different implementations of |
Actually, maybe supporting both is actually our best shot for now. |
What happens in Windows if you glob, using system interfaces, for "X_" or "_Y", where X and Y are UTF-16 surrogates? If that works, then glob needs to based on possibly-invalid UTF-16 to be correct on Windows. Isn't it possible to call OS APIs (e.g. FindFirstFile) though instead of reimplementing them in Rust? |
@bill-myers "X_" or "_Y" where X and Y are UTF-16 surrogates is not valid unicode. |
@kballard Yes, it's not valid UTF-16, but I'm not sure whether Windows really rejects invalid UTF-16 everywhere, such as in glob patterns. In fact, I suspect that Windows doesn't check for valid UTF-16 anywhere. |
@alexcrichton Last I heard @flaper87 was still planning on working on this, but I don't know what his immediate plans are right now. |
@alexcrichton It could've been merged as it was and then add support for unicode paths in a follow-up patch. We thought about adding that support in this patch right away. I'd like to keep working on it but I'm focused on the other bugs now, if someone wants to complete it that'd be awesome. Otherwise, we could merge it and add support for unicode later. The patch already improves our current situation. I'll rebase it, anyway. |
Closing due to inactivity, but feel free to reopen with a rebase! |
@kballard It looks like I've been failing at getting back to this patch. I still think there's some value in it as-is. What do you think about merging it and adding support for utf-8 in a follow-up patch? I can create an issue for that. This patch at least fixes glob to some extent. |
@flaper87 I think I need to finish my review ;) |
@kballard awesome, I'll re-open it for now. Feel free to re-close it if you think it is not worth to merge as is (or with some fixes). |
|
||
let chars = pattern.chars().to_owned_vec(); | ||
let bytes = pattern.container_as_bytes().to_owned(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not need to be owned.
It occurs to me that |
Flagged A-unicode |
Closing it for now. It is taking longer to get back to it and it still requires some work. If someone wants to take it, I'm fine with that. |
Do not suggest `[T; n]` instead of `vec![T; n]` if `T` is not `Copy` changelog: [`useless_vec`]: do not suggest replacing `&vec![T; N]` by `&[T; N]` if `T` is not `Copy` Fix rust-lang#11958
The patch adds
ByteContainer
pretty much everywhere and treats paths and patterns asbytes
instead ofstr
to support non-utf8 characters.cc #11916