Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add non-ut8 support to glob #11972

Closed
wants to merge 3 commits into from
Closed

Conversation

flaper87
Copy link
Contributor

@flaper87 flaper87 commented Feb 1, 2014

The patch adds ByteContainer pretty much everywhere and treats paths and patterns as bytes instead of str to support non-utf8 characters.

cc #11916

@flaper87
Copy link
Contributor Author

flaper87 commented Feb 1, 2014

@kballard r?

use std::path::{is_sep_byte, BytesContainer};

// Bytes used by the Pattern matcher
// as UTF-8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "as UTF-8" mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

erm, looks like I forgot to update the comments!

@huonw
Copy link
Member

huonw commented Feb 1, 2014

This is an unfortunate change, to handling only bytes. What happens on say windows where paths are unicode? E.g. what happens with a pattern like [ä-ö]?

@huonw
Copy link
Member

huonw commented Feb 1, 2014

And, just thinking about it: what happens to that pattern on linux? At a guess, it will behave like [\xC3\xA4-\xC3\xB6] (where \x is a byte, not a codepoint), i.e. match any bytes that are \xC3, \xB6 or anything between \xA4 and \xC3, not matching codepoints from ä to ö).

This might be the "correct" behaviour, but it's certainly very peculiar.

Python 2 has 2 "modes":

>>> from fnmatch import fnmatch
>>> fnmatch(u'å', u'[ä-ö]')
True
>>> fnmatch('å', '[ä-ö]')
False

(Python3 has the same behaviour as the u'...' unicode strings.)

@@ -28,10 +28,23 @@
#[crate_type = "dylib"];
#[license = "MIT/ASL2"];

extern mod extra;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra is only used for the tests. You should #[cfg(test)] this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmh, right!

@erickt
Copy link
Contributor

erickt commented Feb 1, 2014

Seems related to #11650.

@lilyball
Copy link
Contributor

lilyball commented Feb 1, 2014

I am going to get lunch, I will continue reviewing this afternoon. I also share @huonw's worry about supporting clients who want to use unicode codepoints in their character classes. Plausibly, we could detect whether the pattern is a string or a byte vector, and if it's a string, use character processing after all, although I am concerned that this may be confusing. If we do this, we still need to support non-utf8 filenames as those can still match * and ? tokens.

@flaper87
Copy link
Contributor Author

flaper87 commented Feb 1, 2014

Agreed. I hadn't replied to @huonw's comment because I was putting some thought on it.

I'm not sure what the right solution to this is. As you pointed out supporting both will be confusing. We also have different implementations of Path for posix and windows, perhaps we should do something similar for Pattern. However, I don't think that will actually help, it'd just keep things separated but the issue will still remain.

@flaper87
Copy link
Contributor Author

flaper87 commented Feb 1, 2014

Actually, maybe supporting both is actually our best shot for now.

@bill-myers
Copy link
Contributor

What happens in Windows if you glob, using system interfaces, for "X_" or "_Y", where X and Y are UTF-16 surrogates?

If that works, then glob needs to based on possibly-invalid UTF-16 to be correct on Windows.

Isn't it possible to call OS APIs (e.g. FindFirstFile) though instead of reimplementing them in Rust?

@lilyball
Copy link
Contributor

lilyball commented Feb 3, 2014

@bill-myers "X_" or "_Y" where X and Y are UTF-16 surrogates is not valid unicode.

@bill-myers
Copy link
Contributor

@kballard Yes, it's not valid UTF-16, but I'm not sure whether Windows really rejects invalid UTF-16 everywhere, such as in glob patterns.

In fact, I suspect that Windows doesn't check for valid UTF-16 anywhere.

@alexcrichton
Copy link
Member

@kballard, @flaper87: what's the status on this?

@lilyball
Copy link
Contributor

@alexcrichton Last I heard @flaper87 was still planning on working on this, but I don't know what his immediate plans are right now.

@flaper87
Copy link
Contributor Author

@alexcrichton It could've been merged as it was and then add support for unicode paths in a follow-up patch. We thought about adding that support in this patch right away. I'd like to keep working on it but I'm focused on the other bugs now, if someone wants to complete it that'd be awesome.

Otherwise, we could merge it and add support for unicode later. The patch already improves our current situation.

I'll rebase it, anyway.

@alexcrichton
Copy link
Member

Closing due to inactivity, but feel free to reopen with a rebase!

@flaper87
Copy link
Contributor Author

@kballard It looks like I've been failing at getting back to this patch. I still think there's some value in it as-is. What do you think about merging it and adding support for utf-8 in a follow-up patch? I can create an issue for that. This patch at least fixes glob to some extent.

@lilyball
Copy link
Contributor

@flaper87 I think I need to finish my review ;)

@flaper87
Copy link
Contributor Author

@kballard awesome, I'll re-open it for now. Feel free to re-close it if you think it is not worth to merge as is (or with some fixes).

@flaper87 flaper87 reopened this Feb 21, 2014

let chars = pattern.chars().to_owned_vec();
let bytes = pattern.container_as_bytes().to_owned();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be owned.

@lilyball
Copy link
Contributor

It occurs to me that AnyByte here is matching one byte. Previous behavior of ? was to match one character. This needs to be resolved before this can be merged in, because this patch shouldn't modify the behavior of pre-existing calls (that use strings).

@pzol pzol added the A-unicode label Feb 26, 2014
@pzol
Copy link
Contributor

pzol commented Feb 26, 2014

Flagged A-unicode

@flaper87
Copy link
Contributor Author

Closing it for now. It is taking longer to get back to it and it still requires some work. If someone wants to take it, I'm fine with that.

@flaper87 flaper87 closed this Feb 28, 2014
@flaper87 flaper87 deleted the issue-11916 branch March 11, 2014 09:16
flip1995 pushed a commit to flip1995/rust that referenced this pull request Jan 11, 2024
Do not suggest `[T; n]` instead of `vec![T; n]` if `T` is not `Copy`

changelog: [`useless_vec`]: do not suggest replacing `&vec![T; N]` by `&[T; N]` if `T` is not `Copy`

Fix rust-lang#11958
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants