Add non-ut8 support to glob #11972

flaper87 · 2014-02-01T13:55:44Z

The patch adds ByteContainer pretty much everywhere and treats paths and patterns as bytes instead of str to support non-utf8 characters.

cc #11916

flaper87 · 2014-02-01T13:55:54Z

@kballard r?

huonw · 2014-02-01T14:15:38Z

src/libglob/lib.rs

+use std::path::{is_sep_byte, BytesContainer};
+
+// Bytes used by the Pattern matcher
+// as UTF-8


What does "as UTF-8" mean?

erm, looks like I forgot to update the comments!

huonw · 2014-02-01T14:33:22Z

This is an unfortunate change, to handling only bytes. What happens on say windows where paths are unicode? E.g. what happens with a pattern like [ä-ö]?

huonw · 2014-02-01T14:41:12Z

And, just thinking about it: what happens to that pattern on linux? At a guess, it will behave like [\xC3\xA4-\xC3\xB6] (where \x is a byte, not a codepoint), i.e. match any bytes that are \xC3, \xB6 or anything between \xA4 and \xC3, not matching codepoints from ä to ö).

This might be the "correct" behaviour, but it's certainly very peculiar.

Python 2 has 2 "modes":

>>> from fnmatch import fnmatch
>>> fnmatch(u'å', u'[ä-ö]')
True
>>> fnmatch('å', '[ä-ö]')
False

(Python3 has the same behaviour as the u'...' unicode strings.)

lilyball · 2014-02-01T19:58:01Z

src/libglob/lib.rs

@@ -28,10 +28,23 @@
 #[crate_type = "dylib"];
 #[license = "MIT/ASL2"];

+extern mod extra;


extra is only used for the tests. You should #[cfg(test)] this.

mmh, right!

erickt · 2014-02-01T20:18:38Z

Seems related to #11650.

lilyball · 2014-02-01T20:51:55Z

I am going to get lunch, I will continue reviewing this afternoon. I also share @huonw's worry about supporting clients who want to use unicode codepoints in their character classes. Plausibly, we could detect whether the pattern is a string or a byte vector, and if it's a string, use character processing after all, although I am concerned that this may be confusing. If we do this, we still need to support non-utf8 filenames as those can still match * and ? tokens.

flaper87 · 2014-02-01T21:16:53Z

Agreed. I hadn't replied to @huonw's comment because I was putting some thought on it.

I'm not sure what the right solution to this is. As you pointed out supporting both will be confusing. We also have different implementations of Path for posix and windows, perhaps we should do something similar for Pattern. However, I don't think that will actually help, it'd just keep things separated but the issue will still remain.

flaper87 · 2014-02-01T21:19:22Z

Actually, maybe supporting both is actually our best shot for now.

bill-myers · 2014-02-02T21:56:37Z

What happens in Windows if you glob, using system interfaces, for "X_" or "_Y", where X and Y are UTF-16 surrogates?

If that works, then glob needs to based on possibly-invalid UTF-16 to be correct on Windows.

Isn't it possible to call OS APIs (e.g. FindFirstFile) though instead of reimplementing them in Rust?

lilyball · 2014-02-03T03:34:13Z

@bill-myers "X_" or "_Y" where X and Y are UTF-16 surrogates is not valid unicode.

bill-myers · 2014-02-06T05:09:37Z

@kballard Yes, it's not valid UTF-16, but I'm not sure whether Windows really rejects invalid UTF-16 everywhere, such as in glob patterns.

In fact, I suspect that Windows doesn't check for valid UTF-16 anywhere.

alexcrichton · 2014-02-14T05:46:20Z

@kballard, @flaper87: what's the status on this?

lilyball · 2014-02-14T07:18:48Z

@alexcrichton Last I heard @flaper87 was still planning on working on this, but I don't know what his immediate plans are right now.

flaper87 · 2014-02-14T08:23:17Z

@alexcrichton It could've been merged as it was and then add support for unicode paths in a follow-up patch. We thought about adding that support in this patch right away. I'd like to keep working on it but I'm focused on the other bugs now, if someone wants to complete it that'd be awesome.

Otherwise, we could merge it and add support for unicode later. The patch already improves our current situation.

I'll rebase it, anyway.

alexcrichton · 2014-02-20T23:16:45Z

Closing due to inactivity, but feel free to reopen with a rebase!

flaper87 · 2014-02-21T07:59:53Z

@kballard It looks like I've been failing at getting back to this patch. I still think there's some value in it as-is. What do you think about merging it and adding support for utf-8 in a follow-up patch? I can create an issue for that. This patch at least fixes glob to some extent.

lilyball · 2014-02-21T16:52:22Z

@flaper87 I think I need to finish my review ;)

flaper87 · 2014-02-21T16:59:16Z

@kballard awesome, I'll re-open it for now. Feel free to re-close it if you think it is not worth to merge as is (or with some fixes).

lilyball · 2014-02-21T17:26:35Z

src/libglob/lib.rs


-        let chars = pattern.chars().to_owned_vec();
+        let bytes = pattern.container_as_bytes().to_owned();


This does not need to be owned.

lilyball · 2014-02-21T17:28:55Z

It occurs to me that AnyByte here is matching one byte. Previous behavior of ? was to match one character. This needs to be resolved before this can be merged in, because this patch shouldn't modify the behavior of pre-existing calls (that use strings).

pzol · 2014-02-26T18:17:30Z

Flagged A-unicode

flaper87 · 2014-02-28T20:54:15Z

Closing it for now. It is taking longer to get back to it and it still requires some work. If someone wants to take it, I'm fine with that.

Do not suggest `[T; n]` instead of `vec![T; n]` if `T` is not `Copy` changelog: [`useless_vec`]: do not suggest replacing `&vec![T; N]` by `&[T; N]` if `T` is not `Copy` Fix rust-lang#11958

huonw reviewed Feb 1, 2014
View reviewed changes

lilyball reviewed Feb 1, 2014
View reviewed changes

alexcrichton closed this Feb 20, 2014

flaper87 reopened this Feb 21, 2014

lilyball reviewed Feb 21, 2014
View reviewed changes

Add support for non-utf8 filenames

333dee3

flaper87 added 2 commits February 21, 2014 19:13

Add a non_utf8 test for matches

89e69a2

Pattern::escape should also use BytesContainer

220ece2

pzol added the A-unicode label Feb 26, 2014

flaper87 closed this Feb 28, 2014

flaper87 deleted the issue-11916 branch March 11, 2014 09:16

WindSoilder mentioned this pull request Jun 7, 2022

print warning message if meet non utf-8 path nushell/nushell#5731

Merged

3 tasks


		let chars = pattern.chars().to_owned_vec();
		let bytes = pattern.container_as_bytes().to_owned();

Add non-ut8 support to glob #11972

Add non-ut8 support to glob #11972

Uh oh!

Conversation

flaper87 commented Feb 1, 2014

Uh oh!

flaper87 commented Feb 1, 2014

Uh oh!

huonw Feb 1, 2014

Choose a reason for hiding this comment

Uh oh!

flaper87 Feb 1, 2014

Choose a reason for hiding this comment

Uh oh!

huonw commented Feb 1, 2014

Uh oh!

huonw commented Feb 1, 2014

Uh oh!

lilyball Feb 1, 2014

Choose a reason for hiding this comment

Uh oh!

flaper87 Feb 1, 2014

Choose a reason for hiding this comment

Uh oh!

erickt commented Feb 1, 2014

Uh oh!

lilyball commented Feb 1, 2014

Uh oh!

flaper87 commented Feb 1, 2014

Uh oh!

flaper87 commented Feb 1, 2014

Uh oh!

bill-myers commented Feb 2, 2014

Uh oh!

lilyball commented Feb 3, 2014

Uh oh!

bill-myers commented Feb 6, 2014

Uh oh!

alexcrichton commented Feb 14, 2014

Uh oh!

lilyball commented Feb 14, 2014

Uh oh!

flaper87 commented Feb 14, 2014

Uh oh!

alexcrichton commented Feb 20, 2014

Uh oh!

flaper87 commented Feb 21, 2014

Uh oh!

lilyball commented Feb 21, 2014

Uh oh!

flaper87 commented Feb 21, 2014

Uh oh!

lilyball Feb 21, 2014

Choose a reason for hiding this comment

Uh oh!

lilyball commented Feb 21, 2014

Uh oh!

pzol commented Feb 26, 2014

Uh oh!

flaper87 commented Feb 28, 2014

Uh oh!

Uh oh!