Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support glob patterns #97

Closed
jakwings opened this issue Oct 11, 2017 · 12 comments
Closed

Support glob patterns #97

jakwings opened this issue Oct 11, 2017 · 12 comments

Comments

@jakwings
Copy link
Contributor

Tracking #96

Add the flag --glob to take <pattern> as a glob pattern. And --regex to override --glob.

@jakwings
Copy link
Contributor Author

jakwings commented Oct 11, 2017

Currently it is impossible to match arbitrary bytes with (?-u) (--regex), so non-UTF8-encoded filenames cannot be matched. What do you think?

@sharkdp
Copy link
Owner

sharkdp commented Oct 11, 2017

Thank you for your contribution!

If I understand this correctly, this would basically be like find's -name/-iname, right?

How would glob-matching work, exactly?

  • Would this respect fds smart-case-by-default setting and change to case sensitive for -s?
  • How would this work together with --full-path? Would we try to match the glob pattern on the full path as well? Does that make sense?

As a side remark: we do have the -e, --extension <ext> option, so fd -e ext is basically the same as fd --glob '*.ext'.

@jakwings
Copy link
Contributor Author

jakwings commented Oct 11, 2017

Would this respect fds smart-case-by-default setting and change to case sensitive for -s?

Yes, the glob pattern is converted to a regex pattern, prefixed with '(?-u)' to match arbitrary bytes (not UTF-8 aware, but I removed it in the patch).

How would this work together with --full-path? Would we try to match the glob pattern on the full path as well? Does that make sense?

Not working well currently, I just saw that flag. Glob pattern like ** is also useful on full path.

As a side remark: we do have the -e, --extension option, so fd -e ext is basically the same as fd --glob '*.ext'.

Omitted too. :P But there is a conflict: fd -e .rs src this_repo gives nothing. (wrong)

@jakwings
Copy link
Contributor Author

Oh, there is no conflict between -e and <pattern>.

@jakwings
Copy link
Contributor Author

PS. The syntax for globs: https://docs.rs/globset/0.2.0/globset/#syntax

@sharkdp
Copy link
Owner

sharkdp commented Oct 12, 2017

Thank you for the clarification.

prefixed with '(?-u)' to match arbitrary bytes (not UTF-8 aware, but I removed it in the patch).

Could you please explain this in more detail? I'm not familiar with (?-u)

@jakwings
Copy link
Contributor Author

(?-u) means "disable Unicode support". So that . matches a single byte, not a valid UTF-8 byte sequence. Normally the regex crate has Unicode support enabled.

Ref: https://doc.rust-lang.org/regex/regex/index.html#grouping-and-flags

@sharkdp
Copy link
Owner

sharkdp commented Oct 13, 2017

Sorry to be pedantic, but what does this mean, exactly?

Do we need Unicode to be disabled to match files with UTF-8-broken filenames?
Do we need Unicode to be enabled to be able to match on glob patterns with non-ascii characters in them?

@jakwings
Copy link
Contributor Author

jakwings commented Oct 13, 2017

Different file systems have different result for creating files with arbitrary bytes. Some will keep any byte (except NUL and "/"). On MacOS's HFS+, invalid bytes are escaped like URI special chars, e.g. 0xE4 becomes %E4.

With Unicode support, the empty pattern "" or "^" and the likes can match all files, while "^.*$" or more complex ones may not.

I think it is fine to enable Unicode support for glob patterns, if fd does not respect system locales and only support UTF-8. Take GNU find for example:

$ touch $'k\xe4k'
$ ls
k?k
$ echo *
k�k
$ LC_ALL=C find . -name '*'
.
./k?k
$ LC_ALL=C.UTF-8 find . -name '*'
.

What if we want to check for file paths with invalid bytes? Just do a subtraction on sets:

$ touch kkk $'k\xe4k'
$  comm -z23 <(fd -0 | sort -zu) <(fd -0 '(?s)^.*$' | sort -zu)
k?k
$ comm -z23 <(fd -0  '' ~ | sort -zu) <(fd -0 '(?s)^.*$' ~ | sort -zu)  # now search you $HOME
...

EDIT: s/rfind/fd/
EDIT: add (?s) to match linefeeds

@jakwings
Copy link
Contributor Author

The command for checking can be simplified to find -not -name '*' (GNU)

@nedbat
Copy link

nedbat commented Apr 19, 2018

fd looks great, thanks for it. Converting to it from find still leaves one point of friction: when thinking about filenames, I think in globs, not regexes. This feature request looks like exactly what I was looking for, but it's been closed.

I see the feature has been added to a fork, but not here?

@sharkdp
Copy link
Owner

sharkdp commented Apr 19, 2018

@nedbat Thanks! I have opened a new ticket (that tries to summarize all aspects) here: #284.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants