Byte pattern detection works inconsistently #1838

bdlmt · 2021-03-31T01:44:49Z

What version of ripgrep are you using?

ripgrep 12.1.1 (rev 7cb2113)
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

Initially, I found the bug while using 11.0.2:

ripgrep 11.0.2
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

sudo dpkg -i ./ripgrep_12.1.1_amd64.deb

What operating system are you using ripgrep on?

Ubuntu 20.04.2 LTS

Linux local 5.8.0-45-generic #51~20.04.1-Ubuntu SMP Tue Feb 23 13:46:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Describe your bug.

Using command line mode, escaping from Unicode mode in order to scan for bytes doesn't seem to work as documented.

What are the steps to reproduce the behavior?

Simple to reproduce. A scan of /usr/bin will suffice.

What is the actual behavior?

An rg scan of /usr/bin for ELF magic values returns 0 results, while there should be over 1000 matching files:

user@local:~$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/
user@local:~$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/*
user@local:~$

Using the same byte pattern, a direct scan of individual files works as expected:

user@local:~$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/ls
Binary file matches (found "\u{0}" byte around offset 7)
user@local:~$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/ed
Binary file matches (found "\u{0}" byte around offset 7)
user@local:~$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/file
Binary file matches (found "\u{0}" byte around offset 7)
user@local:~$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/find
Binary file matches (found "\u{0}" byte around offset 7)

Reducing the set of files scanned seems to change the results when using a filename wildcard, but not when just just specifying the directory:

user@local:~$ mkdir test
user@local:~/test$ cd test
user@local:~/test$ cp /usr/bin/ls .
user@local:~/test$ cp /usr/bin/ed . 
user@local:~/test$ cp /usr/bin/file .
user@local:~/test$ cp /usr/bin/find .
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./*
Binary file ./ls matches (found "\u{0}" byte around offset 7)

Binary file ./find matches (found "\u{0}" byte around offset 7)

Binary file ./file matches (found "\u{0}" byte around offset 7)

Binary file ./ed matches (found "\u{0}" byte around offset 7)
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./ls 
Binary file matches (found "\u{0}" byte around offset 7)
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./ed
Binary file matches (found "\u{0}" byte around offset 7)
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./file 
Binary file matches (found "\u{0}" byte around offset 7)
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./find 
Binary file matches (found "\u{0}" byte around offset 7)

Removing the \x00 bytes from the tail of the pattern changes ripgrep's behavior, and seems to work as expected:

user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin/ | wc -l
0
user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01' /usr/bin/ | wc -l
1092

Here are some debug outputs from the 4-file ~/test directory:

user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./ --debug
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:58: literal prefixes detected: Literals { lits: [Complete(ELF)], limit_size: 250, limit_class: 10 }
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01' ./* --debug
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:58: literal prefixes detected: Literals { lits: [Complete(ELF)], limit_size: 250, limit_class: 10 }
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
Binary file ./ls matches (found "\u{0}" byte around offset 7)
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

Binary file ./find matches (found "\u{0}" byte around offset 7)

Binary file ./file matches (found "\u{0}" byte around offset 7)

Binary file ./ed matches (found "\u{0}" byte around offset 7)
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01' ./ --debug
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:58: literal prefixes detected: Literals { lits: [Complete(ELF)], limit_size: 250, limit_class: 10 }
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
Binary file ./ls matches (found "\u{0}" byte around offset 7)
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

Binary file ./find matches (found "\u{0}" byte around offset 7)

Binary file ./file matches (found "\u{0}" byte around offset 7)

Binary file ./ed matches (found "\u{0}" byte around offset 7)
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

user@local:~/test$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01' ./ls --debug
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:58: literal prefixes detected: Literals { lits: [Complete(ELF)], limit_size: 250, limit_class: 10 }
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
Binary file matches (found "\u{0}" byte around offset 7)

What is the expected behavior?

ripgrep should consistently match on files in a search path which contain a specified byte pattern, using the documented method.

The text was updated successfully, but these errors were encountered:

bdlmt · 2021-03-31T01:58:19Z

Further test results, when copying all of /usr/bin into a local test directory.

Somehow, there are more pattern hits after a copy:

user@local:~$ mkdir test2
user@local:~$ cd test2/
user@local:~/test2$ cp /usr/bin/* .
cp: -r not specified; omitting directory '/usr/bin/X11'
user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01' /usr/bin | wc -l
1092
user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01' | wc -l
1468

The full byte pattern search still doesn't match any files, unless individual files are targeted:

user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' | wc -l
0
user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' ./ls | wc -l
1

And here's some more --debug output:

user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' --debug
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:58: literal prefixes detected: Literals { lits: [Complete(ELF)], limit_size: 250, limit_class: 10 }
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

BurntSushi · 2021-03-31T02:16:54Z

At first, I thought this might just be some confusion at how ripgrep handles binary files. The guide's section on binary data tries to explain the different modes. But since you are enabling binary mode via -uuu in all of your searches, that should cause things to behave consistently.

On mobile at the moment, but I suspect there may be a bug in how ripgrep is detecting and handling binary data specifically when a match occurs. I'll take a look when I'm back on my workstation.

Out of curiosity, do your results change if you add the --no-mmap flag? What about --mmap?

BurntSushi · 2021-03-31T02:18:29Z

(I believe there were changes in this area between the 11 and 12 releases, so please make sure you're doing all your testing with the latest release.)

bdlmt · 2021-03-31T04:45:48Z

(I believe there were changes in this area between the 11 and 12 releases, so please make sure you're doing all your testing with the latest release.)

All reported test results have been with latest.
(I was using 11.0.2 initially, but replicated all results in 12.1.1 for the bug report.)

At first, I thought this might just be some confusion at how ripgrep handles binary files. The guide's section on binary data tries to explain the different modes. But since you are enabling binary mode via -uuu in all of your searches, that should cause things to behave consistently.

On mobile at the moment, but I suspect there may be a bug in how ripgrep is detecting and handling binary data specifically when a match occurs. I'll take a look when I'm back on my workstation.

Out of curiosity, do your results change if you add the --no-mmap flag? What about --mmap?

The issue seems unchanged by '--no-mmap':

user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' --no-mmap | wc -l
0

But the results do change with '--mmap':

user@local:~/test2$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' --mmap | wc -l
1447

1447 seems to be the correct number of matches for that directory.

I tested using '--mmap' with a much larger set of files as well, and again got the expected number of matches.
(Though, slower than I expected, but maybe that's a side effect of escaping out of Unicode mode to do byte searches with regex.)

Here's a debug run with '--mmap' over the test dir with only 4 files:

$ rg -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' --mmap --debug
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:58: literal prefixes detected: Literals { lits: [Complete(ELF)], limit_size: 250, limit_class: 10 }
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexesBinary file ls matches (found "\u{0}" byte around offset 7)


Binary file file matches (found "\u{0}" byte around offset 7)

Binary file find matches (found "\u{0}" byte around offset 7)

Binary file ed matches (found "\u{0}" byte around offset 7)
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
DEBUG|globset|crates/globset/src/lib.rs:431: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes

BurntSushi · 2021-03-31T12:17:18Z

(Though, slower than I expected, but maybe that's a side effect of escaping out of Unicode mode to do byte searches with regex.)

Yeah, forcing memory maps when searching a bunch of tiny files actually results in some pretty serious overhead that slows down the entire enterprise.

In any case, I am able to reproduce this on my own /usr/bin directory, so I'll investigate this as soon as I can. Thanks for the great bug report!

bdlmt · 2021-03-31T15:33:56Z

In any case, I am able to reproduce this on my own /usr/bin directory, so I'll investigate this as soon as I can.

If I were to attempt to investigate and fix the issue, where's your best guess to start looking?
I'm still new to Rust, but willing to dig in.

Thanks for the great bug report!

No problem; thank you for creating and maintaining a great project.

BurntSushi · 2021-03-31T16:36:22Z

OK, so I took a look at this, and yeesh, this is actually behaving as intended. Which is kind of scary, because the UX here is just totally abysmal. This is a perfect storm, basically.

TL;DR - If you put --no-mmap in your ripgrep config file (or alias or whatever), then you'll get consistent behavior. For the specific case of searching for explicit binary data, you almost always want the -a flag. Otherwise, a \x00 byte will never match anything, by construction.

So the top-level issue here is that binary detection/handling fundamentally works differently depending on whether memory maps are used or not.

The underlying motivation for binary handling is a desire for ripgrep to treat files as "text." Of course, ripgrep, like any grep, can also search binary files. The main problem is that you generally don't want to accidentally print out results from a binary file because it can muck with your terminal via escape sequences and what not. On top of that, there is the issue that "text" files are typically line oriented where as binary data isn't. That means a single "line" in a binary data (typically a meaningless concept) can actually be quite large. Since ripgrep requires that each line be able to fit in memory, it uses a trick: it converts \x00 bytes into line terminators in an effort to make binary data "look" like text. When this happens, the file is also marked as "binary." If a match occurs, you get the "match found but it's a binary file so we aren't going to print anything" message.

Only looking at the above, we can explain why rg -uuu 'foo\x00\x00' finds nothing and why rg -uuu 'foo' finds something:

The third -u flag is synonymous with --binary. This causes ripgrep to disable auto-filtering of binary data. The main UX concern here is that if no match is found, then one should be able to conclude that no matches exist.
Since the -a/--text flag is not enabled, ripgrep does not treat everything as if it were "text." Instead, it tries to do binary detection. As stated above, this has a dual purpose: to prevent dumping arbitrary binary data to your terminal, but also to avoid using gratuitous amounts of memory for long lines in binary files. Thus, all \x00 bytes in every file searched are rewritten to \n bytes. This is what explains why foo\x00 won't match: \x00 can never match because all such bytes are re-mapped to line terminators. But foo contains no NUL bytes, and thus can match.

All of that behavior was lifted exactly from GNU grep. Including the NUL byte rewriting. However, GNU grep makes it much harder to search for NUL bytes. It doesn't recognize things like \x00, and inserting a NUL byte into a shell string is weird.

If that weren't confusing enough as it is, there's another catch here. When you search a single file directly, ripgrep will typically switch over to using memory maps to read the file since it tends to be faster in that case. The issue with memory maps is that we can't apply the above technique in the same way without destroying the performance benefit gained by using memory maps in the first place. In particular:

We can't search the entire file for \x00 without adverse consequences. e.g., In particularly big files, in the worst case, this could require reading the entire file twice. When it comes to memory maps, they aren't searched in blocks, but rather, as a single contiguous slice. Thus binary detection is itself different for memory maps.
When using memory maps, we never copy the data from the actual file just for searching. Therefore, there is no real opportunity to rewrite \x00 bytes as line terminators, even if we could find all such \x00 bytes.

So, with memory maps, we do binary detection slightly differently:

We look for a \x00 byte in the first N bytes of a file.
When printing matches or contextual lines, we search only those specific lines for a \x00 byte. In this way, we prevent dumping binary data to the terminal.

Thus, when you search a file directly, you get matches because ripgrep is using memory maps for searching and thus doesn't rewrite \x00 bytes as line terminators. So your pattern finds a hit.

So what could we do differently here? I don't think documentation is going to really help much here. It's such an oddball case. We could add to the relevant section on the guide with this particular case study. And in particular, we could say something like, "add --no-mmap to your config file to make binary data handling consistent." The performance improvement from memory maps isn't that great, after all.

In terms of behavior changes, here are some things I can think of in no particular order. But I also include reasons why they problematic:

We could force the --no-mmap strategy to have the same binary handling logic as the --mmap strategy. e.g., By only looking for binary data in the first N bytes and otherwise only searching matching lines. This would at least avoid dumping binary data to your terminal. However, this gives up on the heuristic for avoiding large heap usages due to long "lines" in binary data. Memory maps don't have this problem since the OS handles paging things in and out of memory transparently. Having the heap explode when searching binary data seems unfortunate.
We could get rid of memory maps entirely. The problem here is that they do have some nice perf benefits. They also enable certain configurations to run very very fast that wouldn't otherwise be possible without memory maps.
We could somehow force the memory map binary handling to match the non-memory map handling. The only way I can think to do this would be to chunk up the memory map, copy data from it and rewrite its bytes like we do in the non-mmap case. But this is adding a lot of extra costs to the memory map handling that we can safely ignore precisely because of the nature of memory maps.
We could forcefully disable the use of memory maps whenever --binary mode is enabled. But this is enabled by default when searching a specific file, so this would effectively shut off memory maps completely.

This message will emit the binary detection mechanism being used for each file. This does not noticeably increases the number of log messages, as the 'trace' level is already used for emitting messages for every file searched. This trace message was added in the course of investigating #1838.

bdlmt · 2021-03-31T18:30:57Z

Thank you for your detailed explanation!

For the specific case of searching for explicit binary data, you almost always want the -a flag. Otherwise, a \x00 byte will never match anything, by construction.

Aha! I missed that in the docs.

That means a single "line" in a binary data (typically a meaningless concept) can actually be quite large. Since ripgrep requires that each line be able to fit in memory, it uses a trick: it converts \x00 bytes into line terminators in an effort to make binary data "look" like text.

Okay, got it.

The third -u flag is synonymous with --binary.

After reading through your response and the man page, I'm not able to rectify this behavior:

user@local:~$ rg -a --binary '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin | wc -l
0
user@local:~$ rg -a -uuu '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin | wc -l
1081
user@local:~$ rg -a --binary --mmap '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin | wc -l
1080
user@local:~$ rg -a -uuu --mmap '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin | wc -l
1081
user@local:~$ rg -a --binary --no-mmap '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin | wc -l
0
user@local:~$ rg -a -uuu --no-mmap '(?-u)\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00' /usr/bin | wc -l
1081

-uuu and --binary seem to have different behavior when combined with -a, --mmap, and --no-mmap.

Assuming I understand your explanation correctly, the -uuu cases are behaving as expected, while the --binary cases don't seem to be.

(Edit: rearranged the test cases to make more sense side-by-side.)

bdlmt · 2021-03-31T19:21:57Z

In terms of behavior changes, here are some things I can think of in no particular order. But I also include reasons why they problematic:

We could force the --no-mmap strategy to have the same binary handling logic as the --mmap strategy. e.g., By only looking for binary data in the first N bytes and otherwise only searching matching lines. This would at least avoid dumping binary data to your terminal. However, this gives up on the heuristic for avoiding large heap usages due to long "lines" in binary data. Memory maps don't have this problem since the OS handles paging things in and out of memory transparently. Having the heap explode when searching binary data seems unfortunate.

We could get rid of memory maps entirely. The problem here is that they do have some nice perf benefits. They also enable certain configurations to run very very fast that wouldn't otherwise be possible without memory maps.

We could somehow force the memory map binary handling to match the non-memory map handling. The only way I can think to do this would be to chunk up the memory map, copy data from it and rewrite its bytes like we do in the non-mmap case. But this is adding a lot of extra costs to the memory map handling that we can safely ignore precisely because of the nature of memory maps.

We could forcefully disable the use of memory maps whenever --binary mode is enabled. But this is enabled by default when searching a specific file, so this would effectively shut off memory maps completely.

Would it make sense to have this as an option as well?
5. Use memory map mode whenever --binary mode is enabled. Treat a binary file as a long line and let the OS handle paging. No \x00 replacement overhead would be required, and the heap wouldn't explode on large binaries.

BurntSushi · 2021-03-31T19:36:18Z

@bdlmt The -a/--text and --binary flag are mutually exclusive things. They actually override each other. So -a --binary enables binary searching while --binary -a makes everything searched as if it were text. Of course, the former is equivalent to --binary without -a and the latter is equivalent to -a without --binary.

Use memory map mode whenever --binary mode is enabled. Treat a binary file as a long line and let the OS handle paging. No \x00 replacement overhead would be required, and the heap wouldn't explode on large binaries.

It wouldn't. Aside from the downside of this seriously regressing performance when --binary is enabled, it's also just plain impossible. Memory maps can only be used in a subset of cases. They can't be used on streams and certain files (like /proc/cpuinfo) can't be memory mapped. If it was just a matter of performance regressing, then it would make sense to list it for completeness. But it's actually impossible. Standard read syscalls are the only universal way to search data.

bdlmt · 2021-03-31T19:58:37Z

The -a/--text and --binary flag are mutually exclusive things. They actually override each other. So -a --binary enables binary searching while --binary -a makes everything searched as if it were text. Of course, the former is equivalent to --binary without -a and the latter is equivalent to -a without --binary.

I see now. I read the man page incorrectly. I misunderstood this line under --binary:

This flag can be disabled with --no-binary. It overrides the -a/--text flag.

I thought "It" referred to --no-binary, not --binary. I should have realized I had it wrong, based on the intended nature of -a/--text and --binary.

They can't be used on streams and certain files (like /proc/cpuinfo) can't be memory mapped.

Good point; I wasn't thinking far enough outside my recent use case.

bdlmt · 2021-04-01T18:48:54Z

We could force the --no-mmap strategy to have the same binary handling logic as the --mmap strategy. e.g., By only looking for binary data in the first N bytes and otherwise only searching matching lines. This would at least avoid dumping binary data to your terminal. However, this gives up on the heuristic for avoiding large heap usages due to long "lines" in binary data. Memory maps don't have this problem since the OS handles paging things in and out of memory transparently. Having the heap explode when searching binary data seems unfortunate.

From a UX point of view, behavior change option 1 seems best. Mixing -a/--text with -uuu to enable text mode during a binary search in order to successfully match on raw \x00 bytes is a bit awkward and not intuitive. The change would also address the unexpected disparity between using -a/--text with --binary (mutually exclusive) vs. using -a/--text with -uuu (text mode enabled on binaries), because -a/--text would no longer be necessary when matching on \x00 bytes.

From a performance point of view, it seems like leaving it as-is would be best, unless there's a way to split a binary file into multiple "lines" without replacing bytes.

As a user, I like the sound of improving the UX, but obviously not if there's a significant performance hit.

Basically, unless the -a/--text flag is given, it is generally always an error to search for an explicit NUL byte because the binary detection will prevent it from matching. Fixes #1838

14.0.2 (2023-11-27) =================== This is a patch release with a few small bug fixes. Bug fixes: * [BUG #2654](BurntSushi/ripgrep#2654): Fix `deb` release sha256 sum file. * [BUG #2658](BurntSushi/ripgrep#2658): Fix partial regression in the behavior of `--null-data --line-regexp`. * [BUG #2659](BurntSushi/ripgrep#2659): Fix Fish shell completions. * [BUG #2662](BurntSushi/ripgrep#2662): Fix typo in documentation for `-i/--ignore-case`. 14.0.1 (2023-11-26) =================== This a patch release meant to fix `cargo install ripgrep` on Windows. Bug fixes: * [BUG #2653](BurntSushi/ripgrep#2653): Include `pkg/windows/Manifest.xml` in crate package. 14.0.0 (2023-11-26) =================== ripgrep 14 is a new major version release of ripgrep that has some new features, performance improvements and a lot of bug fixes. The headlining feature in this release is hyperlink support. In this release, they are an opt-in feature but may change to an opt-out feature in the future. To enable them, try passing `--hyperlink-format default`. If you use [VS Code], then try passing `--hyperlink-format vscode`. Please [report your experience with hyperlinks][report-hyperlinks], positive or negative. [VS Code]: https://code.visualstudio.com/ [report-hyperlinks]: BurntSushi/ripgrep#2611 Another headlining development in this release is that it contains a rewrite of its regex engine. You generally shouldn't notice any changes, except for some searches may get faster. You can read more about the [regex engine rewrite on my blog][regex-internals]. Please [report your performance improvements or regressions that you notice][report-perf]. [report-perf]: BurntSushi/ripgrep#2652 Finally, ripgrep switched the library it uses for argument parsing. Users should not notice a difference in most cases (error messages have changed somewhat), but flag overrides should generally be more consistent. For example, things like `--no-ignore --ignore-vcs` work as one would expect (disables all filtering related to ignore rules except for rules found in version control systems such as `git`). [regex-internals]: https://blog.burntsushi.net/regex-internals/ **BREAKING CHANGES**: * `rg -C1 -A2` used to be equivalent to `rg -A2`, but now it is equivalent to `rg -B1 -A2`. That is, `-A` and `-B` no longer completely override `-C`. Instead, they only partially override `-C`. Build process changes: * ripgrep's shell completions and man page are now created by running ripgrep with a new `--generate` flag. For example, `rg --generate man` will write a man page in `roff` format on stdout. The release archives have not changed. * The optional build dependency on `asciidoc` or `asciidoctor` has been dropped. Previously, it was used to produce ripgrep's man page. ripgrep now owns this process itself by writing `roff` directly. Performance improvements: * [PERF #1746](BurntSushi/ripgrep#1746): Make some cases with inner literals faster. * [PERF #1760](BurntSushi/ripgrep#1760): Make most searches with `\b` look-arounds (among others) much faster. * [PERF #2591](BurntSushi/ripgrep#2591): Parallel directory traversal now uses work stealing for faster searches. * [PERF #2642](BurntSushi/ripgrep#2642): Parallel directory traversal has some contention reduced. Feature enhancements: * Added or improved file type filtering for Ada, DITA, Elixir, Fuchsia, Gentoo, Gradle, GraphQL, Markdown, Prolog, Raku, TypeScript, USD, V * [FEATURE #665](BurntSushi/ripgrep#665): Add a new `--hyperlink-format` flag that turns file paths into hyperlinks. * [FEATURE #1709](BurntSushi/ripgrep#1709): Improve documentation of ripgrep's behavior when stdout is a tty. * [FEATURE #1737](BurntSushi/ripgrep#1737): Provide binaries for Apple silicon. * [FEATURE #1790](BurntSushi/ripgrep#1790): Add new `--stop-on-nonmatch` flag. * [FEATURE #1814](BurntSushi/ripgrep#1814): Flags are now categorized in `-h/--help` output and ripgrep's man page. * [FEATURE #1838](BurntSushi/ripgrep#1838): An error is shown when searching for NUL bytes with binary detection enabled. * [FEATURE #2195](BurntSushi/ripgrep#2195): When `extra-verbose` mode is enabled in zsh, show extra file type info. * [FEATURE #2298](BurntSushi/ripgrep#2298): Add instructions for installing ripgrep using `cargo binstall`. * [FEATURE #2409](BurntSushi/ripgrep#2409): Added installation instructions for `winget`. * [FEATURE #2425](BurntSushi/ripgrep#2425): Shell completions (and man page) can be created via `rg --generate`. * [FEATURE #2524](BurntSushi/ripgrep#2524): The `--debug` flag now indicates whether stdin or `./` is being searched. * [FEATURE #2643](BurntSushi/ripgrep#2643): Make `-d` a short flag for `--max-depth`. * [FEATURE #2645](BurntSushi/ripgrep#2645): The `--version` output will now also contain PCRE2 availability information. Bug fixes: * [BUG #884](BurntSushi/ripgrep#884): Don't error when `-v/--invert-match` is used multiple times. * [BUG #1275](BurntSushi/ripgrep#1275): Fix bug with `\b` assertion in the regex engine. * [BUG #1376](BurntSushi/ripgrep#1376): Using `--no-ignore --ignore-vcs` now works as one would expect. * [BUG #1622](BurntSushi/ripgrep#1622): Add note about error messages to `-z/--search-zip` documentation. * [BUG #1648](BurntSushi/ripgrep#1648): Fix bug where sometimes short flags with values, e.g., `-M 900`, would fail. * [BUG #1701](BurntSushi/ripgrep#1701): Fix bug where some flags could not be repeated. * [BUG #1757](BurntSushi/ripgrep#1757): Fix bug when searching a sub-directory didn't have ignores applied correctly. * [BUG #1891](BurntSushi/ripgrep#1891): Fix bug when using `-w` with a regex that can match the empty string. * [BUG #1911](BurntSushi/ripgrep#1911): Disable mmap searching in all non-64-bit environments. * [BUG #1966](BurntSushi/ripgrep#1966): Fix bug where ripgrep can panic when printing to stderr. * [BUG #2046](BurntSushi/ripgrep#2046): Clarify that `--pre` can accept any kind of path in the documentation. * [BUG #2108](BurntSushi/ripgrep#2108): Improve docs for `-r/--replace` syntax. * [BUG #2198](BurntSushi/ripgrep#2198): Fix bug where `--no-ignore-dot` would not ignore `.rgignore`. * [BUG #2201](BurntSushi/ripgrep#2201): Improve docs for `-r/--replace` flag. * [BUG #2288](BurntSushi/ripgrep#2288): `-A` and `-B` now only each partially override `-C`. * [BUG #2236](BurntSushi/ripgrep#2236): Fix gitignore parsing bug where a trailing `\/` resulted in an error. * [BUG #2243](BurntSushi/ripgrep#2243): Fix `--sort` flag for values other than `path`. * [BUG #2246](BurntSushi/ripgrep#2246): Add note in `--debug` logs when binary files are ignored. * [BUG #2337](BurntSushi/ripgrep#2337): Improve docs to mention that `--stats` is always implied by `--json`. * [BUG #2381](BurntSushi/ripgrep#2381): Make `-p/--pretty` override flags like `--no-line-number`. * [BUG #2392](BurntSushi/ripgrep#2392): Improve global git config parsing of the `excludesFile` field. * [BUG #2418](BurntSushi/ripgrep#2418): Clarify sorting semantics of `--sort=path`. * [BUG #2458](BurntSushi/ripgrep#2458): Make `--trim` run before `-M/--max-columns` takes effect. * [BUG #2479](BurntSushi/ripgrep#2479): Add documentation about `.ignore`/`.rgignore` files in parent directories. * [BUG #2480](BurntSushi/ripgrep#2480): Fix bug when using inline regex flags with `-e/--regexp`. * [BUG #2505](BurntSushi/ripgrep#2505): Improve docs for `--vimgrep` by mentioning footguns and some work-arounds. * [BUG #2519](BurntSushi/ripgrep#2519): Fix incorrect default value in documentation for `--field-match-separator`. * [BUG #2523](BurntSushi/ripgrep#2523): Make executable searching take `.com` into account on Windows. * [BUG #2574](BurntSushi/ripgrep#2574): Fix bug in `-w/--word-regexp` that would result in incorrect match offsets. * [BUG #2623](BurntSushi/ripgrep#2623): Fix a number of bugs with the `-w/--word-regexp` flag. * [BUG #2636](BurntSushi/ripgrep#2636): Strip release binaries for macOS.

BurntSushi added doc An issue with or an improvement to documentation. question An issue that is lacking clarity on one or more points. labels Mar 31, 2021

BurntSushi added a commit that referenced this issue Nov 25, 2023

cli: error when searching for NUL

6304568

Basically, unless the -a/--text flag is given, it is generally always an error to search for an explicit NUL byte because the binary detection will prevent it from matching. Fixes #1838

BurntSushi added the rollup A PR that has been merged with many others in a rollup. label Nov 25, 2023

BurntSushi closed this as completed in 9ed7565 Nov 25, 2023

sankalp-khare mentioned this issue Nov 26, 2023

ripgrep 14.0.0 Homebrew/homebrew-core#155592

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Byte pattern detection works inconsistently #1838

Byte pattern detection works inconsistently #1838

bdlmt commented Mar 31, 2021

bdlmt commented Mar 31, 2021

BurntSushi commented Mar 31, 2021 •

edited

Loading

BurntSushi commented Mar 31, 2021

bdlmt commented Mar 31, 2021

BurntSushi commented Mar 31, 2021

bdlmt commented Mar 31, 2021

BurntSushi commented Mar 31, 2021 •

edited

Loading

bdlmt commented Mar 31, 2021 •

edited

Loading

bdlmt commented Mar 31, 2021

BurntSushi commented Mar 31, 2021

bdlmt commented Mar 31, 2021

bdlmt commented Apr 1, 2021

Byte pattern detection works inconsistently #1838

Byte pattern detection works inconsistently #1838

Comments

bdlmt commented Mar 31, 2021

What version of ripgrep are you using?

How did you install ripgrep?

What operating system are you using ripgrep on?

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

bdlmt commented Mar 31, 2021

BurntSushi commented Mar 31, 2021 • edited Loading

BurntSushi commented Mar 31, 2021

bdlmt commented Mar 31, 2021

BurntSushi commented Mar 31, 2021

bdlmt commented Mar 31, 2021

BurntSushi commented Mar 31, 2021 • edited Loading

bdlmt commented Mar 31, 2021 • edited Loading

bdlmt commented Mar 31, 2021

BurntSushi commented Mar 31, 2021

bdlmt commented Mar 31, 2021

bdlmt commented Apr 1, 2021

BurntSushi commented Mar 31, 2021 •

edited

Loading

BurntSushi commented Mar 31, 2021 •

edited

Loading

bdlmt commented Mar 31, 2021 •

edited

Loading