-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Byte pattern detection works inconsistently #1838
Comments
Further test results, when copying all of /usr/bin into a local test directory. Somehow, there are more pattern hits after a copy:
The full byte pattern search still doesn't match any files, unless individual files are targeted:
And here's some more --debug output:
|
At first, I thought this might just be some confusion at how ripgrep handles binary files. The guide's section on binary data tries to explain the different modes. But since you are enabling binary mode via On mobile at the moment, but I suspect there may be a bug in how ripgrep is detecting and handling binary data specifically when a match occurs. I'll take a look when I'm back on my workstation. Out of curiosity, do your results change if you add the |
(I believe there were changes in this area between the 11 and 12 releases, so please make sure you're doing all your testing with the latest release.) |
All reported test results have been with latest.
The issue seems unchanged by '--no-mmap':
But the results do change with '--mmap':
1447 seems to be the correct number of matches for that directory. I tested using '--mmap' with a much larger set of files as well, and again got the expected number of matches. Here's a debug run with '--mmap' over the test dir with only 4 files:
|
Yeah, forcing memory maps when searching a bunch of tiny files actually results in some pretty serious overhead that slows down the entire enterprise. In any case, I am able to reproduce this on my own |
If I were to attempt to investigate and fix the issue, where's your best guess to start looking?
No problem; thank you for creating and maintaining a great project. |
OK, so I took a look at this, and yeesh, this is actually behaving as intended. Which is kind of scary, because the UX here is just totally abysmal. This is a perfect storm, basically. TL;DR - If you put So the top-level issue here is that binary detection/handling fundamentally works differently depending on whether memory maps are used or not. The underlying motivation for binary handling is a desire for ripgrep to treat files as "text." Of course, ripgrep, like any grep, can also search binary files. The main problem is that you generally don't want to accidentally print out results from a binary file because it can muck with your terminal via escape sequences and what not. On top of that, there is the issue that "text" files are typically line oriented where as binary data isn't. That means a single "line" in a binary data (typically a meaningless concept) can actually be quite large. Since ripgrep requires that each line be able to fit in memory, it uses a trick: it converts Only looking at the above, we can explain why
All of that behavior was lifted exactly from GNU grep. Including the NUL byte rewriting. However, GNU grep makes it much harder to search for NUL bytes. It doesn't recognize things like If that weren't confusing enough as it is, there's another catch here. When you search a single file directly, ripgrep will typically switch over to using memory maps to read the file since it tends to be faster in that case. The issue with memory maps is that we can't apply the above technique in the same way without destroying the performance benefit gained by using memory maps in the first place. In particular:
So, with memory maps, we do binary detection slightly differently:
Thus, when you search a file directly, you get matches because ripgrep is using memory maps for searching and thus doesn't rewrite So what could we do differently here? I don't think documentation is going to really help much here. It's such an oddball case. We could add to the relevant section on the guide with this particular case study. And in particular, we could say something like, "add In terms of behavior changes, here are some things I can think of in no particular order. But I also include reasons why they problematic:
|
This message will emit the binary detection mechanism being used for each file. This does not noticeably increases the number of log messages, as the 'trace' level is already used for emitting messages for every file searched. This trace message was added in the course of investigating #1838.
Thank you for your detailed explanation!
Aha! I missed that in the docs.
Okay, got it.
After reading through your response and the man page, I'm not able to rectify this behavior:
Assuming I understand your explanation correctly, the (Edit: rearranged the test cases to make more sense side-by-side.) |
Would it make sense to have this as an option as well? |
@bdlmt The
It wouldn't. Aside from the downside of this seriously regressing performance when |
I see now. I read the man page incorrectly. I misunderstood this line under
I thought "It" referred to
Good point; I wasn't thinking far enough outside my recent use case. |
From a UX point of view, behavior change option 1 seems best. Mixing From a performance point of view, it seems like leaving it as-is would be best, unless there's a way to split a binary file into multiple "lines" without replacing bytes. As a user, I like the sound of improving the UX, but obviously not if there's a significant performance hit. |
Basically, unless the -a/--text flag is given, it is generally always an error to search for an explicit NUL byte because the binary detection will prevent it from matching. Fixes #1838
14.0.2 (2023-11-27) =================== This is a patch release with a few small bug fixes. Bug fixes: * [BUG #2654](BurntSushi/ripgrep#2654): Fix `deb` release sha256 sum file. * [BUG #2658](BurntSushi/ripgrep#2658): Fix partial regression in the behavior of `--null-data --line-regexp`. * [BUG #2659](BurntSushi/ripgrep#2659): Fix Fish shell completions. * [BUG #2662](BurntSushi/ripgrep#2662): Fix typo in documentation for `-i/--ignore-case`. 14.0.1 (2023-11-26) =================== This a patch release meant to fix `cargo install ripgrep` on Windows. Bug fixes: * [BUG #2653](BurntSushi/ripgrep#2653): Include `pkg/windows/Manifest.xml` in crate package. 14.0.0 (2023-11-26) =================== ripgrep 14 is a new major version release of ripgrep that has some new features, performance improvements and a lot of bug fixes. The headlining feature in this release is hyperlink support. In this release, they are an opt-in feature but may change to an opt-out feature in the future. To enable them, try passing `--hyperlink-format default`. If you use [VS Code], then try passing `--hyperlink-format vscode`. Please [report your experience with hyperlinks][report-hyperlinks], positive or negative. [VS Code]: https://code.visualstudio.com/ [report-hyperlinks]: BurntSushi/ripgrep#2611 Another headlining development in this release is that it contains a rewrite of its regex engine. You generally shouldn't notice any changes, except for some searches may get faster. You can read more about the [regex engine rewrite on my blog][regex-internals]. Please [report your performance improvements or regressions that you notice][report-perf]. [report-perf]: BurntSushi/ripgrep#2652 Finally, ripgrep switched the library it uses for argument parsing. Users should not notice a difference in most cases (error messages have changed somewhat), but flag overrides should generally be more consistent. For example, things like `--no-ignore --ignore-vcs` work as one would expect (disables all filtering related to ignore rules except for rules found in version control systems such as `git`). [regex-internals]: https://blog.burntsushi.net/regex-internals/ **BREAKING CHANGES**: * `rg -C1 -A2` used to be equivalent to `rg -A2`, but now it is equivalent to `rg -B1 -A2`. That is, `-A` and `-B` no longer completely override `-C`. Instead, they only partially override `-C`. Build process changes: * ripgrep's shell completions and man page are now created by running ripgrep with a new `--generate` flag. For example, `rg --generate man` will write a man page in `roff` format on stdout. The release archives have not changed. * The optional build dependency on `asciidoc` or `asciidoctor` has been dropped. Previously, it was used to produce ripgrep's man page. ripgrep now owns this process itself by writing `roff` directly. Performance improvements: * [PERF #1746](BurntSushi/ripgrep#1746): Make some cases with inner literals faster. * [PERF #1760](BurntSushi/ripgrep#1760): Make most searches with `\b` look-arounds (among others) much faster. * [PERF #2591](BurntSushi/ripgrep#2591): Parallel directory traversal now uses work stealing for faster searches. * [PERF #2642](BurntSushi/ripgrep#2642): Parallel directory traversal has some contention reduced. Feature enhancements: * Added or improved file type filtering for Ada, DITA, Elixir, Fuchsia, Gentoo, Gradle, GraphQL, Markdown, Prolog, Raku, TypeScript, USD, V * [FEATURE #665](BurntSushi/ripgrep#665): Add a new `--hyperlink-format` flag that turns file paths into hyperlinks. * [FEATURE #1709](BurntSushi/ripgrep#1709): Improve documentation of ripgrep's behavior when stdout is a tty. * [FEATURE #1737](BurntSushi/ripgrep#1737): Provide binaries for Apple silicon. * [FEATURE #1790](BurntSushi/ripgrep#1790): Add new `--stop-on-nonmatch` flag. * [FEATURE #1814](BurntSushi/ripgrep#1814): Flags are now categorized in `-h/--help` output and ripgrep's man page. * [FEATURE #1838](BurntSushi/ripgrep#1838): An error is shown when searching for NUL bytes with binary detection enabled. * [FEATURE #2195](BurntSushi/ripgrep#2195): When `extra-verbose` mode is enabled in zsh, show extra file type info. * [FEATURE #2298](BurntSushi/ripgrep#2298): Add instructions for installing ripgrep using `cargo binstall`. * [FEATURE #2409](BurntSushi/ripgrep#2409): Added installation instructions for `winget`. * [FEATURE #2425](BurntSushi/ripgrep#2425): Shell completions (and man page) can be created via `rg --generate`. * [FEATURE #2524](BurntSushi/ripgrep#2524): The `--debug` flag now indicates whether stdin or `./` is being searched. * [FEATURE #2643](BurntSushi/ripgrep#2643): Make `-d` a short flag for `--max-depth`. * [FEATURE #2645](BurntSushi/ripgrep#2645): The `--version` output will now also contain PCRE2 availability information. Bug fixes: * [BUG #884](BurntSushi/ripgrep#884): Don't error when `-v/--invert-match` is used multiple times. * [BUG #1275](BurntSushi/ripgrep#1275): Fix bug with `\b` assertion in the regex engine. * [BUG #1376](BurntSushi/ripgrep#1376): Using `--no-ignore --ignore-vcs` now works as one would expect. * [BUG #1622](BurntSushi/ripgrep#1622): Add note about error messages to `-z/--search-zip` documentation. * [BUG #1648](BurntSushi/ripgrep#1648): Fix bug where sometimes short flags with values, e.g., `-M 900`, would fail. * [BUG #1701](BurntSushi/ripgrep#1701): Fix bug where some flags could not be repeated. * [BUG #1757](BurntSushi/ripgrep#1757): Fix bug when searching a sub-directory didn't have ignores applied correctly. * [BUG #1891](BurntSushi/ripgrep#1891): Fix bug when using `-w` with a regex that can match the empty string. * [BUG #1911](BurntSushi/ripgrep#1911): Disable mmap searching in all non-64-bit environments. * [BUG #1966](BurntSushi/ripgrep#1966): Fix bug where ripgrep can panic when printing to stderr. * [BUG #2046](BurntSushi/ripgrep#2046): Clarify that `--pre` can accept any kind of path in the documentation. * [BUG #2108](BurntSushi/ripgrep#2108): Improve docs for `-r/--replace` syntax. * [BUG #2198](BurntSushi/ripgrep#2198): Fix bug where `--no-ignore-dot` would not ignore `.rgignore`. * [BUG #2201](BurntSushi/ripgrep#2201): Improve docs for `-r/--replace` flag. * [BUG #2288](BurntSushi/ripgrep#2288): `-A` and `-B` now only each partially override `-C`. * [BUG #2236](BurntSushi/ripgrep#2236): Fix gitignore parsing bug where a trailing `\/` resulted in an error. * [BUG #2243](BurntSushi/ripgrep#2243): Fix `--sort` flag for values other than `path`. * [BUG #2246](BurntSushi/ripgrep#2246): Add note in `--debug` logs when binary files are ignored. * [BUG #2337](BurntSushi/ripgrep#2337): Improve docs to mention that `--stats` is always implied by `--json`. * [BUG #2381](BurntSushi/ripgrep#2381): Make `-p/--pretty` override flags like `--no-line-number`. * [BUG #2392](BurntSushi/ripgrep#2392): Improve global git config parsing of the `excludesFile` field. * [BUG #2418](BurntSushi/ripgrep#2418): Clarify sorting semantics of `--sort=path`. * [BUG #2458](BurntSushi/ripgrep#2458): Make `--trim` run before `-M/--max-columns` takes effect. * [BUG #2479](BurntSushi/ripgrep#2479): Add documentation about `.ignore`/`.rgignore` files in parent directories. * [BUG #2480](BurntSushi/ripgrep#2480): Fix bug when using inline regex flags with `-e/--regexp`. * [BUG #2505](BurntSushi/ripgrep#2505): Improve docs for `--vimgrep` by mentioning footguns and some work-arounds. * [BUG #2519](BurntSushi/ripgrep#2519): Fix incorrect default value in documentation for `--field-match-separator`. * [BUG #2523](BurntSushi/ripgrep#2523): Make executable searching take `.com` into account on Windows. * [BUG #2574](BurntSushi/ripgrep#2574): Fix bug in `-w/--word-regexp` that would result in incorrect match offsets. * [BUG #2623](BurntSushi/ripgrep#2623): Fix a number of bugs with the `-w/--word-regexp` flag. * [BUG #2636](BurntSushi/ripgrep#2636): Strip release binaries for macOS.
What version of ripgrep are you using?
Initially, I found the bug while using 11.0.2:
How did you install ripgrep?
What operating system are you using ripgrep on?
Ubuntu 20.04.2 LTS
Describe your bug.
Using command line mode, escaping from Unicode mode in order to scan for bytes doesn't seem to work as documented.
What are the steps to reproduce the behavior?
Simple to reproduce. A scan of /usr/bin will suffice.
What is the actual behavior?
An rg scan of /usr/bin for ELF magic values returns 0 results, while there should be over 1000 matching files:
Using the same byte pattern, a direct scan of individual files works as expected:
Reducing the set of files scanned seems to change the results when using a filename wildcard, but not when just just specifying the directory:
Removing the \x00 bytes from the tail of the pattern changes ripgrep's behavior, and seems to work as expected:
Here are some debug outputs from the 4-file ~/test directory:
What is the expected behavior?
ripgrep should consistently match on files in a search path which contain a specified byte pattern, using the documented method.
The text was updated successfully, but these errors were encountered: