Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible bug with symlink following #106

Closed
BurntSushi opened this issue Nov 6, 2016 · 4 comments
Closed

possible bug with symlink following #106

BurntSushi opened this issue Nov 6, 2016 · 4 comments
Assignees
Milestone

Comments

@BurntSushi
Copy link

To reproduce:

$ git clone --depth 1 git://github.com/BurntSushi/linux linux-burntsushi
Cloning into 'linux-burntsushi'...
remote: Counting objects: 58881, done.
remote: Compressing objects: 100% (55208/55208), done.
remote: Total 58881 (delta 4182), reused 21851 (delta 2810), pack-reused 0
Receiving objects: 100% (58881/58881), 155.87 MiB | 8.87 MiB/s, done.
Resolving deltas: 100% (4182/4182), done.
Checking out files: 100% (55499/55499), done.
$ cd linux-burntsushi/
$ rg -n -u -tc -w -L '[A-Z]+_SUSPEND' | wc -l
403
$ LC_ALL=en_US.UTF-8 egrep -R -n --include='*.c' --include='*.h' -w '[A-Z]+_SUSPEND' | wc -l
403
$ LC_ALL=C egrep -R -n --include='*.c' --include='*.h' -w '[A-Z]+_SUSPEND' | wc -l
403
$ ucg --type=cc -w '[A-Z]+_SUSPEND' | wc -l
ucg: warning: './arch/powerpc/boot/dts/include/dt-bindings': recursive directory loop
ucg: warning: './arch/metag/boot/dts/include/dt-bindings': recursive directory loop
ucg: warning: './arch/arm64/boot/dts/include/dt-bindings': recursive directory loop
ucg: warning: './arch/arm/boot/dts/include/dt-bindings': recursive directory loop
ucg: warning: './arch/cris/boot/dts/include/dt-bindings': recursive directory loop
ucg: warning: './arch/mips/boot/dts/include/dt-bindings': recursive directory loop
391

In particular, I don't believe there are any recursive directory loops.

Are you perhaps not descending into the symlinked directory because it links to a directory that has already been searched? (Either way, it's at least not a loop.)

@BurntSushi
Copy link
Author

Version info:

ucg --version
UniversalCodeGrep 0.3.0
Copyright (C) 2015-2016 Gary R. Van Sickle.

This program is free software; you can redistribute it and/or modify
it under the terms of version 3 of the GNU General Public License as
published by the Free Software Foundation.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.

Build info

Repo version: 0.3.0-4-g6867e8e

Compiler info:
 Name ($(CXX)): g++
 Version string: "g++ (GCC) 6.2.1 20160830"

ISA extensions in use:
 sse4.2: yes
 popcnt: yes

libpcre info:
 Version: 8.39 2016-06-14
 JIT support built in?: yes
 JIT target architecture: x86 64bit (little endian + unaligned)
 Newline style: LF

libpcre2-8 info:
 Version: 10.22 2016-07-29
 JIT support built in?: yes
 JIT target architecture: x86 64bit (little endian + unaligned)
 Newline style: LF

@gvansickle
Copy link
Owner

(note to self, that's current master).

Hey @BurntSushi ,

Coincidentally, I'm right in the middle of reworking the directory tree traversal, so this may get changed in the process. It's not really a bug though, but I suppose it depends on how you look at it. E.g. try this:

$ ucg --type=cc -w '[A-Z]+_SUSPEND' ~/src/TestCorpus/linux --dirjobs=1 | wc -l
404
$ ucg --type=cc -w '[A-Z]+_SUSPEND' ~/src/TestCorpus/linux --dirjobs=2 | wc -l
392

Now do something similar with grep (note the -R (follow symlinks) vs. -r (don't follow symlinks)):

$ egrep -R -n --include='*.c' --include='*.h' -w '[A-Z]+_SUSPEND' ~/src/TestCorpus/linux | wc -l
404
$ egrep -r -n --include='*.c' --include='*.h' -w '[A-Z]+_SUSPEND' ~/src/TestCorpus/linux | wc -l
390

Are you perhaps not descending into the symlinked directory because it links to a directory that has already been searched?

That's exactly what's going on, but as always, it's a bit more complicated than that. When running with more than 1 dir tree thread, I have to handle symlink loop detection myself. I do that by not re-visiting directories I've already visited. But with only one thread, the fts library I use is able to handle this for me; however, it appears to be detecting only real cycles, and not just previously-visited dirs. So in that case, any "forward-crosslinks" (gotta be an actual term for that) get visited more than once.

grep uses the same fts library, which similarly explains its behavior. If you do a physical traversal, you hit each file exactly once, and you get the lower number. If you do a logical traversal, you're safe from cycles, but any files on the other side of any forward-crosslinks will get visited multiple times.

I noticed this when I first went to multithreaded directory traversal, and made the decision that:

  1. For the use-case at hand, it seems like not re-searching files that have already been searched is the expected and preferred behavior.
  2. Since this would be the behavior with the default 2 threads, plus my (maybe naive) expectation that symlinks would be rare in the trees being searched, it wasn't worth worrying about the duplicate hits in the 1 thread case.

So it's not a bug of ommission.

(Either way, it's at least not a loop.)

Indeed. It probably would make sense to at least reword the warning in this case.

Thanks for the report Andrew. Looks like you've been pretty busy!

GRVS

@gvansickle gvansickle added this to the Version 0.4.0 milestone Nov 7, 2016
@BurntSushi
Copy link
Author

BurntSushi commented Nov 7, 2016

I admit, it's a little strange for the results returned to vary based on the number of threads you're using, but I can see why skipping previously visited directories might be helpful (if a bit surprising, based on the behavior of others tools).

That's exactly what's going on, but as always, it's a bit more complicated than that. When running with more than 1 dir tree thread, I have to handle symlink loop detection myself. I do that by not re-visiting directories I've already visited. But with only one thread, the fts library I use is able to handle this for me; however, it appears to be detecting only real cycles, and not just previously-visited dirs. So in that case, any "forward-crosslinks" (gotta be an actual term for that) get visited more than once.

I just recently added a lock-free parallel recursive directory iterator myself, which also of course has to handle symlinks manually, but it only detects true loops (which matches single threaded behavior). Rust being fairly new, I also had to write the single threaded iterator. :P They both check for symlink loops in exactly the same way.

@gvansickle
Copy link
Owner

This has been resolved since release 0.3.2 in a few ways:

  • The message has been changed to "warning: '': already visited this directory, possible recursive directory loop?"
  • Replacement of fts lib with hand-rolled DirTree, which behaves the same wrt this issue regardless of the number of threads.

Closing as resolved.

@gvansickle gvansickle self-assigned this May 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants