-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
preserve order of input files in output without sacrificing parallelism #152
Comments
You can do this today, but you have to give up parallelism. If you run |
Yeah, so I don't want to give up parallelism, obviously. GNU parallel let's me be parallel and preserve the order of output with |
I get what you're saying. Here are my thoughts on the matter:
All of these things combined together means I'm not particularly motivated to work on this, but I grant it's something I'd be willing to explore given unbounded resources. |
You can implement ordering + parallelism with no additional cost; all rg has to do is read output from the threads in the order that they are launched. This means, of course, that some of the threads will block waiting to be read because their output buffers will fill up, but the performance of such an implementation would be no worse than You can improve the performance simply by making the output buffers larger so that they block less. "You just have to put up with your search running eight times slower on your eight-core machine," is not really a reasonable answer, especially since I don't have to put up with it, since as I pointed out, there are other tools that will allow me to search and preserve ordering in the output. I frankly get the impression that you are rationalizing this idea being not worthwhile because the internal implementation of rg is not conducive to it. That's putting the cart before the horse. If it's too hard to do this right now in rg, fine, that's a reasonable answer. But to then try to hand-wave away the value of the idea because it's hard to do is a step too far. |
Unless I'm missing something, it sounds like the additional cost there is both memory and time.
But it sounds like it will be slower than Comparing your suggestion to
Of course it's a reasonable answer. This is about time preferences and priorities. Two reasonable people can disagree on how important this particular issue is. I happen to think it's not that important. You clearly do. That's OK. If GNU parallel lets you achieve this already, then it sounds like you can use
I'm sorry I gave you that impression because that wasn't my intent. I think we should try to avoid reading into others' motivations and give the benefit of the doubt that I'm saying what I mean. Without an assumption of good faith here, it's hard to have a productive conversation. (To be frank, this comment by you is putting me on the defensive and making it hard for me to respond.) To respond though, I don't think |
Point taken. Thanks for the thoughtful response. |
Unfortunately, I frequently get "Out of memory (os error 12)" from ripgrep when I try to use "rg -j1" with GNU parallel. |
@jikamens Well... that's bad. :P If you'd be willing to file a bug, I'd be happy to look into it. I'm not a user of GNU parallel, so I think I'd need at least the full command you're running. If you can reproduce the issue on data we can both access, that's even better, but I understand if that's not possible. |
I've just told you all that I'm able to tell you, so you can file the bug yourself if you want. ;-) Seriously, the reason I haven't filed a bug is because the data I'm searching is very large and proprietary and so are the search strings I'm using, so all I'd be able to say in the bug is, "Yeah, I'm getting |
@BurntSushi on ripgrep 0.4.0,
It would be great if Thanks for all your hard work! ripgrep is a joy to use. We have type coupling between microservices so I am often searching for |
@ryanberckmans There's no debate that Glad you're enjoying ripgrep. :-) |
Since the documentation states that search time is guaranteed to be linear, perhaps we'd get more consistency (without guaranteeing full determinism of output) by ordering the files to search from largest to smallest by default? This of course would require completing the directory traversal before even starting a search, but I'm not sure how big a burden that is, either in terms of programming effort to make the modification or runtime memory usage to store and traverse the file list. (I'm guessing that Obviously, to guarantee consistency, there's no way to escape the fact that sometimes the output will "hang" while waiting for the next thread to complete its search. |
Storing the entire tree in memory isn't a path I'm excited about pursuing. It's not only a huge code change, but it will introduce additional latency and a memory requirement that is proportional to the size of the tree. |
That makes sense. Does ripgrep currently search all files in a directory before going into subdirectories? If so, simply sorting the files in each directory by size (largest first) after getting the listing from the file system seems like it would probably be an improvement. (I'm guessing that getting the total size of each child directory is probably not worthwhile if exploring the entire file tree isn't an option.) |
@BatmanAoD The directory traversal itself is parallelized. It's a key reason why ripgrep is so fast even when it needs to process gitignore files. Regardless, standard directory traversal is depth first, not breadth first, precisely for memory constraints. (Large but flat directories are more common than exceedingly deep directories.) |
Right, but there's still an ordering of tasks available for execution (in parallel), based on which files haven't been searched yet and which folders haven't been explored yet. And even in a depth-first search, the full list of entries in a single directory is still retrieved from the file system before the children are actually searched (or added to the work queue). So I'm suggesting that in between the call to ......but now that I've written that out, I think the obvious question I should have asked earlier is, what kind of work queue does ripgrep actually use? If it's simply FIFO, perhaps switching to a priority queue based on size would be the best way to ensure that larger files are searched first if possible? |
Nope. :-) In depth first search, the only things you need to store in memory are:
With that said, for a directory tree of depth At least, this is roughly how
And that was pretty much it. The The worker thread could then be made to "block" until it was the entry's turn to be written, which could be done by checking the sequence id of the most recently printed entry. (There's probably a bit more to this, and perhaps latency could be reduced by introducing a ring queue and another worker that is only responsible for printing. This last worker would need to do a bit of windowing not unlike how TCP works.) I'd also like to take this time to note that "sorting by file size" is itself a non-starter because it would slow ripgrep down a bit. Namely, ripgrep would need to issue a stat call for each file. With that said, ripgrep no longer uses this simplistic approach. The core problem with it is that it forces the processing of all filter logic to be single threaded. Given the size of some gitignore files, this can end up being a substantial portion of ripgrep's runtime. Thus, a while ago, I set out to parallelize directory traversal itself. It still uses a work queue, but the fundamental difference is that there are no longer separate producers or consumers. Instead, every producer is, itself, a consumer. The reason for this is that if you're parallelizing directory traversal, then you end up reading the contents of multiple directories simultaneously, so there's no single point of truth about when it terminates. Indeed, termination is complicated. The big picture problem is that this form of parallelism makes determinism much harder than the simpler work queue formulation that ripgrep used to have. In particular, each item in a directory is sent along a queue, which means there is no point at which you could sort the files you're searching and realistically expect that to result in more deterministic output. Namely, even if you collected all of the entries of a directory and sorted them before sending them to the queue, you would in fact be racing with the N other threads that are doing the same. To solve the deterministic output problem while simultaneously using parallelism, I think you need to do one of these things:
|
If you look at my pseudo code above, you might wonder why one couldn't just move the So if you go down this route, you basically need to move construction of the filters themselves into the workers as well. But if you do this, you either need to be willing to pay for constructing each filter N times (for each of the N threads), or you need to do some type of synchronization so that each filter is built once. If you pick the former, then you're right back to where you started: you're not actually parallelizing a good chunk of the filter process, although you are parallelizing the filter match process, which is a small win. If you pick the latter, then at some point, you're going to ask If you really tumble down the rabbit hole, then maybe there is a smarter way to prioritize which entries get searched based on which filters are available, but then you've probably given up determinism again. If you get this far, then you're back to door number 3 in my previous comment. |
Okay, I think I understand much better now what you meant by "the directory traversal itself is parallelized" and why that presents an essentially intractable problem for obtaining determinism via the sort of minor tweaks I was trying to propose. And at the risk of making my complete ignorance of the issues at play even more obvious (! 😆), I didn't even realize that directories could be read as streams in Linux, let alone in any sort of cross-platform application (at least without circumventing the OS filesystem API and reading/interpreting the raw bits). ...and, looking back, I'm not even sure why I suggested that largest-file-first would be valuable, even heuristically; I think my idea was that this would prioritize actual regex-matching over output printing, but I'm pretty sure that's exactly the reverse of what you'd want. Based on the very high-level explanation of Go-routine scheduling and stealing here, I do not immediately see any reason why that form of scheduling would make output-sorting any easier. |
Anyway, thanks for the detailed explanation! That's quite helpful in understanding what's going on. |
Hi all. I also want this feature for ripgrep, as it's one of my most missed features from Priority Queue SolutionSuppose you're searching this directory:
Let's say we sort the output of readdir.
I propose we use a priority queue, ordered by path name:
Notes about this approach:
Questions for @BurntSushi: Does this make sense? Do you agree that this will mostly sort the output? Or do you see some critical flaw I didn't consider? Is this something you'd be willing to merge? Assuming everything pans out - files are usually sorted and there's no major performance change. |
@bmalehorn did you happen to read my most recent comments on this? Namely, i don't see how this helps the matter. Remember, in the current implementation, the consumer of the queue is also its producer. Also, I'm pretty sure ag doesn't have this feature. Perhaps you could elaborate on that point? |
The other issue I see is that your queue example seems to indicate a breadth first search. But most recursive directory iterators use depth first search (including ripgrep). Which kind of makes the priority queue idea moot. |
Actually, I don't think ripgrep uses depth first search.
I tried swapping it out for a stack, and sorting readdir:
The result is that output is often sorted. However, it can get out of order if, for example:
Now you will interleave You can view the commit here. Even if you don't want to use my half-assed sorting approach, I'd still argue that ripgrep should use a stack instead of a queue. The stack won't grow nearly as large as the queue. |
Uh... Maybe in single threaded mode, but certainly not when search is parallelized.
I see. I was thinking about
The sorting approach is probably something that would need good benchmarking before getting merged. Particularly on very large directories. If ripgrep were used on a directory with many files, then it won't even start searching until the entire list is in memory. If using a stack ends up with a more consistent order in practice, then I'd be fine with that! |
Change the pool of "files to search" from queue to a stack. This causes ripgrep to approximate depth-first search instead of breadth-first search. This dramatically reduces the size of the pool, since most directories are much more "wide" than they are "deep". As a result, ripgrep uses ~45% less peak memory: Before: /usr/bin/time -f '%M' rg PM_RESUME > /dev/null ... 8876 After: /usr/bin/time -f '%M' rg PM_RESUME > /dev/null 16240 Throughput improves a tiny bit: Before: linux_literal (pattern: PM_RESUME) ---------------------------------- rg (ignore)* 0.376 +/- 0.003 (lines: 16)* After: linux_literal (pattern: PM_RESUME) ---------------------------------- rg (ignore)* 0.371 +/- 0.004 (lines: 16)* Related to BurntSushi#152.
To be clear, they're both non-deterministic. But there are a few "levels" of how non-deterministic their outputs are:
In practice, 2 is good enough. I've used this patched version of "Priority queue" and "stack + sort readdir" are somewhere between 2 and 3. I've made a PR for switching queue -> stack because it has some performance gains. It doesn't attempt to put the files in order.
Yeah I agree. I'm going to play around with sorting and priority queues and see what I find. Hopefully, we can either get sorting as the default, or make PS. Thanks for being such a responsive maintainer! |
I've created two builds:
In my opinion, the minor performance loss of |
Realistically speaking, I don't think this feature request is going to get satisfied any time soon. If deterministic ordering is important to you, then for now you'll need to use |
I would also be interested by this feature. As a programmer myself I would see a way to solve that. It would indeed have a cost, but it would be way better than But you seem to have had a lot of discussion about that, could you help me with a TL;DR of this discussion about why it's impossible to do? Thank you very much |
I never said it was impossible, so I'm confused as to why you're asking this question. This feature is precisely about trade offs, in terms of runtime performance and also in terms of implementation complexity. Honestly, I haven't thought about this specific problem in a long time, so I would need to re-read this thread myself to give you a convincing TL;DR. But given that it wouldn't be answering the question you asked, I think it would be better if you just went ahead and read the thread. If you have specific questions after that, then I'd be happy to try and answer those. |
Right, wrong wording, I considered that if the issue's been closed it meant that after the discussion you arrived to the conclusion that the trade offs were not good enough to spend more time on it. No worries if you don't remember the thread, I'll read it then ;) |
I've implemented an ordered tree walk, originally based off the ignore create. I'm just using internally, it would require quite a bit of work to get back to ignore api state and feature set, but it might serve as a good reference point for evaluating how much performance and complexity overhead an ordered walk would take. These times are all for walking over linux checkout with the configuration
In my ordered walk there are two kinds of order. The initial processing happens in "mostly" order. Each worker maintains a priority queue of work (ordered by index_path or each entry) . Workers exchange work items occasionally to keep queues balanced.
On top of that "mostly" order you can apply a complete ordering. This works by gathering the results from each worker into another priority queue, and pulling the final results off the top when the top equals struct NextEntry.
Generally this all seems to be working to my liking. The mostly ordered processing doesn't add much overhead. The complete ordering does certainly add some overhead, but it also saves a lot of time if you are only interested in first_n entires as would be the case in my UI's. Probably lots of room for improvement. This is my first big chunk of rust code, and also my first code that does more than just split up and process parallel processing. I'm sharing this code for anyone to do what they like with it. I would love to see the ignore crate get this type of complete ordering, but with parallel processing. https://gist.github.com/jessegrosjean/6b3ca3a52249b65f58294dfafca12c91 |
@jessegrosjean Lovely! Thanks for sharing! |
I just posted an update to that gist that improves performance and simplifies things. Previously work was shared between workers through a single channel. Also workers would occasionally "balance" their local work queue by pulling values out of this channel (if they needed work), or pushing into it (if they had too much). The updated gist instead creates a separate channel for each worker. And each worker now gets a Vec of the sender half of all these channels. Now for each new work item the worker scheduling the item just sends it randomly to one of the other workers. This is simpler and faster since there's no longer any need for "balance" work queue logic. Now I see only 3-4ms difference between ignore's implementation and the ordered_walk_visit_all_entries_in_mostly_order test case. |
Ok, I said I was learning... I keep noticing places where my ordered walk needs fixing. After last update I realized that my version wasn't really doing work stealing. A work item that took a long time to process could block other work items (in that workers queue) even when other workers needed work. To solve this I now share all work queues with all workers. And when a workers queue is empty it will now try to steal work from another workers queue. This means all queues now need to be accessed through a mutex, but performance still seems really good. Stealing doesn't happen very often, so not much contention. I also realized that in doing my benchmarks I was doing some extra work specific to my use. The end result is that the current version of the gist is performing a little better (by a few ms) than the default ignore crate. Summary is that (unless I'm missing something, possible!) performance shouldn't be an issue. This ordered walk is just as fast or faster. Here I'm talking about the underlying mostly ordered performance. Applying a total order (using I'm going to leave this be for now, but let me know if you've got any questions. |
Thanks for the writeup! What does memory usage look like? I think the
current parallel implementation is actually using more memory than it
should be, so I would be interested in both theoretical and actual answers,
if that makes sense. :)
…On Wed, Nov 7, 2018, 19:38 Jesse Grosjean ***@***.*** wrote:
Ok, I said I was learning... I keep noticing places where my ordered walk
needs fixing. After last update I realized that my version wasn't really
doing work stealing. A work item that took a long time to process could
block other work items (in that workers queue) even when other workers
needed work.
To solve this I now share all work queues with all workers. And when a
workers queue is empty it will now try to steal work from another workers
queue. This means all queues now need to be accessed through a mutex, but
performance still seems really good. Stealing doesn't happen very often, so
not much contention.
I also realized that in doing my benchmarks I was doing some extra work
specific to my use. The end result is that the current version of the gist
is performing a little better (by a few ms) than the default ignore crate.
Summary is that (unless I'm missing something, possible!) performance
shouldn't be an issue. This ordered walk is just as fast or faster. Here
I'm talking about the underlying mostly ordered performance. Applying a
total order (using entries method) does cost more (about double).
I'm going to leave this be for now, but let me know if you've got any
questions.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#152 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAb34mdEzBcEs5uZL0URjJ9hIfNHjMypks5us30OgaJpZM4KPYye>
.
|
Is there a good way to measure memory usage from within rust? Or should I use external tool? For theoretical usage, let me see. I'm mostly making this up... mostly I just build my app and if it doesn't explode I call it NP good. My understanding is that the way the current ignore works is for each WalkEntry encountered it pushes it into a channel. When a worker needs work it pulls off that channel. I "think" ordered walk is doing pretty much the same thing in terms of overall memory use. When ordered walk encounters a new WalkEntry it is also pushed into a channel. The difference is that instead of a single channel each worker has it's own channel. Each WalkEntry gets pushed into a randomly selected worker's channel. So instead of a single big channel, there are N smaller channels. And then each worker transfers the contents of their channel to a priority queue. Besides the overhead of extra channels, priority queues, etc... I don't think the actually number of entries alive in memory at once changes between the two implementations. When a worker needs work and it's local queue is empty it will just steal from another worker. So I think the total number of in flight entries is the same.
I'm not sure... I guess memory grows when you process a folder, since it generates more entries. So if you wanted to use less memory in the current implementation you could prioritize processing of files over processing of folders. That would keep the number of entries buffered in the channel smaller I think. If that's actually true, then I guess the ordered walk code might help to actually lower memory usage. Instead of sorting alphabetically as the current code does you could sort files ahead of folders. And I think that might theoretically use less memory. Though it also might slow things down as you would be more likely to run into empty queue cases and be forced to try work stealing. Maybe. |
Figured I should comment here since I've messed around with file order and memory usage. @BurntSushi you're right that there are memory problems in the current implementation. Since ripgrep uses a queue for searching files, it does something close to a BFS search. BFS uses memory proportional to the widest part of the tree. How wide is the widest part of your average source tree? E.g. how many files are 3 directories deep? Quite a few; I think Linux can get up to 10k elements in ripgrep's search queue. In #450 I changed the work queue into a work stack, changing from BFS-ish to DFS-ish. It reduced memory usage from 8.8 MB to about 16 MB, since memory usage was now proportional to the max depth. It didn't get merged because of other reasons (mucking with the the sort order). When you use a priority queue for the work, it will do something close to DFS. "Sorted order of files" is just a DFS walk over the files. So @jessegrosjean's patches will probably use less memory. Jesse if you want to test peak memory usage and you're on Linux, you can run tl;dr: ripgrep does BFS, priority queues do DFS, BFS uses more memory than DFS |
My implementation could be made much simpler if there was a fast priority queue that the workers could all share, just as they share a single channel in the current implementation. Is there such a thing in rust? A priority channel? Right now the only reason that each worker has own work queue is because if I use a single shared queue wrapped in a mutex there's to much contention and it's slow. |
FWIW, I would try to find a way to use rayon. It's possibly a research task though. I don't know how well it fits. |
@BurntSushi I have recently read through the source code in the I used this as a motivation to try and implement my own simple parallel-walk implementation for a disk-usage-counting tool (which does not need the https://github.com/sharkdp/diskus/blob/v0.4.0/src/walk.rs I am rather sure that this is NOT an ideal solution, since this was the first time that I have worked with rayon. Still, the rayon-based implementation is a bit faster than the One thing that I don't like about my current implementation is that I need to copy all Paths into a (obviously my rayon-based implementation does not fulfill any ordering requirements, so I'm sorry if this is not really of any use here). * This might have other reasons too. I have noticed that the Edit: Peak memory usage is a factor of four smaller for the |
@sharkdp Somehow I missed your comment! Just seeing it now. Just wanted to chime in and thank you. Hopefully some day I'll be able to find time to dig back into this and look more closely and your work and @jessegrosjean's work, and see about rolling it back into |
For anybody who stumbles on this discussion by accident, piping the result through Example: rg -e foo -nH file1 file2 file3 | sort -t: -k1,1 -k2n,2 The main downside is you don't see the results until the search completes, of course. |
@dgutov that's a great example of sorting the output and very fast indeed. I found that
|
@theimpostor The state of command line tools on Apple systems is really bad (another instance of that I've heard of recently is broken behavior of the ancient version of Is there a possibility to install newer |
There should be a way to tell rg to preserve the order of input files specified on the command line in the output.
E.g., if I'm searching a bunch of files which are in chronological order, I want the output to be in chronological order.
The text was updated successfully, but these errors were encountered: