Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --offset option to allow for pagination of large files #608

Closed
wcamarao opened this issue Sep 21, 2017 · 18 comments
Closed

Add --offset option to allow for pagination of large files #608

wcamarao opened this issue Sep 21, 2017 · 18 comments

Comments

@wcamarao
Copy link

wcamarao commented Sep 21, 2017

Motivation:

Say I want to grep a large file but there are too many results, and I want to display the results elsewhere (e.g. a web app).

I could start paginating with rg --max-count 100 pattern 1gb.log which will quickly return the first 100 lines, but I don't think there's an option in ripgrep to currently return the next 100 lines.

So, I'm doing rg pattern 1gb.log |sed -n 101,200p for the next 100 lines which is really slow because ripgrep will load all results to have sed only display the next 100 matching lines.

Suggestion:

Add --offset option so ripgrep can paginate through large results like piping sed -n but much faster as it won't load all matching lines into memory.

Note on naming:

Since ripgrep supports both --max-count and --max-columns, maybe the new flag could be named --count-offset for clarity that it's related to --max-count.

@wcamarao wcamarao changed the title Add --offset option to allow for "pagination" of large files Add --offset option to allow for pagination of large files Sep 21, 2017
@richarson
Copy link

How about doing this?:

rg pattern 1gb.log --max-count 100
rg pattern 1gb.log --max-count 200 | sed -n 101,200p
rg pattern 1gb.log --max-count 300 | sed -n 201,300p
rg pattern 1gb.log --max-count 400 | sed -n 301,400p
...

Or maybe even easier:

rg pattern 1gb.log --max-count 100
rg pattern 1gb.log --max-count 200 | tail -n 100
rg pattern 1gb.log --max-count 300 | tail -n 100
rg pattern 1gb.log --max-count 400 | tail -n 100
...

@wcamarao
Copy link
Author

wcamarao commented Sep 22, 2017 via email

@BurntSushi
Copy link
Owner

This isn't going to happen, and certainly isn't going to happen in a way that limits memory usage. Consider:

  1. ripgrep makes no guarantees about order of results. You can ask ripgrep to sort the output, but this will force it to be single threaded.
  2. Memory usage on search results isn't really something that I think is worth spending a lot of effort on. There is certainly an implicit assumption that search results are much smaller than the corpus being searched.

Overall, I think this is a relatively niche use case and I think the given work around is good enough (you'll want to tell ripgrep to sort the output though).

@wcamarao
Copy link
Author

wcamarao commented Sep 23, 2017

I understand you don't find this a common use case, and I appreciate you explaining that and the technical challenge.

However I do have an idea on how to keep multithreading in different files while supporting offset across the board. So if you ever reconsider the idea, please reopen this issue to let me know.

Thanks

@BurntSushi
Copy link
Owner

I'm always happy to entertain ideas, but I'd caution you that suggesting changes to how multithreaded search in ripgrep works should take into account how it works today.

@wcamarao
Copy link
Author

Glad to hear you're open to hearing more. Would you mind giving a brief explanation and pointing out the relevant part in the code? That may as well invalidate my idea :) but I'll be happy to dig further and think through it more if that happens. Thanks

@BurntSushi
Copy link
Owner

I don't think a brief explanation is really possible, but I've explained it in other issues:

The actual parallelism is implemented in the ignore crate:

ripgrep/ignore/src/walk.rs

Lines 864 to 1297 in 214f2be

/// WalkParallel is a parallel recursive directory iterator over files paths
/// in one or more directories.
///
/// Only file and directory paths matching the rules are returned. By default,
/// ignore files like `.gitignore` are respected. The precise matching rules
/// and precedence is explained in the documentation for `WalkBuilder`.
///
/// Unlike `Walk`, this uses multiple threads for traversing a directory.
pub struct WalkParallel {
paths: vec::IntoIter<PathBuf>,
ig_root: Ignore,
parents: bool,
max_filesize: Option<u64>,
max_depth: Option<usize>,
follow_links: bool,
threads: usize,
}
impl WalkParallel {
/// Execute the parallel recursive directory iterator. `mkf` is called
/// for each thread used for iteration. The function produced by `mkf`
/// is then in turn called for each visited file path.
pub fn run<F>(
self,
mut mkf: F,
) where F: FnMut() -> Box<FnMut(Result<DirEntry, Error>) -> WalkState + Send + 'static> {
let mut f = mkf();
let threads = self.threads();
let queue = Arc::new(MsQueue::new());
let mut any_work = false;
// Send the initial set of root paths to the pool of workers.
// Note that we only send directories. For files, we send to them the
// callback directly.
for path in self.paths {
let dent =
if path == Path::new("-") {
DirEntry::new_stdin()
} else {
match DirEntryRaw::from_link(0, path) {
Ok(dent) => DirEntry::new_raw(dent, None),
Err(err) => {
if f(Err(err)).is_quit() {
return;
}
continue;
}
}
};
queue.push(Message::Work(Work {
dent: dent,
ignore: self.ig_root.clone(),
}));
any_work = true;
}
// ... but there's no need to start workers if we don't need them.
if !any_work {
return;
}
// Create the workers and then wait for them to finish.
let num_waiting = Arc::new(AtomicUsize::new(0));
let num_quitting = Arc::new(AtomicUsize::new(0));
let quit_now = Arc::new(AtomicBool::new(false));
let mut handles = vec![];
for _ in 0..threads {
let worker = Worker {
f: mkf(),
queue: queue.clone(),
quit_now: quit_now.clone(),
is_waiting: false,
is_quitting: false,
num_waiting: num_waiting.clone(),
num_quitting: num_quitting.clone(),
threads: threads,
parents: self.parents,
max_depth: self.max_depth,
max_filesize: self.max_filesize,
follow_links: self.follow_links,
};
handles.push(thread::spawn(|| worker.run()));
}
for handle in handles {
handle.join().unwrap();
}
}
fn threads(&self) -> usize {
if self.threads == 0 {
2
} else {
self.threads
}
}
}
/// Message is the set of instructions that a worker knows how to process.
enum Message {
/// A work item corresponds to a directory that should be descended into.
/// Work items for entries that should be skipped or ignored should not
/// be produced.
Work(Work),
/// This instruction indicates that the worker should start quitting.
Quit,
}
/// A unit of work for each worker to process.
///
/// Each unit of work corresponds to a directory that should be descended
/// into.
struct Work {
/// The directory entry.
dent: DirEntry,
/// Any ignore matchers that have been built for this directory's parents.
ignore: Ignore,
}
impl Work {
/// Returns true if and only if this work item is a directory.
fn is_dir(&self) -> bool {
self.dent.file_type().map_or(false, |t| t.is_dir())
}
/// Adds ignore rules for parent directories.
///
/// Note that this only applies to entries at depth 0. On all other
/// entries, this is a no-op.
fn add_parents(&mut self) -> Option<Error> {
if self.dent.depth() > 0 {
return None;
}
// At depth 0, the path of this entry is a root path, so we can
// use it directly to add parent ignore rules.
let (ig, err) = self.ignore.add_parents(self.dent.path());
self.ignore = ig;
err
}
/// Reads the directory contents of this work item and adds ignore
/// rules for this directory.
///
/// If there was a problem with reading the directory contents, then
/// an error is returned. If there was a problem reading the ignore
/// rules for this directory, then the error is attached to this
/// work item's directory entry.
fn read_dir(&mut self) -> Result<fs::ReadDir, Error> {
let readdir = match fs::read_dir(self.dent.path()) {
Ok(readdir) => readdir,
Err(err) => {
let err = Error::from(err)
.with_path(self.dent.path())
.with_depth(self.dent.depth());
return Err(err);
}
};
let (ig, err) = self.ignore.add_child(self.dent.path());
self.ignore = ig;
self.dent.err = err;
Ok(readdir)
}
}
/// A worker is responsible for descending into directories, updating the
/// ignore matchers, producing new work and invoking the caller's callback.
///
/// Note that a worker is *both* a producer and a consumer.
struct Worker {
/// The caller's callback.
f: Box<FnMut(Result<DirEntry, Error>) -> WalkState + Send + 'static>,
/// A queue of work items. This is multi-producer and multi-consumer.
queue: Arc<MsQueue<Message>>,
/// Whether all workers should quit at the next opportunity. Note that
/// this is distinct from quitting because of exhausting the contents of
/// a directory. Instead, this is used when the caller's callback indicates
/// that the iterator should quit immediately.
quit_now: Arc<AtomicBool>,
/// Whether this worker is waiting for more work.
is_waiting: bool,
/// Whether this worker has started to quit.
is_quitting: bool,
/// The number of workers waiting for more work.
num_waiting: Arc<AtomicUsize>,
/// The number of workers waiting to quit.
num_quitting: Arc<AtomicUsize>,
/// The total number of workers.
threads: usize,
/// Whether to create ignore matchers for parents of caller specified
/// directories.
parents: bool,
/// The maximum depth of directories to descend. A value of `0` means no
/// descension at all.
max_depth: Option<usize>,
/// The maximum size a searched file can be (in bytes). If a file exceeds
/// this size it will be skipped.
max_filesize: Option<u64>,
/// Whether to follow symbolic links or not. When this is enabled, loop
/// detection is performed.
follow_links: bool,
}
impl Worker {
/// Runs this worker until there is no more work left to do.
///
/// The worker will call the caller's callback for all entries that aren't
/// skipped by the ignore matcher.
fn run(mut self) {
while let Some(mut work) = self.get_work() {
// If the work is not a directory, then we can just execute the
// caller's callback immediately and move on.
if !work.is_dir() {
if (self.f)(Ok(work.dent)).is_quit() {
self.quit_now();
return;
}
continue;
}
if self.parents {
if let Some(err) = work.add_parents() {
if (self.f)(Err(err)).is_quit() {
self.quit_now();
return;
}
}
}
let readdir = match work.read_dir() {
Ok(readdir) => readdir,
Err(err) => {
if (self.f)(Err(err)).is_quit() {
self.quit_now();
return;
}
continue;
}
};
let depth = work.dent.depth();
match (self.f)(Ok(work.dent)) {
WalkState::Continue => {}
WalkState::Skip => continue,
WalkState::Quit => {
self.quit_now();
return;
}
}
if self.max_depth.map_or(false, |max| depth >= max) {
continue;
}
for result in readdir {
if self.run_one(&work.ignore, depth + 1, result).is_quit() {
self.quit_now();
return;
}
}
}
}
/// Runs the worker on a single entry from a directory iterator.
///
/// If the entry is a path that should be ignored, then this is a no-op.
/// Otherwise, the entry is pushed on to the queue. (The actual execution
/// of the callback happens in `run`.)
///
/// If an error occurs while reading the entry, then it is sent to the
/// caller's callback.
///
/// `ig` is the `Ignore` matcher for the parent directory. `depth` should
/// be the depth of this entry. `result` should be the item yielded by
/// a directory iterator.
fn run_one(
&mut self,
ig: &Ignore,
depth: usize,
result: Result<fs::DirEntry, io::Error>,
) -> WalkState {
let fs_dent = match result {
Ok(fs_dent) => fs_dent,
Err(err) => {
return (self.f)(Err(Error::from(err).with_depth(depth)));
}
};
let mut dent = match DirEntryRaw::from_entry(depth, &fs_dent) {
Ok(dent) => DirEntry::new_raw(dent, None),
Err(err) => {
return (self.f)(Err(err));
}
};
let is_symlink = dent.file_type().map_or(false, |ft| ft.is_symlink());
if self.follow_links && is_symlink {
let path = dent.path().to_path_buf();
dent = match DirEntryRaw::from_link(depth, path) {
Ok(dent) => DirEntry::new_raw(dent, None),
Err(err) => {
return (self.f)(Err(err));
}
};
if dent.file_type().map_or(false, |ft| ft.is_dir()) {
if let Err(err) = check_symlink_loop(ig, dent.path(), depth) {
return (self.f)(Err(err));
}
}
}
let is_dir = dent.file_type().map_or(false, |ft| ft.is_dir());
let max_size = self.max_filesize;
let should_skip_path = skip_path(ig, dent.path(), is_dir);
let should_skip_filesize = if !is_dir && max_size.is_some() {
skip_filesize(max_size.unwrap(), dent.path(), &dent.metadata().ok())
} else {
false
};
if !should_skip_path && !should_skip_filesize {
self.queue.push(Message::Work(Work {
dent: dent,
ignore: ig.clone(),
}));
}
WalkState::Continue
}
/// Returns the next directory to descend into.
///
/// If all work has been exhausted, then this returns None. The worker
/// should then subsequently quit.
fn get_work(&mut self) -> Option<Work> {
loop {
if self.is_quit_now() {
return None;
}
match self.queue.try_pop() {
Some(Message::Work(work)) => {
self.waiting(false);
self.quitting(false);
return Some(work);
}
Some(Message::Quit) => {
// We can't just quit because a Message::Quit could be
// spurious. For example, it's possible to observe that
// all workers are waiting even if there's more work to
// be done.
//
// Therefore, we do a bit of a dance to wait until all
// workers have signaled that they're ready to quit before
// actually quitting.
//
// If the Quit message turns out to be spurious, then the
// loop below will break and we'll go back to looking for
// more work.
self.waiting(true);
self.quitting(true);
while !self.is_quit_now() {
let nwait = self.num_waiting();
let nquit = self.num_quitting();
// If the number of waiting workers dropped, then
// abort our attempt to quit.
if nwait < self.threads {
break;
}
// If all workers are in this quit loop, then we
// can stop.
if nquit == self.threads {
return None;
}
// Otherwise, spin.
}
}
None => {
self.waiting(true);
self.quitting(false);
if self.num_waiting() == self.threads {
for _ in 0..self.threads {
self.queue.push(Message::Quit);
}
} else {
// You're right to consider this suspicious, but it's
// a useful heuristic to permit producers to catch up
// to consumers without burning the CPU. It is also
// useful as a means to prevent burning the CPU if only
// one worker is left doing actual work. It's not
// perfect and it doesn't leave the CPU completely
// idle, but it's not clear what else we can do. :-/
thread::sleep(Duration::from_millis(1));
}
}
}
}
}
/// Indicates that all workers should quit immediately.
fn quit_now(&self) {
self.quit_now.store(true, Ordering::SeqCst);
}
/// Returns true if this worker should quit immediately.
fn is_quit_now(&self) -> bool {
self.quit_now.load(Ordering::SeqCst)
}
/// Returns the total number of workers waiting for work.
fn num_waiting(&self) -> usize {
self.num_waiting.load(Ordering::SeqCst)
}
/// Returns the total number of workers ready to quit.
fn num_quitting(&self) -> usize {
self.num_quitting.load(Ordering::SeqCst)
}
/// Sets this worker's "quitting" state to the value of `yes`.
fn quitting(&mut self, yes: bool) {
if yes {
if !self.is_quitting {
self.is_quitting = true;
self.num_quitting.fetch_add(1, Ordering::SeqCst);
}
} else {
if self.is_quitting {
self.is_quitting = false;
self.num_quitting.fetch_sub(1, Ordering::SeqCst);
}
}
}
/// Sets this worker's "waiting" state to the value of `yes`.
fn waiting(&mut self, yes: bool) {
if yes {
if !self.is_waiting {
self.is_waiting = true;
self.num_waiting.fetch_add(1, Ordering::SeqCst);
}
} else {
if self.is_waiting {
self.is_waiting = false;
self.num_waiting.fetch_sub(1, Ordering::SeqCst);
}
}
}
}

@wcamarao
Copy link
Author

wcamarao commented Sep 30, 2017 via email

@BurntSushi
Copy link
Owner

Ripgrep already does that by using max count, correct?

Yes. I'd strongly recommend reading the search code. max_count in particular is used here:

if self.max_count.map_or(false, |max| match_count >= max) {

E.g. Currently, ripgrep sets one boundary (max count) and searches only the first N lines of each file in a directory. When passing the new offset flag, idea is to limit the search scope even further to a specific line range where you can define a starting point different from the first line of the file.

Max count specifically refers to the number of matches reported, not the number of lines searched.

@BurntSushi
Copy link
Owner

@wcamarao One thing I may have misunderstood was that you probably want your proposed offset flag to be done on a per-file basis like how --max-count works. In that respect, me bringing up parallelism was a red herring.

I still believe you should be using standard command line tools as proposed to solve this problem. The memory usage argument is interesting, but that should only matter when a significant number of lines match in a significant file and you want to page through the entire thing. What is your use case for such a thing?

@wcamarao
Copy link
Author

wcamarao commented Oct 2, 2017

Interesting, so max count is used to terminate the search earlier, and it only affects matches reported, not lines searched. Did you also mean it still searches the whole file, allocating memory for all lines searched, but not allocating more memory for matches reported within ripgrep? That could make sense to what I'm experiencing at the moment (more below).

Oh no problem. Thanks for digging this further with me and helping us both understand this better now. I really appreciate your continued support.

The case is a UI tool that given a root directory, it allows you to search file contents using ripgrep in the backend and reporting results in a "web app way" (meaning we must paginate through records found, otherwise pageload could be really bad). Most cases are like you described earlier, search results are much smaller than whole file, but sometimes that's not the case and that's what I'm trying to improve at the moment. For now I'm using only max count and it's lightning speed at first request, but it keeps losing performance and increasing memory usage on subsequent requests. That's why I was hoping it would be feasible to introduce something similar to sed -n N,Np.

I'm not aware of other command line tools I could use for this, so I'm considering moving the backend to something like lucene. I'm just amazed by how much I was able to achieve so far with all the speed and simplicity of ripgrep, so I'd like to keep it if possible.

@BurntSushi
Copy link
Owner

BurntSushi commented Oct 2, 2017

allocating memory for all lines searched

ripgrep's memory usage is proportional to the size of its output, not the size of the corpus it searches. (This is somewhat of a non-truth, in the sense that if you aren't searching with memory maps, then ripgrep requires at least one line to be resident in memory. So if your corpus is a single line---no matter how large---then ripgrep will store it all in memory.)

For now I'm using only max count and it's lightning speed at first request, but it keeps losing performance and increasing memory usage on subsequent requests.

While I agree that a hypothetical --offset flag could reduce memory usage, I don't think it will be capable of reclaiming CPU time. You still need to find the first N matches that you want to skip over.

So all we're talking about here is memory usage. Are you really witnessing memory usage so high that it is producing problems for you? If so, then your users might be executing searches with a large amount of output, and no matter you do, that's going to cost you. Allowing users to paginate through that either requires more memory or more CPU without more sophisticated data structures (like making searches interruptible with some kind of saveable cursor).

I'm not aware of other command line tools I could use for this, so I'm considering moving the backend to something like lucene.

Lucene is a completely different animal and solves a completely different problem that ripgrep does. ripgrep is basically "line oriented searcher, with some clever filtering." Lucene is "use the full weight of information retrieval to do relevance ranking." In fact, with Lucene, operations like "paginating through the entire result set" are discouraged and won't perform well, because it doesn't correspond to an action that a user typically performs on a large result set. (The one exception to this is if you're interested in creating training data, then one method of doing that is paying people to make exhaustive relevance judgments.)

So what's my point? My point is that if Lucene is the better tool for the problem you're trying to solve, then you should just use that.


With all of this said, I do think your feature request is interesting. I'm still not a fan of adding it, because every new feature needs to consider its interaction with other features. But if this feature were simple to implement and came with tests, I might be willing to maintain it. I personally probably won't add it any time soon.

@wcamarao
Copy link
Author

wcamarao commented Oct 2, 2017

I don't think it will be capable of reclaiming CPU time

Interesting. I thought it would, and that's my final goal. I also thought it was related to memory usage because I saw them growing proportionally, but no- I'm not too concerned about memory itself.

Lucene is a completely different animal

I know, and for this case I don't need its most sophisticated features like relevance ranking you mentioned, phonetic searching and so on.

because every new feature needs to consider its interaction with other features

I understand this very well and I appreciate it.

Changing topics slightly, I'd like to vote for #566 then. It could be used to show a total count alongside first results, giving the user an idea there's much more, so they'll probably want to refine the search instead of paging through. Combining this new count with the existing one, it could give a summary like "N results found in N files".

@BurntSushi
Copy link
Owner

You can get a count of matched lines---which is a reasonable approximation of the number of occurrences---using the --count flag.

@BurntSushi
Copy link
Owner

I thought it would, and that's my final goal.

Think about it: if you say you want M matches after the first N matches, then ripgrep still needs to do the work to find the first N matches in order to know to skip them.

This isn't true of every method of pagination, of course. If you can save some state about where your last search left off and know that the corpus you're searching hasn't changed since that state was saved, then you could "pick up where you left off," which would keep CPU time and memory usage proportional to the output that is being displayed. But such a feature will most definitely never find its way into ripgrep.

@wcamarao
Copy link
Author

wcamarao commented Oct 3, 2017

ripgrep still needs to do the work to find the first N matches

That's slightly different than I thought. My idea is not to skip first N matches, it is to skip first N lines no matter how many matches are found within these lines.

Pagination state is kept on user side, which starts at offset 1 (first line) and reports up to N matching lines (max count). Say last matching line was 456, next request would be offset 457 and same max count (e.g. 100), meaning next 100 matching lines starting at line 457.

I understand this may be too specific for ripgrep, I'm mostly entertained by our discussion at this point.

@BurntSushi
Copy link
Owner

@wcamarao oic. I didn't think of it that way since it seems so... not user friendly. :-) I can see how you could make use of it in a higher level UI though, sure.

@wcamarao
Copy link
Author

wcamarao commented Oct 4, 2017

Exactly, after digging this further with you I discovered new things I didn't think before as well. Looking at the proposed interface now, I'd probably not have suggested this to be honest, so I appreciate you taking the time and exercising the idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants