-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: add ngram indexing support to ripgrep #1497
Comments
Seems like a useful proposal with a sensible scope. Sorry if this is bike shedding minor details, but with my, "what if this was a man page hat on", I don’t think |
I currently solve this with a combination of a slightly changed This RFC describes exactly what I need with a very precise scope. I'd love to see this implemented! |
Cool! The one thing that would really excite me is if this got integrated into cloud-scale search engines like Github's code search or AWS Cloudwatch Logs search. I've always wanted to use regex instead of word search for those things. |
Good question. I certainly think it should. It should also work with the
My intent is to design this such that it should be able to scale "arbitrarily," so it's certainly possible. The main problem is that at that scale, you would need to ban regexes that search too much of the corpus. It is very possible to implement this ban, but whether that's an acceptable user experience, I'm not sure. For example, the regex |
Nice. I've often thought about ngram-indexing in ack, but have never pursued it. I'm glad you did. |
I would love to have this as I have been advocating for such indexing feature in the other issue as well. |
It's great to hear that this is coming to ripgrep. I'm still using
|
@MikeFHay Have you tried using sourcegraph.com instead for searching repos on GitHub? |
For people who index entire directory trees at once, rather than individual files, I'm hoping that indexes can get updated by atomic rename, which would prevent the majority of index corruption. Also, I'm hoping that the index logic is provided as a separate crate that doesn't inherently depend on integration with path searching, so that it would be possible (for instance) to index the history of a git repository. |
Never is a strong word, is there a reason you wouldn’t want this to be the default at one point (in the far future). |
These are amazing questions/feedback folks. Really appreciate. Keep more coming like this!
Yeah, I noticed this and I've been thinking on what to do about it. It is enticing to adopt a similar heuristic in ripgrep because a single large file could really screw the pooch. That is, a large file probably has a lot of different ngrams in it, so it has a high likelihood of appearing in candidate sets generated by the index. Which would then kind of defeat some of the large gains of using an index in the first place, since you wind up searching that large file---which might take as long as it takes to search thousands of tiny files. The main reason why I'm initially not a fan of having such a limit is because it's an additional filter on files on top of what ripgrep does normally, but it's a filter that only makes sense in the context of indexing. One possible alternative is to make the filter opt-in. After all, ripgrep already has a But yes, if this does wind up being an opt-out feature specific to indexing, then ripgrep would emit a warning message for each large file skipped. And the limit would certainly be configurable.
My hope is that ripgrep will make incremental indexing fast enough where you can just drop this particular behavior. :-) I think if this were to exist, it would probably need to be an opt-in feature. In particular, one of the key benefits that an index will bring is the ability to avoid re-walking the directory tree, which can take non-trivial time. But I agree, some kind of flag to say "use the index but search newer files like normal" seems like a good idea.
Ah yeah interesting. I hadn't really thought of a "blow up the world and re-index" mode, but it probably makes sense to have something like that. I do hope that incremental indexing will negate the need for such things in most cases, but sure, "re-index the world" would indeed likely take the atomic rename strategy.
Yes, it will be contained in the
This may be trickier to avoid. I believe the implementation path in my head does indeed focus around having a path for each thing you want to search. I'll consider the git history use case though and see if something like it can be supported. Maybe the way to think about it is not "path searching" but rather "searching documents where each document has some unique ID that can be resolved to its content." Making that pluggable at the crate level certainly seems doable. |
I guess it just seems like surprising UX to me. Searching with an index can lead to subtle state synchronization bugs where the index is lagging behind your actual corpus (or some other sync issue). So to me, it seems like it should be opt-in. Also, I think you might be giving more weight to "never" in this context than I intended. It's describing the behavior of a specific release of ripgrep, and not necessarily my future intent. |
You should consider using Bloom filters instead. Totally sidesteps the large file problem, has constant size which is a lot easier to manage, and uses less space. Or XOR filters, which are supposedly more compact: https://lemire.me/blog/2019/12/19/xor-filters-faster-and-smaller-than-bloom-filters/ Edit: That is using a Bloom filter instead of the inverted index, i.e. still a bloom filter over the ngrams. |
@mpdn I know I dipped into implementation details, but I'd really like to keep this ticket focused on the higher level user story so that folks are more inclined to participate. If we bog it down into implementation details, then I fear the discussion may become too intimidating for most. So with that in mind, I created #1518 so that folks who want to talk details can. :-) I responded to your comment there. |
There is some overlap between the goals to add ngram support to ripgrep. THAT would be phenomenal! Work was started to achieve that in rust. There was a c/c++ implementation called BigGrep, a set of tools, but the one that does the indexing is called bgindex. At Geekweek in Nov. 2019, they started a rust re-implementation called rs-bgindex, but we only had a week there. |
On Mon, Mar 16, 2020 at 06:56:55AM -0700, Andrew Gallant wrote:
@joshtriplett
> For people who index entire directory trees at once, rather than individual files, I'm hoping that indexes can get updated by atomic rename, which would prevent the majority of index corruption.
Ah yeah interesting. I hadn't really thought of a "blow up the world and re-index" mode, but it probably makes sense to have something like that. I do hope that incremental indexing will negate the need for such things in most cases, but sure, "re-index the world" would indeed likely take the atomic rename strategy.
I wasn't necessarily thinking of re-indexing everything, so much as
indexing *enough* that it makes sense to write a new file containing the
new index "segment"; that file could then be written atomically.
> Also, I'm hoping that the index logic is provided as a separate crate
Yes, it will be contained in the [`grep-index`](https://crates.io/crates/grep-index) crate.
Excellent.
> that doesn't inherently depend on integration with path searching, so that it would be possible (for instance) to index the history of a git repository.
This may be trickier to avoid. I believe the implementation path in my head does indeed focus around having a path for each thing you want to search. I'll consider the git history use case though and see if something like it can be supported. Maybe the way to think about it is not "path searching" but rather "searching documents where each document has some unique ID that can be resolved to its content." Making that pluggable at the crate level certainly seems doable.
A unique ID seems fine, sure.
|
Some high-level feedback: I am really delighted with your decision to not support relevance ordering and index synchronisation. I feel this is absolutely the right call as it means that ripgrep can very easily be dropped into a larger system. I'm generally satisfied with all the details mentioned. That said, a query regarding this:
Do you think this metadata would/should be extensible to user-defined metadata? E.g. a user might apply |
@aidansteele Thanks for the feedback. I suspect that the metadata will probably permit the library user to insert an arbitrary blob of bytes that they can use, yes. I doubt that will be exposed at the CLI level though. (Because ripgrep's CLI just really isn't expressive enough for something like that.) |
That sounds ideal. And I think that's a reasonable judgment on the CLI. Thanks for your great work. |
First of all, I really appreciate the thought and care you've put into proposing this feature! I think your decision to make a daemon that updates the index out of scope is very reasonable. One suggestion I have is that I think it might be better to make your If you get a basic implementation of index creation/updating I think it would be straightforward to prototype a continuous index update process using |
@luser Thanks for the feedback. I'll noodle on it, but as of now, I'm still inclined to keep one command. It does feel a little shoe-horned, but |
Another way to slice that would be " |
Can it work like gtags? gtags only provides commands to generate and update index, but doesn't maintain the index files by itself. Editor plugins can auto update the index since they know when user change some files. By the way, if to avoid re-walking the directory tree can have significant gains, can rg learn from git tree blob to speed up searching in a git repository? |
You mean ctags? I'm not quite sure what you mean when you say "it generates and updates the index" but also "doesn't maintain the index." These seem like contradictory statements. If all you mean is, "can ripgrep just provide commands to create, update and search the index" while letting external processes run those commands, then yes, that is exactly what's being proposed as an initial version of this feature. Please see the "ripgrep will not synchronize your index for you" section in the first comment of this issue.
This question isn't really on-topic for this thread, but the answer is "not really." ripgrep may search files outside of git's index. For example, an |
I really like this proposal. It sound like the underlying assumption here is that reading the index into memory for each search will provide acceptable performance. This is a trade-off with a daemon approach which could keep the index in-memory, but has the complication of managing a daemon. I don't have good intuition about the performance trade-off, but if it is substantial I would want to make sure that the daemon experience is good. Maybe the list of places to find the index could be extended to auto-discover such a (maybe third-party) daemon in some way, with a simple protocol? Alternatively, maybe storing the index on For reference, I use ripgrep (often through vscode, sometimes from CLI) for searching code bases that I audit. For very large projects this is usually too slow, so I use and index like OpenGrok or VsChromium. It would be great if I could use ripgrep for everything! |
@nzig The index would be designed to be read via memory maps. So the entire index doesn't need to be read off disk immediately. This is a substantial simplification to the implementation, as it pushes the onus of managing what's on disk and what isn't to the OS. If you want to get more into the weeds on implementation details, I'd recommend #1518 for that. I'm trying to keep this issue more focused on the user experience, which I think is a discussion that more people can participate in. |
@BurntSushi Thanks for the reply! If I'm understanding correctly, you're saying that you think reading the index via memory-mapped file and relying on OS caching will be fast enough so that a daemon won't be needed? I think this is on the intersection between UX and implementation. If the performance can't be good enough without a daemon, some users will want a daemon and will want the UX for that to be good as well. If the performance is acceptable, then everyone will have a good time with this proposal 😀. |
Yes, I am saying that. There is precedent for this. It's how Lucene works, for example. And yeah, I see how this is at the intersection of the UX and implementation. More or less, if you want to cross deeper into implementation details, then the other issue is better for it. :) |
Hey, I just noticed this thread and thought I'd chime in. I'm a heavy user of rg, and I've used/built many search tools for my various day jobs. Most recently I used codesearch quite extensively for a research project totaling 30gb across 1M files - https://freshchalk.com/blog/150k-small-business-website-teardown-2019. Thank you so much for creating and maintaining ripgrep! It's making our lives better and I appreciate your dedication. As to your original design questions:
This sentence makes me nervouse - I anticipate many broken cases where people try to use un-indexed directories with indexed queries, since ripgrep is so wonderfully flexible with paths. This is likely to be more prevalent with a complex index location procedure... Maybe Happy to brainstorm more if that would be helpful. Very excited to start indexing! |
Thanks for working on this! I just wanted to share another usecase in case it helps. I work on case.law, where our core users are academic researchers doing NLP on US caselaw, and I've experimented before with using the regex crate to provide on-demand regex search and extraction of research data sets across the corpus. (Like, let's get a list of every dollar figure discussed in a US court decision along with metadata about the case.) The corpus is about 6 million documents and a couple hundred gigs of text, and we have limited web server budget so we can only share features publicly that can be efficiently implemented. The feature set you described sounds perfect -- e.g. exhaustive unranked search and managing reindexing ourselves are no problem. The design question I'd highlight is, is there anything in the library design that would interfere with standing up a public web server that uses the index? Either anything specific to the index, like adversarial inputs that play on the index design or ability to run multiple queries simultaneously, or interactions with issues that would apply even without an index, like the ability to paginate results or shut down queries that exceed resource limits. Feeling a little out of my depth, so those are just examples -- I'll just say I'm excited to use this to offer researchers an easier time extracting text from large online corpuses, and I hope the library design makes that doable. |
@gurgeous Sorry, I am just seeing your comment now! Sorry I missed it before.
:-) Thanks for the kind words! And thank you for posting feedback. It's exactly the kind of thing I was hoping to get!
Just to clarify here, In terms of ripgrep, its "search by index" functionality would be fairly limited if you don't have the original corpus available. The only thing it could really tell you is the set of files which might contain a match. Otherwise, ripgrep will still need to search the file itself to report matches. So if you're going to put a ripgrep index on a mobile app, you'll need to carry the original corpus with it. There will likely be ways to reduce corpus size. e.g., You could compress each individual file and ripgrep will decompress each before searching them when used with the The reason for doing this is primarily simplicity. With that said, this is the kind of thing that could be added later if there was a compelling motivated to do it. Comparatively speaking, changing the indexing infrastructure to support associating some additional blob of data isn't that big of a deal. The real problem with it, IMO, is that it begets more work. Because when you start including file content in the index itself, you are now also likely responsible for keeping it compressed to keep storage requirements lower. It's also something that will increase indexing time as well.
It's important when you want to minimize the latency between when the content changes and when it becomes available to search. Code is one such example, but really anything that has both frequently changing data and a desire for it to be searchable quickly fits this mold. There's no doubt that this is not all use cases, but I think it's an important quality of life enhancement. I think the cron job approach you describe is perhaps "good enough," but I'd really love it if you didn't have to resort to that. It's also a difference maker in that most search indexing tooling does not support this. (And it doesn't because it's hard!) So if I'm going to go through the trouble of building something like this, I should at least try to do something that makes it obviously better than alternatives. Also, a ripgrep index will be a directory, not a single file. And on top of that, I am currently planning on reversing course with respect to my lack-of-durability idea. I'd like to fully guarantee durability. I am perhaps biting off a bit more than I can chew, and maybe I won't start there, but that's my current plan. I have started work on planning the design for the underlying IR engine for this, and its scope is a bit bigger than just ripgrep's use case. (But not by much.) It's a bit disorganized at the moment and not totally consistent with itself, but if anyone's interested in the details, it's here: https://github.com/BurntSushi/nakala/blob/master/doc/PLAN.md
I'm also a strong proponent of failing fast, and I suspect I'll reverse course on my initial statement here. If only because it's the prudent option. It's much easier to turn an error into a non-error than the other way around. |
I think that in general, ripgrep was not designed with a threat model that allows untrusted input (whether that be regexes or corpora). While its default regex engine promises never to take more than linear time on input (i.e., it is not subject to catastrophic backtracking), that doesn't mean it will always be lightning fast on all inputs. And I think that generally applies to ripgrep-with-an-index as well. Consider what happens when you stand up your web server and a user decides to run the regex ripgrep-with-an-index does have on trick up its sleeve that could help you. Namely, if all searches use the index then it can know, quickly and cheaply, roughly how many files it will need to search to produce a complete set of results. If this number is too high (for example, the regex Now, what this doesn't prevent is the normal case of a regex being slow. Namely, if you have a document in your corpus that is big and the user has crafted a regex that runs particularly slowly on that document, then it wouldn't be hard for it to peg your CPU. Aside from that, no matter how fast ripgrep is, you would need some kind of rate limiting. Overall, exposing regexes to end users is very difficult to do safely in a way that is 100% bullet proof. ripgrep-with-an-index could make it better, but it's not something I would ever advertise as a feature. Or at least, if I did, there would need to be dedicated work invested into ensuring that a single search request couldn't peg the CPU. That would require specific feature development inside of ripgrep.
I think the "adversarial inputs" I covered above. Running multiple queries would be possible, certainly. The index will support multiple concurrent readers and writers, so you can invoke as many ripgrep processes as you like. I don't see result pagination as something that would ever get added to ripgrep. Since this is exhaustive unranked search, you could always just partition your corpus to approximate that. Shutting down queries that exceed limits is what I had in mind when I said "specific feature development inside of ripgrep." That has an outside possibility of happening, but not in the initial implementation. Whether it happens or not depends on how much it complicates ripgrep and whether it is enough to actually unlock compelling use cases. |
This has been tackled, albeit without the indexing part and at a smaller scale, by @simonw recently - see datasette-ripgrep. He may have something to say regarding its performance, security, scale etc. |
@kokes Yes, it looks like it uses a time limit to kill the ripgrep process: https://github.com/simonw/datasette-ripgrep/blob/dd97a44cd77367fa447a520e6cdbb99ef829b77f/datasette_ripgrep/__init__.py#L63-L71 That's probably the best you can do right now. |
Thanks for these detailed thoughts, and the link! The contract on datasette-ripgrep's In case more detail helps understand the usecase: when implementing this I would add parameters to limit my It's also fine with me if this turns out to better handled with lower level tools like the |
The only other thing I can think of is memory usage. And that can come in two ways. Firstly, it's possible for pathological file contents to provoke very high memory usage. Usually these inputs come in the form of binary files. ripgrep has heuristics to prevent this in the default case (it replaces NUL bytes with Secondly, the other type of memory usage problem comes from the regex itself. This can actually be controlled via the I think that's all I've got at the moment. Please consider that I haven't put a ton of thought into this type of threat model, so I may be missing other attack vectors. |
Have you considered using Tantivy (https://github.com/tantivy-search/tantivy) for this use case? The main developer behind Tantivy has also recently started a company "QuickWit" (https://quickwit.io/) which builds on top of Tantivy and can provide direct full-text search of data stored in object storage (S3, etc). |
@BurntSushi , My usecase is whenever sync local repo code with latest it should be able to create delta index for previous and merge it in the current local index |
If it's okay to throw in more examples of prior work, I'd look at plocate (https://plocate.sesse.net/). Even though in its current form it's more of a drop-in replacement mlocate than a library/cli for indexing/search, it looks like it can be adapted for a wide scope of use cases, or at least it could be interesting to read about their approach. The title page says:
Not advertised, but it also uses zstd compression. |
I'm somwhat late to the party, but would be really excited for this to land in ripgrep, and think you're taking exactly the right approach by deferring problems about detecting file changes and ranking results to whomever's invoking ripgrep. In terms of interface, my intuitive expectation for the available command line flags would be slightly different than what you proposed, something more like the following:
Logically, what ripgrep is doing is equivalent to...
Allowing this explicit overlay structure makes a few neat things possible:
|
This comment was marked as off-topic.
This comment was marked as off-topic.
All sounds great one comment:
Consider the last modified date being stored and used for comparison instead. Modified is fairly understood as an acceptable do we check if the file changed or not indicator but normally you want to compare against the known value rather than just a less than. The reason being if something restores file modified times (archive tool, etc) the less than check passes but a comparison to last modified would not. There is precedent for the original way (say make), but I think more state based tools compare to the known modified stamp. Per your comments you may already store this metadata so its just the choice to use it. This would also be useful if someone say copied/restored an index and its modified time changes in the process. That happening would cause it to assume it has all changes since it actually last ran. |
Another potentially relevant project: https://github.com/Genivia/ugrep-indexer. It is interesting for having a configurable index size/search speed tradeoff. |
An alternative name for this issue is "ripgrep at scale."
I've been thinking about some kind of indexing feature in ripgrep for a long
time. In particular, I created #95 with the hope of adding fulltext support to
ripgrep. But every time I think about it, I'm stymied by the following hurdles:
state of the file system.
I think what I'm coming to realize is that we could just declare that we will
do neither of the above two things. My claim is that we will still end up
with something quite useful!
So I'd like to describe my vision and I would love to get feedback from folks.
I'm going to start by talking about the feature at a very high level and end
with a description of the new flags that would be added to ripgrep. I have
not actually built this yet, so none of this is informed by actually using
the tool and figuring out what works. I have built
indexing tools
before, and my day job revolves around information retrieval, so I do have some
relevant background that I'm starting with.
While the below is meant more to describe the UX of the feature, I have added a
few implementation details in places mostly for clarity purposes.
I'd very much appreciate any feedback folks have. I'm especially interested in
whether the overall flow described below makes sense and feels natural (sans
the fact that this initially won't come with anything that keeps the index in
sync). Are there better ways of exposing an indexing feature?
I'd also like to hear feedback if something below doesn't make sense or if a question about how things work isn't answer. In particular, please treat the section that describes flags as if it were docs in a man page. Is what I wrote sufficient? Or does it omit important details?
Search is exhaustive with no ranking
I think my hubris is exposed by ever thinking that I could do code aware
relevance ranking on my own. It's likely something that requires a paid team of
engineers to develop, at least initially. Indeed,
GitHub is supposedly working on this.
The key to simplifying this is to just declare that searching with an index
will do an exhaustive search. There should be no precision/recall trade off and
no ranking. This also very nicely clarifies the target audience for ripgrep
indexing: it is specifically for folks that want searches to go faster and are
willing to put up with creating an index. Typically, this corresponds to
frequently searching a corpus that does not fit into memory. If your corpus
fits into memory, then ripgrep's existing search is probably fast enough.
Use cases like semantic search, ranking results or filtering out noise are
specifically not supported by this simplification. While I think these things
are very useful, I am now fully convinced that they don't belong in ripgrep.
They really require some other kind of tool.
ripgrep will not synchronize your index for you
I personally find the task of getting this correct to be rather daunting. While
I think this could potentially lead to a much better user experience, I think
an initial version of indexing support in ripgrep should avoid trying to do
this. In part to make it easier to ship this feature, and in part so that we
have time to collect feedback about usage patterns.
I do however think that the index should very explicitly support fast
incremental updates that are hard to get wrong. For example:
That it, ripgrep should not allow duplicates in its index.
precedes their indexed date. That is, when you re-index a directory, the only
cost you should pay is the overhead of traversing that directory and the cost
of checking for and indexing files that have been changed since the last time
they were indexed.
overhead as re-indexing many additional files at once. While it's likely
impossible to make the performance idential between these, the big idea here
is that the index process itself should use a type of batching that makes
re-indexing files easier for the user.
If I manage to satisfy these things, then I think it would be fairly
straight-forward to build a quick-n-dirty index synchronizer on top of
something like the
notify
crate. So, ifit's easy, why not just put this synchronization logic into ripgrep itself?
Because I have a suspicion that I'm wrong, and moreover, I'm not sure if I want
to get into the business of maintaining a daemon.
Indexes are only indexes, they do not contain file contents
An index generated by ripgrep only contains file paths, potentially file meta
data and an inverted index identifying which ngrams occur in each file. In
order for ripgrep to actually execute a search, it will still need to read each
file.
An index does not store any offset information, so the only optimizations that
an index provides over a normal search are the following:
file paths to search from the index.
every file. It only needs to read files that the index believes may
contain a match. An index may report a false positive (but never a false
negative), so ripgrep will still need to confirm the match.
Moreover, in terms of incrementally updating the index for a particular
directory tree, ripgrep should only need to read the contents of files that
either weren't indexed previously, or are reported by the file system to have
been modified since the last time it was indexed. (ripgrep will have an option
to forcefully re-index everything, in case the last modified time is for some
reason unreliable.)
Also, as mentioned above, in addition to the file path, ripgrep will associate
some meta data with each file. This will include the time at which the file was
indexed, but might also include other metadata (like a file type, other
timestamps, etc.) that might be useful for post-hoc sorting/filtering.
Indexes will probably not be durable
What I mean by this is that an index should never contain critical data. It
should always be able to be re-generated from the source data. This is not to
say that ripgrep will be reckless about index corruption, but its committment
to durability will almost certainly not rise to the level of an embedded
database. That is, it will likely be possible to cut the power (or network) at
an inopportune moment that will result in a corrupt index.
It's not clear to me whether it's feasible or worth it to detect index
corruption. Needing to, for example, checksum the contents of indexed files on
disk would very likely eat deeply into the performance budget at search time.
There are certainly some nominal checks we can employ that are cheap but will
not rise to the level of robustness that one gets from a checksum.
This is primarily because ripgrep will not provide a daemon that can amortize
this cost. Instead, ripgrep must be able to quickly read the index off disk and
start searching immediately. However, it may be worthwhile to provide the
option to check the integrity of the index.
The primary implication from an end user level here isn't great, but hopefully
it's rare: it will be possible for ripgrep to panic as part of reading an index
in a way that is not a bug in ripgrep. (It is certainly possible to treat
reading the index as a fallible operation and return a proper error instead
when a corrupt index is found, but the
fst
crate currently does not do thisand making it do it would be a significant hurdle to overcome.)
In order to prevent routine corruption, ripgrep will adopt Lucene's "segment
indexing" strategy. A more in depth explanation of how this works can be found
elsehwere, but effectively, it represents a partioning of the index. Once a
segment is written to disk, it is never modified. In order to search the index,
one must consult all of the active segments. Over time, segments can be merged.
(Segmenting is not just done to prevent corruption, but also for decreasing
the latency of incremental index updates.)
Additionally, ripgrep will make use of advisory file locking to synchronize
concurrent operations.
Indexs may not have a stable format at first
I expect this feature will initially be released in an "experimental" state.
That is, one should expect newer versions of ripgrep to potentially change the
index format. ripgrep will present an error when this happens, where the remedy
will be to re-index your files.
New flags
I don't think this list of flags is exhaustive, but I do think these cover
the main user interactions with an index. I do anticipate other flags being
added, but I think those will be more along the lines of config knobs that
tweak how indexing works.
I've tried to write these as if they were the docs that will end up in
ripgrep's man page. For now, I've prefixed the flags with
--x-
since Isuspect there will be a lot of them, and this will help separate them from the
rest of ripgrep's flags.
-X/--index
Enables index search in ripgrep. Without this flag set, ripgrep will never
use an index to search.
ripgrep will find an index to search using the following procedure. Once a
step has found at least one index, subsequent steps are skipped. If an index
could not be found, then ripgrep behaves as if
-X
was not given.index itself, then all of them are searched, in sequence.
RIPGREP_INDEX_PATH
is set to a valid index, then it is searched..ripgrep
directory in the current working directory andit contains a valid index, then it is searched.
directory.
If
-X
is given twice, then ripgrep will stop searching if an index is presentand the query makes it unable to use the index.
--x-crud
Indexes one or more directories or files. If a file has already been indexed,
then it is re-indexed. By default, a file is only re-indexed if it's last
modified time is more recent than the time it was last indexed. To force
re-indexing, use the
--x-force
flag. If a file that was previously indexed isno longer accessible, then it is removed from the index.
The files that are indexed are determined by ripgrep's normal filtering
options.
The location of the index itself is determined by the following procedure. Once
a step has found an index, subsequent steps are skipped.
--x-index-path
is set, then the index is written to that path.RIPGREP_INDEX_PATH
is set, then its value is used as the path..ripgrep
path and is a validexisting index, then it is updated.
directory.
current working directory at
.ripgrep
. If.ripgrep
already exists and isnot a valid index, then ripgrep returns an error.
If another ripgrep process is writing to the same index identified via the
above process, then this process will wait until the other is finished.
As an example, the following will create an index of all files (recursively)
in the current directory that pass ripgrep's normal smart filtering:
And this will search that index:
Note that the index created previously can now be updated incrementally. For
example, if you know that
foo/bar/quux
has changed, then you can run:To tell ripgrep to re-index just that file (or directory). This works even if
foo/bar/quux
has been deleted, where ripgrep will remove that entry from itsindex.
Prior work
I believe there are three popularish tools that accomplish something similar to
what I'm trying to achieve here. That is, indexed search where the input is a
regex.
famounsly described in his
"Regular Expression Matching with a Trigram Index"
article. Other tools, like
Hound
are based on
codesearch
.here.
nice front-end for searching. It is described in more detail
here.
(There are other similarish tools, like
Recoll
and
zoekt,
but these appear to be information retrieval systems, and thus not exhaustive
searching like what I've proposed above.)
ripgrep's index system will most closely resemble Russ Cox's
codesearch
. Thatis, at its core, it is an inverted index with ngrams as its terms. Each ngram
will point to a postings list, which lists all of the files that contain that
trigram. The main downside of
codesearch
is that it's closer to a proof ofconcept of an idea rather than a productionized thing that will scale. Its most
compelling downside is it performance with respect to incremental updates.
qgrep
andlivegrep
both represent completely different ways to tackle thisproblem.
qgrep
actually keeps a compressed copy of the data from every filein its index. This makes its index quite a bit larger than
codesearch
, butcan do well with spinning rust hard drives due to sequential reading.
qgrep
does support incremental updates, but will eventually require the entire index
to be rebuilt after too many of them. My plan with ripgrep is to make
incremental updates a core part of the design such that a complete re-index is
never necessary.
livegrep
uses suffix arrays. I personally haven't played with this tool yet,but mostly because it seems like incremental updates are slow or hard here.
The text was updated successfully, but these errors were encountered: