Is ORing in a regular expression the fastest way to search in this scenario? #2808

jftuga · 2024-05-15T21:07:55Z

jftuga
May 15, 2024

I have multi-gigabyte, compressed gzip'd files that can be any where from 2 to 6 GB. I will only be searching one file at a time. Will this be the fastest way to search? I am willing to trade more memory for faster searching if this helps. The environment will be an AWS Fargate container running Amazon Linux 2023. I am using version 14.1.0.

rg -z 'MySQL dump|Go SQL Dump|Dump completed on|db_schema_information|db_server_url|CREATE DATABASE |SET @@SESSION.SQL_LOG_BIN ?= ?0' file.sql.gz

Interestingly enough, I only care if each one of these expressions appears in the file only once. If they occur multiple times, I don't really care. Once one of these is found, there is really no need to keep searching for that individual expression. Is there a way to take advantage of this stipulation?

According to the --trace flag (on my Macbook M1, used for development):

rg: DEBUG|rg::flags::parse|crates/core/flags/parse.rs:97: no extra arguments found from configuration file
rg: DEBUG|rg::flags::hiargs|crates/core/flags/hiargs.rs:1260: found hostname for hyperlink configuration: 9d9376737078
rg: DEBUG|rg::flags::hiargs|crates/core/flags/hiargs.rs:1270: hyperlink format: ""
rg: DEBUG|rg::flags::hiargs|crates/core/flags/hiargs.rs:174: using 1 thread(s)
rg: TRACE|grep_regex::matcher|crates/regex/src/matcher.rs:66: final regex: "(?:(?:MySQL dump)|(?:Go SQL Dump)|(?:Dump completed on)|(?:db_schema_information)|(?:jss_server_url)|(?:CREATE DATABASE )|(?:(?:SET @@SESSION)[\0-\t\u{b}-\u{10ffff}](?:SQL_LOG_BIN) ?= ?0))"
rg: TRACE|grep_regex::literal|crates/regex/src/literal.rs:75: skipping inner literal extraction, existing regex is believed to already be accelerated
rg: DEBUG|globset|crates/globset/src/lib.rs:453: built glob set; 0 literals, 0 basenames, 12 extensions, 0 prefixes, 0 suffixes, 0 required extensions, 0 regexes
rg: TRACE|rg::search|crates/core/search.rs:254: <stdin>: binary detection: BinaryDetection(Convert(0))
rg: TRACE|grep_searcher::searcher|crates/searcher/src/searcher/mod.rs:743: generic reader: searching via roll buffer strategy
rg: TRACE|grep_searcher::searcher::core|crates/searcher/src/searcher/core.rs:65: searcher core: will use fast line searcher

Thank you.

Answered by BurntSushi

May 15, 2024

I think it's likely you'll be bottlenecked on gzip decompression. Especially if you're searching the same files multiple times, you might experiment with searching uncompressed files or even files using a different compression algorithm. There may also be much faster gzip decompression implementations than whatever your default gzip -d does (which is what ripgrep will use to do decompression).

Beyond that, yeah I don't see any other improvement.

View full answer

BurntSushi · 2024-05-15T21:35:14Z

BurntSushi
May 15, 2024
Maintainer

I think it's likely you'll be bottlenecked on gzip decompression. Especially if you're searching the same files multiple times, you might experiment with searching uncompressed files or even files using a different compression algorithm. There may also be much faster gzip decompression implementations than whatever your default gzip -d does (which is what ripgrep will use to do decompression).

Beyond that, yeah I don't see any other improvement.

0 replies

jftuga · 2024-05-15T22:19:07Z

jftuga
May 15, 2024
Author

Thanks for replying. I am only searching any given file once and these are coming from an external source in which gzip is the only option.

I am actually using rapidgzip, which can use multiple cores for decompression. For simplicity's sake, I just mentioned -z above. As a note to others, rapidgzip completed searching a 1.2 GB file in about 2.5 minutes whereas gzip would take close to 9 minutes.

The exact command started with:

rapidgzip -d -c -P 0 file.sql.gz | rg ...

I just want to call this out in case someone else in a similar scenario is looking to decrease their run time.

1 reply

BurntSushi May 15, 2024
Maintainer

Yeah that still seems like your blocked.on decompression. Idk how big your data is uncompressed, but on my workstation, ripgrep search through >10GB files in a second or two.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is ORing in a regular expression the fastest way to search in this scenario? #2808

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is ORing in a regular expression the fastest way to search in this scenario? #2808

jftuga May 15, 2024

Replies: 2 comments · 1 reply

BurntSushi May 15, 2024 Maintainer

jftuga May 15, 2024 Author

BurntSushi May 15, 2024 Maintainer

jftuga
May 15, 2024

Replies: 2 comments 1 reply

BurntSushi
May 15, 2024
Maintainer

jftuga
May 15, 2024
Author

BurntSushi May 15, 2024
Maintainer