Is ORing in a regular expression the fastest way to search in this scenario? #2808
-
I have multi-gigabyte, compressed gzip'd files that can be any where from 2 to 6 GB. I will only be searching one file at a time. Will this be the fastest way to search? I am willing to trade more memory for faster searching if this helps. The environment will be an AWS Fargate container running Amazon Linux 2023. I am using version
Interestingly enough, I only care if each one of these expressions appears in the file only once. If they occur multiple times, I don't really care. Once one of these is found, there is really no need to keep searching for that individual expression. Is there a way to take advantage of this stipulation? According to the
Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
I think it's likely you'll be bottlenecked on gzip decompression. Especially if you're searching the same files multiple times, you might experiment with searching uncompressed files or even files using a different compression algorithm. There may also be much faster gzip decompression implementations than whatever your default Beyond that, yeah I don't see any other improvement. |
Beta Was this translation helpful? Give feedback.
-
Thanks for replying. I am only searching any given file once and these are coming from an external source in which I am actually using rapidgzip, which can use multiple cores for decompression. For simplicity's sake, I just mentioned The exact command started with:
I just want to call this out in case someone else in a similar scenario is looking to decrease their run time. |
Beta Was this translation helpful? Give feedback.
I think it's likely you'll be bottlenecked on gzip decompression. Especially if you're searching the same files multiple times, you might experiment with searching uncompressed files or even files using a different compression algorithm. There may also be much faster gzip decompression implementations than whatever your default
gzip -d
does (which is what ripgrep will use to do decompression).Beyond that, yeah I don't see any other improvement.