-
Notifications
You must be signed in to change notification settings - Fork 462
Performance issues #389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't have time to investigate, but asking for captures requires running the slower engines (likely the slowest in this case, since the input is large) to resolve the capture groups. |
Any news on this? I'm affected by it. :( |
If there was news, I would have posted it. Instead of posting a "me too" comment, it would be much more helpful to show data. If you're affected, then post a benchmark. |
I created a benchmark to extract data from HTTP server logs in the Common Log Format. You can find the code in jeschkies/common-log-parser-bench. I used
It takes a little less than 4 seconds to run through the 35M log file. This crate takes around 9 seconds. Which is kind of surprising to me. nom, a parser combinator, takes just around 50ms. This leads me to the following questions.
Thanks for your efforts! |
@jeschkies I'd be happy to take a look, but I don't see any way to run your benchmark. Namely, I don't have your
I don't know. ripgrep uses the regex crate. But ripgrep does have some tricks that it does to speed things up. I'd have to be able to run your program and ripgrep in order to answer this.
Well, your nom and regex parsers aren't equivalent. Whether that impacts performance, I don't know. Probably not. But just from a quick glance, I see these inconsistencies:
There may be other inconsistencies, but this is just what I could see quickly. Also, you could probably speed things up by working on
Yes, resolving capturing groups requires using a slower regex engine. In this particular case, as long as you fix your regex, it's likely to be amenable to a "one pass" optimization. There was an attempt to add that here: #467 If this were easy to fix it would have been done already. Sorry. nom looks like a fine answer to this problem in my opinion. Although personally, I'd probably just use a regex and if the performance was not good enough, then I'd hand-write the parser. Also, from your repo:
That's not what I said. |
Thanks for you long answer 🙂
I have to check whether the file includes anything confidential and will upload it after a review.
This is a lot of help. I'll give the regex version another try!
I did not mean to imply it was easy. However, I'll take a look at #467 to gain a better understanding.
That is a very good point and the answer to my original intend.
I'll remove that comment. Sorry for my misunderstanding. |
I added the logs for anyone who would like to run the benchmarks or dive a little deeper. |
@jeschkies Thanks! Your regex is just still not equivalent to the nom programs. Your matching is too greedy. This regex is a bit faster and closer in functionality to your nom parser:
The only change I made was one that I had suggested previously: use Making the same change to the Go regex causes it to run faster as well. You also aren't taking Unicode into account, which is a difference between what the Go regex is matching and what the Rust regex is matching. For example,
This does run a touch faster for me than the regex above, but not as fast as Go. Finally, and perhaps most egregiously, your Go program is doing a lot less work than the Rust program! When you measure the performance of two things, you really need to make sure you're doing as close to an apples-to-apples comparison as possible. Your Rust program is iterating over all of the capture groups, extracting the data, parsing it and so on. Where as the Go program is just iterating over the matches and counting them. So if we cut out that extra processing, it might look like this: impl<'t> Iterator for CommonLogParser<'t> {
type Item = ();
fn next(&mut self) -> Option<()> {
self.captures.next().map(|_| ())
}
} This runs a little faster, but still not as fast as Go. Technically speaking, if that were all you wanted to do, then you wouldn't use pub struct CommonLogParser<'t> {
captures: regex::Matches<'static, 't>,
}
impl<'t> CommonLogParser<'t> {
pub fn new(i: &'t str) -> Self {
CommonLogParser{
captures: RE.find_iter(i),
}
}
}
impl<'t> Iterator for CommonLogParser<'t> {
type Item = ();
fn next(&mut self) -> Option<()> {
self.captures.next().map(|_| ())
}
} And this program runs almost as fast as your nom parser. The point here is that the regex's implementation of capturing groups is what is slow. Go's is also quite slow, although it looks to be just a little faster in this case. This performance pitfall is documented. The performance problems with capturing groups have to do with the technique employed and the time complexity guarantees made by this crate. I'm working on this problem, but it's a hard problem and the actual solution isn't totally clear. With that said, there is actually something you can do to dramatically increase performance with your regex based approach. That is, instead of looking for matches in the entire file all at once, why not instead iterate over each line? Since most lines match, this doesn't do a ton of extra work, and more importantly, it means you don't need to load your entire log file into memory at once. You can stream the lines from the file. (Although your program isn't set up for that, it would need some refactoring.) From the perspective of the regex engine, this actually lets it use a more efficient regex engine internally since it searches one line at a time. A single line is much much smaller than the entire file, and this permits some optimizations internally. Both Go and Rust have this optimization, so it will actually help both. Here's the Go program: package main
import (
"bufio"
"fmt"
"log"
"os"
"regexp"
)
func main() {
if err := run(); err != nil {
log.Fatal(err)
}
}
func run() error {
re := regexp.MustCompile(
`^([0-9a-f\.:]+) ([\S\-]+) ([\S\-]+) \[([^\]]+)\] "([^"]*)" ([0-9]{3}) ([0-9]*)`,
)
f, err := os.Open("data/small_access.log")
if err != nil {
return err
}
count := 0
scanner := bufio.NewScanner(f)
for scanner.Scan() {
line := scanner.Text()
caps := re.FindStringSubmatch(line)
if caps != nil {
count++
}
}
if err := scanner.Err(); err != nil {
return err
}
expected := 161761
if count != expected {
return fmt.Errorf("Exptected %d matches but found %d", expected, count)
}
return nil
} And now here's a new use super::Labels;
use regex::Regex;
lazy_static! {
static ref RE: Regex = Regex::new(r#"([0-9a-f\.:]+) ([\S\-]+) ([\S\-]+) \[([^\]]+)\] "([^"]*)" ([0-9]{3}) ([0-9]*)"#).unwrap();
}
pub struct CommonLogParser<'t> {
lines: std::str::Lines<'t>,
}
impl<'t> CommonLogParser<'t> {
pub fn new(i: &'t str) -> Self {
CommonLogParser{
lines: i.lines(),
}
}
}
impl<'t> Iterator for CommonLogParser<'t> {
type Item = Labels<'t>;
fn next(&mut self) -> Option<Self::Item> {
loop {
let line = self.lines.next()?;
let cap = match RE.captures(line) {
None => continue,
Some(cap) => cap,
};
return Some(Labels {
ip: cap.get(1).unwrap().as_str(),
user: cap.get(2).unwrap().as_str(),
frank: cap.get(3).unwrap().as_str(),
date_time: cap.get(4).unwrap().as_str(),
request: cap.get(5).unwrap().as_str(),
response_code: cap[6].parse::<u16>().unwrap(),
size: cap[7].parse::<u32>().unwrap(),
});
}
}
} And now the runtime results on my machine:
The nom parser is still the fastest, but at least Rust is faster than Go here and it's not as abysmally slow. (One can go further and cause the Go program to use the onepass optimization I talked about above, but it actually winds up being a bit slower in this case, interestingly enough.) I think this is about as much detail as I could possible give without going into the mechanics of regex engine internals themselves. |
@BurntSushi thanks for taking so much time to dive into it. The Go capture return slice to the position of each match. Thus I was under the impression that it also did similar work. I guess I was wrong 🙂
I completely failed to pay attention to an important bit in your section on Mechanics for Just to give some context. I was investigating different solutions to parse the Common Log Format in Loki (grafana/loki#2783) a log aggregator. I wanted to find out if it was worth using a parser combinator or a hand written parser. I'd argue the answer is yes and no. A regular expression can be fast enough. However, I'd also argue that not everyone would be able to write an expression that is fast. My attempt is an example. Anyhow, thanks again 🙏 |
To be clear, yes, getting the capture group slices does do similar work. However, your Rust code was doing additional work by parsing some of those groups into integers. |
A quick comparison with oniguruma (the "onig" crate) gives the following results:
example.html
is just https://www.bbc.co.uk saved page. But generally the same results can be reproduced on any input.Approx. 60 times slower. Is it by design or is a bug?
The regular expression above is used in my crate victoria-dom.
The text was updated successfully, but these errors were encountered: