Skip to content

Performance issues #389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
khvzak opened this issue Aug 2, 2017 · 10 comments
Closed

Performance issues #389

khvzak opened this issue Aug 2, 2017 · 10 comments

Comments

@khvzak
Copy link

khvzak commented Aug 2, 2017

A quick comparison with oniguruma (the "onig" crate) gives the following results:

extern crate regex;
extern crate onig;

use std::io::prelude::*;
use std::fs::File;
use std::time::SystemTime;

#[allow(non_snake_case)]
fn main() {
    let ATTR_RE_STR: String = String::new() +
        r"([^<>=\s/]+|/)" +
        r"(?:" +
            r"\s*=\s*" +
            r"(?s:" +
                r#""(.*?)""# +
            r"|" +
                r"'(.*?)'" +
            r"|" +
                r"([^>\s]*)" +
            r")" +
        r")?\s*";

    let TOKEN_RE_STR: String = String::new() +
        r"(?is)" +
        r"([^<]+)?" +
        r"(?:" +
            r"<(?:" +
                r"!(?:" +
                    r"DOCTYPE(\s+\w+.*?)" +
                r"|" +
                    r"--(.*?)--\s*" +
                r"|" +
                    r"\[CDATA\[(.*?)\]\]" +
                r")" +
            r"|" +
                r"\?(.*?)\?" +
            r"|" +
                r"\s*([^<>\s]+\s*(?:" + &ATTR_RE_STR + r")*)" +
            r")>" +
        r"|" +
            r"(<)" +
        r")?" +
        r"(.*)$";

    let mut f = File::open("example.html").unwrap();
    let mut buffer = String::new();
    f.read_to_string(&mut buffer).unwrap();

    {
        use regex::Regex;

        let re = Regex::new(&TOKEN_RE_STR).unwrap();
        let now = SystemTime::now();
        if let Some(caps) = re.captures(&buffer) {
            let rest = caps.get(12).map(|c| c.as_str()).unwrap_or("");
            let elapsed = now.elapsed().unwrap();
            println!("[rust-regex] elapsed: {:.5} sec; rest = {}", elapsed.as_secs() as f64 + elapsed.subsec_nanos() as f64 * 1e-9, rest.len());
        }
    }

    {
        use onig::{Regex, Syntax, REGEX_OPTION_NONE};

        let re = Regex::with_options(&TOKEN_RE_STR, REGEX_OPTION_NONE, Syntax::perl()).unwrap();
        let now = SystemTime::now();
        if let Some(caps) = re.captures(&buffer) {
            let rest = caps.at(12).unwrap_or("");
            let elapsed = now.elapsed().unwrap();
            println!("[onig] elapsed: {:.5} sec; rest = {}", elapsed.as_secs() as f64 + elapsed.subsec_nanos() as f64 * 1e-9, rest.len());
        }
    }
}

example.html is just https://www.bbc.co.uk saved page. But generally the same results can be reproduced on any input.

# cargo run --release
   ...
   Compiling onig v1.5.0
   Compiling regex v0.2.2
   ...
[rust-regex] elapsed: 0.40805 sec; rest = 292869
[onig] elapsed: 0.00644 sec; rest = 292869

Approx. 60 times slower. Is it by design or is a bug?
The regular expression above is used in my crate victoria-dom.

@BurntSushi
Copy link
Member

I don't have time to investigate, but asking for captures requires running the slower engines (likely the slowest in this case, since the input is large) to resolve the capture groups.

@Kixunil
Copy link

Kixunil commented May 15, 2019

Any news on this? I'm affected by it. :(

@BurntSushi
Copy link
Member

BurntSushi commented May 15, 2019

If there was news, I would have posted it. Instead of posting a "me too" comment, it would be much more helpful to show data. If you're affected, then post a benchmark.

@jeschkies
Copy link

jeschkies commented Dec 29, 2020

I created a benchmark to extract data from HTTP server logs in the Common Log Format. You can find the code in jeschkies/common-log-parser-bench. I used ripgrep as my base line.

› /usr/bin/time rg '([\da-f\.:]*) (.*) (.*) \[(.*)\] "(.*)" (\d{3}) (\d*).*' data/small_access.log -r '$1 , $2, $3 ,$4 , $5, $6, $7' > /dev/null
3.68user 0.00system 0:03.70elapsed 99%CPU

It takes a little less than 4 seconds to run through the 35M log file.

This crate takes around 9 seconds. Which is kind of surprising to me. nom, a parser combinator, takes just around 50ms. This leads me to the following questions.

  1. How come the regex implementation is so much slower than rg? Is there something wrong with my benchmark? I'd really appreciate a look at my code. It's just 50 lines without tests.
  2. How can I help? @BurntSushi, you mention that capture might use a slower engine. Are there known alternatives I could help implement?

Thanks for your efforts!

@BurntSushi
Copy link
Member

BurntSushi commented Dec 29, 2020

@jeschkies I'd be happy to take a look, but I don't see any way to run your benchmark. Namely, I don't have your data/small_access.log input file. Please either share that or reproduce the benchmark on something that we both can see.

How come the regex implementation is so much slower than rg?

I don't know. ripgrep uses the regex crate. But ripgrep does have some tricks that it does to speed things up. I'd have to be able to run your program and ripgrep in order to answer this.

Is there something wrong with my benchmark?

Well, your nom and regex parsers aren't equivalent. Whether that impacts performance, I don't know. Probably not. But just from a quick glance, I see these inconsistencies:

  • In the regex parser, you use \d, which is Unicode aware, but in the nom parser, you only accept hex digits. Thus, the regex parser is more permissive.
  • In the regex parser, you use things like \[(.*)\] while the nom parser uses your surrounded combinator. They look similar, but the regex pattern will match something like [foo]bar] where as your nom parser would stop matching at the first ]. It's the same thing with the "(.*)" regex.
  • You also use things like (.*) which seem inappropriate when compared to your nom parser. "frank" for example is done by taking as many characters as possible until a space is observed. But your regex greedily matches everything. The equivalent in regex would be something like ([^ ]+).
  • Since your regex permits the size (at the end, via (\d*)) to both contain any Unicode digit and an arbitrary number of them, cap[7].parse::<u32>().unwrap() is not guaranteed to succeed. That is, your parser can panic on some inputs where as your nom parser would return an error.

There may be other inconsistencies, but this is just what I could see quickly.

Also, you could probably speed things up by working on &[u8] instead of &str. The regex crate supports this.

2. How can I help? @BurntSushi, you mention that capture might use a slower engine. Are there known alternatives I could help implement?

Yes, resolving capturing groups requires using a slower regex engine. In this particular case, as long as you fix your regex, it's likely to be amenable to a "one pass" optimization. There was an attempt to add that here: #467

If this were easy to fix it would have been done already. Sorry. nom looks like a fine answer to this problem in my opinion. Although personally, I'd probably just use a regex and if the performance was not good enough, then I'd hand-write the parser.

Also, from your repo:

According to a Stackoverflow answer by @BurntSushi, the rg author, regex is not optimized for matches yet.

That's not what I said.

@jeschkies
Copy link

jeschkies commented Dec 29, 2020

Thanks for you long answer 🙂

Namely, I don't have your data/small_access.log input file.

I have to check whether the file includes anything confidential and will upload it after a review.

Well, your nom and regex parsers aren't equivalent.

This is a lot of help. I'll give the regex version another try!
EDIT I improved the expression a little and your suggestions almost half the runtime. Thanks 🙂

If this were easy to fix it would have been done already.

I did not mean to imply it was easy. However, I'll take a look at #467 to gain a better understanding.

nom looks like a fine answer to this problem in my opinion. Although personally, I'd probably just use a regex and if the performance was not good enough, then I'd hand-write the parser.

That is a very good point and the answer to my original intend.

According to a Stackoverflow answer by @BurntSushi, the rg author, regex is not optimized for matches yet.

That's not what I said.

I'll remove that comment. Sorry for my misunderstanding.

@jeschkies
Copy link

I added the logs for anyone who would like to run the benchmarks or dive a little deeper.

@BurntSushi
Copy link
Member

@jeschkies Thanks! Your regex is just still not equivalent to the nom programs. Your matching is too greedy. This regex is a bit faster and closer in functionality to your nom parser:

r#"([0-9a-f\.:]+) ([\S\-]+) ([\S\-]+) \[([^\]]+)\] "([^"]*)" ([0-9]{3}) ([0-9]*).*\n"#

The only change I made was one that I had suggested previously: use "([^"]*)" instead of "(.*)". Yours is arguably incorrect, since as mentioned above (using a different example), it will match something like "foo"bar" where as the former will not.

Making the same change to the Go regex causes it to run faster as well. You also aren't taking Unicode into account, which is a difference between what the Go regex is matching and what the Rust regex is matching. For example, \S matches any non-whitespace ASCII byte in Go's regex engine, but in Rust's, it matches any non-whitespace Unicode scalar value. The latter set is much larger and at least in theory would require more overhead at search time. However, fixing this makes things awkward. You either need to switch to bytes::Regex, which deals with &[u8] instead of &str which is less convenient, or you could make the regex more complicated by restricting things to ASCII everywhere:

r#"([0-9a-f\.:]+) ([[\S&&\p{Ascii}]\-]+) ([[\S&&\p{Ascii}]\-]+) \[([[^\]&&\p{Ascii}]]+)\] "([[^"]&&\p{Ascii}]*)" ([0-9]{3}) ([0-9]*).*\n"#

This does run a touch faster for me than the regex above, but not as fast as Go.

Finally, and perhaps most egregiously, your Go program is doing a lot less work than the Rust program! When you measure the performance of two things, you really need to make sure you're doing as close to an apples-to-apples comparison as possible. Your Rust program is iterating over all of the capture groups, extracting the data, parsing it and so on. Where as the Go program is just iterating over the matches and counting them. So if we cut out that extra processing, it might look like this:

impl<'t> Iterator for CommonLogParser<'t> {
    type Item = ();

    fn next(&mut self) -> Option<()> {
        self.captures.next().map(|_| ())
    }
}

This runs a little faster, but still not as fast as Go. Technically speaking, if that were all you wanted to do, then you wouldn't use captures_iter here, but find_iter instead:

pub struct CommonLogParser<'t> {
    captures: regex::Matches<'static, 't>,
}

impl<'t> CommonLogParser<'t> {
    pub fn new(i: &'t str) -> Self {
        CommonLogParser{
            captures: RE.find_iter(i),
        }
    }
}

impl<'t> Iterator for CommonLogParser<'t> {
    type Item = ();

    fn next(&mut self) -> Option<()> {
        self.captures.next().map(|_| ())
    }
}

And this program runs almost as fast as your nom parser.

The point here is that the regex's implementation of capturing groups is what is slow. Go's is also quite slow, although it looks to be just a little faster in this case. This performance pitfall is documented.

The performance problems with capturing groups have to do with the technique employed and the time complexity guarantees made by this crate. I'm working on this problem, but it's a hard problem and the actual solution isn't totally clear.

With that said, there is actually something you can do to dramatically increase performance with your regex based approach. That is, instead of looking for matches in the entire file all at once, why not instead iterate over each line? Since most lines match, this doesn't do a ton of extra work, and more importantly, it means you don't need to load your entire log file into memory at once. You can stream the lines from the file. (Although your program isn't set up for that, it would need some refactoring.)

From the perspective of the regex engine, this actually lets it use a more efficient regex engine internally since it searches one line at a time. A single line is much much smaller than the entire file, and this permits some optimizations internally. Both Go and Rust have this optimization, so it will actually help both. Here's the Go program:

package main

import (
	"bufio"
	"fmt"
	"log"
	"os"
	"regexp"
)

func main() {
	if err := run(); err != nil {
		log.Fatal(err)
	}
}

func run() error {
	re := regexp.MustCompile(
		`^([0-9a-f\.:]+) ([\S\-]+) ([\S\-]+) \[([^\]]+)\] "([^"]*)" ([0-9]{3}) ([0-9]*)`,
	)

	f, err := os.Open("data/small_access.log")
	if err != nil {
		return err
	}

	count := 0
	scanner := bufio.NewScanner(f)
	for scanner.Scan() {
		line := scanner.Text()
		caps := re.FindStringSubmatch(line)
		if caps != nil {
			count++
		}
	}
	if err := scanner.Err(); err != nil {
		return err
	}

	expected := 161761
	if count != expected {
		return fmt.Errorf("Exptected %d matches but found %d", expected, count)
	}
	return nil
}

And now here's a new regex-line module for your setup:

use super::Labels;

use regex::Regex;

lazy_static! {
    static ref RE: Regex = Regex::new(r#"([0-9a-f\.:]+) ([\S\-]+) ([\S\-]+) \[([^\]]+)\] "([^"]*)" ([0-9]{3}) ([0-9]*)"#).unwrap();
}

pub struct CommonLogParser<'t> {
    lines: std::str::Lines<'t>,
}

impl<'t> CommonLogParser<'t> {
    pub fn new(i: &'t str) -> Self {
        CommonLogParser{
            lines: i.lines(),
        }
    }
}

impl<'t> Iterator for CommonLogParser<'t> {
    type Item = Labels<'t>;

    fn next(&mut self) -> Option<Self::Item> {
        loop {
            let line = self.lines.next()?;
            let cap = match RE.captures(line) {
                None => continue,
                Some(cap) => cap,
            };
            return Some(Labels {
                ip: cap.get(1).unwrap().as_str(),
                user: cap.get(2).unwrap().as_str(),
                frank: cap.get(3).unwrap().as_str(),
                date_time: cap.get(4).unwrap().as_str(),
                request: cap.get(5).unwrap().as_str(),
                response_code: cap[6].parse::<u16>().unwrap(),
                size: cap[7].parse::<u32>().unwrap(),
            });
        }
    }
}

And now the runtime results on my machine:

$ time go/regex

real    0.830
user    0.872
sys     0.045
maxmem  8 MB
faults  0

$ time ./target/release/parse nom

real    0.088
user    0.060
sys     0.027
maxmem  35 MB
faults  0

$ time ./target/release/parse regex

real    4.784
user    4.755
sys     0.027
maxmem  36 MB
faults  0

$ time ./target/release/parse regex-line

real    0.375
user    0.351
sys     0.023
maxmem  36 MB
faults  0

The nom parser is still the fastest, but at least Rust is faster than Go here and it's not as abysmally slow.

(One can go further and cause the Go program to use the onepass optimization I talked about above, but it actually winds up being a bit slower in this case, interestingly enough.)

I think this is about as much detail as I could possible give without going into the mechanics of regex engine internals themselves.

@jeschkies
Copy link

@BurntSushi thanks for taking so much time to dive into it.

The Go capture return slice to the position of each match. Thus I was under the impression that it also did similar work. I guess I was wrong 🙂

From the perspective of the regex engine, this actually lets it use a more efficient regex engine internally since it searches one line at a time. A single line is much much smaller than the entire file, and this permits some optimizations internally.

I completely failed to pay attention to an important bit in your section on Mechanics for ripgrep. It says This approach is problematic primarily because, in the common case, finding a match is rare.. I thought it would be faster to read the whole file at once. But obviously not when there is a match on each line.

Just to give some context. I was investigating different solutions to parse the Common Log Format in Loki (grafana/loki#2783) a log aggregator. I wanted to find out if it was worth using a parser combinator or a hand written parser. I'd argue the answer is yes and no. A regular expression can be fast enough. However, I'd also argue that not everyone would be able to write an expression that is fast. My attempt is an example.

Anyhow, thanks again 🙏

@BurntSushi
Copy link
Member

The Go capture return slice to the position of each match. Thus I was under the impression that it also did similar work. I guess I was wrong

To be clear, yes, getting the capture group slices does do similar work. However, your Rust code was doing additional work by parsing some of those groups into integers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants