-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parser uses a surprising amount of memory for simple but long patterns #1090
Comments
I need a reproduction please. The code you've provided is not runnable by me. You can upload data files to GitHub issues. |
I generated one. This is no the same as the one I've been using but has the same issue (gunzip first). |
Okay, there are some knots to untangle here. First is that you aren't checking whether regex compilation succeeds or not. In the case you've given me, it does not, because it exceeds the default size limit. The way this works is that the regex will try to eagerly be compiled, and if during the course of compilation it exceeds the limit, compilation is stopped and an error is returned. That means that if you increase the limit and let the regex actually compile, it's likely to use more memory than what you're observing. We should also try to actually use the regex to check that it actually works. So let's do that first. I'll use my shell's built-in use std::time::{Duration, Instant};
fn main() {
env_logger::init();
let (s, elapsed) = timeit(|| std::fs::read_to_string("file").unwrap());
println!("pattern read from file: {:?}", elapsed);
let (r, elapsed) = timeit(|| {
regex::RegexBuilder::new(&s)
.size_limit(100 * (1 << 20))
.build()
.unwrap()
});
println!("regex compiled: {:?}", elapsed);
let (m, elapsed) = timeit(|| {
r.find("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!zygote")
.expect("a regex match")
});
println!("search finished: {:?}", elapsed);
assert_eq!(45, m.start());
}
fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
let start = Instant::now();
let t = f();
let elapsed = Instant::now().duration_since(start);
(t, elapsed)
} And to compile and run it:
All right, now we need a baseline to compare it to. You made a claim about the memory usage being greater than Python, but you didn't provide an equivalent Python program. Thankfully I know Python and how to run Python programs, so I'll do that part for you: import re
import time
def main():
s, elapsed = timeit(lambda: open('file', 'r').read())
print(f"pattern read from file: {elapsed}s")
r, elapsed = timeit(lambda: re.compile(s))
print(f"regex compiled: {elapsed}s")
m, elapsed = timeit(
lambda: r.search("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!zygote")
)
print(f"search finished: {elapsed}s")
assert m.start() == 45
def timeit(f):
start = time.time()
t = f()
elapsed = time.time() - start
return t, elapsed
if __name__ == '__main__':
main() And to run it:
So the Python program uses less memory than the Rust program, but the difference is not as great as you mentioned. Generally speaking, this is not unexpected. Python uses a backtracking regex engine, and so I would expect it to use less memory overall. The regex crate uses finite automata to guarantee linear time, and this generally leads to greater memory usage. So actually looking at your input, you do not have an alternation of literals. Most of your alternates are literals, but some are regexes due to the presence of the use std::time::{Duration, Instant};
fn main() {
let (s, elapsed) = timeit(|| {
std::fs::read_to_string("file")
.unwrap()
.lines()
.map(regex::escape)
.collect::<Vec<String>>()
.join("|")
});
println!("pattern read from file: {:?}", elapsed);
let (r, elapsed) = timeit(|| {
regex::RegexBuilder::new(&s)
.size_limit(100 * (1 << 20))
.build()
.unwrap()
});
println!("regex compiled: {:?}", elapsed);
let (m, elapsed) = timeit(|| {
r.find("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!zygote")
.expect("a regex match")
});
println!("search finished: {:?}", elapsed);
assert_eq!(45, m.start());
}
fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
let start = Instant::now();
let t = f();
let elapsed = Instant::now().duration_since(start);
(t, elapsed)
} And Python: import re
import time
def main():
s, elapsed = timeit(
lambda: '|'.join(map(re.escape, open('file', 'r').read().splitlines()))
)
print(f"pattern read from file: {elapsed}s")
r, elapsed = timeit(lambda: re.compile(s))
print(f"regex compiled: {elapsed}s")
m, elapsed = timeit(
lambda: r.search("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!zygote")
)
print(f"search finished: {elapsed}s")
assert m.start() == 45
def timeit(f):
start = time.time()
t = f()
elapsed = time.time() - start
return t, elapsed
if __name__ == '__main__':
main() And now to run the programs:
Memory usage for the Rust program dropped a little bit, but not by much. However, the search time dropped quite a bit. This is because the regex engine is no longer compiling a regex, but rather, just building an Aho-Corasick automaton. It it likely also using Aho-Corasick. To check whether memory usage is as I would expect, I ran the Rust program above under I'll leave this bug open for now to see if I can investigate it, but if you really only need to look for an alternation of literals, you can just use the use std::time::{Duration, Instant};
fn main() {
env_logger::init();
let (patterns, elapsed) = timeit(|| {
std::fs::read_to_string("file")
.unwrap()
.lines()
.map(|line| line.to_string())
.collect::<Vec<String>>()
});
println!("pattern read from file: {:?}", elapsed);
let (r, elapsed) = timeit(|| {
aho_corasick::AhoCorasick::builder()
.match_kind(aho_corasick::MatchKind::LeftmostFirst)
.build(&patterns)
.unwrap()
});
println!("aho-corasick compiled: {:?}", elapsed);
let haystack = format!("{}zygote", "!".repeat(10_000));
let (m, elapsed) = timeit(|| r.find(&haystack).expect("a regex match"));
println!("search finished: {:?}", elapsed);
assert_eq!(10_000, m.start());
}
fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
let start = Instant::now();
let t = f();
let elapsed = Instant::now().duration_since(start);
(t, elapsed)
} And now to compile and run it, we can see memory usage is way lower, including lower than Python:
|
I'm sorry I left some issues in the example program, but they weren't present in my actual program (in which I escaped meta-characters, got the size_limit error, raised it, and it worked correctly). Btw I didn't measure max memory. Instead I looked at the resident memory during long time run. I've switched to Glad to hear the |
Aye. This is the sort of thing that should have been in the original issue. You need to provide the full context for not only how you're producing a problem but how you're measuring it. Notice how I did that in my comment and it enabled you to see a discrepancy between our measurement methods! |
Sadly, this looks a little tricky to fix, even with a band-aide. One possible band-aide is to introduce some boxing inside the It's not clear to me if more fixes will be needed beyond that. If so, it might require deeper changes to the parser or perhaps even a "fast" path. But if we can get the |
I'll do next time. (I thought I meant looking at whatever *top / process monitor program. Probably I shouldn't have written the issue after I hurried up coding and debugging.) I have another question: why doesn't the parser free the memory after compilation is done? Such a large memory peak is still an issue for a router, but at least not so much for a desktop / server. |
It does. The issue is that the looking at memory usage via top (i.e., RSS as reported by the OS) is a flawed measurement. Just because this library frees the memory doesn't mean the allocator frees it back to the OS. The allocator might keep it around for other uses. You should be able to confirm this by inserting a shim via a global allocator to report total memory retained. (I have such a shim, and I will share it next time I have my hands on a keyboard. I'm on mobile at the moment.) |
Also, if someone is using this library in a constrained environment such as a router, I would be happy to hear more detailed reports of issues that one runs into. |
I see. It's a pity though. This may be a good reason to have a swap space so this kind of never-going-to-be-used-again memory can be swapped out to make room for other use.
Sorry I don't have my own router, and don't feel good to mess up with the ISP one. People asked about using my program on their routers though and that's one of the reasons I switch from Python to Rust. |
I haven't put swap on any of my machines for over a decade. And yes, Rust isn't magic. Python's regex engine is written in C. And it's a backtracker. So its memory usage is in principle not going to be obviously worse than any other regex engine written in Rust or C. |
Here is a program with the allocator shim: use std::time::{Duration, Instant};
fn main() {
let (s, elapsed) = timeit(|| {
std::fs::read_to_string("file")
.unwrap()
.lines()
.map(regex::escape)
.collect::<Vec<String>>()
.join("|")
});
println!("pattern read from file: {:?}", elapsed);
allocator::reset();
let (r, elapsed) = timeit(|| {
regex::RegexBuilder::new(&s)
.size_limit(100 * (1 << 20))
.build()
.unwrap()
});
allocator::tell();
println!("regex compiled: {:?}", elapsed);
let haystack = format!("{}zygote", "!".repeat(10_000));
let (m, elapsed) = timeit(|| r.find(&haystack).expect("a regex match"));
println!("search finished: {:?}", elapsed);
assert_eq!(10_000, m.start());
}
fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
let start = Instant::now();
let t = f();
let elapsed = Instant::now().duration_since(start);
(t, elapsed)
}
mod allocator {
use std::alloc::{GlobalAlloc, Layout, System};
use std::sync::atomic::{AtomicUsize, Ordering};
static TOTAL_ALLOC: AtomicUsize = AtomicUsize::new(0);
static TOTAL_DEALLOC: AtomicUsize = AtomicUsize::new(0);
#[global_allocator]
static ALLOCATOR: TotalAllocator = TotalAllocator;
struct TotalAllocator;
unsafe impl GlobalAlloc for TotalAllocator {
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
TOTAL_ALLOC.fetch_add(layout.size(), Ordering::Relaxed);
System.alloc(layout)
}
unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
TOTAL_DEALLOC.fetch_add(layout.size(), Ordering::Relaxed);
System.dealloc(ptr, layout)
}
}
pub(crate) fn reset() {
TOTAL_ALLOC.store(0, Ordering::Relaxed);
TOTAL_DEALLOC.store(0, Ordering::Relaxed);
}
pub(crate) fn tell() {
let total_alloc = TOTAL_ALLOC.load(Ordering::Relaxed) as isize;
let total_dealloc = TOTAL_DEALLOC.load(Ordering::Relaxed) as isize;
eprintln!(
"allocated: {}, retained: {}",
total_alloc,
total_alloc - total_dealloc,
);
}
} And then compile and run it:
So there is a lot being allocated overall, but only 1.45MB is retained. |
OK, I've got memory usage down below the Python program. Specifically, this Rust program: use std::time::{Duration, Instant};
fn main() {
let (s, elapsed) = timeit(|| {
std::fs::read_to_string("file")
.unwrap()
.lines()
.map(regex::escape)
.collect::<Vec<String>>()
.join("|")
});
println!("pattern read from file: {:?}", elapsed);
let (r, elapsed) = timeit(|| {
regex::RegexBuilder::new(&s)
.size_limit(100 * (1 << 20))
.build()
.unwrap()
});
println!("regex compiled: {:?}", elapsed);
let haystack = format!("{}zygote", "!".repeat(10_000_000));
let (m, elapsed) = timeit(|| r.find(&haystack).expect("a regex match"));
println!("search finished: {:?}", elapsed);
assert_eq!(10_000_000, m.start());
}
fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
let start = Instant::now();
let t = f();
let elapsed = Instant::now().duration_since(start);
(t, elapsed)
} And this Python program: import re
import time
def main():
s, elapsed = timeit(
lambda: '|'.join(map(re.escape, open('file', 'r').read().splitlines()))
)
print(f"pattern read from file: {elapsed}s")
r, elapsed = timeit(lambda: re.compile(s))
print(f"regex compiled: {elapsed}s")
haystack = ('!' * 10_000_000) + 'zygote'
m, elapsed = timeit(lambda: r.search(haystack))
print(f"search finished: {elapsed}s")
assert m.start() == 10_000_000
def timeit(f):
start = time.time()
t = f()
elapsed = time.time() - start
return t, elapsed
if __name__ == '__main__':
main() with the data given by the OP:
Note that I'm not 100% sure when this fix is going to be released since it will require putting out a |
This puts every Ast value behind a box to conserve space. It makes things like Vec<Ast> quite a bit smaller than what they would be otherwise, which is especially beneficial for the representation of concatenations and alternations. This doesn't quite solve the memory usage problems though, since an AstKind is still quite big (over 200 bytes). The next step will be boxing each of the variants of an AstKind which should hopefully resolve the issue. Ref #1090
The AstKind experiment proved unfruitful. I think the issue here is that the savings on Vec<Ast> didn't prove to be enough to offset the extra heap allocation that resulted from the indirection. This seems to be a sweet spot. It would be nice to get Ast down below 16 bytes, but it's not clear how to do that (without much larger changes that I don't feel inclined to pursue). Fixes #1090
The AstKind experiment proved unfruitful. I think the issue here is that the savings on Vec<Ast> didn't prove to be enough to offset the extra heap allocation that resulted from the indirection. This seems to be a sweet spot. It would be nice to get Ast down below 16 bytes, but it's not clear how to do that (without much larger changes that I don't feel inclined to pursue). Fixes #1090
This puts every Ast value behind a box to conserve space. It makes things like Vec<Ast> quite a bit smaller than what they would be otherwise, which is especially beneficial for the representation of concatenations and alternations. This doesn't quite solve the memory usage problems though, since an AstKind is still quite big (over 200 bytes). The next step will be boxing each of the variants of an AstKind which should hopefully resolve the issue. Ref #1090
What version of regex are you using?
1.9.5
Describe the bug at a high level.
I compiled a regex with a lot of literals as alternatives (~700KiB) and it consumed way too much memory (256MiB).
What are the steps to reproduce the behavior?
Prepare a text file with content like
abc|def|text|...
but about 700KiB long and ~64K|
s (it could be generated from a dict file).And then look at the process's memory usage.
What is the actual behavior?
It took up about 256MiB memory. Python only used 31MiB memory.
What is the expected behavior?
I expect it to use less memory than Python, not a lot more.
The text was updated successfully, but these errors were encountered: