-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thread local cache may cause memory leaks #362
Comments
Is it possible to write a small reproducible example of this problem? The short answer is no: the thread local cache cannot be disabled. Setting it to a low value should reduce memory increases quite a bit because it will effectively stop the DFA from running. The DFA is the biggest user of the thread local cache, but not the only one. All of the regex engines require some type of mutable state while matching. If the only way to fix this is to disable the thread local cache, then the only way to make that happen is to introduce a lower level API that forces the caller to manage that state. |
@saarw Can you say more about what's happening? Is your program spawning an unbounded number of threads and never terminating them? |
The greenlighted answer to the question below describes how the thread pool in this environment can be trimmed and expanded. Rust thread locals are not destroyed just because the thread exits (from the thread_local crate "Per-thread objects are not destroyed when a thread exits. Instead, objects are only destroyed when the ThreadLocal containing them is destroyed.") so trimming and expanding a thread pool seems to create an unbounded growing number of thread locals, even though the number of actual threads that are alive at any time is bounded. |
Ah I see, then there might be a work-around. Presumably, in your current code (it would be nice to have a reproducible example) you are using either an Let me know if this doesn't make sense and I'll try to explain more with examples. |
I have a Rust lib with basically the following methods that is built as a DLL and runs in IIS
A C# program creates my collection of regexes using the init_regexes method and then runs tasks that match_regexes on the thread pool described above. The tasks run in sequence so the atomic ptr is just there as a memory fence to prevent partial creation before use, but the C# program does not have control of what thread that runs each task. I can certainly create a thread local regex collection for each new thread, but I don't have control over when a thread exits so I wouldn't know when to drop the copy of the regexes (or drop the clones as in your suggestion). |
Yeah, I'm not sure what to say. Regex matching requires some kind of mutable scratch space, so there must be some synchronization that happens if you're sharing a single regex across multiple threads. The alternative is to not share a single regex across multiple threads. It sounds like your environment is pretty constrained, but that's the trade off you have available to you today
If that's true, then I don't see how you could ever avoid a leak. A leak seems inevitable if you can't free the things you allocate. That means that even if there were a lower level API that provided a way for the caller to control allocation of the scratch space, then that also wouldn't be useful to you because you wouldn't know when to free the scratch space. What kind of solution are you expecting here? |
I would know when to free the scratch space - after each match operation (or at least before returning from the function call to my DLL). So either a caller-controlled cache, or a config option that makes the library simply drop all cached data at the end of each match operation would help me. |
@saarw That config option doesn't make sense. If all cached data is dropped after every match, then the performance will be so bad that you might as well use whatever C# regex library is available to you. The cached data is there for really good reasons. :-) In any case, you can do that today. Clone the regex before every match, then drop the regex after you're done with it. You can clone the regex from your global collection. When you do that, the thread local state in your collection will never be used. |
Yes, cloning the regexes before every match is the only workaround I've come up with so far, but agree that it may be too slow. We are really not that performance sensitive and were hoping to use Rust to be able to deploy it on a wide variety of platforms, but guess we may need a special non-native implementation for C#/IIS because we don't control the threading on that platform. I know you have a lot of experience of regex performance, do you perhaps know if there is any engine that does not persist state between match operations and that runs faster than clone-per-match rust-regex instances, even though it may be a lot slower than cached rust-regex? Maybe PCRE? RE2? |
The regex struct internally looks like this: struct Regex {
compiled: Arc<Internal>,
cached: CachedThreadLocal,
} This representation supports two cheap ways to use regexes in multithreaded environments. The first is the normal way: you put a However, the second way is to just clone the So for example, if you do: let re = my_global_regexes.Whatever.clone();
re.is_match(...)
for m in re.find_iter(...) { ... }
re.captures(...)
drop(re) then the same exact scratch space will be used for all calls to the Now, this will actually work just fine and should give you good performance. But this roughly assumes that you can amortize that
One important correction: I'm not advocating clone-per-match, but rather, clone-per-matches. All regex engines I know of require some kind of scratch space. It's inherent in how regex engines work. Rust's regex engine is based on RE2, so it works similarly, although RE2 actually tries to share the same scratch space across multiple threads for a single regex. This fixes your memory issue, but will make searching using the same regex from multiple threads slower. I'm less familiar with PCRE, but it's a backtracking engine, which means it must either use exponential time or space in some cases. And it must use some kind state to keep track of the match, capturing groups, etc etc. |
Yes, it definitely seems like RE2's concurrency model would work better for my current use case where we only have single, but different, threads accessing the regexes at any one time so there isn't any actual contention on the locks. For platforms where we do control the threading and can use the thread local cache, I'm also guessing RE2 may have smaller memory footprint with its single cache compared to the solution with thread local caches (considering a typical web server has about 200 threads). However, I will eventually need to handle the situation with concurrent accesses later as well and don't want RE2 to become a point of contention so I will definitely try the clone-per-matches approach with the rust-regex crate. The rust-regex functionality and focus on safe behavior is really nice! Thanks a lot for all the info! |
No problem! I'll keep noodling on this problem by the way. At some point, there may be more explicit control over allocations, but it won't be happening any time soon! :) |
I put together a benchmark app of using the Rust lib in a threadpool-friendly way with a new clone for every match vs using Java's regex engine in the same way and recreating its mutable parts (the Matcher object) for every match It seems the Rust lib gets about 7-8x slower than Java's implementation. I wonder if it's the thread local cloning operation that may be the culprit or if the Rust lib is just more reliant on its mutable state than other regex implementations. Guessing it would be easier for me to hack in a thread-unsafe cache in a fork of the rust lib than to try to integrate RE2 or other C-libs in my build process, if the threadpool-friendly performance turns out to be unacceptable... |
@saarw If you are literally doing the clone for every match, then yes, this regex engine will get very slow. The reason why is because the DFA itself is lazily generated during matching, and the DFA is itself stored in this scratch space. By clearing the scratch space, you're forcing the DFA to be regenerated, which is quite slow. The key is doing the clone for every group of matches. |
Ok, just out of interest I added the ability to specify batch sizes for the Rust lib where it only clones the regex every X number of matches. Seems the lib is only about 4x slower than Java with a batch size of 10, and stabilizes somewhere around 3x slower regardless of batch size (so the overhead for cloning for each match is just 2-3x vs batching). |
Tested a thread-unsafe cache version in a fork. I first started with an empty cache whenever the regex was cloned, not much effect on clone-per-match timing. Also tested cloning the existing cache when calling the clone operation and tried initializing the original regex cache with some warm-up data before starting the actual matches, but performance got worse. The experiments are in the last commits in this branch of the fork https://github.com/saarw/regex/tree/threadunsafe-perf Unfortunately, since we are using the library in a web app that processes requests in a thread pool, we can only group together data that comes from a single request to run through each regex instance, so we don't reach the batch volumes necessary to make this efficient enough for us but it doesn't seem to matter that the scratch space is allocated in a thread local. |
@saarw Very interesting stuff. I think I'm going to close this because I think we've reached a point at which we understand the fundamental trade offs: this implementation requires mutable scratch space for matching, and allocating a fresh space for every match causes the implementation to regress quite badly. I'm going to close this for now, and although I do see a future where some lower level access to the matching machines/state is available, it will probably be quite some time before that happens. |
OK! Yes, I'm focusing on trying to find some other library with different performance characteristics in relation to scratch space. That thread local use probably justifies a warning somewhere in the docs though, it's not really expected from a regex lib, and a lot of people seem to be trying to push Rust into the web backend space where thread pools aren't uncommon... (can the thread local cause memory leaks in regexes that get evaluated in parallel with Rayon? does it destroy threads when it no longer needs them?) edit: The Rayon docs warn "In the future, however, the default behavior may change to dynamically add or remove threads as needed.", so maybe this won't be causing memory leaks from parallel execution today, but will if Rayon gets extended. |
I don't know of any production grade regex library that doesn't utilize some kind of scratch space... Whether it's implicit on the stack or more explicit on the heap, you really need some kind of space proportional to the regex in order to execute a search. This problem is probably exacerbated in a regex engine based on finite state machines, since a backtracking engine can probably get away with using a lot less space. (But then you lose your linear time guarantee.)
Note that thread pools on their own aren't a problem. Your particular environment is quite a bit more constraining:
I think that if any one of those points weren't true, then there would have been a way to fix this in a satisfying way:
Yeah, it's talked about a bit in the performance guide: https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md#using-a-regex-from-multiple-threads --- But I agree, it would be wise to add more details and the lessons learned from this ticket. So I'll morph this into a documentation bug. :-) Thanks for all your patience and helping to illuminate a weak point in the implementation. It will undoubtedly be on mind when I revisit how all this stuff works. |
Adding another note here. I took a look at how Go manages this with their |
Sure, the use of a scratch space is common, it's only the use of a thread local that I think should be documented (I only started suspecting it from seeing the import on the dependencies on regex's crates.io page). After all, it doesn't help to have the ability to act when a thread is created or goes away unless you also know that you need to do so. I guess the need for such documentation especially applies to languages like Rust, since even if GCed language did use a thread local, its GC would clean up the memory when the thread dies and prevent the memory leak. This has really given me something to think about when designing concurrent components in the future in Rust (and I will be keeping an eye out for thread locals in dependency pages for other crates). Thanks again for helping me sort this out! |
After discussing this with @BurntSushi on IRC for a bit, one potential solution would be to change the way thread_local allocates thread IDs to prefer reusing IDs of threads that have exited wherever possible. This would limit the number of entries in the ThreadLocal to the maximum number of threads live at any one point, and avoid unbounded growth. |
That sounds like a great improvement! Anything that helps clean up or limit thread local states of exited threads sounds very useful for thread locals in general. The change would make Rust thread locals a lot safer to use from other languages. It is probably still worth documenting the thread local use in the regex crate though. A non-reactive web server thread pool can often use ~200 threads, and a memory footprint of 2 MB regex cache x 200 threads per regex is probably worth warning about... |
This is now fixed in |
@Amanieu ah that's awesome! Thanks so much! |
Very nice! I guess that fix closes this issue since the memory shouldn't be leaking with that behavior. |
@saarw Nice! Were you able to confirm that? |
Sorry, can't easily reproduce it since I'm not personally working in the .Net integration where it happened and we swapped in Oniguruma to solve the problem there. |
The thread local cache causes a memory leak if you use the regex library in environments where you don't have direct control over threads that can be created and destroyed at the platform's whims, such as when executing tasks inside Microsoft IIS. Setting the dfa_size_limit to 0 does not eliminate the cache. Eliminating the cache should also reduce the memory footprint when you have lots of regexes in lots of threads.
Is it possible to make the thread_local cache optional?
The text was updated successfully, but these errors were encountered: