-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Spontaneously high CPU usage #890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Alright, happening on another server now, with same versions. Take a look at that --
SecData dir:
And again, nothing in strace output.. But CPU continues to be consumed! |
Compiled apr with debug on.
I should also note this all is happening when running in detectiononly mode. |
Hi @celesteking, thank you for the detailed description. What is your ModSecurity version? there are any public rules set loaded? any custom rules? The version of the apr library which was dynamic loaded is the same that ModSecurity was compiled with ? |
2.8.0, apache 2.2 mpm worker fixed thread count, suphp. Yes, same version of apr library. open crs ruleset with a couple of rules disabled and some minor adjustments:
|
Getting same behaviour with modsec 2.9.0 , same apache version (2.2) & config. |
Anyone having similar problems? Basically, the module is unusable at current state. |
we are currently experiencing the same behavior on a CentOS 7.1 System: |
I ran into a similar problem as well. Processes consuming 100% CPU due to threads becoming stuck. This is on a CentOS 7.1 CPanel server with Apache 2.4.16 running event MPM, mod_security version 2.9.0 with the OWASP ModSecurity Core Rule Set installed. Running in detection only mode. Looking at the thread stacks with gdb revealed the stacks pasted below. I didn't do any further digging.
|
I get strange CPU activity too. #991 |
Please give it a try with release 2.9.2. If the problem persists let us know. Thanks. |
Problem persists. EL6 latest, stock kernel.
|
This might be related to mlogc being specified via |
We're seeing this bug after a big update to our WAF. Today, after deploying it to 35 servers, within an hour Apache'd broken down and tried to use 100% of CPU on 7 of those servers. There's nothing in strace on the broken processes, and ltrace shows why: they're all infinite loops, executing a dummy function over and over again:
which is consistent with the backtrace above. I only see four places in the entire modsec 2.9.2 repo where apr_pool_cleanup_null is registered as a cleanup function, so it should be easy to at least shift blame to Apache. Although I don't have anything like an easy "do this and you see the bug" replication, I can replicate it fairly reliably, and with cgroups CPU limitations I can do that while leaving some CPU to spare for any investigations, so please let me know (within a week, ideally) if there's some investigation you'd like me to perform. We're not doing anything with mlogc and SecAuditLog. |
I see that the issues you're reporting are related with ModSecurity 2.9.2 released in July/2017. Since the change set between 2.9.2 and 2.9.3 is quite large, there's a chance that this is not an issue anymore. Please let us know if the issue persists with 2.9.3 (released Dec/2018). Thanks! |
@victorhora I've tested with 2.9.3 and get the same behavior. The loop itself is in apr:memory/unix/apr_pools.c:
so modsec is giving this a circular list somehow. |
I'm not sure if we'll be able to fix it on 2.9.x since this issue doesn't seem to happen on the current version of libModSecurity (3.x). Still I'm reopening this one for further investigation. |
@jrfondren you're not using mLogC right? Can you provide more info about environment? Version of the CRS, any custom rules, custom conf, custom Apache directives, loaded modules, particularity on the environment? Due to some reports related with cPANEL (cc @celesteking) can you confirm if you're using ModSecurity from cPANEL? There has been some known issues with modules that cPANEL uses. See #712. Also, there have been cases in the past where people had issues with apr_pool_cleanup calls due filesystem issues such as a full disk or lack of permissions to clean up temporary files (e.g. ip.pag collection data) so please check that too. In addition, as @zimmerle mentioned, please ensure that Apache and ModSecurity are running against the correct APR (i.e. the Apache included and not a system installed version). Usually apachectl -V" will give you this but you can also check on the first few lines of the error_log when Apache starts with ModSecurity. Please provide these log lines too. You may also want to check if the optional global mutex configuration introduced at 112ba45 is a valid workaround for you. It's a workaround for v2 due to some known issues with handling the collections with APR. See #1224. If possible, please provide Apache's error_log and ModSec debug_log when reproducing the issue to see if we can have a better clue on why this problem is happening only on your environments. Thanks |
We've seen this problem on Apache 2.2 servers with APR 1.5.2 using some of our own builds, and Apache 2.4 servers with APR 1.6.3, using cPanel's EA4 RPMs, in both cases the APR is the same as what mod_security2.so was compiled with. Mutxes around apr cleanup stuff sounds like it might be helpful, but that patch seems to be concerned only with database access. You probably don't put pointers into the database, right?
With ModSecurity 2.9.2:
That PCRE complaint is a concern but it's not present on the Apache 2.4 servers
We're not using mLogC. We've these configuration options (with the *Access options temporarily Off to confirm that they aren't related to the bug):
We're upgrading from a few-years-old version of Atomicorp's to a current version. The current version's a little bit larger and except for using @ipcheck and @ipCheckFromFile instead of @pm-based workarounds, and probably other modernizations, the rules are very similar in complexity and construction. https://wiki.atomicorp.com/wiki/index.php/Atomic_ModSecurity_Rules The performance problems' been observed to start at times where there are no logs at all from modsec. Since the bug is the creation of an infinite loop, which is then only a problem when APR tries to clean it up forever, there's not going to be anything timely. |
Alright, I'm able to reproduce this on 2.9.3 built from v2.9.3 tag against cpanel-provided apr 1.6.3 on EL6, no global mutex patch applied. |
Replication:
The single request lays the trap, and then once some traffic causes Apache to try to expire children, the trap is sprung. |
The following replication works in seconds on a vanilla cPanel CentOS 6 and CentOS 7 environment with a worker MPM and suPHP. It was replicated with a very old CentOS 7 image, and then again after massive updates. It Replication: add the following to a vanilla cPanel configuration. Don't have any modsec rules in addition to this. More SecRules as well as steps to slow down Apache's ability to spawn PHP processes will, not fix the bug, but make it harder for it to happen. We've had large buggy configs run for half a day before Apache finally broke down.
And run an ApacheBench with high concurrent connections and a request that will trigger the above SecRule and also spawn a PHP process with suPHP:
A simple timthumb.php in the docroot with "hello world" in it is enough, as long as suPHP is the handler. If ltrace isn't working for you and you're not sure if you're seeing the bug or not, some additional signs are
|
Some more information about this bug: it's a data race(?) within libapr in a highly-threaded environment as with the Worker MPM. In one thread libapr will break a data structure, and then before it can go on to unbreak the data structure, elsewhere libapr will fork() and the "unbreak" steps won't come along. Even though libapr intends to then exec(), it tries to clean up a bit first, and this clean-up step is unreliable with broken data structures. suphp is involved as it's a fork of mod_cgi and causes Apache to fork rapidly. Worker MPM is involved as it brings the threads that can't finish their work around a fork. modsecurity is involved as it registers a cleanup job once per regex compilation and this is the other side of the libapr data race that's causing the bug. The single modsec rule that we can replicate the bug with is involved as it contains a SERVER_NAME and therefore is constantly getting recompiled. |
Great work! I was able to replicate that reliably, too, with the specified simple regex rule from the above. If apr is compiled with APR_POOL_DEBUG, httpd won't even start because of the error: The error is produced because apparently, modsecurity is using apr functions in a weird way, and that bites in the ass thereafter , as exactly seen in this whole bugreport. Looking at |
This example httpd.conf reproduces the bug with only core Apache modules and ModSecurity. I was able to isolate the corruption of the linked list to the functions in re_operators.c that are using rule->ruleset->mp (a global per-process pool) in the threads that handle individual HTTP requests. These should use storage pools that are assigned to the HTTP request rather than a pool that is shared by all the worker threads in the process. I'll submit a pull request with fixes for the pool usage in re_execute.c in a moment. |
This PR fixed our problem in a very busy cluster |
@lightsey Did you spend quite a while tracking this down? It's burned a lot of our time and we could never reproduce it in our dev environments. |
IIRC, my approach for pinpointing the source of the problem was:
It's definitely a difficult problem to debug because the linked list corruption becoming evident in apr_pool_cleanup_for_exec() happens well after the bug that corrupts the linked list of cleanups. |
fixed as part of #2049 |
This bug is hard to replicate, but I'll try to describe it here.
We've activated modsec on our servers, but with some we're noticing extremely high cpu usage. It manifests itself after running for some time in detection mode.
We have mlogc logging enabled.
It might be caused by graceful restart, although I'm not sure -- no way to replicate reliably.
Here's what happens with a process that's consuming 100% CPU:
strace tells nothing (as far as I tried). apache running perfectly fine without modsec (SecRuleEngine Off).
Hopefully, will have more info as we gather stats.
The text was updated successfully, but these errors were encountered: