Skip to content

Spontaneously high CPU usage #890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
celesteking opened this issue May 23, 2015 · 28 comments
Closed

Spontaneously high CPU usage #890

celesteking opened this issue May 23, 2015 · 28 comments
Assignees
Milestone

Comments

@celesteking
Copy link

This bug is hard to replicate, but I'll try to describe it here.
We've activated modsec on our servers, but with some we're noticing extremely high cpu usage. It manifests itself after running for some time in detection mode.
We have mlogc logging enabled.

It might be caused by graceful restart, although I'm not sure -- no way to replicate reliably.

Here's what happens with a process that's consuming 100% CPU:

(gdb) bt
#0  0x00002b36cebad969 in run_child_cleanups () from /usr/local/apache/lib/libapr-1.so.0
#1  0x00002b36cebad9af in cleanup_pool_for_exec () from /usr/local/apache/lib/libapr-1.so.0
#2  0x00002b36cebad9c6 in cleanup_pool_for_exec () from /usr/local/apache/lib/libapr-1.so.0
#3  0x00002b36cebad9c6 in cleanup_pool_for_exec () from /usr/local/apache/lib/libapr-1.so.0
#4  0x00002b36cebad9eb in apr_pool_cleanup_for_exec () from /usr/local/apache/lib/libapr-1.so.0
#5  0x00002b36cebba4e9 in apr_proc_create () from /usr/local/apache/lib/libapr-1.so.0
#6  0x00002b36cf1e86c2 in suphp_script_handler (r=0x1122ceb8) at mod_suphp.c:953
#7  0x00002b36cf1e8ef9 in suphp_handler (r=0x1122ceb8) at mod_suphp.c:569
#8  0x000000000044ac13 in ap_run_handler ()
#9  0x000000000044b4dc in ap_invoke_handler ()
#10 0x00000000004b9ddd in ap_internal_redirect ()
#11 0x00000000004e1f11 in handler_redirect ()
#12 0x000000000044ac13 in ap_run_handler ()
#13 0x000000000044b4dc in ap_invoke_handler ()
#14 0x00000000004b9ddd in ap_internal_redirect ()
#15 0x00000000004b90d3 in ap_die ()
#16 0x00000000004b92ca in ap_process_request ()
#17 0x00000000004b5abd in ap_process_http_connection ()
#18 0x00000000004546cf in ap_run_process_connection ()
#19 0x0000000000454b33 in ap_process_connection ()
#20 0x00000000004e3157 in process_socket ()
#21 0x00000000004e3a82 in worker_thread ()
#22 0x00002b36cebbb3a1 in dummy_worker () from /usr/local/apache/lib/libapr-1.so.0
#23 0x00000037c180683d in start_thread () from /lib64/libpthread.so.0
#24 0x00000037c10d4fcd in clone () from /lib64/libc.so.6

strace tells nothing (as far as I tried). apache running perfectly fine without modsec (SecRuleEngine Off).

# httpd  -V
Server version: Apache/2.2.29 (Unix)
Server built:   May 14 2015 10:34:22
Cpanel::Easy::Apache v3.28.8 rev9999
Server's Module Magic Number: 20051115:36
Server loaded:  APR 1.5.1, APR-Util 1.5.4
Compiled using: APR 1.5.1, APR-Util 1.5.4
Architecture:   64-bit
Server MPM:     Worker
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
 -D APACHE_MPM_DIR="server/mpm/worker"
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses disabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=128
 -D HTTPD_ROOT="/usr/local/apache"
 -D SUEXEC_BIN="/usr/local/apache/bin/suexec"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="conf/mime.types"
 -D SERVER_CONFIG_FILE="conf/httpd.conf"

Hopefully, will have more info as we gather stats.

@celesteking
Copy link
Author

Alright, happening on another server now, with same versions. Take a look at that --

  PID USER      PR  NI  VIRT  RES  SHR SWAP S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                    
 8248 nobody    20   0 2035m  47m  368    0 R 25.1  0.3   3:42.35 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 7573 nobody    20   0 2035m  47m  368    0 R 24.8  0.3   4:14.22 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 7921 nobody    20   0 2035m  47m  368    0 R 24.8  0.3   3:58.99 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 7984 nobody    20   0 2035m  47m  368    0 R 24.8  0.3   3:43.85 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 8012 nobody    20   0 2035m  47m  368    0 R 24.8  0.3   3:44.34 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 8177 nobody    20   0 2035m  47m  368    0 R 24.8  0.3   3:39.23 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 7985 nobody    20   0 2035m  47m  368    0 R 24.4  0.3   3:44.45 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 8104 nobody    20   0 2035m  47m  368    0 R 24.4  0.3   3:49.61 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 8132 nobody    20   0 2035m  47m  368    0 R 24.4  0.3   3:45.99 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 7596 nobody    20   0 2035m  47m  368    0 R 23.5  0.3   4:40.06 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 7947 nobody    20   0 2035m  47m  368    0 R 23.5  0.3   4:05.32 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 8019 nobody    20   0 2035m  47m  368    0 R 23.5  0.3   3:44.38 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 8159 nobody    20   0 2035m  47m  368    0 R 23.5  0.3   3:36.48 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 8360 nobody    20   0 2035m  47m  368    0 R 23.5  0.3   3:36.19 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 8409 nobody    20   0 2035m  47m  368    0 R 23.5  0.3   3:29.32 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 8447 nobody    20   0 2035m  47m  368    0 R 23.5  0.3   3:39.56 /usr/local/apache/bin/httpd -k start -DSSL                                                                                                                                 
 7971 nobody    20   0 2035m  47m  368    0 R 23.1  0.3   4:00.09 /usr/local/apache/bin/httpd -k start -DSSL                                         

SecData dir:

-rw-r-----  1 nobody nobody    0 May  6 07:35 global.dir
-rw-r-----  1 nobody nobody    0 May  6 07:35 global.pag
-rw-r-----  1 nobody nobody 4.0K May 23 16:54 ip.dir
-rw-r-----  1 nobody nobody  24M May 23 16:54 ip.pag

And again, nothing in strace output.. But CPU continues to be consumed!

@celesteking
Copy link
Author

Compiled apr with debug on.

gdb) bt
#0  0x00002ad3759cd969 in run_child_cleanups (cref=0x1606f728) at memory/unix/apr_pools.c:2364
#1  0x00002ad3759cd9af in cleanup_pool_for_exec (p=0x1606f708) at memory/unix/apr_pools.c:2372
#2  0x00002ad3759cd9c6 in cleanup_pool_for_exec (p=0x1606f708) at memory/unix/apr_pools.c:2375
#3  0x00002ad3759cd9c6 in cleanup_pool_for_exec (p=0x1606d6f8) at memory/unix/apr_pools.c:2375
#4  0x00002ad3759cd9eb in apr_pool_cleanup_for_exec () at memory/unix/apr_pools.c:2380
#5  0x00002ad3759da4e9 in apr_proc_create (new=0x17b1a770, progname=0x2ad376010769 "/opt/suphp/sbin/suphp", args=0x17b19170, env=0x17b19da0, attr=0x17b1a320, pool=0x17aec408) at threadproc/unix/proc.c:425
#6  0x00002ad37600f6c2 in suphp_script_handler (r=0x17afe1c0) at mod_suphp.c:953
#7  0x00002ad37600fef9 in suphp_handler (r=0x17afe1c0) at mod_suphp.c:569
(gdb) list
2359    static void run_child_cleanups(cleanup_t **cref)
2360    {
2361        cleanup_t *c = *cref;
2362
2363        while (c) {
2364            *cref = c->next;
2365            (*c->child_cleanup_fn)((void *)c->data);
2366            c = *cref;
2367        }
2368    }

(gdb) display **cref
1: **cref = {next = 0x161a37c8, data = 0x179b4290, plain_cleanup_fn = 0x2ad3765ba680 <msc_pcre_cleanup>, child_cleanup_fn = 0x42c3b8 <apr_pool_cleanup_null@plt>}
(gdb) next
2366            c = *cref;
1: **cref = {next = 0x161a37c8, data = 0x179b4290, plain_cleanup_fn = 0x2ad3765ba680 <msc_pcre_cleanup>, child_cleanup_fn = 0x42c3b8 <apr_pool_cleanup_null@plt>}
(gdb) 
2363        while (c) {
1: **cref = {next = 0x161a37c8, data = 0x179b4290, plain_cleanup_fn = 0x2ad3765ba680 <msc_pcre_cleanup>, child_cleanup_fn = 0x42c3b8 <apr_pool_cleanup_null@plt>}
(gdb) 
2364            *cref = c->next;
1: **cref = {next = 0x161a37c8, data = 0x179b4290, plain_cleanup_fn = 0x2ad3765ba680 <msc_pcre_cleanup>, child_cleanup_fn = 0x42c3b8 <apr_pool_cleanup_null@plt>}
(gdb) 
2365            (*c->child_cleanup_fn)((void *)c->data);
1: **cref = {next = 0x161a37c8, data = 0x179b4290, plain_cleanup_fn = 0x2ad3765ba680 <msc_pcre_cleanup>, child_cleanup_fn = 0x42c3b8 <apr_pool_cleanup_null@plt>}
(gdb) 
2366            c = *cref;
1: **cref = {next = 0x161a37c8, data = 0x179b4290, plain_cleanup_fn = 0x2ad3765ba680 <msc_pcre_cleanup>, child_cleanup_fn = 0x42c3b8 <apr_pool_cleanup_null@plt>}
(gdb) 
2363        while (c) {
1: **cref = {next = 0x161a37c8, data = 0x179b4290, plain_cleanup_fn = 0x2ad3765ba680 <msc_pcre_cleanup>, child_cleanup_fn = 0x42c3b8 <apr_pool_cleanup_null@plt>}
(gdb) 
2364            *cref = c->next;
1: **cref = {next = 0x161a37c8, data = 0x179b4290, plain_cleanup_fn = 0x2ad3765ba680 <msc_pcre_cleanup>, child_cleanup_fn = 0x42c3b8 <apr_pool_cleanup_null@plt>}
(gdb) 
2365            (*c->child_cleanup_fn)((void *)c->data);
1: **cref = {next = 0x161a37c8, data = 0x179b4290, plain_cleanup_fn = 0x2ad3765ba680 <msc_pcre_cleanup>, child_cleanup_fn = 0x42c3b8 <apr_pool_cleanup_null@plt>}
(gdb) 

(gdb) info threads
* 1 Thread 0x2ad37b11f940 (LWP 13772)  run_child_cleanups (cref=0x1606f728) at memory/unix/apr_pools.c:2365

(gdb) p *c->child_cleanup_fn
$10 = {apr_status_t (void *)} 0x42c3b8 <apr_pool_cleanup_null@plt>

gdb) 
apr_pool_cleanup_null (data=0x179b4290) at memory/unix/apr_pools.c:2405
2405        return APR_SUCCESS;
(gdb) list
2400    #endif /* !defined(WIN32) && !defined(OS2) */
2401
2402    APR_DECLARE_NONSTD(apr_status_t) apr_pool_cleanup_null(void *data)
2403    {
2404        /* do nothing cleanup routine */
2405        return APR_SUCCESS;
2406    }
2407
2408    /* Subprocesses don't use the generic cleanup interface because
2409     * we don't want multiple subprocesses to result in multiple

I should also note this all is happening when running in detectiononly mode.

@zimmerle
Copy link
Contributor

Hi @celesteking,

thank you for the detailed description. What is your ModSecurity version? there are any public rules set loaded? any custom rules?

The version of the apr library which was dynamic loaded is the same that ModSecurity was compiled with ?

@celesteking
Copy link
Author

2.8.0, apache 2.2 mpm worker fixed thread count, suphp. Yes, same version of apr library.

open crs ruleset with a couple of rules disabled and some minor adjustments:

# Disable GeoLookup
SecRuleRemoveById 900050
SecRuleRemoveById 900051

# Disable httprbl lookups
SecRuleRemoveById 981138
SecRuleRemoveById 981139

@celesteking
Copy link
Author

Getting same behaviour with modsec 2.9.0 , same apache version (2.2) & config.

@celesteking
Copy link
Author

Anyone having similar problems? Basically, the module is unusable at current state.

@cp0815
Copy link

cp0815 commented Aug 6, 2015

we are currently experiencing the same behavior on a CentOS 7.1 System:
httpd-2.4.6-31.el7.centos.x86_64
mod_security_crs-2.2.6-6.el7.noarch
mod_security-2.7.3-5.el7.x86_64
It only happens in detectiononly mode and there are increasing amount of pid that are consuming cpu, but strace doesn't show up anything for them

@calebdelnayatgfs
Copy link

I ran into a similar problem as well. Processes consuming 100% CPU due to threads becoming stuck.

This is on a CentOS 7.1 CPanel server with Apache 2.4.16 running event MPM, mod_security version 2.9.0 with the OWASP ModSecurity Core Rule Set installed. Running in detection only mode.

Looking at the thread stacks with gdb revealed the stacks pasted below. I didn't do any further digging.

#0  0x00007fb7339f38c2 in parse_operator2 (sf=0x7fb7287f75c0) at libinjection/libinjection_sqli.c:562
#1  0x00007fb7339f5188 in libinjection_sqli_tokenize (sf=sf@entry=0x7fb7287f75c0) at libinjection/libinjection_sqli.c:1268
#2  0x00007fb7339f5b15 in libinjection_sqli_fold (sf=sf@entry=0x7fb7287f75c0) at libinjection/libinjection_sqli.c:1702
#3  0x00007fb7339f65a6 in libinjection_sqli_fingerprint (sql_state=0x7fb7287f75c0, flags=flags@entry=9)
    at libinjection/libinjection_sqli.c:1903
#4  0x00007fb7339f6b91 in libinjection_is_sqli (sql_state=sql_state@entry=0x7fb7287f75c0) at libinjection/libinjection_sqli.c:2255
#5  0x00007fb7339f6d26 in libinjection_sqli (input=<optimized out>, slen=<optimized out>, fingerprint=fingerprint@entry=0x7fb7287f7840 "")
    at libinjection/libinjection_sqli.c:2313
#6  0x00007fb733a1eda9 in msre_op_detectSQLi_execute (msr=0x7fb670022aa0, rule=0x1536af8, var=0x7fb6a809ebd0, error_msg=0x7fb7287f7910)
    at re_operators.c:2134
#7  0x00007fb733a19140 in execute_operator (var=var@entry=0x7fb6a809ebd0, rule=rule@entry=0x1536af8, msr=msr@entry=0x7fb670022aa0,
    acting_actionset=acting_actionset@entry=0xebfba0, mptmp=mptmp@entry=0x7fb6a809e488) at re.c:2650
#8  0x00007fb733a1a7a9 in msre_rule_process_normal (msr=<optimized out>, rule=0x1536af8) at re.c:3138
#9  msre_rule_process (msr=<optimized out>, rule=0x1536af8) at re.c:3353
#10 msre_ruleset_process_phase (ruleset=<optimized out>, msr=msr@entry=0x7fb670022aa0) at re.c:1773
#11 0x00007fb7339faa80 in modsecurity_process_phase_request_body (msr=0x7fb670022aa0) at modsecurity.c:569
#12 modsecurity_process_phase (msr=msr@entry=0x7fb670022aa0, phase=phase@entry=2) at modsecurity.c:815
#13 0x00007fb7339f9915 in hook_request_late (r=0x7fb670021340) at mod_security2.c:1055
#14 0x00000000004551fd in ap_run_fixups ()
#15 0x0000000000456ec9 in ap_process_request_internal ()
#16 0x00000000004a2936 in ap_process_async_request ()
#17 0x000000000049e281 in ap_process_http_async_connection ()
#18 0x000000000049e499 in ap_process_http_connection ()
#19 0x000000000047153a in ap_run_process_connection ()
#20 0x000000000052c578 in process_socket ()
#21 0x000000000052ee6a in worker_thread ()
#22 0x00007fb7375b6627 in dummy_worker () from /usr/local/apache/lib/libapr-1.so.0
#23 0x00007fb736d2edf5 in start_thread () from /lib64/libpthread.so.0
#24 0x00007fb7368581ad in clone () from /lib64/libc.so.6
#0  0x00007fb7339f3f68 in strlencspn (
    s=s@entry=0x7fb6e00de0a6 "member><name>methodName</name><value><string>wp.getCategories</string></value></member><member><name>params</name><value><array><data><value><string></string></value><value><string>REDACTED</strin"..., len=<optimized out>,
    accept=accept@entry=0x7fb733a580b0 " []{}<>:\\?=@!#~+-*/&|^%(),';\t\n\v\f\r\"\240") at libinjection/libinjection_sqli.c:189
#1  0x00007fb7339f4586 in parse_word (sf=0x7fb7197d95c0) at libinjection/libinjection_sqli.c:898
#2  0x00007fb7339f5188 in libinjection_sqli_tokenize (sf=sf@entry=0x7fb7197d95c0) at libinjection/libinjection_sqli.c:1268
#3  0x00007fb7339f5b15 in libinjection_sqli_fold (sf=sf@entry=0x7fb7197d95c0) at libinjection/libinjection_sqli.c:1702
#4  0x00007fb7339f65a6 in libinjection_sqli_fingerprint (sql_state=0x7fb7197d95c0, flags=flags@entry=9)
    at libinjection/libinjection_sqli.c:1903
#5  0x00007fb7339f6b91 in libinjection_is_sqli (sql_state=sql_state@entry=0x7fb7197d95c0) at libinjection/libinjection_sqli.c:2255
#6  0x00007fb7339f6d26 in libinjection_sqli (input=<optimized out>, slen=<optimized out>, fingerprint=fingerprint@entry=0x7fb7197d9840 "")
    at libinjection/libinjection_sqli.c:2313
#7  0x00007fb733a1eda9 in msre_op_detectSQLi_execute (msr=0x7fb6e804ab18, rule=0xb1e648, var=0x7fb6e0058a40, error_msg=0x7fb7197d9910)
    at re_operators.c:2134
#8  0x00007fb733a19140 in execute_operator (var=var@entry=0x7fb6e0058a40, rule=rule@entry=0xb1e648, msr=msr@entry=0x7fb6e804ab18,
    acting_actionset=acting_actionset@entry=0x12b6b80, mptmp=mptmp@entry=0x7fb6e00582f8) at re.c:2650
#9  0x00007fb733a1a7a9 in msre_rule_process_normal (msr=<optimized out>, rule=0xb1e648) at re.c:3138
#10 msre_rule_process (msr=<optimized out>, rule=0xb1e648) at re.c:3353
#11 msre_ruleset_process_phase (ruleset=<optimized out>, msr=msr@entry=0x7fb6e804ab18) at re.c:1773
#12 0x00007fb7339faa80 in modsecurity_process_phase_request_body (msr=0x7fb6e804ab18) at modsecurity.c:569
#13 modsecurity_process_phase (msr=msr@entry=0x7fb6e804ab18, phase=phase@entry=2) at modsecurity.c:815
#14 0x00007fb7339f9915 in hook_request_late (r=0x7fb6e80493c0) at mod_security2.c:1055
#15 0x00000000004551fd in ap_run_fixups ()
#16 0x0000000000456ec9 in ap_process_request_internal ()
#17 0x00000000004a2936 in ap_process_async_request ()
#18 0x000000000049e281 in ap_process_http_async_connection ()
#19 0x000000000049e499 in ap_process_http_connection ()
#20 0x000000000047153a in ap_run_process_connection ()
#21 0x000000000052c578 in process_socket ()
#22 0x000000000052ee6a in worker_thread ()
#23 0x00007fb7375b6627 in dummy_worker () from /usr/local/apache/lib/libapr-1.so.0
#24 0x00007fb736d2edf5 in start_thread () from /lib64/libpthread.so.0
#25 0x00007fb7368581ad in clone () from /lib64/libc.so.6

@hcanning2014
Copy link

I get strange CPU activity too. #991

@victorhora
Copy link
Contributor

Please give it a try with release 2.9.2.

If the problem persists let us know.

Thanks.

@celesteking
Copy link
Author

Problem persists.

EL6 latest, stock kernel.
ea-apache24-mod_security2-2.9.2-10.10.1.src.rpm

httpd  -V
Server version: Apache/2.4.37 (cPanel)
Server built:   Dec  3 2018 16:16:44
Server's Module Magic Number: 20120211:83
Server loaded:  APR 1.6.3, APR-UTIL 1.6.1
Compiled using: APR 1.6.3, APR-UTIL 1.6.1
Architecture:   64-bit
Server MPM:     worker
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses disabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/etc/apache2"
 -D SUEXEC_BIN="/usr/sbin/suexec"
 -D DEFAULT_PIDLOG="/var/run/apache2/httpd.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="conf/mime.types"
 -D SERVER_CONFIG_FILE="conf/httpd.conf"
(gdb) bt
#0  run_child_cleanups (p=0x55b2a39c1138) at memory/unix/apr_pools.c:2642
#1  cleanup_pool_for_exec (p=0x55b2a39c1138) at memory/unix/apr_pools.c:2649
#2  0x00002ba5109da448 in cleanup_pool_for_exec (p=0x55b2a39c1138) at memory/unix/apr_pools.c:2652
#3  0x00002ba5109da448 in cleanup_pool_for_exec (p=0x55b2a39bf128) at memory/unix/apr_pools.c:2652
#4  0x00002ba5109e5227 in apr_proc_create (new=<value optimized out>, progname=0x2ba51b87dc59 "/usr/sbin/suphp", args=0x2ba5744155d8, env=0x2ba574402440, attr=0x2ba574402968, pool=0x2ba574417c98) at threadproc/unix/proc.c:425
#5  0x00002ba51b87c857 in suphp_script_handler (r=0x2ba574417d10) at mod_suphp.c:953
#6  0x00002ba51b87d3b0 in suphp_handler (r=0x2ba574417d10) at mod_suphp.c:569
#7  0x000055b2a18b50d0 in ap_run_handler (r=0x2ba574417d10) at config.c:171
#8  0x000055b2a18b96be in ap_invoke_handler (r=0x2ba574417d10) at config.c:444
#9  0x000055b2a18ceafa in ap_process_async_request (r=0x2ba574417d10) at http_request.c:453
#10 0x000055b2a18cec4e in ap_process_request (r=<value optimized out>) at http_request.c:488
#11 0x000055b2a18caa45 in ap_process_http_sync_connection (c=0x2ba524030078) at http_core.c:210
#12 ap_process_http_connection (c=0x2ba524030078) at http_core.c:251
#13 0x000055b2a18c09b0 in ap_run_process_connection (c=0x2ba524030078) at connection.c:42
#14 0x00002ba5117b7842 in process_socket (thd=0x55b2a65e2b98, dummy=<value optimized out>) at worker.c:479
#15 worker_thread (thd=0x55b2a65e2b98, dummy=<value optimized out>) at worker.c:808
#16 0x00002ba510bfaaa1 in start_thread () from /lib64/libpthread.so.0
#17 0x00002ba510ef8c4d in clone () from /lib64/libc.so.6
(gdb) 

# ps -ww -o cmd,wchan,rss,pcpu -p 4474
CMD                         WCHAN    RSS %CPU
/usr/sbin/httpd -k start    -      77620 10.1

@celesteking
Copy link
Author

This might be related to mlogc being specified via SecAuditLog "|/usr/bin/mlogc /etc/apache2/conf/mlogc.conf". It looks like this is happening after graceful restart has been invoked. Maybe somehow modsec is impeding the pool cleanup? Or maybe there's a stack corruption of some sort?

@jrfondren
Copy link

jrfondren commented Jan 22, 2019

We're seeing this bug after a big update to our WAF. Today, after deploying it to 35 servers, within an hour Apache'd broken down and tried to use 100% of CPU on 7 of those servers. There's nothing in strace on the broken processes, and ltrace shows why: they're all infinite loops, executing a dummy function over and over again:

apr_pool_cleanup_null(0x178f9730, 0, 0x178f9748, 0xf7696c65688d, 0xd81338) = 0
apr_pool_cleanup_null(0x178f9730, 0, 0x178f9748, 0xf7696c65688d, 0xd81338) = 0
apr_pool_cleanup_null(0x178f9730, 0, 0x178f9748, 0xf7696c65688d, 0xd81338) = 0
apr_pool_cleanup_null(0x178f9730, 0, 0x178f9748, 0xf7696c65688d, 0xd81338) = 0
apr_pool_cleanup_null(0x178f9730, 0, 0x178f9748, 0xf7696c65688d, 0xd81338) = 0
apr_pool_cleanup_null(0x178f9730, 0, 0x178f9748, 0xf7696c65688d, 0xd81338) = 0
apr_pool_cleanup_null(0x178f9730, 0, 0x178f9748, 0xf7696c65688d, 0xd81338) = 0
apr_pool_cleanup_null(0x178f9730, 0, 0x178f9748, 0xf7696c65688d, 0xd81338) = 0

which is consistent with the backtrace above.

I only see four places in the entire modsec 2.9.2 repo where apr_pool_cleanup_null is registered as a cleanup function, so it should be easy to at least shift blame to Apache.

Although I don't have anything like an easy "do this and you see the bug" replication, I can replicate it fairly reliably, and with cgroups CPU limitations I can do that while leaving some CPU to spare for any investigations, so please let me know (within a week, ideally) if there's some investigation you'd like me to perform.

We're not doing anything with mlogc and SecAuditLog.

@victorhora
Copy link
Contributor

@celesteking / @jrfondren

I see that the issues you're reporting are related with ModSecurity 2.9.2 released in July/2017.

Since the change set between 2.9.2 and 2.9.3 is quite large, there's a chance that this is not an issue anymore.

Please let us know if the issue persists with 2.9.3 (released Dec/2018).

Thanks!

@jrfondren
Copy link

jrfondren commented Jan 23, 2019

@victorhora I've tested with 2.9.3 and get the same behavior.

The loop itself is in apr:memory/unix/apr_pools.c:

static void run_child_cleanups(cleanup_t **cref)
{
   cleanup_t *c = *cref;

   while (c) {
       *cref = c->next;
       (*c->child_cleanup_fn)((void *)c->data);
       c = *cref;
   }
}

so modsec is giving this a circular list somehow.

@victorhora
Copy link
Contributor

I'm not sure if we'll be able to fix it on 2.9.x since this issue doesn't seem to happen on the current version of libModSecurity (3.x).

Still I'm reopening this one for further investigation.

@victorhora victorhora reopened this Jan 23, 2019
@victorhora victorhora added Platform - Apache 2.x Related to ModSecurity version 2.x labels Jan 23, 2019
@victorhora
Copy link
Contributor

@jrfondren you're not using mLogC right? Can you provide more info about environment? Version of the CRS, any custom rules, custom conf, custom Apache directives, loaded modules, particularity on the environment?

Due to some reports related with cPANEL (cc @celesteking) can you confirm if you're using ModSecurity from cPANEL? There has been some known issues with modules that cPANEL uses. See #712.

Also, there have been cases in the past where people had issues with apr_pool_cleanup calls due filesystem issues such as a full disk or lack of permissions to clean up temporary files (e.g. ip.pag collection data) so please check that too.

In addition, as @zimmerle mentioned, please ensure that Apache and ModSecurity are running against the correct APR (i.e. the Apache included and not a system installed version). Usually apachectl -V" will give you this but you can also check on the first few lines of the error_log when Apache starts with ModSecurity. Please provide these log lines too.

You may also want to check if the optional global mutex configuration introduced at 112ba45 is a valid workaround for you. It's a workaround for v2 due to some known issues with handling the collections with APR. See #1224.

If possible, please provide Apache's error_log and ModSec debug_log when reproducing the issue to see if we can have a better clue on why this problem is happening only on your environments.

Thanks

@jrfondren
Copy link

jrfondren commented Jan 23, 2019

We've seen this problem on Apache 2.2 servers with APR 1.5.2 using some of our own builds, and Apache 2.4 servers with APR 1.6.3, using cPanel's EA4 RPMs, in both cases the APR is the same as what mod_security2.so was compiled with. Mutxes around apr cleanup stuff sounds like it might be helpful, but that patch seems to be concerned only with database access. You probably don't put pointers into the database, right?

Server version: Apache/2.2.34 (Unix)
Server built:   Jan  9 2019 17:32:44
Server's Module Magic Number: 20051115:43
Server loaded:  APR 1.5.2, APR-Util 1.5.4
Compiled using: APR 1.5.2, APR-Util 1.5.4
Architecture:   64-bit
Server MPM:     Worker
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
 -D APACHE_MPM_DIR="server/mpm/worker"
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=128
 -D HTTPD_ROOT="/usr/local/apache"
 -D SUEXEC_BIN="/usr/local/apache/bin/suexec"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="conf/mime.types"
 -D SERVER_CONFIG_FILE="conf/httpd.conf"

With ModSecurity 2.9.2:

[Wed Jan 23 17:06:26 2019] [notice] ModSecurity for Apache/2.9.2 (http://www.modsecurity.org/) configured.
[Wed Jan 23 17:06:26 2019] [notice] ModSecurity: APR compiled version="1.5.2"; loaded version="1.5.2"
[Wed Jan 23 17:06:26 2019] [notice] ModSecurity: PCRE compiled version="7.8 "; loaded version="8.21 2011-12-12"
[Wed Jan 23 17:06:26 2019] [warn] ModSecurity: Loaded PCRE do not match with compiled!
[Wed Jan 23 17:06:26 2019] [notice] ModSecurity: LUA compiled version="Lua 5.1"
[Wed Jan 23 17:06:26 2019] [notice] ModSecurity: LIBXML compiled version="2.7.6"
[Wed Jan 23 17:06:26 2019] [notice] ModSecurity: Original server signature: Apache
[Wed Jan 23 17:06:26 2019] [notice] ModSecurity: Status engine is currently disabled, enable it by set SecStatusEngine to On.

That PCRE complaint is a concern but it's not present on the Apache 2.4 servers
that also have the problem:

[Tue Jan 22 02:33:59.570231 2019] [:notice] [pid 390024:tid 140428295862400] ModSecurity for Apache/2.9.2 (http://www.modsecurity.org/) configured.
[Tue Jan 22 02:33:59.570241 2019] [:notice] [pid 390024:tid 140428295862400] ModSecurity: APR compiled version="1.6.3"; loaded version="1.6.3"
[Tue Jan 22 02:33:59.570252 2019] [:notice] [pid 390024:tid 140428295862400] ModSecurity: PCRE compiled version="8.32 "; loaded version="8.32 2012-11-30"
[Tue Jan 22 02:33:59.570254 2019] [:notice] [pid 390024:tid 140428295862400] ModSecurity: LUA compiled version="Lua 5.1"
[Tue Jan 22 02:33:59.570256 2019] [:notice] [pid 390024:tid 140428295862400] ModSecurity: LIBXML compiled version="2.9.7"
[Tue Jan 22 02:33:59.570258 2019] [:notice] [pid 390024:tid 140428295862400] ModSecurity: Status engine is currently disabled, enable it by set SecStatusEngine
to On.

We're not using mLogC. We've these configuration options (with the *Access options temporarily Off to confirm that they aren't related to the bug):

 LoadFile /usr/lib64/libxml2.so.2
 LoadFile /usr/lib64/liblua-5.1.so
 LoadModule security2_module  modules/mod_security2.so
 <IfModule mod_security2.c>
 SecRuleEngine On
 # See http://www.modsecurity.org/documentation/ModSecurity-Migration-Matrix.pdf
 #  "Add the rules that will do exactly the same as the directives"
 # SecFilterCheckURLEncoding On
 # SecFilterForceByteRange 0 255
 SecRequestBodyAccess Off
 SecRequestBodyLimit 134217728
 SecRequestBodyNoFilesLimit 1048576
 SecResponseBodyAccess Off
 SecResponseBodyMimeType (null) text/html text/plain text/xml
 SecResponseBodyLimit 2621440
 SecServerSignature Apache
 SecComponentSignature 201001051959
 SecUploadDir /var/asl/data/suspicious
 SecUploadKeepFiles Off
 SecAuditEngine off
# SecAuditEngine RelevantOnly
# SecAuditLogRelevantStatus "^(?:5|4(?!04))"
# SecAuditLogType Concurrent
# SecAuditLog logs/modsec_audit.log
 SecAuditLogParts ABIFHZ
 SecArgumentSeparator "&"
 SecCookieFormat 0
 SecRequestBodyInMemoryLimit 131072
 SecDataDir /var/asl/data/msa
 SecTmpDir /tmp
 SecAuditLogStorageDir /var/asl/data/audit
 SecResponseBodyLimitAction ProcessPartial
 SecDataDir /var/asl/data/msa
 SecDefaultAction "phase:2,deny,log,status:406"
 SecRule REMOTE_ADDR "^127.0.0.1$" nolog,allow,id:1500000
 Include "/usr/local/apache/conf/modsec2.user.conf"

 #SecReadStateLimit 100
 SecPcreMatchLimit 150000
 SecPcreMatchLimitRecursion 150000
 </IfModule>

We're upgrading from a few-years-old version of Atomicorp's to a current version. The current version's a little bit larger and except for using @ipcheck and @ipCheckFromFile instead of @pm-based workarounds, and probably other modernizations, the rules are very similar in complexity and construction. https://wiki.atomicorp.com/wiki/index.php/Atomic_ModSecurity_Rules

The performance problems' been observed to start at times where there are no logs at all from modsec. Since the bug is the creation of an infinite loop, which is then only a problem when APR tries to clean it up forever, there's not going to be anything timely.

@celesteking
Copy link
Author

Alright, I'm able to reproduce this on 2.9.3 built from v2.9.3 tag against cpanel-provided apr 1.6.3 on EL6, no global mutex patch applied.
Same backtrace, same problem.
@victorhora I've got the coredump if you're interested.

@jrfondren
Copy link

jrfondren commented Feb 11, 2019

Replication:

# this rule completely by itself with above-noted config and just some SecMarkers for other (commented-out) rules:
SecRule REQUEST_URI "://%{SERVER_NAME}/" "phase:2,id:333816,pass,t:none,t:urlDecodeUni,t:lowercase,nolog,skipAfter:END_INJECTION_RULES_1"

# logs as such (after s/nolog/log/)
[Sun Feb 10 22:31:06 2019] [error] [client 1.2.3.4] ModSecurity: Warning. Pattern match "\\\\:\\\\/\\\\/example\\\\.com\\\\/" at REQUEST_URI. [file "/etc/httpd/modsecurity.d/10_asl_rules.conf"] [line "516"] [id "333816"] [hostname "example.com"] [uri "/wp-json/oembed/1.0/embed"] [unique_id "XGD6icC5BKMABC@FTXwAAAEs"]

# after a request such as
1.2.3.4 - - [10/Feb/2019:22:30:39 -0600] "GET /wp-content/plugins/justified-image-grid/timthumb.php?src=http%3A%2F%2Fexample.com%2Fwp-content%2Fuploads%2F2015%2F10%2Fexample-image-1024x576.jpg&h=230&q=90&f=.jpg HTTP/1.1" 200 20271 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

# and the bug is reliably caused with a single request:
curl -H "Host: example.com" 'http://1.2.3.4/timthumb.php?src=http%3A%2F%2Fexample.com%2Fwp-content%2Fuploads%2F2015%2F10%2Fexample-image-1024x576.jpg&h=230&q=90&f=.jpg'
# and then putting apache through some work
ab -n 10000 -c 30 -H "Host: example.com" http://1.2.3.4/hi/

The single request lays the trap, and then once some traffic causes Apache to try to expire children, the trap is sprung.

@jrfondren
Copy link

jrfondren commented Feb 11, 2019

The above replication survived several changes I tried, but finally stopped triggering the bug with s/REQUEST_URI/REQUEST_URI_RAW/ Previous replication and fix invalidated by phase of moon.

The following replication works in seconds on a vanilla cPanel CentOS 6 and CentOS 7 environment with a worker MPM and suPHP. It was replicated with a very old CentOS 7 image, and then again after massive updates. It hasn't yet has also been replicated with vanilla CentOS 7, with suphp but without cPanel.

Replication: add the following to a vanilla cPanel configuration. Don't have any modsec rules in addition to this. More SecRules as well as steps to slow down Apache's ability to spawn PHP processes will, not fix the bug, but make it harder for it to happen. We've had large buggy configs run for half a day before Apache finally broke down.

# not required; helps the bug show up faster
MaxRequestWorkers 50
MaxRequestsPerChild 50

LoadModule security2_module  modules/mod_security2.so
SecRuleEngine On

SecRule REQUEST_URI "://%{SERVER_NAME}/" "phase:2,id:333816,pass,t:none,t:urlDecodeUni,t:lowercase,log,skipAfter:END_INJECTION_RULES_1"
SecMarker END_INJECTION_RULES_1

And run an ApacheBench with high concurrent connections and a request that will trigger the above SecRule and also spawn a PHP process with suPHP:

ab -n 50000 -c 90 -H "Host: somedomain.com" 'http://1.2.3.4/timthumb.php?src=http%3A%2F%2Fsomedomain.com%2Fwp-content%2Fuploads%2F2015%2F10%2Fexample-image-1024x576.jpg&h=230&q=90&f=.jpg'

A simple timthumb.php in the docroot with "hello world" in it is enough, as long as suPHP is the handler.

If ltrace isn't working for you and you're not sure if you're seeing the bug or not, some additional signs are

  1. Apache CPU use remains high after traffic ceases entirely.
  2. Apache's CPU use comes from a bunch of child processes that all have unusually low RSS -- because the infinite loop it gets stuck in occurs right after an exec.

@jrfondren
Copy link

Some more information about this bug: it's a data race(?) within libapr in a highly-threaded environment as with the Worker MPM. In one thread libapr will break a data structure, and then before it can go on to unbreak the data structure, elsewhere libapr will fork() and the "unbreak" steps won't come along. Even though libapr intends to then exec(), it tries to clean up a bit first, and this clean-up step is unreliable with broken data structures.

suphp is involved as it's a fork of mod_cgi and causes Apache to fork rapidly. Worker MPM is involved as it brings the threads that can't finish their work around a fork. modsecurity is involved as it registers a cleanup job once per regex compilation and this is the other side of the libapr data race that's causing the bug. The single modsec rule that we can replicate the bug with is involved as it contains a SERVER_NAME and therefore is constantly getting recompiled.

@celesteking
Copy link
Author

Great work! I was able to replicate that reliably, too, with the specified simple regex rule from the above.

If apr is compiled with APR_POOL_DEBUG, httpd won't even start because of the error:
apr_table_copy: t's pool is not an ancestor of p

The error is produced because apparently, modsecurity is using apr functions in a weird way, and that bites in the ass thereafter , as exactly seen in this whole bugreport.

Looking at plain_cleanup_fn = 0x2ad3765ba680 <msc_pcre_cleanup> in backtrace , it might highly possible be related to regex setup/teardown procedure, but I'm not sure. We need modsec devs looking into this bug.

@lightsey
Copy link

This example httpd.conf reproduces the bug with only core Apache modules and ModSecurity.

test.conf.txt

I was able to isolate the corruption of the linked list to the functions in re_operators.c that are using rule->ruleset->mp (a global per-process pool) in the threads that handle individual HTTP requests. These should use storage pools that are assigned to the HTTP request rather than a pool that is shared by all the worker threads in the process.

I'll submit a pull request with fixes for the pool usage in re_execute.c in a moment.

@markblackman
Copy link

This PR fixed our problem in a very busy cluster

@markblackman
Copy link

@lightsey Did you spend quite a while tracking this down? It's burned a lot of our time and we could never reproduce it in our dev environments.

@lightsey
Copy link

@lightsey Did you spend quite a while tracking this down? It's burned a lot of our time and we could never reproduce it in our dev environments.

IIRC, my approach for pinpointing the source of the problem was:

  • Get a conf file with a minimal set of Apache modules and mod_security rules to reproduce the bug reliably.
  • Partially enable the pool allocation debugging logic in APR that's designed to detect race conditions. (Apache won't start with mod_security loaded if it's fully enabled.)
  • Extend the pool allocation debugging functionality to detect race conditions in the manipulations of the free lists.
  • Trigger the bug with a small number of threads running to verify how the linked list corruption happens.
  • Pinpoint the code in mod_security that is triggering the pool changes.
  • Trace through the mod_security code to identify which pools different functions were passing around.

It's definitely a difficult problem to debug because the linked list corruption becoming evident in apr_pool_cleanup_for_exec() happens well after the bug that corrupts the linked list of cleanups.

zimmerle pushed a commit that referenced this issue Jan 14, 2021
@zimmerle
Copy link
Contributor

fixed as part of #2049

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants