Invalid cheaped workers status (non-zero pid for cheaped worker) #122

prymitive · 2013-01-20T17:50:37Z

I've noticed a case when cheaper busyness plugin was running only few workers, much less then the limit I've set with --workers, despite the load and a lot of queued requests.
After restarting my app everything started to work fine, I investigated logs and I'm quite sure that this is a bug with counting cheaped workers.

[busyness] 15s average busyness is at 8%, cheap one of 4 running workers
# now we have 3

[busyness] 15s average busyness is at 55%, will spawn 1 new worker(s)
# now we have 4

[busyness] 15s average busyness is at 53%, will spawn 1 new worker(s)
# now we have 5

[busyness] 15s average busyness is at 15%, cheap one of 5 running workers
# now we have 4

[busyness] 15s average busyness is at 52%, will spawn 1 new worker(s)
# nowe we have 5

[busyness] 15s average busyness is at 54%, will spawn 1 new worker(s)
# nowe we have 6

[busyness] 15s average busyness is at 53%, will spawn 1 new worker(s)
# nowe we have 7

[busyness] 15s average busyness is at 56%, will spawn 1 new worker(s)
# nowe we have 8

[busyness] 43 requests in listen queue, spawning 1 emergency worker(s) (1)!
# nowe we have 9

[busyness] 52 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 62 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 76 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 86 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 84 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 85 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 79 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 61 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 82 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 83 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 68 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 80 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 50 requests in listen queue, spawning 0 emergency worker(s) (1)!
# why 0? it means that there was zero workers with cheaped == 1 and pid == 0
# this should not happen

[busyness] 15s average busyness is at 98% but we already started maximum number of workers (20)
# again, we have only 9 running workers but there was zero workers with with cheaped == 1 and pid == 0

[busyness] 15s average busyness is at 16%, cheap one of 9 running workers
# this line confirms that we have a bug with tracking cheaped workers

[busyness] 41 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 54 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 59 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 68 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 75 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 90 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 100 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 92 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 100 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 100 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 15s average busyness is at 97% but we already started maximum number of workers (20)
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
# same story, we can't  spawn new workers because uWSGI thinks we aleady have max spawned

Number of cheaped workers is always counted in uWSGI (not only in cheaper_busyness plugin) using this loop:

for (i = 1; i <= uwsgi.numproc; i++) {
                if (uwsgi.workers[i].cheaped == 1 && uwsgi.workers[i].pid == 0) {
                        <worker is cheaped, do somthing>
                }
        }

I'm 100% sure that "cheap" status that is taken from uwsgi.workers[i].cheaped was right when I had this issue, so it must be the uwsgi.workers[i].pid that was non-zero for cheaped worker. Is there any reason why we should check for both in those loops?
non-zero pid for cheaped worker issue will probably take some digging to track it down, maybe we should not care for the pid and trust in worker status value alone?

The text was updated successfully, but these errors were encountered:

prymitive · 2013-01-20T18:02:09Z

I can add checks in those loops to look for cheaped==1 && pid > 0 and print warning in such case.

like this

prymitive · 2013-01-20T18:24:12Z

Stats from another app on the same server:

"workers":[
                {
                        "id":1,
                        "pid":94,
                        "requests":197,
                        "delta_requests":197,
                        "exceptions":0,
                        "harakiri_count":0,
                        "signals":0,
                        "signal_queue":0,
                        "status":"cheap",
                        "rss":46469120,
                        "vsz":258428928,
                        "running_time":71597280,
                        "last_spawn":1358358710,
                        "respawn_count":1,
                        "tx":1172017,
                        "avg_rt":270923,
                        "apps":[

                        ],
                        "cores":[
                                {
                                        "id":0,
                                        "requests":197,
                                        "static_requests":67,
                                        "routed_requests":0,
                                        "offloaded_requests":67,
                                        "in_request":0
                                }
                        ]
                },

"status":"cheap" and "pid":94

Full stats output: https://gist.github.com/dcd46a04dd9b4f0d32bc
Only workers 14 and 15 are really running, others are cheaped (verified by running ps on this server)

prymitive · 2013-01-20T18:51:15Z

It turns out that all those workers are running but they are <defunct>

prymitive · 2013-01-20T18:53:40Z

Also it seems that many (if not all) apps on all of my boxes are affected, this looks weird, maybe I had some network or other global issue that triggered this error? Maybe something bad happened when those processes were stopped/reloaded due max requests and uWSGI did not noted that there was an error?

prymitive · 2013-01-20T19:56:26Z

It looks like when uWSGI wants to cheap a worker it will mark it as "cheaped = 1" and then just send SIGWINCH signal. In lengthy master_loop uWSGI will check if any worker has died and mark it's pid as 0, so this will normally set pid=0 for cheaped workers once they die. But if worker for any reason does not die, it will continue to run but uWSGI will have it mared as cheaped while keeping pid > 0.

uWSGI will only check if workers are dead while whole uWSGI instance is stopping or reloading, we have reload mercy there, maybe we should also add checks in master_loop() to verify that all workers who have cheaped=1 have also pid=0? Think of it as reload mercy but for individual workers.

prymitive · 2013-01-20T21:35:20Z

I wrote a patch for it, to verify if this will work properly. I've installed patch uWSGI on one server, I will monitor it to check if it kills those defunct cheaped workers.

PS. travis fails to build uWSGI

prymitive · 2013-01-20T22:27:29Z

It seems to be working just fine:

worker 1 is taking too much time to die (60s), sending SIGKILL
worker 1 is taking too much time to die (61s), sending SIGKILL
DAMN ! worker 1 (pid: 94) MISTERIOUSLY killed by signal :( trying respawn ...
uWSGI worker 1 cheaped.

I should only get rid of this MISTERIOUSLY killed by signal message for such cases and add more appropriate one, so there is no confusion.

Strace from such worker:

# strace -ff -p 27525
Process 27525 attached with 3 threads - interrupt to quit
[pid 27527] epoll_wait(13,  <unfinished ...>
[pid 27526] epoll_wait(11,  <unfinished ...>
[pid 27525] epoll_wait(14, {{EPOLLIN, {u32=6, u64=6}}}, 1, 4294967295) = 1
[pid 27525] accept(6, 0x7fe1a4620084, [110]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 27525] epoll_wait(14, 7fffc5491370, 1, 4294967295) = -1 EINTR (Interrupted system call)
[pid 27525] --- SIGWINCH (Window changed) @ 0 (0) ---
[pid 27525] rt_sigreturn(0x1c)          = -1 EINTR (Interrupted system call)
[pid 27525] _exit(0)                    = ?
Process 27525 detached
[pid 27526] +++ killed by SIGKILL +++
PANIC: handle_group_exit: 27526 leader 27525
[pid 27527] +++ killed by SIGKILL +++
PANIC: handle_group_exit: 27527 leader 27525

It looks like some thread is keeping it in defunct state, probably offload one.

EDIT: yes, it didn't take long, my workers are not shutting down correctly due to enabled offload threads. This seems to be happening every time.
So this is the real bug that needs to be fixed. But pease review my patch and if it's ok I would like it to be merged, it will protect uWSGI from such issues in the future.

prymitive · 2013-01-21T08:46:32Z

BTW I didn't noticed this issue before so it is possible that it was introduced somewhere along 1.4 bugfix releasses, probably between 1.4.2 and 1.4.4. I've upgraded from 1.4.2 to 1.4.4 week ago and I think this was happening all week, it's just I didn't noticed it until yesterday.

prymitive · 2013-01-21T09:15:49Z

I've reverted the only change that was made to offload.c past 1.4.2 but it still hangs on worker stop if offload threads are enabled.

unbit · 2013-01-21T09:41:22Z

the uwsgi_ignition() function in core/uwsgi.c exit with pthread_exit(NULL)
Can you check if using exit(0) solve the problem ? (sorry i have no way now to check it by myself)

prymitive · 2013-01-21T09:54:58Z

replacing pthread_exit(NULL) with exit(0) solves this issue for me.

It's easy to test it, just run:

./uwsgi --master --logdate -L --http :8080 --wsgi-file tests/staticfile.py -p 10 --auto-procname --cheaper-algo busyness --cheaper-busyness-verbose --cheaper-overload 5 --cheaper-busyness-multiplier 2 --offload-threads 2

If You have defunct worker after:

Mon Jan 21 10:53:19 2013 - [busyness] 5s average busyness of 10 worker(s) is at 0%
Mon Jan 21 10:53:19 2013 - [busyness] 5s average busyness is at 0%, cheap one of 10 running workers

and there is no uWSGI worker 1 cheaped. message that You just hit this issue.

unbit · 2013-01-21T11:16:07Z

ok applied it to 1.4 tree (using end_me(0))

prymitive · 2013-01-21T11:21:42Z

It works fine, I've retested with latest master and the issue is gone.
I would still like to add this worker reload mercy feature to uWSGI, but I will make a separate issue for it.
Thanks

prymitive closed this as completed Jan 21, 2013

This was referenced Jan 21, 2013

Monitoring worker cheap/reload #124

Closed

reload/cheap mercy for invidual workers #137

Merged

yuzz920903 mentioned this issue Mar 30, 2023

I have got a coredump with uwsgi, version is 2.0.19.1. #2526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid cheaped workers status (non-zero pid for cheaped worker) #122

Invalid cheaped workers status (non-zero pid for cheaped worker) #122

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 21, 2013

prymitive commented Jan 21, 2013

unbit commented Jan 21, 2013

prymitive commented Jan 21, 2013

unbit commented Jan 21, 2013

prymitive commented Jan 21, 2013

Invalid cheaped workers status (non-zero pid for cheaped worker) #122

Invalid cheaped workers status (non-zero pid for cheaped worker) #122

Comments

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 20, 2013

prymitive commented Jan 21, 2013

prymitive commented Jan 21, 2013

unbit commented Jan 21, 2013

prymitive commented Jan 21, 2013

unbit commented Jan 21, 2013

prymitive commented Jan 21, 2013