Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid cheaped workers status (non-zero pid for cheaped worker) #122

Closed
prymitive opened this issue Jan 20, 2013 · 13 comments
Closed

Invalid cheaped workers status (non-zero pid for cheaped worker) #122

prymitive opened this issue Jan 20, 2013 · 13 comments

Comments

@prymitive
Copy link
Contributor

I've noticed a case when cheaper busyness plugin was running only few workers, much less then the limit I've set with --workers, despite the load and a lot of queued requests.
After restarting my app everything started to work fine, I investigated logs and I'm quite sure that this is a bug with counting cheaped workers.

[busyness] 15s average busyness is at 8%, cheap one of 4 running workers
# now we have 3

[busyness] 15s average busyness is at 55%, will spawn 1 new worker(s)
# now we have 4

[busyness] 15s average busyness is at 53%, will spawn 1 new worker(s)
# now we have 5

[busyness] 15s average busyness is at 15%, cheap one of 5 running workers
# now we have 4

[busyness] 15s average busyness is at 52%, will spawn 1 new worker(s)
# nowe we have 5

[busyness] 15s average busyness is at 54%, will spawn 1 new worker(s)
# nowe we have 6

[busyness] 15s average busyness is at 53%, will spawn 1 new worker(s)
# nowe we have 7

[busyness] 15s average busyness is at 56%, will spawn 1 new worker(s)
# nowe we have 8

[busyness] 43 requests in listen queue, spawning 1 emergency worker(s) (1)!
# nowe we have 9

[busyness] 52 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 62 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 76 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 86 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 84 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 85 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 79 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 61 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 82 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 83 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 68 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 80 requests in listen queue, spawning 0 emergency worker(s) (1)!
[busyness] 50 requests in listen queue, spawning 0 emergency worker(s) (1)!
# why 0? it means that there was zero workers with cheaped == 1 and pid == 0
# this should not happen

[busyness] 15s average busyness is at 98% but we already started maximum number of workers (20)
# again, we have only 9 running workers but there was zero workers with with cheaped == 1 and pid == 0

[busyness] 15s average busyness is at 16%, cheap one of 9 running workers
# this line confirms that we have a bug with tracking cheaped workers

[busyness] 41 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 54 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 59 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 68 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 75 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 90 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 100 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 92 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 100 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 100 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 15s average busyness is at 97% but we already started maximum number of workers (20)
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
[busyness] 101 requests in listen queue, spawning 0 emergency worker(s) (0)!
# same story, we can't  spawn new workers because uWSGI thinks we aleady have max spawned

Number of cheaped workers is always counted in uWSGI (not only in cheaper_busyness plugin) using this loop:

for (i = 1; i <= uwsgi.numproc; i++) {
                if (uwsgi.workers[i].cheaped == 1 && uwsgi.workers[i].pid == 0) {
                        <worker is cheaped, do somthing>
                }
        }

I'm 100% sure that "cheap" status that is taken from uwsgi.workers[i].cheaped was right when I had this issue, so it must be the uwsgi.workers[i].pid that was non-zero for cheaped worker. Is there any reason why we should check for both in those loops?
non-zero pid for cheaped worker issue will probably take some digging to track it down, maybe we should not care for the pid and trust in worker status value alone?

@prymitive
Copy link
Contributor Author

I can add checks in those loops to look for cheaped==1 && pid > 0 and print warning in such case.

like this

@prymitive
Copy link
Contributor Author

Stats from another app on the same server:

"workers":[
                {
                        "id":1,
                        "pid":94,
                        "requests":197,
                        "delta_requests":197,
                        "exceptions":0,
                        "harakiri_count":0,
                        "signals":0,
                        "signal_queue":0,
                        "status":"cheap",
                        "rss":46469120,
                        "vsz":258428928,
                        "running_time":71597280,
                        "last_spawn":1358358710,
                        "respawn_count":1,
                        "tx":1172017,
                        "avg_rt":270923,
                        "apps":[

                        ],
                        "cores":[
                                {
                                        "id":0,
                                        "requests":197,
                                        "static_requests":67,
                                        "routed_requests":0,
                                        "offloaded_requests":67,
                                        "in_request":0
                                }
                        ]
                },

"status":"cheap" and "pid":94

Full stats output: https://gist.github.com/dcd46a04dd9b4f0d32bc
Only workers 14 and 15 are really running, others are cheaped (verified by running ps on this server)

@prymitive
Copy link
Contributor Author

It turns out that all those workers are running but they are <defunct>

@prymitive
Copy link
Contributor Author

Also it seems that many (if not all) apps on all of my boxes are affected, this looks weird, maybe I had some network or other global issue that triggered this error? Maybe something bad happened when those processes were stopped/reloaded due max requests and uWSGI did not noted that there was an error?

@prymitive
Copy link
Contributor Author

It looks like when uWSGI wants to cheap a worker it will mark it as "cheaped = 1" and then just send SIGWINCH signal. In lengthy master_loop uWSGI will check if any worker has died and mark it's pid as 0, so this will normally set pid=0 for cheaped workers once they die. But if worker for any reason does not die, it will continue to run but uWSGI will have it mared as cheaped while keeping pid > 0.

uWSGI will only check if workers are dead while whole uWSGI instance is stopping or reloading, we have reload mercy there, maybe we should also add checks in master_loop() to verify that all workers who have cheaped=1 have also pid=0? Think of it as reload mercy but for individual workers.

@prymitive
Copy link
Contributor Author

I wrote a patch for it, to verify if this will work properly. I've installed patch uWSGI on one server, I will monitor it to check if it kills those defunct cheaped workers.

PS. travis fails to build uWSGI

@prymitive
Copy link
Contributor Author

It seems to be working just fine:

worker 1 is taking too much time to die (60s), sending SIGKILL
worker 1 is taking too much time to die (61s), sending SIGKILL
DAMN ! worker 1 (pid: 94) MISTERIOUSLY killed by signal :( trying respawn ...
uWSGI worker 1 cheaped.

I should only get rid of this MISTERIOUSLY killed by signal message for such cases and add more appropriate one, so there is no confusion.

Strace from such worker:

# strace -ff -p 27525
Process 27525 attached with 3 threads - interrupt to quit
[pid 27527] epoll_wait(13,  <unfinished ...>
[pid 27526] epoll_wait(11,  <unfinished ...>
[pid 27525] epoll_wait(14, {{EPOLLIN, {u32=6, u64=6}}}, 1, 4294967295) = 1
[pid 27525] accept(6, 0x7fe1a4620084, [110]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 27525] epoll_wait(14, 7fffc5491370, 1, 4294967295) = -1 EINTR (Interrupted system call)
[pid 27525] --- SIGWINCH (Window changed) @ 0 (0) ---
[pid 27525] rt_sigreturn(0x1c)          = -1 EINTR (Interrupted system call)
[pid 27525] _exit(0)                    = ?
Process 27525 detached
[pid 27526] +++ killed by SIGKILL +++
PANIC: handle_group_exit: 27526 leader 27525
[pid 27527] +++ killed by SIGKILL +++
PANIC: handle_group_exit: 27527 leader 27525

It looks like some thread is keeping it in defunct state, probably offload one.

EDIT: yes, it didn't take long, my workers are not shutting down correctly due to enabled offload threads. This seems to be happening every time.
So this is the real bug that needs to be fixed. But pease review my patch and if it's ok I would like it to be merged, it will protect uWSGI from such issues in the future.

@prymitive
Copy link
Contributor Author

BTW I didn't noticed this issue before so it is possible that it was introduced somewhere along 1.4 bugfix releasses, probably between 1.4.2 and 1.4.4. I've upgraded from 1.4.2 to 1.4.4 week ago and I think this was happening all week, it's just I didn't noticed it until yesterday.

@prymitive
Copy link
Contributor Author

I've reverted the only change that was made to offload.c past 1.4.2 but it still hangs on worker stop if offload threads are enabled.

@unbit
Copy link
Owner

unbit commented Jan 21, 2013

the uwsgi_ignition() function in core/uwsgi.c exit with pthread_exit(NULL)
Can you check if using exit(0) solve the problem ? (sorry i have no way now to check it by myself)

@prymitive
Copy link
Contributor Author

replacing pthread_exit(NULL) with exit(0) solves this issue for me.

It's easy to test it, just run:

./uwsgi --master --logdate -L --http :8080 --wsgi-file tests/staticfile.py -p 10 --auto-procname --cheaper-algo busyness --cheaper-busyness-verbose --cheaper-overload 5 --cheaper-busyness-multiplier 2 --offload-threads 2

If You have defunct worker after:

Mon Jan 21 10:53:19 2013 - [busyness] 5s average busyness of 10 worker(s) is at 0%
Mon Jan 21 10:53:19 2013 - [busyness] 5s average busyness is at 0%, cheap one of 10 running workers

and there is no uWSGI worker 1 cheaped. message that You just hit this issue.

@unbit
Copy link
Owner

unbit commented Jan 21, 2013

ok applied it to 1.4 tree (using end_me(0))

@prymitive
Copy link
Contributor Author

It works fine, I've retested with latest master and the issue is gone.
I would still like to add this worker reload mercy feature to uWSGI, but I will make a separate issue for it.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants