-
Notifications
You must be signed in to change notification settings - Fork 184
Description
Following on from debugging in this issue - collective/haufe.requestmonitoring#15
What we see is waitress switching into 100% CPU and staying there. It is happening in production randomly (within a week) and we haven't traced it back to a certain request).
Using a sampling profiler on waitress with 2 threads (in prod) we identified the thread using the CPU as the mainthread (top -H) and this is the profile. Note that since this is prod there are other requests so not all activity is related to the looping causing this bug.
Austin TUI Wall Time Profile CPU 99% ▇█▇█▇▇▇▇ MEM 263M ████████ 5/5
_________ Command /app/bin/python3.9 /app/parts/instance/bin/interpreter /app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py /app/parts/instance/etc/wsgi.ini
⎝__⎠ ⎝__⎠ Python 3.9.0 PID 3351466 PID:TID 3351466:11
Samples 1451365 ⏲️ 3'16" Threshold 0%
OWN TOTAL %OWN %TOTAL FUNCTION
00" 3'16" 0.0% 100.0% └─ <module> (/app/parts/instance/bin/interpreter:326) ▒
00" 3'16" 0.0% 100.0% └─ <module> (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:255) ▒
00" 3'16" 0.0% 100.0% └─ main (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:251) ▒
00" 3'16" 0.0% 100.0% └─ run (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:217) │
00" 3'16" 0.0% 100.0% └─ serve (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:203) │
00" 3'16" 0.0% 100.0% └─ serve (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:942) │
00" 3'16" 0.0% 100.0% └─ serve_paste (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:917) │
00" 3'16" 0.0% 100.0% └─ serve (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/__init__.py:19) │
00" 3'16" 0.0% 100.0% └─ run (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:322) │
05" 3'16" 2.5% 99.9% ├─ loop (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:245) │
36" 44" 18.3% 22.4% │ ├─ poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:158) │
05" 05" 2.4% 2.4% │ │ ├─ readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:290) │
01" 02" 0.4% 0.9% │ │ ├─ write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:117) │
01" 01" 0.4% 0.5% │ │ │ ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:517) │
00" 00" 0.0% 0.0% │ │ │ │ ├─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98) │
00" 00" 0.0% 0.0% │ │ │ │ └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:95) │
00" 00" 0.0% 0.0% │ │ │ ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:514) │
00" 00" 0.0% 0.0% │ │ │ ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:515) │
00" 00" 0.0% 0.0% │ │ │ │ └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98) │
00" 00" 0.0% 0.0% │ │ │ ├─ poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:150) │
00" 00" 0.0% 0.0% │ │ │ │ └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98) │
00" 00" 0.0% 0.0% │ │ │ └─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:509) │
00" 00" 0.0% 0.0% │ │ │ └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98) │
00" 00" 0.0% 0.1% │ │ ├─ write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:113) │
00" 00" 0.1% 0.1% │ │ │ ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:517) │
00" 00" 0.0% 0.0% │ │ │ │ └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98) │
00" 00" 0.0% 0.0% │ │ │ └─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:509) │
00" 00" 0.1% 0.1% │ │ ├─ readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:154)
from profiling it looks like channel is writable but the channel.connected == False.
So then it goes into a loop without writing or closing since it never actually does anything to the socket.
https://github.com/Pylons/waitress/blob/main/src/waitress/channel.py#L98
EDIT: My suspicion would be that what started this was a client that shutdown (half) very quickly after a connect and this happened before the dispatcher finished being setup. This causes getpeername to fail with EINVAL and connected = False.
waitress/src/waitress/wasyncore.py
Line 310 in 4f6789b
self.connected = False |
try:
self.addr = sock.getpeername()
except OSError as err:
if err.args[0] in (ENOTCONN, EINVAL):
# To handle the case where we got an unconnected
# socket.
self.connected = False
else:
# The socket is broken in some unknown way, alert
# the user and remove it from the map (to prevent
# polling of broken sockets).
self.del_channel(map)
raise
Could be same issue as #411 but hard to tell.
One fix in #419 but could be better ways?