100% cpu in mainthread due to not closing properly? (channel.connected == False) #418

djay · 2023-09-11T10:49:57Z

Following on from debugging in this issue - collective/haufe.requestmonitoring#15

What we see is waitress switching into 100% CPU and staying there. It is happening in production randomly (within a week) and we haven't traced it back to a certain request).

Using a sampling profiler on waitress with 2 threads (in prod) we identified the thread using the CPU as the mainthread (top -H) and this is the profile. Note that since this is prod there are other requests so not all activity is related to the looping causing this bug.

 Austin  TUI   Wall Time Profile                                                                                             CPU  99% ▇█▇█▇▇▇▇   MEM 263M ████████    5/5
   _________   Command /app/bin/python3.9 /app/parts/instance/bin/interpreter /app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py /app/parts/instance/etc/wsgi.ini
   ⎝__⎠ ⎝__⎠   Python 3.9.0     PID 3351466     PID:TID 3351466:11          
               Samples 1451365  ⏲️   3'16"       Threshold 0%   
  OWN    TOTAL    %OWN   %TOTAL  FUNCTION                                                                                                                                     
  00"    3'16"     0.0%  100.0%  └─ <module> (/app/parts/instance/bin/interpreter:326)                                                                                       ▒
  00"    3'16"     0.0%  100.0%     └─ <module> (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:255)                                                                    ▒
  00"    3'16"     0.0%  100.0%        └─ main (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:251)                                                                     ▒
  00"    3'16"     0.0%  100.0%           └─ run (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:217)                                                                   │
  00"    3'16"     0.0%  100.0%              └─ serve (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:203)                                                              │
  00"    3'16"     0.0%  100.0%                 └─ serve (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:942)                       │
  00"    3'16"     0.0%  100.0%                    └─ serve_paste (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:917)              │
  00"    3'16"     0.0%  100.0%                       └─ serve (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/__init__.py:19)                                                  │
  00"    3'16"     0.0%  100.0%                          └─ run (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:322)                                                  │
  05"    3'16"     2.5%   99.9%                             ├─ loop (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:245)                                           │
  36"     44"     18.3%   22.4%                             │  ├─ poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:158)                                        │
  05"     05"      2.4%    2.4%                             │  │  ├─ readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:290)                                    │
  01"     02"      0.4%    0.9%                             │  │  ├─ write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:117)                                    │
  01"     01"      0.4%    0.5%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:517)                    │
  00"     00"      0.0%    0.0%                             │  │  │  │  ├─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:95)                          │
  00"     00"      0.0%    0.0%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:514)                    │
  00"     00"      0.0%    0.0%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:515)                    │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  ├─ poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:150)                                  │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  └─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:509)                    │
  00"     00"      0.0%    0.0%                             │  │  │     └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.1%                             │  │  ├─ write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:113)                                    │
  00"     00"      0.1%    0.1%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:517)                    │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  └─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:509)                    │
  00"     00"      0.1%    0.1%                             │  │  ├─ readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:154)

from profiling it looks like channel is writable but the channel.connected == False.
So then it goes into a loop without writing or closing since it never actually does anything to the socket.
https://github.com/Pylons/waitress/blob/main/src/waitress/channel.py#L98

EDIT: My suspicion would be that what started this was a client that shutdown (half) very quickly after a connect and this happened before the dispatcher finished being setup. This causes getpeername to fail with EINVAL and connected = False.

waitress/src/waitress/wasyncore.py

Line 310 in 4f6789b

self.connected = False

            try:
                self.addr = sock.getpeername()
            except OSError as err:
                if err.args[0] in (ENOTCONN, EINVAL):
                    # To handle the case where we got an unconnected
                    # socket.
                    self.connected = False
                else:
                    # The socket is broken in some unknown way, alert
                    # the user and remove it from the map (to prevent
                    # polling of broken sockets).
                    self.del_channel(map)
                    raise

Could be same issue as #411 but hard to tell.

One fix in #419 but could be better ways?

The text was updated successfully, but these errors were encountered:

djay · 2023-09-11T14:55:59Z

@d-maurer

The error is where self.connected was set tu False.
There, it should have been ensured that the corresponding "fileno"
is removed von socket_map and that it will not be put there again
(as long as self.connected remains False).

Something exceptional must have brought waitress into this state
(otherwise, we would have lots of 100 % CPU usage reports).
I assume that some bad client has used the system call shutdown
to close only part of the socket connection and that waitressdoes not
anticipate something like that.

Waitress does seem to properly close if shutdown is received (empty data).
see https://github.com/Pylons/waitress/blob/main/src/waitress/wasyncore.py#L449

So have to keep looking for a way connected can be false but it can still be trying to write.
Yes it is most likely bad actors. We get hit by this a lot in our line of business.

d-maurer · 2023-09-11T14:59:45Z

Dylan Jay wrote at 2023-9-11 07:56 -0700:

@d-maurer > The error is where `self.connected` was set tu `False`. > There, it should have been ensured that the corresponding "fileno" > is removed von `socket_map` and that it will not be put there again > (as long as `self.connected` remains `False`). > > Something exceptional must have brought `waitress` into this state > (otherwise, we would have lots of 100 % CPU usage reports). > I assume that some bad client has used the system call `shutdown` > to close only part of the socket connection and that `waitress`does not > anticipate something like that. Waitress does seem to properly close if shutdown is received (empty data). see https://github.com/Pylons/waitress/blob/main/src/waitress/wasyncore.py#L449

I would expect an empty data read, if the sending part of the connection is shut down. However, the client might shut down the receiving part; for me, this does not suggest an empty data read.

d-maurer · 2023-09-11T16:01:39Z

kDylan Jay wrote at 2023-9-11 07:56 -0700:

@d-maurer > The error is where `self.connected` was set tu `False`. > There, it should have been ensured that the corresponding "fileno" > is removed von `socket_map` and that it will not be put there again > (as long as `self.connected` remains `False`). > > Something exceptional must have brought `waitress` into this state > (otherwise, we would have lots of 100 % CPU usage reports). > I assume that some bad client has used the system call `shutdown` > to close only part of the socket connection and that `waitress`does not > anticipate something like that. Waitress does seem to properly close if shutdown is received (empty data). see https://github.com/Pylons/waitress/blob/main/src/waitress/wasyncore.py#L449

I have an idea how to fix the inconsistency without understanding how it came into being: The `handle_write_event` is (likely) only called by the loop. Thus, if it is called and the connection is closed, it can detect an inconsistency (a closed connection should not get events). It can resolve the inconsistency by deregistration with the loop.

djay · 2023-09-11T16:35:09Z

@d-maurer yes it could work to insert a self.del_channel(self) in

waitress/src/waitress/channel.py

Line 97 in 84360df

djay · 2023-09-11T17:10:12Z

@d-mauer I created a PR and it passes the current tests and they hit that line but hard to know how to make a test for this scenario...

#419

d-maurer · 2023-09-11T17:48:54Z

Dylan Jay wrote at 2023-9-11 10:10 -0700:

@d-mauer I created a PR and it passes the current tests and they hit that line but hard to know how to make a test for this scenario... #419

Some test should not be that difficult: you register a `write` handler for a connection and then set `connected` to `False`. In the next loop run, there will be a write event and you can check that it does not end in an infinite loop. Of course, the test only verifies that a `not connected` does not lead to an infinite loop. It does not try to set up a realistic case for setting `connected` to `False`.

djay · 2023-09-12T01:34:55Z

@mcdonc another solution instead of #419 might be below. is that preferable?

def poll(timeout=0.0, map=None):
    if map is None:  # pragma: no cover
        map = socket_map
    if map:
        r = []
        w = []
        e = []
        for fd, obj in list(map.items()):  # list() call FBO py3
            # prevent getting into a loop for sockets disconnected but not properly closed.
            if obj.check_client_disconnected():
                obj.del_channel()
                continue

perhaps you have a better idea on how it could have got into this knot and the best way to test?

djay · 2023-09-12T01:54:56Z

@mcdonc one code path that could perhaps lead to this is

waitress/src/waitress/wasyncore.py

Line 313 in 4f6789b

# the user and remove it from the map (to prevent

since connecting == False also there doesn't seem to be a way for it to write data out or close?

EDIT: one scenario could be the client half disconnected very quickly before the dispatcher was setup so getpeername fails? but somehow the socket still can be written to?

djay · 2023-09-12T04:50:34Z

Looks like it is possible a connection thats been broken before getpeername to then no have any error in select. in the case where there is nothing to read since that will result in a close. https://stackoverflow.com/questions/13257047/python-select-does-not-catch-broken-socket. not sure how it has something write in that case. maybe shutdown for readonly very quickly?

EDIT: https://man7.org/linux/man-pages/man3/getpeername.3p.html

"EINVAL The socket has been shut down." <- so looks like shutdown for read very quickly seems possible to create this tight loop.

djay · 2023-09-12T04:59:22Z

or somethow the getpeername is invalid and that results in a oserror. and there is nothing to read but something to write. but I'm not sure if that results in the EINVAL or not.

d-maurer · 2023-09-12T06:05:27Z

Dylan Jay wrote at 2023-9-11 18:35 -0700:

@mcdonc another solution instead of #419 might be below. is that preferable? ``` def poll(timeout=0.0, map=None): if map is None: # pragma: no cover map = socket_map if map: r = [] w = [] e = [] for fd, obj in list(map.items()): # list() call FBO py3 # prevent getting into a loop for sockets disconnected but not properly closed. if obj.check_client_disconnected(): obj.del_channel() continue ```

Not sure. It, too, uses `del_channel` suggesting "delete channel". While "delete channel" likely removes the channel from the socket map, it likely does more than that -- and maybe things not best placed into `poll`. Another point: the `map` may contain objects without `del_channel` (e.g. non channels) and `check_client_disconnected`. In those cases, the code above would raise an exception and bring your service down. Currently, I favor the following reasoning: You have detected that a write event for a "non connected" connection leads to a busy loop. We must do something about it (either prevent the case from happening or clean up locally). For the local cleanup, I proposed to unregister the "write event handler". You have translated this into a call to `del_channel`, likely because the loop does not support "write event handler"s. The loop API only supports the registration of objects (with `fileno`, `readable` and `writable`, ... methods); those objects' `writable` method indicates whether the object is interested in write event notifications. I propose to implement "unregister the write event handler" not by a `del_channel` call but by a modification of `writable`: ensure it returns false for a non connected connection. In the long run `waitress` should likely change its "connected" concept. HTTP is based on TCP which implements bidirectional communication channels. The `shutdown` system call allows applications to shut down individual directions. This is not compatible with a boolean "connected", instead we have 4 connection states: (fully) disconnected, read connected, write connected and (fully) connected. I do not know whether important HTTP clients exist which use `shutdown`. Some might want to signal "no more input" by shutting down the outgoing communication channel, e.g. for a long download.

perhaps you have a better idea on how it could have got into this knot

My primary thought: It must be something exceptional which caused the situation. Otherwise, lots of 100 % CPU usage reports should have been seen. Initially, I thought of a use of `shutdown` to (only) partially shut down the socket. Likely, such a use is unanticipated and may reveal bugs. Meanwhile, I can think of other causes: Maybe, when the connection is closed while the response is beeing streamed back, an inconsistency can creep in. The "async" world should be quite robust against such types of problems (because all IO logic is serialized) but the system is connected to a network and important network events (e.g. the closing of a communication channel) can happen at any time and cause unanticipated exceptions changing the normal control flow.

and the best way to test?

We have here the case that we cannot reproduce the real situation (because we do not know how the inconsistency was introduced). We therefore only implement a workaround. My favorite workaround (ensure "writable" returns "False" when "not connected") is so simple that no test is necessary to convince us that the busy loop is avoided. The workaround may not solve all problems. For example, it may keep a socket in use which should in principle have been closed.

d-maurer · 2023-09-12T06:28:59Z

Dylan Jay wrote at 2023-9-11 21:59 -0700:

or somethow the getpeername is invalid and that results in a oserror. and there is nothing to read but something to write. but I'm not sure if that results in the EINVAL or not.

I do not think that this goes into the right direction: `waitress` is an HTTP server and (unlike e.g. a `telnet` server) it produces output only after it has received input. Thus, I expect that the problematic connection has once been in a `connected` state (to read something and then to produce output).

djay · 2023-09-12T10:58:59Z

@d-maurer I'm fairly sure I have one solid explanation how this could occur.

Outlined in this test - https://github.com/Pylons/waitress/pull/419/files#diff-5938662f28fcbb376792258701d0b6c21ec8a1232dada6ad2ca0ea97d4043d96R775

NOTE: I haven't worked out a way to detect the looping in a test yet. So the assert at the end is not correct.

It is as you say. There is a shutdown of the read only but this is a race condition. it has to happen before the dispatcher is created so right after the connect. I've confirmed this results in an getpeername returning OSError EINVAL and thus connected = False and the select still thinks it can write so the loop will be inifinite. or maybe until the bad actor properly closes the connection. not sure on that one.

In the long run waitress should likely change its "connected" concept.
HTTP is based on TCP which implements bidirectional communication channels.
The shutdown system call allows applications to shut down individual
directions. This is not compatible with a boolean "connected",
instead we have 4 connection states: (fully) disconnected,
read connected, write connected and (fully) connected.

true. but if I'm right on the cause of this this, the socket would never have connected=False with most shutdowns. Only when it happens too quickly. That flag is mostly used to indicate not yet connected or in the process of closing.

My favorite workaround (ensure "writable" returns "False" when "not connected")
is so simple that no test is necessary to convince us that
the busy loop is avoided.

yes that will also work. I'll switch it to that.
There is a system to remove inactive sockets so I guess that would get them closed eventually.
I'm not really sure the pros and cons of having sockets left open vs the consequences of just closing them for this case (I tried this. it also worked in terms of the tests).

djay · 2023-09-12T11:08:21Z

@d-maurer I pushed new code that uses writable instead.

d-maurer · 2023-09-12T12:05:53Z

Dylan Jay wrote at 2023-9-12 03:59 -0700:

... It is as you say. There is a shutdown of the read only but this is a race condition. it has to happen before the dispatcher is created so right after the connect. I've confirmed this results in an getpeername returning OSError EINVAL and thus connected = False and the select still thinks it can write so the loop will be inifinite.

I do not know `waitress` implementation details **BUT** in general, write notications are called for only **AFTER** output has been generated (i.e. `writable` will only return `True` once data to be written has been generated). As explained earlier, an HTTP server first reads data from a connection before it writes to the connection. If you are right with your assumption above, then reading has been possible (despite a "not connected") and output was generated based in this input.

or maybe until the bad actor properly closes the connection. not sure on that one.

The connection's `writeable` must also return `True` (otherwise, the corresponding fd will not be included in `writefs`). Usually, this would happen if it is known that there is data to be output.

djay · 2023-09-13T03:01:49Z

@d-maurer maybe a core contributor can step in and advise the best solution and test. @digitalresistor @kgaughan ?

d-maurer · 2023-09-13T05:27:11Z

Dylan Jay wrote at 2023-9-12 20:02 -0700:

@d-maurer maybe a core contributor can step in and advise the best solution and test. @digitalresistor @kgaughan ?

I had a closer look at the code and I think I found a realistic scenario to enter the busy loop state: If `HTTPChannel.read` reads empty data, it sets `connected` to `False`; if there is pending output at that time we are in the busy loop state. We can force `HTTPChannel.read` to read empty data by letting the HTTP client shutdown its sending direction. Once all data has been read by the receiving site, its next `recv` will return empty data. A normal `close` (rather than `shutdown`) might have a similar effect. The hypothesis can be checked in the following way: Design an HTTP service to produce sufficient output to saturate the output channel. Design an HTTP client: it sends a request to the service (but does not read the response), waits sufficiently long such that the service has produced its output, then shuts down the writing direction of its HTTP connection (maybe just closes its HTTP connection). Check whether this has brings `waitress` into the busy loop state.

djay · 2023-09-13T07:07:25Z

@d-maurer that was my initial thought but as I pointed out in #418 (comment) recv in wasynccore will do handle_close on getting empty data and take it out of the map so I couldn't see any way for no bytes being sent to cause this loop.

    def recv(self, buffer_size):
        try:
            data = self.socket.recv(buffer_size)
            if not data:
                # a closed connection is indicated by signaling
                # a read condition, and having recv() return 0.
                self.handle_close()
                return b""
            else:
                return data
        except OSError as why:
            # winsock sometimes raises ENOTCONN
            if why.args[0] in _DISCONNECTED:
                self.handle_close()
                return b""
            else:
                raise

djay · 2023-09-13T07:11:36Z

Also when I did some testing it did seem like the select would indicate a write was possible even without the back end producing any data. So there is no read needed. Just a connect and very quick shutdown. But I do have to work out a proper test for that.

d-maurer · 2023-09-13T07:36:22Z

Dylan Jay wrote at 2023-9-13 00:07 -0700:

@d-maurer that was my initial thought but as I pointed out in #418 (comment) recv in wasynccore will do ```handle_close``` on getting empty data and take it out of the map as I couldn't see any way for no bytes being sent to cause this loop.

You are right! I missed (had forgotten) that.

d-maurer · 2023-09-13T07:57:19Z

Dylan Jay wrote at 2023-9-13 00:11 -0700:

Also when I did some testing it did seem like the select would indicate a write was possible even without the back end producing any data.

A `select` will almost always report a possible `write`. (For a "normal" socket) the only exception is that the write buffer is satuarated. That's why the `writeable` must return `False` unless there is data to write (or the `handle_write` will be able to clean up the state).

So there is no read needed. Just a connect and very quick shutdown. But I do have to work out a proper test for that.

Only, if `waitress` defines its `writable` in a non standard way: typically, `writable` would only return `True` if output was pending. In `channel.py` `writable` is defined as: ``` return self.total_outbufs_len or self.will_close or self.close_when_flushed ``` Thus, it is not completely standard. However, as far as I could see, `will_close` and `close_when_flushed` can only be set during request processing, i.e. after input has been received.

d-maurer · 2023-09-13T08:10:10Z

Dylan Jay wrote at 2023-9-13 00:07 -0700:

@d-maurer that was my initial thought but as I pointed out in #418 (comment) recv in wasynccore will do ```handle_close``` on getting empty data and take it out of the map as I couldn't see any way for no bytes being sent to cause this loop.

I have meanwhile read the Python socket HOWTO (--> "https://docs.python.org/3/howto/sockets.html#socket-howto"). It recommends (in the "Disconnecting" section) to operate in an HTTP-like exchange: send the request and then use `shutdown(1)` to indicate "I (the client) will produce no more output but am still ready for input". The behavior of `waitress` you point out above (close as soon as there is no more input) will not play well with this recommendation.

djay · 2023-09-13T11:26:12Z

@d-maurer that would be a different bug in waitress.

My problem is I run out of CPU on my servers if I don't restart them often due to these weird requests we are receiving. That no one else is the world seems to get :(

d-maurer · 2023-09-13T11:51:35Z

Dylan Jay wrote at 2023-9-13 04:26 -0700:

... My problem is I run out of CPU on my servers if I don't restart them often due to these weird requests we are receiving. That no one else is the world seems to get :(

Would you share the version of `waitress` you are observing the behavior?

d-maurer · 2023-09-13T12:26:42Z

Dieter Maurer wrote at 2023-9-13 09:57 +0200:

Dylan Jay wrote at 2023-9-13 00:11 -0700: >Also when I did some testing it did seem like the select would indicate a write was possible even without the back end producing any data. A `select` will almost always report a possible `write`. (For a "normal" socket) the only exception is that the write buffer is satuarated. That's why the `writeable` must return `False` unless there is data to write (or the `handle_write` will be able to clean up the state). >So there is no read needed. Just a connect and very quick shutdown. But I do have to work out a proper test for that. Only, if `waitress` defines its `writable` in a non standard way: typically, `writable` would only return `True` if output was pending. In `channel.py` `writable` is defined as: ``` return self.total_outbufs_len or self.will_close or self.close_when_flushed ``` Thus, it is not completely standard. However, as far as I could see, `will_close` and `close_when_flushed` can only be set during request processing, i.e. after input has been received.

`will_close` can be set by `server.BaseWSGIServer.maintenance`, i.e. independent of a task/request. Thus, you might be right with your hypothesis: connected set to false in `wasyncore.dispatcher.__init__` due to a connection race condition; later busy loop due to `not connected and writable`. You could verify this as follows: Add a sufficiently large `sleep` into `wasyncore.dispatcher.__init__` before the `getpeername` call. Open a connection to the server and immediately close it again. The `sleep` should ensure that at the time of the `getpeername` call, the remote socket end is already closed (maybe, `getpeername` then fails with an exception and `connected` is set to `False`). Wait sufficiently long (somewhat longer than `cleanup_interval`) to let `maintenance` set `will_close`. If you are right, this will result in a busy loop.

d-maurer · 2023-09-13T12:55:22Z

Dieter Maurer wrote at 2023-9-13 14:26 +0200:

... `will_close` can be set by `server.BaseWSGIServer.maintenance`, i.e. independent of a task/request. Thus, you might be right with your hypothesis: connected set to false in `wasyncore.dispatcher.__init__` due to a connection race condition; later busy loop due to `not connected and writable`. You could verify this as follows: Add a sufficiently large `sleep` into `wasyncore.dispatcher.__init__` before the `getpeername` call. Open a connection to the server and immediately close it again. The `sleep` should ensure that at the time of the `getpeername` call, the remote socket end is already closed (maybe, `getpeername` then fails with an exception and `connected` is set to `False`). Wait sufficiently long (somewhat longer than `cleanup_interval`) to let `maintenance` set `will_close`. If you are right, this will result in a busy loop.

On my system (Linux, kernel 5.4.0), `getpeername` returns the peer address even after the socket's remote end has been closed. Verified via the following interactive code: server code: ```python from socket import socket, AF_INET, SOCK_STREAM ss = socket(AF_INET, SOCK_STREAM) ss.bind(("localhost", 10000)) ss.listen() cs, addr = ss.accept() # run the client code in a second interactive session cs.getpeername() ``` client code: ```python from socket import socket, AF_INET, SOCK_STREAM cs = socket(AF_INET, SOCK_STREAM) cs.connect(("localhost", 10000)) cs.close() ```

d-maurer · 2023-09-13T12:58:55Z

"EINVAL The socket has been shut down." <- so looks like shutdown for read very quickly seems possible to create this tight loop.

Note that a socket has 2 ends. "The socket has been shut down" might refer to the local (not the remote) end.

digitalresistor · 2023-09-18T15:06:03Z

No, channel_request_lookahead is used to allow waitress to try and read more than the first request (normally it would read a request, wait for a response to be created/processed, only then read the next request, so that it wouldn't read a bunch of data/HTTP pipelining requests from the wire without being able to respond for example if the connection was goin to be closed early)... this however meant there was no good way to let the app know that the client disconnected for long running requests that wanted to check before they committed data to the database for example.

That is why waitress has a waitress.client_disconnected callable in the environment that allows the app to check if the remote client has disconnected, but that only works if we continue to read() from the socket and thus get notified when the remote hangs up.

djay · 2023-09-18T15:49:25Z

@d-maurer maybe but it will result in the socket being closed since it will allow it to read the RST. Thats also how it works in the test. I'll try it in production.

But this is still a bug. I've put in two fixes.

    def handle_write(self):
        # Precondition: there's data in the out buffer to be sent, or
        # there's a pending will_close request

        if not self.connected and not self.close_when_flushed:
            # we dont want to close the channel twice
            return

    def __init__(self, server, sock, addr, adj, map=None):
        self.server = server
        self.adj = adj
        self.outbufs = [OverflowableBuffer(adj.outbuf_overflow)]
        self.creation_time = self.last_activity = time.time()
        self.sendbuf_len = sock.getsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF)

        # requests_lock used to push/pop requests and modify the request that is
        # currently being created
        self.requests_lock = threading.Lock()
        # outbuf_lock used to access any outbuf (expected to use an RLock)
        self.outbuf_lock = threading.Condition()

        wasyncore.dispatcher.__init__(self, sock, map=map)
        if not self.connected:
            # Sometimes can be closed quickly and getpeername fails.
            self.handle_close()

        # Don't let wasyncore.dispatcher throttle self.addr on us.
        self.addr = addr
        self.requests = []

djay · 2023-09-20T04:16:41Z

@d-maurer I've also worked out why maintenance never close the connection and put in a fix for that too.
Unfortunately there doesn't seem to be a test for this close called twice case so still not clear why that line that caused the loop is there in the first place and if isn't better just to get rid of it.

djay · 2023-09-20T04:30:21Z

Well this is at least the commit that put this in
9a7d2e5

d-maurer · 2023-09-20T06:12:34Z

Dylan Jay wrote at 2023-9-18 08:49 -0700:

@d-maurer maybe but it will result in the socket being closed since it will allow it to read the RST. Thats also how it works in the test. I'll try it in production. But this is still a bug. I've put in two fixes. ``` def handle_write(self): # Precondition: there's data in the out buffer to be sent, or # there's a pending will_close request if not self.connected and not self.close_when_flushed: # we dont want to close the channel twice return ```

Maybe, this change is not yet optimal: The comment indicates that the purpose is to avoid to close the channel twice. It likely addresses the following scenario: the previous `handle_read` has detected a remote closure and closed the channel; the following `handle_write` may try to close the channel again which should be prevented. My proposal: do not abuse `connected` for this logic. Instead introduce an additional attribute `closed`, initially `False` and set to `True` in `_dispatcher.close`. This way, `closed` is `True` if and only if `close` has already been called. The code above could become `if self.closed: return`.

d-maurer · 2023-09-20T06:12:37Z

Dylan Jay wrote at 2023-9-18 08:49 -0700:

@d-maurer maybe but it will result in the socket being closed since it will allow it to read the RST. Thats also how it works in the test. I'll try it in production.

That's where my speed argument may come in: when the state change from `channel.service()` (which likely requires a thread switch) happens after the next poll, `channel.readable()` still returns `True` in the `poll` and the channel has the chance to learn that the remote end was closed (closing the channel);; only when it happens before the next poll, it prevents `readable` from returning `True` and the remote end closing remains unnoticed.

d-maurer · 2023-09-20T07:00:06Z

Dylan Jay wrote at 2023-9-19 21:16 -0700:

@d-maurer I've also worked out why maintenance never close the connection and put in a fix for that too. Unfortunately there doesn't seem to be a test for this close called twice case so still not clear why that line that caused the loop is there in the first place and if isn't better just to get rid of it.

Scenario: The channel is both `readable` and `writable` (not sure whether this is possible); When in this state, `recv` returns empty data, it will close the channel; a following `handle_write` (in the same `poll` call) will try to close a second time (because the `send` will (likely) fail with an exception).

djay · 2023-10-20T10:17:25Z

The current PR has been running in production of a couple of weeks without the bug reoccuring so that at least confirms the source of the issue.
But I will still rewrite the PR with changes mentioned above

digitalresistor · 2024-02-04T20:51:29Z

@djay I have not been able to reproduce the original error with the current released version of waitress. It is somewhat frustrating that I can't reproduce it.

djay · 2024-02-05T04:20:06Z

@digitalresistor I merged in master, reversed my fixes and uncommented the test case code that reproduces it and it showed that it went into a loop still. How did you try and reproduce it?

digitalresistor · 2024-02-05T05:02:07Z

@djay I have not been using the test code you created. I wanted to reproduce it without trying to set up the conditions to match exactly, or forcing a socket into a particular state. I have been testing with waitress running on a RHEL 8.9 system, I have been unable to get getpeername() to fail in those conditions even while trying to do the half-shutdown and stress testing waitress at the same time. I have not manually manipulated SO_LINGER to setup the conditions as specified, just using the defaults from the OS.

I also would expect to see far more instances of this error happening out in the wild, waitress is used extensively, and I have yet to see it happen on any of my applications that are running in production.

I don't disagree that there is an issue and we can solve it, but I find it interesting that it is not easily reproducible on test systems I have.

The current PR does the closing much later, I'd much rather see it get handled when getpeername() fails before we even attempt to do anything else with the socket, or start setting up a channel, if we can avoid doing extra work, let's avoid doing extra work.

Then we can rework the logic for writeable as well to make sure that we can't busy loop there.

d-maurer · 2024-02-05T06:36:29Z

Delta Regeer wrote at 2024-2-4 21:02 -0800:

@djay I have not been using the test code you created. I wanted to reproduce it without trying to set up the conditions to match exactly, or forcing a socket into a particular state. I have been testing with waitress running on a RHEL 8.9 system, I have been unable to get `getpeername()` to fail in those conditions even while trying to do the half-shutdown and stress testing waitress at the same time. I have not manually manipulated SO_LINGER to setup the conditions as specified, just using the defaults from the OS.

"Normal" browsers (and other HTTP agents) likely use `SO_LINGER` rarely. I assume that this is the reason why we see only few cases.

djay · 2024-02-05T06:43:10Z

@digitalresistor The circumstances are very particular. It was something we were getting from pentests only I believe.
An invalid request is sent and immediately closed by the client before the channel could be created. it would not be easy to reproduce as you would need the invalid request, and a loaded enough server that it not be so fast to create the channel.

The current PR does the closing much later, I'd much rather see it get handled when getpeername() fails before we even attempt to do anything else with the socket, or start setting up a channel, if we can avoid doing extra work, let's avoid doing extra work.

This part ensures its closed if getpeername fails

https://github.com/Pylons/waitress/pull/419/files#diff-5cb215c6142fa7daad673c77fcb7f3bc0a0630e18e3487a5ac0894e90f1b2071R71

And I added some other tests to show you how the other fix prevents any looping if for any other reasons channel.connected would become false and we have an error in the request. Even though that should not happen.

digitalresistor · 2024-02-05T14:57:47Z

@djay I will take another look at this over the next week or so. I appreciate the follow-up!

djay · 2024-02-13T03:23:36Z

@digitalresistor I wouldn't think about it too long. basically I've given a recipe to DDOS any waitress site. lock up one thread with a single request. This would happen every sunday with this repeated pentest. Once I deployed the fix it never happened again. Leaving a bug like this is in doesn't seem right to me.

No longer call getpeername() on the remote socket either, as it is not necessary for any of the places where waitress requires that self.addr in a subclass of the dispatcher needs it. This removes a race condition when setting up a HTTPChannel where we accepted the socket, and know the remote address, yet call getpeername() again which would have the unintended side effect of potentially setting self.connected to False because the remote has already shut down part of the socket. This issue was uncovered in #418, where the server would go into a hard loop because self.connected was used in various parts of the code base.

digitalresistor · 2024-03-04T00:17:23Z

I've created a new PR that removes getpeername() entirely. It is not useful in the context of waitress, and cleaned up some code related to self.connected, since calling self.handle_close() multiple times is not an issue.

Could you drop this into your system instead and take a look if you still see the same issue: #435

djay · 2024-03-04T08:02:37Z

@digitalresistor I can.
But are you sure that there is no other scenario where self.connected is False so it goes into a loop and it can't ever get closed down?
ie this line is still problematic I think. https://github.com/Pylons/waitress/pull/419/files#diff-5cb215c6142fa7daad673c77fcb7f3bc0a0630e18e3487a5ac0894e90f1b2071L95

djay · 2024-03-04T08:04:12Z

@digitalresistor sorry I see you remove this also. Ok, I will test this

digitalresistor · 2024-03-21T19:29:03Z

@djay do you have any feedback with the newly proposed changes?

djay · 2024-03-22T03:19:20Z

@digitalresistor sorry for the delay on this. It was deployed earlier this week but the weekly pentest only happens once a week on a sunday so monday we should know.

djay · 2024-04-03T03:17:59Z

@digitalresistor seems to be running fine in prod with what we presume are the same pentests being thrown at it

Bumps [waitress](https://github.com/Pylons/waitress) from 2.1.2 to 3.0.1. <details> <summary>Release notes</summary> Sourced from <a href="https://github.com/Pylons/waitress/releases">waitress's releases</a>. <blockquote> <h2>v3.0.0</h2> <h1>3.0.0 (2024-02-04)</h1> <ul> <li> Rename "master" git branch to "main" </li> <li> Fix a bug that would appear on macOS whereby if we accept() a socket that is already gone, setting socket options would fail and take down the server. See <a href="https://redirect.github.com/Pylons/waitress/pull/399">Pylons/waitress#399</a> </li> <li> Fixed testing of vendored asyncore code to not rely on particular naming for errno's. See <a href="https://redirect.github.com/Pylons/waitress/pull/397">Pylons/waitress#397</a> </li> <li> HTTP Request methods and versions are now validated to meet the HTTP standards thereby dropping invalid requests on the floor. See <a href="https://redirect.github.com/Pylons/waitress/pull/423">Pylons/waitress#423</a> </li> <li> No longer close the connection when sending a HEAD request response. See <a href="https://redirect.github.com/Pylons/waitress/pull/428">Pylons/waitress#428</a> </li> <li> Always attempt to send the Connection: close response header when we are going to close the connection to let the remote know in more instances. <a href="https://redirect.github.com/Pylons/waitress/pull/429">Pylons/waitress#429</a> </li> <li> Python 3.7 is no longer supported. Add support for Python 3.11, 3.12 and PyPy 3.9, 3.10. See <a href="https://redirect.github.com/Pylons/waitress/pull/412">Pylons/waitress#412</a> </li> <li> Document that trusted_proxy may be set to a wildcard value to trust all proxies. See <a href="https://redirect.github.com/Pylons/waitress/pull/431">Pylons/waitress#431</a> </li> </ul> <h2>Updated Defaults</h2> <ul> <li>clear_untrusted_proxy_headers is set to True by default. See <a href="https://redirect.github.com/Pylons/waitress/pull/370">Pylons/waitress#370</a></li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> Sourced from <a href="https://github.com/Pylons/waitress/blob/main/CHANGES.txt">waitress's changelog</a>. <blockquote> <h2>3.0.1 (2024-11-28)</h2> Security <pre><code> - Fix a bug that would lead to Waitress busy looping on select() on a half-open socket due to a race condition that existed when creating a new HTTPChannel. See Pylons/waitress#435, Pylons/waitress#418 and GHSA-3f84-rpwh-47g6 With thanks to Dylan Jay and Dieter Maurer for their extensive debugging and helping track this down. <ul> <li> No longer strip the header values before passing them to the WSGI environ. See <a href="https://redirect.github.com/Pylons/waitress/pull/434">Pylons/waitress#434</a> and <a href="https://redirect.github.com/Pylons/waitress/issues/432">Pylons/waitress#432</a> </li> <li> Fix a race condition in Waitress when <code>channel_request_lookahead</code> is enabled that could lead to HTTP request smuggling. See <a href="https://github.com/Pylons/waitress/security/advisories/GHSA-9298-4cf8-g4wj">https://github.com/Pylons/waitress/security/advisories/GHSA-9298-4cf8-g4wj</a> </li> </ul> <h2>3.0.0 (2024-02-04)</h2> <ul> <li> Rename "master" git branch to "main" </li> <li> Fix a bug that would appear on macOS whereby if we accept() a socket that is already gone, setting socket options would fail and take down the server. See <a href="https://redirect.github.com/Pylons/waitress/pull/399">Pylons/waitress#399</a> </li> <li> Fixed testing of vendored asyncore code to not rely on particular naming for errno's. See <a href="https://redirect.github.com/Pylons/waitress/pull/397">Pylons/waitress#397</a> </li> <li> HTTP Request methods and versions are now validated to meet the HTTP standards thereby dropping invalid requests on the floor. See <a href="https://redirect.github.com/Pylons/waitress/pull/423">Pylons/waitress#423</a> </li> <li> No longer close the connection when sending a HEAD request response. See <a href="https://redirect.github.com/Pylons/waitress/pull/428">Pylons/waitress#428</a> </li> <li> Always attempt to send the Connection: close response header when we are going to close the connection to let the remote know in more instances. <a href="https://redirect.github.com/Pylons/waitress/pull/429">Pylons/waitress#429</a> </li> <li> Python 3.7 is no longer supported. Add support for Python 3.11, 3.12 and PyPy 3.9, 3.10. See <a href="https://redirect.github.com/Pylons/waitress/pull/412">Pylons/waitress#412</a> </li> </ul> </tr></table> </code></pre> </blockquote> ... (truncated) </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/Pylons/waitress/commit/ae949bb428e50cf04152db56460f31c1e6d3a2a9"><code>ae949bb</code></a> Ready for 3.0.1</li> <li><a href="https://github.com/Pylons/waitress/commit/e4359018537af376cf24bd13616d861e2fb76f65"><code>e435901</code></a> Merge commit from fork</li> <li><a href="https://github.com/Pylons/waitress/commit/810a435f9e9e293bd3446a5ce2df86f59c4e7b1b"><code>810a435</code></a> Add documentation for channel_request_lookahead</li> <li><a href="https://github.com/Pylons/waitress/commit/f4ba1c260cf17156b582c6252496213ddc96b591"><code>f4ba1c2</code></a> Fix a race condition on recv_bytes boundary when request is invalid</li> <li><a href="https://github.com/Pylons/waitress/commit/7e7f11e61d358ab1cb853fcadf2b46b1f00f5993"><code>7e7f11e</code></a> Add a new test to validate the lookahead race condition</li> <li><a href="https://github.com/Pylons/waitress/commit/6943dcf556610ece2ff3cddb39e59a05ef110661"><code>6943dcf</code></a> Make DummySock() look more like an actual socket</li> <li><a href="https://github.com/Pylons/waitress/commit/fdd2ecfd325af2f419d91c62b2551e2c3922f686"><code>fdd2ecf</code></a> Merge pull request <a href="https://redirect.github.com/Pylons/waitress/issues/445">#445</a> from Pylons/feature/support-py-3-13</li> <li><a href="https://github.com/Pylons/waitress/commit/dcd18e7b4b8e78e2abea8f286c23b0b9298bea9b"><code>dcd18e7</code></a> Update exclude matrix</li> <li><a href="https://github.com/Pylons/waitress/commit/4633ea6d69d6b7eff5db91e263ea85f437026db0"><code>4633ea6</code></a> Drop Python 3.8 and add Python 3.13</li> <li><a href="https://github.com/Pylons/waitress/commit/4584936eac5838b6d3b07e84a86874fa586ffe6e"><code>4584936</code></a> Merge pull request <a href="https://redirect.github.com/Pylons/waitress/issues/440">#440</a> from Pylons/fix/ci</li> <li>Additional commits viewable in <a href="https://github.com/Pylons/waitress/compare/v2.1.2...v3.0.1">compare view</a></li> </ul> </details> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=waitress&package-manager=pip&previous-version=2.1.2&new-version=3.0.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/uktrade/jml/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

merwok · 2024-11-22T19:21:44Z

Could this be backported to the 2.x line?
Updating to v3 causes infinite redirect loop on heroku which I can’t debug.

mmerickel · 2024-11-22T20:36:19Z

The only bw-incompat change in waitress was changing clear_untrusted_proxy_headers=true as the default. If you haven't setup trusted proxy configs you might want to just set this back to false and I suspect that's a thing that would fix your heroku issue.

merwok · 2024-11-22T23:18:12Z

Ah, thanks for the useful pointer! It seems that that’s an easy thing to configue properly. Then I can go back to fighting over redis ssl connection with self-signed certificate 🥲👍🏽

trying to solve load issues Pylons/waitress#418

djay mentioned this issue Sep 11, 2023

using 100% trying to acquire lock collective/haufe.requestmonitoring#15

Open

djay mentioned this issue Sep 12, 2023

ensure we don't loop trying to write to a channel thats not connected (fix 100% CPU) #419

Closed

5 tasks

djay changed the title ~~100% cpu in mainthread due to not closing properly?~~ 100% cpu in mainthread due to not closing properly? (channel.connected == False) Sep 13, 2023

digitalresistor mentioned this issue Feb 4, 2024

waitress-servce consume 100% cpu #411

Closed

digitalresistor mentioned this issue Mar 3, 2024

Remove race condition when creating new HTTPChannel #435

Merged

digitalresistor closed this as completed in #435 Jun 8, 2024

remdub added a commit to IMIO/buildout.library that referenced this issue Dec 17, 2024

feat(wsgi): pin waitress 3.02

db48d84

trying to solve load issues Pylons/waitress#418

100% cpu in mainthread due to not closing properly? (channel.connected == False) #418

100% cpu in mainthread due to not closing properly? (channel.connected == False) #418

Comments

djay commented Sep 11, 2023 • edited Loading

djay commented Sep 11, 2023

d-maurer commented Sep 11, 2023 via email

d-maurer commented Sep 11, 2023 via email

djay commented Sep 11, 2023

djay commented Sep 11, 2023

d-maurer commented Sep 11, 2023 via email

djay commented Sep 12, 2023

djay commented Sep 12, 2023 • edited Loading

djay commented Sep 12, 2023 • edited Loading

djay commented Sep 12, 2023

d-maurer commented Sep 12, 2023 via email

d-maurer commented Sep 12, 2023 via email

djay commented Sep 12, 2023

djay commented Sep 12, 2023

d-maurer commented Sep 12, 2023 via email

djay commented Sep 13, 2023

d-maurer commented Sep 13, 2023 via email

djay commented Sep 13, 2023 • edited Loading

djay commented Sep 13, 2023

d-maurer commented Sep 13, 2023 via email

d-maurer commented Sep 13, 2023 via email

d-maurer commented Sep 13, 2023 via email

djay commented Sep 13, 2023

d-maurer commented Sep 13, 2023 via email

d-maurer commented Sep 13, 2023 via email

d-maurer commented Sep 13, 2023 via email

d-maurer commented Sep 13, 2023

digitalresistor commented Sep 18, 2023

djay commented Sep 18, 2023

djay commented Sep 20, 2023

djay commented Sep 20, 2023

d-maurer commented Sep 20, 2023 via email

d-maurer commented Sep 20, 2023 via email

d-maurer commented Sep 20, 2023 via email

djay commented Oct 20, 2023

digitalresistor commented Feb 4, 2024

djay commented Feb 5, 2024

digitalresistor commented Feb 5, 2024

d-maurer commented Feb 5, 2024 via email

djay commented Feb 5, 2024

digitalresistor commented Feb 5, 2024

djay commented Feb 13, 2024

digitalresistor commented Mar 4, 2024

djay commented Mar 4, 2024

djay commented Mar 4, 2024

digitalresistor commented Mar 21, 2024

djay commented Mar 22, 2024

djay commented Apr 3, 2024

merwok commented Nov 22, 2024

mmerickel commented Nov 22, 2024

merwok commented Nov 22, 2024

djay commented Sep 11, 2023 •

edited

Loading

djay commented Sep 12, 2023 •

edited

Loading

djay commented Sep 12, 2023 •

edited

Loading

djay commented Sep 13, 2023 •

edited

Loading