Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests using client & server are flakey #94

Closed
belm0 opened this issue Nov 23, 2018 · 5 comments
Closed

tests using client & server are flakey #94

belm0 opened this issue Nov 23, 2018 · 5 comments

Comments

@belm0
Copy link
Member

belm0 commented Nov 23, 2018

Since adopting trio-websocket for both client and server, my project is having random failures in simple tests which use both together.

The failures are rare, and seem to require use of pytest-trio autojump_clock (though there are often no sleep calls involved in the tests).

Following is the minimal code I can reproduce this with (OS X, needs to run many times). The test simply starts a server, connects a client to it, and exits.

from functools import partial

import trio
from trio_websocket import serve_websocket, open_websocket_url


class Server:

    def __init__(self, port=0):
        self._port = port

    async def _handleConnection(self, request):
        await request.accept()

    @property
    def port(self):
        return self._port

    async def serve(self, *, task_status=trio.TASK_STATUS_IGNORED):
        serve = partial(serve_websocket, self._handleConnection, '127.0.0.1',
                        self._port, ssl_context=None)
        async with trio.open_nursery() as nursery:
            server = await nursery.start(serve)
            self._port = server.port
            task_status.started()


host = '127.0.0.1'

async def test0(nursery, autojump_clock):
    server = Server()
    await nursery.start(server.serve)
    async with open_websocket_url(f'ws://{host}:{server.port}/') as ws:
        pass

(The function test0 is repeated 9 more times in the file to induce the failure.)

Failure:

================================ test session starts ================================
platform darwin -- Python 3.6.5, pytest-3.9.1, py-1.7.0, pluggy-0.8.0
rootdir: /foo, inifile: pytest.ini
plugins: trio-0.5.1
collected 10 items

test/foo.py ........F.                                                        [100%]

===================================== FAILURES ======================================
_______________________________________ test8 _______________________________________
trio.MultiError: Cancelled(), Cancelled()

Details of embedded exception 1:

  Traceback (most recent call last):
    File "/Users/.../lib/python3.6/site-packages/trio/_timeouts.py", line 116, in fail_at
      yield scope
    File "/Users/.../lib/python3.6/site-packages/trio_websocket/_impl.py", line 66, in open_websocket
      max_message_size=max_message_size)
    File "/Users/.../lib/python3.6/site-packages/trio_websocket/_impl.py", line 112, in connect_websocket
      stream = await trio.open_tcp_stream(host, port)
    File "/Users/.../lib/python3.6/site-packages/trio/_highlevel_open_tcp_stream.py", line 297, in open_tcp_stream
      nursery.start_soon(attempt_connect, nursery, None)
    File "/Users/.../lib/python3.6/site-packages/trio/_core/_run.py", line 397, in __aexit__
      raise combined_error_from_nursery
    File "/Users/.../lib/python3.6/site-packages/trio/_core/_run.py", line 632, in raise_cancel
      raise exc
  trio.Cancelled

Details of embedded exception 2:

  Traceback (most recent call last):
    File "/Users/.../lib/python3.6/site-packages/trio/_timeouts.py", line 116, in fail_at
      yield scope
    File "/Users/.../lib/python3.6/site-packages/trio_websocket/_impl.py", line 66, in open_websocket
      max_message_size=max_message_size)
    File "/Users/.../lib/python3.6/site-packages/trio_websocket/_impl.py", line 112, in connect_websocket
      stream = await trio.open_tcp_stream(host, port)
    File "/Users/.../lib/python3.6/site-packages/trio/_highlevel_open_tcp_stream.py", line 297, in open_tcp_stream
      nursery.start_soon(attempt_connect, nursery, None)
    File "/Users/.../lib/python3.6/site-packages/trio/_core/_run.py", line 397, in __aexit__
      raise combined_error_from_nursery
    File "/Users/.../lib/python3.6/site-packages/trio/_core/_run.py", line 632, in raise_cancel
      raise exc
  trio.Cancelled

During handling of the above exception, another exception occurred:

value = <trio._core._run.Nursery object at 0x10b21a320>

    async def yield_(value=None):
>       return await _yield_(value)

../.../lib/python3.6/site-packages/async_generator/_impl.py:106:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../.../lib/python3.6/site-packages/async_generator/_impl.py:99: in _yield_
    return (yield _wrap(value))
test/foo.py:73: in test8
    async with open_websocket_url(f'ws://{host}:{server.port}/') as ws:
../.../lib/python3.6/site-packages/async_generator/_util.py:34: in __aenter__
    return await self._agen.asend(None)
../.../lib/python3.6/site-packages/async_generator/_impl.py:366: in step
    return await ANextIter(self._it, start_fn, *args)
../.../lib/python3.6/site-packages/async_generator/_impl.py:202: in send
    return self._invoke(self._it.send, value)
../.../lib/python3.6/site-packages/async_generator/_impl.py:209: in _invoke
    result = fn(*args)
../.../lib/python3.6/site-packages/trio_websocket/_impl.py:66: in open_websocket
    max_message_size=max_message_size)
../../../.pyenv/versions/3.6.5/lib/python3.6/contextlib.py:99: in __exit__
    self.gen.throw(type, value, traceback)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

deadline = 60.0

    @contextmanager
    def fail_at(deadline):
        """Creates a cancel scope with the given deadline, and raises an error if it
        is actually cancelled.

        This function and :func:`move_on_at` are similar in that both create a
        cancel scope with a given absolute deadline, and if the deadline expires
        then both will cause :exc:`Cancelled` to be raised within the scope. The
        difference is that when the :exc:`Cancelled` exception reaches
        :func:`move_on_at`, it's caught and discarded. When it reaches
        :func:`fail_at`, then it's caught and :exc:`TooSlowError` is raised in its
        place.

        Raises:
          TooSlowError: if a :exc:`Cancelled` exception is raised in this scope
            and caught by the context manager.

        """

        with move_on_at(deadline) as scope:
            yield scope
        if scope.cancelled_caught:
>           raise TooSlowError
E           trio.TooSlowError

../.../lib/python3.6/site-packages/trio/_timeouts.py:118: TooSlowError
======================== 1 failed, 9 passed in 0.31 seconds =========================
@belm0 belm0 changed the title tests using trio-websocket client & server are flakey tests using client & server are flakey Nov 23, 2018
@belm0
Copy link
Member Author

belm0 commented Nov 23, 2018

@njsmith any thoughts on how autojump_clock may be implicated here?

@njsmith
Copy link
Member

njsmith commented Nov 23, 2018

My first thought is that the problem might be from combining autojump_clock and real sockets. If one task sends data over loopback TCP to another task in the same process, then from your point of view that's effectively instantaneous: send -> receive. But it's possible that at the OS level, what happens is that it's send -> tiny pause while the kernel figures out what to do here and the app sleeps -> receive. And during that tiny pause, the autojump clock might kick in.

You can confirm this using whatever macOS's equivalent of strace is (maybe truss?). If you watch the system call trace for a failing run, you'll see one task so something (maybe a connect in your example above), and then trio will call kqueue to ask the kernel if anything has happened, and the kernel will tell it nope, no new connections to handle or anything.

If this is the problem, then there are two workarounds:

  1. Switch to using a virtual network, so the kernel doesn't get involved and the trio scheduler has a complete picture of what's going on. Ideally this would just mean adding a fake_network fixture and you're done, but no one has implemented that yet (Mock network for testing trio#170 for discussion). You might also be able to hack something together using trio.testing.open_memory_stream.

  2. Set the autojump clock trigger threshold to a larger value. By default it's 0, meaning that any time the scheduler has nothing to do, it triggers a jump. But you can set it to something higher, e.g. 0.1 will mean that the clock waits 100 ms for things to settle before deciding that things really are idle. This is a bit less satisfying, because it shows things down slightly and forces you to fiddle with this arbitrary constant. But it does let autojump clock play along with kernel weirdness like you're (probably) seeing.

To set the threshold, you just set clock.autojump_threshold = 0.1. You can do this manually at the top of your test, or put it inside a custom fixture.

@belm0
Copy link
Member Author

belm0 commented Nov 25, 2018

@njsmith thank you, that appears to be it

I ran different values of autojump_threshold. As far as my 5 yr old Mac i7 laptop (otherwise unloaded), the highest value I could reproduce the error was 1 ms.

For my use case I'd be fine setting it at 5 ms globally. How about a pytest-trio config option?

@njsmith
Copy link
Member

njsmith commented Nov 25, 2018

Be aware that a timeout that works locally may not work if/when you start running things in CI on weirdo cloud machines.

We could think about a pytest-trio config option (maybe open an issue there?), but in the mean time you should be able to override pytest-trio's version like:

# Put this in a test file to override pytest-trio's version of the fixture locally
# Or in 'conftest.py' in the root of your test suite to override for the whole testsuite
@pytest.fixture
def autojump_clock():
    clock = trio.testing.MockClock()
    clock.autojump_threshold = 0.005
    return clock

(You can also give it a different name in case that's less confusing...)

@belm0
Copy link
Member Author

belm0 commented Nov 25, 2018

thank you-- yes conftest.py override is good enough

@belm0 belm0 closed this as completed Nov 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants