Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max outbound streams is 100, 100 open #1414

Closed
2 tasks done
valiant1x opened this issue Dec 5, 2020 · 15 comments · Fixed by encode/httpcore#253 or encode/httpcore#440
Closed
2 tasks done

Max outbound streams is 100, 100 open #1414

valiant1x opened this issue Dec 5, 2020 · 15 comments · Fixed by encode/httpcore#253 or encode/httpcore#440
Labels
concurrency Issues related to concurrency and usage of async libraries http/2 Issues and PRs related to HTTP/2 needs confirmation The issue described has not yet been confirmed, or replicated locally.

Comments

@valiant1x
Copy link

Checklist

  • The bug is reproducible against the latest release and/or master.
  • There are no similar issues or pull requests to fix it yet.

Describe the bug

I tried looking through GitHub issues and documentation for an explanation of how to increase this limit but did not find any answers. We use httpx.AsyncClient(http2=True) Async client with HTTP/2 in high concurrency scenarios, although I observed the same error with http2=False. It is necessary to keep many streams open when the target website goes into queue mode. At this point, if running more than 100 concurrent requests, we encounter this error. How can we resolve it?

I have tried initializing the AsyncClient with httpx.Limits(max_connections=None, max_keepalive_connections=None) but this did not make a difference.

To reproduce

Will provide upon request, this seems more of a design/conceptual problem where reproducible POC may not be necessary

Expected behavior

Client can configure the maximum number of outbound streams.

Actual behavior

Client is provided with error about maximum streams and request fails.

Debugging material

n/a

Environment

  • OS: Windows
  • Python version: 3.8.2
  • HTTPX version: 0.16.1
  • Async environment: Only observed in Asyncio, have not tested trio
  • HTTP proxy: Yes - not feasible to perform this many connections without proxy, it will result in HTTP errors
  • Custom certificates: no

Additional context

Issue was somewhat acknowledged in #832 but there was no resolution provided.

How can we increase the streams limit or otherwise work around the issue here?

@florimondmanca
Copy link
Member

@valiant1x Thanks for opening this, could you perhaps share the following... ?

  • Full traceback
  • Simple repro case, if possible. Eg is this reproducible by just spawning > 100 requests to example.com? How is the parallelism achieved: asyncio.gather, something else?

@florimondmanca florimondmanca added http/2 Issues and PRs related to HTTP/2 concurrency Issues related to concurrency and usage of async libraries needs confirmation The issue described has not yet been confirmed, or replicated locally. labels Dec 5, 2020
@valiant1x
Copy link
Author

valiant1x commented Dec 5, 2020

@florimondmanca Thanks for the quick reply.

Traceback:

[2020-12-05 17:00:10] [WORKER f41f826]: <GET> Error making request to 'https://www.redacted.com/site/redacted.html'. Max outbound streams is 100, 100 open
  File "C:\Users\valiant\Documents\LuckySuite\Client\TaskServer\modules\plugin_LuckyScraper.py", line 149, in request
    res = await super().request(method, url, **kwargs)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpx\_client.py", line 1371, in request
    response = await self.send(
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpx\_client.py", line 1406, in send
    response = await self._send_handling_auth(
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpx\_client.py", line 1444, in _send_handling_auth
    response = await self._send_handling_redirects(
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpx\_client.py", line 1476, in _send_handling_redirects
    response = await self._send_single_request(request, timeout)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpx\_client.py", line 1502, in _send_single_request
    (status_code, headers, stream, ext,) = await transport.arequest(
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\http_proxy.py", line 124, in arequest
    return await self._tunnel_request(
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\http_proxy.py", line 248, in _tunnel_request
    (status_code, headers, stream, ext) = await connection.arequest(
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\connection.py", line 100, in arequest
    return await self.connection.arequest(method, url, headers, stream, ext)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\http2.py", line 119, in arequest
    return await h2_stream.arequest(method, url, headers, stream, ext)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\http2.py", line 292, in arequest
    await self.send_headers(method, url, headers, has_body, timeout)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\http2.py", line 330, in send_headers
    await self.connection.send_headers(self.stream_id, headers, end_stream, timeout)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\http2.py", line 227, in send_headers
    self.h2_state.send_headers(stream_id, headers, end_stream=end_stream)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\h2\connection.py", line 761, in send_headers
    raise TooManyStreamsError(
h2.exceptions.TooManyStreamsError: Max outbound streams is 100, 100 open

I'll try to provide a simple repro, it's a little challenging because it seems to depend on our target website being in 'queue' mode. I don't want to share the target website here, but the gist of it is, we have a class that inherits httpx.AsyncClient and then provides the same .get, .post, etc. helpers that httpx provides. We redirect all of these to our own LuckyScraper.request method, which adds automatic retries and a couple other request handlers useful to us.

class LuckyScraper(httpx.AsyncClient):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
    async def get(self, url: str, **kwargs) -> object:
        return await self.request("GET", url, **kwargs)
    async def request(self, method: str, url: str, **kwargs) -> object:
        attempts = 0
        retries = 10
        while attempts < retries:
            res = await super().request(method, url, **kwargs)
            if res and not res.is_error:
                return res
            attempts += 1
            await asyncio.sleep(5)

sslCtx = getCustomSslCtx()
proxies = { ... }
async with LuckyScraper(verify=sslCtx, proxies=proxies, timeout=20, http2=True) as client:
    headers = OrderedDict([
        ('accept', '*/*'),
        #other headers
    ])
    response = await client.get(url, headers=headers)

During queue mode, we will see the worker spend a lot of time in await client.get due to the long queue time. It's some time during this that once a specific worker thread hits the Max 100 streams error, it never recovers. Every subsequent request yields the Max 100 streams error.

Spawning 100 requests does not trigger this, only the retries in the same async context loop trigger the issue.

Batches of tasks are sent to socketio.AsyncClients, which just individually fire async/await calls as needed in their individual worker threads.

Are there streams opened during this process that are not gracefully cleaned up because the underlying httpx instance is long living? We need a persistent instance because we rely on some session cookies that must persist through the refreshes, in order for the queue system on target website to interact properly.

@florimondmanca
Copy link
Member

@valiant1x When you write « queue mode », do you mean the server returning 503 Service Unavailable responses, which causes your custom get method to loop until it gets a proper non-error response? Or does it mean something else?

@florimondmanca
Copy link
Member

florimondmanca commented Dec 5, 2020

Are there streams opened during this process that are not gracefully cleaned up because the underlying httpx instance is long living?

The intended behavior would be that when .request() returns a response, the corresponding HTTP/2 connection is kept in a keep-alive state, and reused upon the next request. I’m not exactly sure what happens in terms of HTTP/2 streams on that connection, though.

FWIW our networking code for HTTP/2 lives here: https://github.com/encode/httpcore/blob/master/httpcore/_async/http2.py

You’ll notice that there’s error handling around the h2 library telling us « I have no streams left », in which case we end up creating a new connection. The exception class there is NoAvailableStreamIDError, while you see TooManyStreams, which seems to be a different case you might just not be handling yet.

Right now what I assume may be happening is some kind of bug (either is, h2, or the server) causes the stream to « leak » after the response is sent. h2 still considers it to be active, and so we keep opening more and we reach the 100 limit.

@valiant1x
Copy link
Author

valiant1x commented Dec 5, 2020

@valiant1x When you write « queue mode », do you mean the server returning 503 Service Unavailable responses, which causes your custom get method to loop until it gets a proper non-error response? Or does it mean something else?

Exactly, queue mode is 503. We simply retry here, for many attempts (200+)

Are there streams opened during this process that are not gracefully cleaned up because the underlying httpx instance is long living?

The intended behavior would be that when .request() returns a response, the corresponding HTTP/2 connection is kept in a keep-alive state, and reused upon the next request. I’m not exactly sure what happens in terms of HTTP/2 streams on that connection, though.

FWIW our networking code for HTTP/2 lives here: https://github.com/encode/httpcore/blob/master/httpcore/_async/http2.py

You’ll notice that there’s error handling around the h2 library telling us « I have no streams left », in which case we end up creating a new connection. The exception class there is NoAvailableStreamIDError, while you see TooManyStreams, which seems to be a different case you might just not be handling yet.

Right now what I assume may be happening is some kind of bug (either is, h2, or the server) causes the stream to « leak » after the response is sent. h2 still considers it to be active, and so we keep opening more and we reach the 100 limit.

Can we disable keep-alive on our retries to mitigate the issue, or forcefully close the underlying stream if the response status code is_error? I would think this is desired behavior, to close a stream on error.

Testing was a mess today and the 503 is no longer triggering, but I thought I saw the same issue when specifying http2=False.

@florimondmanca
Copy link
Member

florimondmanca commented Dec 5, 2020

I would think this is desired behavior, to close a stream on error.

Well, there's not really an error here — the server responds just fine, it's just that we get a 503 Service Unavailable (which is an error response, but there's no exception occurring, the request/response cycle happened just fine). So streams definitely shouldn't leak in that case — the response was sent, stream should be terminated, end of story. So right now I'd say there's a bug somewhere. :-)

Some next-step things to figure out:

  • Is it related to 503 responses specifically?
  • Is it caused by this particular server implementation? (Not clear that the server couldn't have a faulty HTTP/2 implementation when in queue mode.) Does it reproduce on a local mock server that returns a 503?
  • Is it reproducible against a local nginx server forced into queue mode (eg connection_pool_size=0, if nginx has that knob)?

@florimondmanca
Copy link
Member

florimondmanca commented Dec 5, 2020

Can we disable keep-alive

@valiant1x Disabling keep-alive would mean not reusing connections, so at this point you might as well create a new client instance each time, i.e. have LuckyScraper.request() instantiate a throw-away AsyncClient to make the request.

(Btw, I wouldn't recommend subclassing from AsyncClient directly — nicely wrapping it using composition, i.e. having a .client attribute, seems to be a more flexible approach.)

@valiant1x
Copy link
Author

Some next-step things to figure out:

* Is it related to 503 responses specifically?

* Is it caused by this particular server implementation? (Not clear that the server couldn't have a faulty HTTP/2 implementation when in queue mode.) Does it reproduce on a local mock server that returns a 503?

* Is it reproducible against a local `nginx` server forced into queue mode (eg connection_pool_size=0, if nginx has that knob)?

Great ideas. I'll get some testing going to reproduce 503s with local server and see if I can get anywhere.

Can we disable keep-alive

@valiant1x Disabling keep-alive would mean not reusing connections, so at this point you might as well create a new client instance each time, i.e. have LuckyScraper.request() instantiate a throw-away AsyncClient to make the request.

If I did not need to preserve cookies, I would create a new instance every time as a hack workaround to this issue, but I have found cookie importing and exporting to be flaky so I want to avoid this. If you instantiate httpx with a cookie that the server then updates, it seems difficult to avoid duplicate cookies. Duplicate cookies are then sent in subsequent requests and also crash dict() casting of the client cookie jar. I have encountered this a few times. Setting path and domain helps but does not cover every situation.

(Btw, I wouldn't recommend subclassing from AsyncClient directly — nicely wrapping it using composition, i.e. having a .client attribute, seems to be a more flexible approach.)

Can you elaborate on why it would be better to use a class attribute instead of subclassing httpx.AsyncClient?

@valiant1x
Copy link
Author

Reproducing this and finding the root cause has proved to be very time intensive. The error is sporadic and does not seem to depend much on server response code necessarily, but just on re-using an AsyncClient over many subsequent requests. Even with re-use, I only succeeded in reproducing the error once. The only conclusive finding I have is that the Max outbound streams errors seem to always follow Connection lost errors.

Here is a stack trace of a Connection lost exception.

  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpx\_exceptions.py", line 326, in map_exceptions
    yield
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpx\_client.py", line 1502, in _send_single_request
    (status_code, headers, stream, ext,) = await transport.arequest(
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\http_proxy.py", line 124, in arequest
    return await self._tunnel_request(
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\http_proxy.py", line 258, in _tunnel_request
    (status_code, headers, stream, ext) = await connection.arequest(
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\connection.py", line 106, in arequest
    return await self.connection.arequest(method, url, headers, stream, ext)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\http2.py", line 119, in arequest
    return await h2_stream.arequest(method, url, headers, stream, ext)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\http2.py", line 292, in arequest
    await self.send_headers(method, url, headers, has_body, timeout)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\http2.py", line 330, in send_headers
    await self.connection.send_headers(self.stream_id, headers, end_stream, timeout)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_async\http2.py", line 230, in send_headers
    await self.socket.write(data_to_send, timeout)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_backends\asyncio.py", line 163, in write
    return await asyncio.wait_for(
  File "C:\Users\valiant\AppData\Local\Programs\Python\Python38\lib\contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpcore\_exceptions.py", line 12, in map_exceptions
    raise to_exc(exc) from None
httpcore.WriteError: Connection lost

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\valiant\Documents\LuckySuite\Client\TaskServer\modules\plugin_LuckyScraper.py", line 148, in request
    res = await super().request(method, url, **kwargs)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpx\_client.py", line 1371, in request
    response = await self.send(
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpx\_client.py", line 1406, in send
    response = await self._send_handling_auth(
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpx\_client.py", line 1444, in _send_handling_auth
    response = await self._send_handling_redirects(
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpx\_client.py", line 1476, in _send_handling_redirects
    response = await self._send_single_request(request, timeout)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpx\_client.py", line 1502, in _send_single_request
    (status_code, headers, stream, ext,) = await transport.arequest(
  File "C:\Users\valiant\AppData\Local\Programs\Python\Python38\lib\contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "C:\Users\valiant\Documents\LuckySuite\venv\lib\site-packages\httpx\_exceptions.py", line 343, in map_exceptions
    raise mapped_exc(message, **kwargs) from exc  # type: ignore
httpx.WriteError: Connection lost

We are going to simply revert to HTTP/1.1 for now unfortunately.

@florimondmanca
Copy link
Member

@valiant1x

The only conclusive finding I have is that the Max outbound streams errors seem to always follow Connection lost errors.

That's quite an interesting observation. Given how infrequent those are, and assuming they correspond to asyncio raising a ConnectionResetError("Connection lost"), they probably correspond to the server sometimes dropping the connection for some reason (eg a flaky network).

But… When that happens in the middle of a request, all we do is release the stream semaphore, then bubble up.

https://github.com/encode/httpcore/blob/641a762bee001486cfbbe81c25156bb6a6091abb/httpcore%2F_async%2Fhttp2.py#L119-L122

But h2 does its own book-keeping of states too. If we look at the point when that TooManyStreamsError is raised, we can see it's when h2:

    def send_headers(self, stream_id, headers, end_stream=False,
                     priority_weight=None, priority_depends_on=None,
                     priority_exclusive=None):
        ...
        # Check we can open the stream.
        if stream_id not in self.streams:
            max_open_streams = self.remote_settings.max_concurrent_streams
            if (self.open_outbound_streams + 1) > max_open_streams:
                raise TooManyStreamsError(
                    "Max outbound streams is %d, %d open" %
                    (max_open_streams, self.open_outbound_streams)
                )

self.open_outbound_streams is computed from iterating over self.streams and looking for which ones are open or closed. A stream is considered closed if its state is CLOSED:

    @property
    def closed(self):
        """
        Whether the stream is closed.
        """
        return self.state_machine.state == StreamState.CLOSED

So, since we just give up and propagate the exception without letting h2 know that something went wrong and the connection is unusable, h2 actually carries on since from its POV the connection is fine and the stream has remained open.

So each time that Connection lost error occurs, a stream basically leaks. This builds up, and depending on conditions it explodes sooner or later with the TooManyStreamsError. Well, that's my theory, at least!

So… Maybe we just have to call h2_state.close_connection() before raising exceptions up the stack?

@valiant1x
Copy link
Author

So each time that Connection lost error occurs, a stream basically leaks. This builds up, and depending on conditions it explodes sooner or later with the TooManyStreamsError. Well, that's my theory, at least!

So… Maybe we just have to call h2_state.close_connection() before raising exceptions up the stack?

Great analysis and synthesis. It seems like a probable solution. I will re-test it at some point in the medium term future. If anyone else encounters the same problem in the meantime, they will have a good starting point for resolution.

@florimondmanca
Copy link
Member

@valiant1x I pushed encode/httpcore#253 with a naive attempt at resolving this based on my previous answer. You can test it by installing HTTPCore from the branch. Would you be able to test it and see if it resolves the issue?

pip install "git+https://github.com/encode/httpcore.git@fm/http2-close-conn#egg=httpcore"

@kice
Copy link

kice commented May 22, 2021

I have also encounter this issue. And it looks like it was a server-side limitation, in h2, self.remote_settings.max_concurrent_streams decides if the exception will be raised or not.

My questions are:

  1. Could httpx wait/await for available stream slot, instead of giving it to h2 and cause an exception?
  2. If TooManyStreamsError was raised, is it okay to retry the request using the same ASyncClient?

@kice
Copy link

kice commented May 26, 2021

@florimondmanca

I think the internal value of self.max_streams_semaphore dose not represent the actual value of self.h2_state.open_outbound_streams in AsyncHTTP2Connection.

Maybe try to add a conditional variable (looks like we need to add a AsyncCondition) to make sure we do not request more stream than maximum opened streams allowed?

@frankwalter1301
Copy link

Hello there. I get this same error when too many connections get unexpectedly closed like the OP. @florimondmanca your patch doesn't seems to work with newer versions since I get an exception when the new line of code is executed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
concurrency Issues related to concurrency and usage of async libraries http/2 Issues and PRs related to HTTP/2 needs confirmation The issue described has not yet been confirmed, or replicated locally.
Projects
None yet
4 participants