router hangs under load #550

garypen · 2022-02-25T08:56:28Z

Describe the bug

report 1 @tomrj

I did not see this mystery behaviour in previous builds but when running some load on router, it getting stuck. I’m using a tool called autocannon) after a while, and does not recover.

However: when curling the server using same headers and same query it works just fine.

report 2

I can replicate this sort of behavior when I blow out concurrency limits on downstream lambdas and they start throttling. If I recall, router didn't seem to handling closure of stuck threads all that gracefully and eventually would top out.

To Reproduce

Need more info from users.

Expected behavior

The router shouldn't hang.

BrynCooke · 2022-02-25T10:44:14Z

Something to ask is if query variables are being inlined.
This will cause a severe perf issue due to query planning happening on every request.

abernix · 2022-02-28T15:40:09Z

Something to ask is if query variables are being inlined.
This will cause a severe perf issue due to query planning happening on every request.

It's worth noting that this would be problematic if the operations generated by autocannon were truly dynamic and there was a high cardinality of operations being received; If the same set of inlined variables are sent over and over again, I wouldn't expect a performance hit as the cache would still suffice. (Worth noting that we also have an article about best practices that's somewhat relevant.)

tomrj · 2022-03-17T09:54:41Z

The queries were all static and there was only one query. This was really simple microbenchmark. The effect was not so much a performance hit as total inability to run the benchmark after a while, but still curl on same query would work reliably, even curl in a shell loop, without missing a beat.

So to go further I need to verify that is this the pod that is messed up, and does this scenario occur with other kinds of benchmarks, such as artillery or some native one.

Geal · 2022-03-25T09:41:41Z

@tomrj did you look into the pod's behaviour? Could you give me some of the parameters of the bench, like number of concurrent connections, size of query, etc? When you say "total inability to run the benchmark", do you mean that the benchmark never finishes? Or that a benchmark runs entirely but then successive benchmarks fail?

Geal · 2022-03-25T14:00:51Z

@tomrj I can reproduce the issue, I'll let you know when I have a fix for it

Geal · 2022-03-28T08:39:37Z

Here is what I know so far: in a serie of subgraph requests happening on a connection, at some point one of them has an issue and no more requests are sent to that subgraph. The router still receives any new requests but does not answer if it is waiting for a response from that subgraph.

The timeline, pieced together from logs and wireshark:

2022-03-25T15:58:06.207045290Z wireshark sees a packet sent to the subgraph
2022-03-25T15:58:06.207059Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: KeepAlive, keep_alive: Busy } sent the packet (hyper's side)
2022-03-25T15:58:06.207080Z TRACE hyper::proto::h1::io: received 0 bytes epoll notifies the socket got an event, hyper tries to read from the socket, gets a success with 0 bytes which indicates the other side closed the socket
2022-03-25T15:58:06.207089Z TRACE hyper::proto::h1::conn: found unexpected EOF on busy connection: State { reading: KeepAlive, writing: Init, keep_alive: Busy } since hyper was expecting a response, it considers the connection fails and will close it
2022-03-25T15:58:06.207200201Z wireshark sees the FIN packet (because the connection was closed by hyper)
2022-03-25T15:58:06.207842869Z wireshark sees the response packet from the subgraph (with a RST flag). Hyper believes the subgraph closed the connection, while the packet trace shows it is initiated by the router
from here on we should see hyper either retry the request (but we have not set up retries) or return an error. But the future holding the request from the router's side never returns. And any new request will gets stuck in the same way

I debugged hyper's execution a bit and right now I do not believe it comes from hyper, it might be in the reqwest library that we use as http client

tomrj · 2022-03-28T12:00:10Z

@tomrj did you look into the pod's behaviour? Could you give me some of the parameters of the bench, like number of concurrent connections, size of query, etc? When you say "total inability to run the benchmark", do you mean that the benchmark never finishes? Or that a benchmark runs entirely but then successive benchmarks fail?

The run seems to be unable to connect to the router at all, all requests fail

garypen added triage performance Performance or scalability issues labels Feb 25, 2022

abernix assigned o0Ignition0o Mar 22, 2022

abernix removed the triage label Mar 24, 2022

Geal assigned Geal and unassigned o0Ignition0o Mar 25, 2022

Geal mentioned this issue Mar 28, 2022

fix behaviour of the deduplication layer under load #746

Closed

garypen mentioned this issue Mar 29, 2022

improve async cancellation handling in the router #752

Merged

BrynCooke mentioned this issue Mar 29, 2022

consider query deduplication off by default and documenting its availability #717

Closed

garypen mentioned this issue Mar 29, 2022

improved async cancellation handling will require more hashmap management #754

Closed

garypen closed this as completed in #752 Mar 29, 2022

Geal assigned garypen Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

router hangs under load #550

router hangs under load #550

garypen commented Feb 25, 2022 •

edited by abernix

Loading

BrynCooke commented Feb 25, 2022

abernix commented Feb 28, 2022

tomrj commented Mar 17, 2022

Geal commented Mar 25, 2022

Geal commented Mar 25, 2022

Geal commented Mar 28, 2022

tomrj commented Mar 28, 2022

router hangs under load #550

router hangs under load #550

Comments

garypen commented Feb 25, 2022 • edited by abernix Loading

BrynCooke commented Feb 25, 2022

abernix commented Feb 28, 2022

tomrj commented Mar 17, 2022

Geal commented Mar 25, 2022

Geal commented Mar 25, 2022

Geal commented Mar 28, 2022

tomrj commented Mar 28, 2022

garypen commented Feb 25, 2022 •

edited by abernix

Loading