Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

router hangs under load #550

Closed
garypen opened this issue Feb 25, 2022 · 7 comments · Fixed by #752
Closed

router hangs under load #550

garypen opened this issue Feb 25, 2022 · 7 comments · Fixed by #752
Assignees
Labels
performance Performance or scalability issues

Comments

@garypen
Copy link
Contributor

garypen commented Feb 25, 2022

Describe the bug

report 1 @tomrj

I did not see this mystery behaviour in previous builds but when running some load on router, it getting stuck. I’m using a tool called autocannon) after a while, and does not recover.

However: when curling the server using same headers and same query it works just fine.

report 2

I can replicate this sort of behavior when I blow out concurrency limits on downstream lambdas and they start throttling. If I recall, router didn't seem to handling closure of stuck threads all that gracefully and eventually would top out.

To Reproduce

Need more info from users.

Expected behavior

The router shouldn't hang.

@garypen garypen added triage performance Performance or scalability issues labels Feb 25, 2022
@BrynCooke
Copy link
Contributor

Something to ask is if query variables are being inlined.
This will cause a severe perf issue due to query planning happening on every request.

@abernix
Copy link
Member

abernix commented Feb 28, 2022

Something to ask is if query variables are being inlined.
This will cause a severe perf issue due to query planning happening on every request.

It's worth noting that this would be problematic if the operations generated by autocannon were truly dynamic and there was a high cardinality of operations being received; If the same set of inlined variables are sent over and over again, I wouldn't expect a performance hit as the cache would still suffice. (Worth noting that we also have an article about best practices that's somewhat relevant.)

@tomrj
Copy link

tomrj commented Mar 17, 2022

The queries were all static and there was only one query. This was really simple microbenchmark. The effect was not so much a performance hit as total inability to run the benchmark after a while, but still curl on same query would work reliably, even curl in a shell loop, without missing a beat.

So to go further I need to verify that is this the pod that is messed up, and does this scenario occur with other kinds of benchmarks, such as artillery or some native one.

@abernix abernix removed the triage label Mar 24, 2022
@Geal Geal assigned Geal and unassigned o0Ignition0o Mar 25, 2022
@Geal
Copy link
Contributor

Geal commented Mar 25, 2022

@tomrj did you look into the pod's behaviour? Could you give me some of the parameters of the bench, like number of concurrent connections, size of query, etc? When you say "total inability to run the benchmark", do you mean that the benchmark never finishes? Or that a benchmark runs entirely but then successive benchmarks fail?

@Geal
Copy link
Contributor

Geal commented Mar 25, 2022

@tomrj I can reproduce the issue, I'll let you know when I have a fix for it

@Geal
Copy link
Contributor

Geal commented Mar 28, 2022

Here is what I know so far: in a serie of subgraph requests happening on a connection, at some point one of them has an issue and no more requests are sent to that subgraph. The router still receives any new requests but does not answer if it is waiting for a response from that subgraph.

The timeline, pieced together from logs and wireshark:

  • 2022-03-25T15:58:06.207045290Z wireshark sees a packet sent to the subgraph
  • 2022-03-25T15:58:06.207059Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: KeepAlive, keep_alive: Busy } sent the packet (hyper's side)
  • 2022-03-25T15:58:06.207080Z TRACE hyper::proto::h1::io: received 0 bytes epoll notifies the socket got an event, hyper tries to read from the socket, gets a success with 0 bytes which indicates the other side closed the socket
  • 2022-03-25T15:58:06.207089Z TRACE hyper::proto::h1::conn: found unexpected EOF on busy connection: State { reading: KeepAlive, writing: Init, keep_alive: Busy } since hyper was expecting a response, it considers the connection fails and will close it
  • 2022-03-25T15:58:06.207200201Z wireshark sees the FIN packet (because the connection was closed by hyper)
  • 2022-03-25T15:58:06.207842869Z wireshark sees the response packet from the subgraph (with a RST flag). Hyper believes the subgraph closed the connection, while the packet trace shows it is initiated by the router
  • from here on we should see hyper either retry the request (but we have not set up retries) or return an error. But the future holding the request from the router's side never returns. And any new request will gets stuck in the same way

I debugged hyper's execution a bit and right now I do not believe it comes from hyper, it might be in the reqwest library that we use as http client

@tomrj
Copy link

tomrj commented Mar 28, 2022

@tomrj did you look into the pod's behaviour? Could you give me some of the parameters of the bench, like number of concurrent connections, size of query, etc? When you say "total inability to run the benchmark", do you mean that the benchmark never finishes? Or that a benchmark runs entirely but then successive benchmarks fail?

The run seems to be unable to connect to the router at all, all requests fail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or scalability issues
Projects
None yet
6 participants