router stopped while loading #2712

chkp-talron · 2023-03-05T08:27:46Z

Describe the bug
We've had a v1.7.0 router setup working on k8s. when we tried to move using v1.8.0-v.1.11.0, they all failed during router startup. no heavy load issue here. just simply start the router container.
attaching TRACE log

To Reproduce
Steps to reproduce the behavior:

running in k8s using the official image with the following command:
--hot-reload --config router.yaml --supergraph supergraph.graphql
we do have a nodejs sidecar that updates the supergraph.graphql file if any service schema changes

Expected behavior
router should work as in v1.7.0

Output
If applicable, add output to help explain your problem.

OS: k8s 1.20.15
router_1_11_0_log.txt

router.yaml:
supergraph: path: /graphql listen: "0.0.0.0:4000" introspection: true headers: all: request: - propagate: matching: ".*" include_subgraph_errors: all: true

The text was updated successfully, but these errors were encountered:

Geal · 2023-03-06T14:23:38Z

could you explain what you mean by "failed to start" here? Is there a panic? Is it hanging?

garypen · 2023-03-06T14:32:40Z

From the log file, it looks like the router is shutting down normally:
{"timestamp":"2023-03-05T08:05:33.538201Z","level":"INFO","message":"shutting down"}

It's not obvious, however, what is causing the shutdown to occur.

I'll make a guess though...

I note (from the logs) that the router is shutting down approximately 30 seconds after it starts. This is probably a health or readiness problem in a kubernetes deployment.

chkp-talron · 2023-03-06T14:33:02Z

if I understand it correctly the process exit after failing to process a request. you can see this from the attached log file:
{"timestamp":"2023-03-05T08:05:33.501371Z","level":"TRACE","message":"Conn::read_head"} {"timestamp":"2023-03-05T08:05:33.501383Z","level":"TRACE","message":"flushed({role=server}): State { reading: Init, writing: Init, keep_alive: Busy }"} {"timestamp":"2023-03-05T08:05:33.501400Z","level":"TRACE","message":"Conn::read_head"} {"timestamp":"2023-03-05T08:05:33.501406Z","level":"TRACE","message":"received 164 bytes"} {"timestamp":"2023-03-05T08:05:33.501442Z","level":"DEBUG","message":"setting h1 header read timeout timer","span":{"name":"parse_headers"},"spans":[{"name":"parse_headers"}]} {"timestamp":"2023-03-05T08:05:33.501458Z","level":"TRACE","message":"Request.parse","bytes":164,"span":{"name":"parse_headers"},"spans":[{"name":"parse_headers"}]} {"timestamp":"2023-03-05T08:05:33.501466Z","level":"TRACE","message":"Request.parse Complete(164)","span":{"name":"parse_headers"},"spans":[{"name":"parse_headers"}]} {"timestamp":"2023-03-05T08:05:33.501503Z","level":"DEBUG","message":"parsed 5 headers"} {"timestamp":"2023-03-05T08:05:33.501513Z","level":"DEBUG","message":"incoming body is empty"} {"timestamp":"2023-03-05T08:05:33.501518Z","level":"TRACE","message":"remote disabling keep-alive"} {"timestamp":"2023-03-05T08:05:33.501559Z","level":"DEBUG","message":"started processing request","span":{"http.flavor":"HTTP/1.1","http.method":"GET","http.route":"/graphql/?query={__typename}","otel.kind":"SERVER","name":"request"},"spans":[{"http.flavor":"HTTP/1.1","http.method":"GET","http.route":"/graphql/?query={__typename}","otel.kind":"SERVER","name":"request"}]} {"timestamp":"2023-03-05T08:05:33.501591Z","level":"DEBUG","message":"finished processing request","latency":"0 ms","status":404,"span":{"http.flavor":"HTTP/1.1","http.method":"GET","http.route":"/graphql/?query={__typename}","otel.kind":"SERVER","name":"request"},"spans":[{"http.flavor":"HTTP/1.1","http.method":"GET","http.route":"/graphql/?query={__typename}","otel.kind":"SERVER","name":"request"}]} {"timestamp":"2023-03-05T08:05:33.501621Z","level":"TRACE","message":"Server::encode status=404, body=Some(Unknown), req_method=Some(GET)","span":{"name":"encode_headers"},"spans":[{"name":"encode_headers"}]} {"timestamp":"2023-03-05T08:05:33.501643Z","level":"TRACE","message":"no more write body allowed, user body is_end_stream = false"} {"timestamp":"2023-03-05T08:05:33.501805Z","level":"DEBUG","message":"flushed 82 bytes"} {"timestamp":"2023-03-05T08:05:33.501816Z","level":"TRACE","message":"State::close()"} {"timestamp":"2023-03-05T08:05:33.501819Z","level":"TRACE","message":"flushed({role=server}): State { reading: Closed, writing: Closed, keep_alive: Disabled }"} {"timestamp":"2023-03-05T08:05:33.501832Z","level":"TRACE","message":"shut down IO complete"} {"timestamp":"2023-03-05T08:05:33.501841Z","level":"TRACE","message":"deregistering event source from poller","log.target":"mio::poll","log.module_path":"mio::poll","log.file":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.8.5/src/poll.rs","log.line":663} {"timestamp":"2023-03-05T08:05:33.538201Z","level":"INFO","message":"shutting down"} {"timestamp":"2023-03-05T08:05:33.538269Z","level":"TRACE","message":"deregistering event source from poller","log.target":"mio::poll","log.module_path":"mio::poll","log.file":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.8.5/src/poll.rs","log.line":663} {"timestamp":"2023-03-05T08:05:33.538289Z","level":"TRACE","message":"deregistering event source from poller","log.target":"mio::poll","log.module_path":"mio::poll","log.file":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.8.5/src/poll.rs","log.line":663} {"timestamp":"2023-03-05T08:05:33.538402Z","level":"TRACE","message":"worker polling for next message"} {"timestamp":"2023-03-05T08:05:33.538426Z","level":"TRACE","message":"buffer already closed"} {"timestamp":"2023-03-05T08:05:33.605717Z","level":"DEBUG","message":"deno runtime shutdown successfully"} {"timestamp":"2023-03-05T08:05:33.609404Z","level":"DEBUG","message":"stopped"}

chkp-talron · 2023-03-06T14:41:22Z

From the log file, it looks like the router is shutting down normally: {"timestamp":"2023-03-05T08:05:33.538201Z","level":"INFO","message":"shutting down"}

It's not obvious, however, what is causing the shutdown to occur.

I'll make a guess though...

I note (from the logs) that the router is shutting down approximately 30 seconds after it starts. This is probably a health or readiness problem in a kubernetes deployment.

healthcheck issue might seems the issue.
I can see that the request has these attributes:
"otel.kind":"SERVER"
"http.route":"/graphql/?query={__typename}" <-- this is actually how the health is configured in the container deployment

"status":404 ->>> it is unclear why this fail with 404
with v1.7.0 this succeeded but started to fail from v1.8-v1.11

has something changed?

garypen · 2023-03-06T15:12:02Z

@talr1 Things have changed, see the CHANGELOG.md for (many) details.

The router has a health_check built-in: https://www.apollographql.com/docs/router/configuration/health-checks/

So, that's the recommended way to check that the router is healthy.

Our helm chart has Values.yaml that contains configuration for this:

    health_check:
      listen: 0.0.0.0:8088

If you add this to your configuration router.yaml, then you should be good.

chkp-talron · 2023-03-06T15:51:25Z

Thanks for the prompt help.
I've updated router.yaml with new health setting & the deployment.
however, the router still exits.
now I cant even see the logs for the health request.
attaching new log.

end of log looks like:

{"timestamp":"2023-03-06T15:40:03.273406Z","level":"DEBUG","message":"previous server stopped"} {"timestamp":"2023-03-06T15:40:03.273437Z","level":"TRACE","message":"worker polling for next message"} {"timestamp":"2023-03-06T15:40:03.273686Z","level":"TRACE","message":"buffer already closed"} {"timestamp":"2023-03-06T15:40:03.273814Z","level":"INFO","message":"Health check endpoint exposed at http://0.0.0.0:8088/health"} {"timestamp":"2023-03-06T15:40:03.273957Z","level":"INFO","message":"GraphQL endpoint exposed at http://0.0.0.0:4000/graphql 🚀"} {"timestamp":"2023-03-06T15:40:03.273986Z","level":"DEBUG","message":"extra endpoints the router listens to","tracing_endpoints":"http://0.0.0.0:8088"} {"timestamp":"2023-03-06T15:40:03.274009Z","level":"TRACE","message":"worker polling for next message"} {"timestamp":"2023-03-06T15:40:03.274052Z","level":"DEBUG","message":"restarted on http://0.0.0.0:8088"} {"timestamp":"2023-03-06T15:40:03.274066Z","level":"INFO","message":"reload complete"} {"timestamp":"2023-03-06T15:40:03.304237Z","level":"DEBUG","message":"deno runtime shutdown successfully"} {"timestamp":"2023-03-06T15:40:06.649540Z","level":"INFO","message":"shutting down"} {"timestamp":"2023-03-06T15:40:06.649624Z","level":"TRACE","message":"deregistering event source from poller","log.target":"mio::poll","log.module_path":"mio::poll","log.file":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.8.5/src/poll.rs","log.line":663} {"timestamp":"2023-03-06T15:40:06.649652Z","level":"TRACE","message":"deregistering event source from poller","log.target":"mio::poll","log.module_path":"mio::poll","log.file":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.8.5/src/poll.rs","log.line":663} {"timestamp":"2023-03-06T15:40:06.649721Z","level":"TRACE","message":"worker polling for next message"} {"timestamp":"2023-03-06T15:40:06.649732Z","level":"TRACE","message":"buffer already closed"} {"timestamp":"2023-03-06T15:40:06.670656Z","level":"DEBUG","message":"deno runtime shutdown successfully"} {"timestamp":"2023-03-06T15:40:06.671163Z","level":"DEBUG","message":"stopped"}

`supergraph:
path: /graphql
listen: "0.0.0.0:4000"
introspection: true

headers:
all:
request:
- propagate:
matching: ".*"

include_subgraph_errors:
all: true

health_check:
listen: 0.0.0.0:8088
enabled: true
`

and the k8s deployment health to :

livenessProbe: httpGet: path: "/health" port: 8080 readinessProbe: httpGet: path: "/health" port: 8080

logs file:
router_1_11_0_log2.txt

garypen · 2023-03-06T15:57:06Z

It looks like you are using 8080 in your k8s, but specifying 8088 in your router configuration. Those numbers need to agree.

So, just pick which one is your preference. I think 8088 is the default for the the router.

chkp-talron · 2023-03-06T16:38:37Z

nice and quick catch. sorry for this typo.
will try the updated setup

chkp-talron · 2023-03-07T11:15:23Z

Ok, so now it seems to work fine, thanks a lot for the help.

if I may another question, I've tried adding otel telemetry (was the whole point of upgrading to 1.11.0 due to fixes in otel)

http did not work, grpc works fine. any idea?
working:
telemetry: tracing: trace_config: service_name: "federation-router" otlp: endpoint: "http://example-opentelemetry-collector:4317" protocol: grpc
not working:
telemetry: tracing: trace_config: service_name: "federation-router" otlp: endpoint: http://example-opentelemetry-collector:4318 protocol: http
any way to limit to amount of spans on each request? there seems to be ~8-9 spans per request. most of the time they don't add much useful info besides taking up storage?
spans names are:
request, router, supergraph, query_planning, parse_query, execution, fetch,
sub_graph, sub_graph_request

from my testing it seems that the amount of spans are tied to the router log_level (APOLLO_ROUTER_LOG). with level 'info' i can see ~8-9 spans as listed above. with level 'warn' no spans are emitted. I haven't checked other levels.

IMO: there should be at least 1 span on each service when i look on a full trace regardless of the service log level. in fact tracing is configured on the router but none is emitted in warn level.

Thanks in advance for any help,

garypen · 2023-03-09T09:13:47Z

I'm unclear if this was actually an issue, but since the suggestion to use the provided router health_check has resolved the problem I'm happy to close this.

@talr1 I'm going to migrate your last section to a discussion question and link it back to this issue.

chkp-talron added the raised by user label Mar 5, 2023

garypen self-assigned this Mar 9, 2023

garypen closed this as completed Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

router stopped while loading #2712

router stopped while loading #2712

chkp-talron commented Mar 5, 2023 •

edited

Loading

Geal commented Mar 6, 2023

garypen commented Mar 6, 2023

chkp-talron commented Mar 6, 2023

chkp-talron commented Mar 6, 2023

garypen commented Mar 6, 2023

chkp-talron commented Mar 6, 2023 •

edited

Loading

garypen commented Mar 6, 2023 •

edited

Loading

chkp-talron commented Mar 6, 2023

chkp-talron commented Mar 7, 2023

garypen commented Mar 9, 2023

router stopped while loading #2712

router stopped while loading #2712

Comments

chkp-talron commented Mar 5, 2023 • edited Loading

Geal commented Mar 6, 2023

garypen commented Mar 6, 2023

chkp-talron commented Mar 6, 2023

chkp-talron commented Mar 6, 2023

garypen commented Mar 6, 2023

chkp-talron commented Mar 6, 2023 • edited Loading

garypen commented Mar 6, 2023 • edited Loading

chkp-talron commented Mar 6, 2023

chkp-talron commented Mar 7, 2023

garypen commented Mar 9, 2023

chkp-talron commented Mar 5, 2023 •

edited

Loading

chkp-talron commented Mar 6, 2023 •

edited

Loading

garypen commented Mar 6, 2023 •

edited

Loading