Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

router stopped while loading #2712

Closed
chkp-talron opened this issue Mar 5, 2023 · 10 comments
Closed

router stopped while loading #2712

chkp-talron opened this issue Mar 5, 2023 · 10 comments
Assignees

Comments

@chkp-talron
Copy link

chkp-talron commented Mar 5, 2023

Describe the bug
We've had a v1.7.0 router setup working on k8s. when we tried to move using v1.8.0-v.1.11.0, they all failed during router startup. no heavy load issue here. just simply start the router container.
attaching TRACE log

To Reproduce
Steps to reproduce the behavior:

  1. running in k8s using the official image with the following command:
    --hot-reload --config router.yaml --supergraph supergraph.graphql
  2. we do have a nodejs sidecar that updates the supergraph.graphql file if any service schema changes

Expected behavior
router should work as in v1.7.0

Output
If applicable, add output to help explain your problem.

router.yaml:
supergraph: path: /graphql listen: "0.0.0.0:4000" introspection: true headers: all: request: - propagate: matching: ".*" include_subgraph_errors: all: true

@Geal
Copy link
Contributor

Geal commented Mar 6, 2023

could you explain what you mean by "failed to start" here? Is there a panic? Is it hanging?

@garypen
Copy link
Contributor

garypen commented Mar 6, 2023

From the log file, it looks like the router is shutting down normally:
{"timestamp":"2023-03-05T08:05:33.538201Z","level":"INFO","message":"shutting down"}

It's not obvious, however, what is causing the shutdown to occur.

I'll make a guess though...

I note (from the logs) that the router is shutting down approximately 30 seconds after it starts. This is probably a health or readiness problem in a kubernetes deployment.

@chkp-talron
Copy link
Author

if I understand it correctly the process exit after failing to process a request. you can see this from the attached log file:
{"timestamp":"2023-03-05T08:05:33.501371Z","level":"TRACE","message":"Conn::read_head"} {"timestamp":"2023-03-05T08:05:33.501383Z","level":"TRACE","message":"flushed({role=server}): State { reading: Init, writing: Init, keep_alive: Busy }"} {"timestamp":"2023-03-05T08:05:33.501400Z","level":"TRACE","message":"Conn::read_head"} {"timestamp":"2023-03-05T08:05:33.501406Z","level":"TRACE","message":"received 164 bytes"} {"timestamp":"2023-03-05T08:05:33.501442Z","level":"DEBUG","message":"setting h1 header read timeout timer","span":{"name":"parse_headers"},"spans":[{"name":"parse_headers"}]} {"timestamp":"2023-03-05T08:05:33.501458Z","level":"TRACE","message":"Request.parse","bytes":164,"span":{"name":"parse_headers"},"spans":[{"name":"parse_headers"}]} {"timestamp":"2023-03-05T08:05:33.501466Z","level":"TRACE","message":"Request.parse Complete(164)","span":{"name":"parse_headers"},"spans":[{"name":"parse_headers"}]} {"timestamp":"2023-03-05T08:05:33.501503Z","level":"DEBUG","message":"parsed 5 headers"} {"timestamp":"2023-03-05T08:05:33.501513Z","level":"DEBUG","message":"incoming body is empty"} {"timestamp":"2023-03-05T08:05:33.501518Z","level":"TRACE","message":"remote disabling keep-alive"} {"timestamp":"2023-03-05T08:05:33.501559Z","level":"DEBUG","message":"started processing request","span":{"http.flavor":"HTTP/1.1","http.method":"GET","http.route":"/graphql/?query={__typename}","otel.kind":"SERVER","name":"request"},"spans":[{"http.flavor":"HTTP/1.1","http.method":"GET","http.route":"/graphql/?query={__typename}","otel.kind":"SERVER","name":"request"}]} {"timestamp":"2023-03-05T08:05:33.501591Z","level":"DEBUG","message":"finished processing request","latency":"0 ms","status":404,"span":{"http.flavor":"HTTP/1.1","http.method":"GET","http.route":"/graphql/?query={__typename}","otel.kind":"SERVER","name":"request"},"spans":[{"http.flavor":"HTTP/1.1","http.method":"GET","http.route":"/graphql/?query={__typename}","otel.kind":"SERVER","name":"request"}]} {"timestamp":"2023-03-05T08:05:33.501621Z","level":"TRACE","message":"Server::encode status=404, body=Some(Unknown), req_method=Some(GET)","span":{"name":"encode_headers"},"spans":[{"name":"encode_headers"}]} {"timestamp":"2023-03-05T08:05:33.501643Z","level":"TRACE","message":"no more write body allowed, user body is_end_stream = false"} {"timestamp":"2023-03-05T08:05:33.501805Z","level":"DEBUG","message":"flushed 82 bytes"} {"timestamp":"2023-03-05T08:05:33.501816Z","level":"TRACE","message":"State::close()"} {"timestamp":"2023-03-05T08:05:33.501819Z","level":"TRACE","message":"flushed({role=server}): State { reading: Closed, writing: Closed, keep_alive: Disabled }"} {"timestamp":"2023-03-05T08:05:33.501832Z","level":"TRACE","message":"shut down IO complete"} {"timestamp":"2023-03-05T08:05:33.501841Z","level":"TRACE","message":"deregistering event source from poller","log.target":"mio::poll","log.module_path":"mio::poll","log.file":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.8.5/src/poll.rs","log.line":663} {"timestamp":"2023-03-05T08:05:33.538201Z","level":"INFO","message":"shutting down"} {"timestamp":"2023-03-05T08:05:33.538269Z","level":"TRACE","message":"deregistering event source from poller","log.target":"mio::poll","log.module_path":"mio::poll","log.file":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.8.5/src/poll.rs","log.line":663} {"timestamp":"2023-03-05T08:05:33.538289Z","level":"TRACE","message":"deregistering event source from poller","log.target":"mio::poll","log.module_path":"mio::poll","log.file":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.8.5/src/poll.rs","log.line":663} {"timestamp":"2023-03-05T08:05:33.538402Z","level":"TRACE","message":"worker polling for next message"} {"timestamp":"2023-03-05T08:05:33.538426Z","level":"TRACE","message":"buffer already closed"} {"timestamp":"2023-03-05T08:05:33.605717Z","level":"DEBUG","message":"deno runtime shutdown successfully"} {"timestamp":"2023-03-05T08:05:33.609404Z","level":"DEBUG","message":"stopped"}

@chkp-talron
Copy link
Author

From the log file, it looks like the router is shutting down normally: {"timestamp":"2023-03-05T08:05:33.538201Z","level":"INFO","message":"shutting down"}

It's not obvious, however, what is causing the shutdown to occur.

I'll make a guess though...

I note (from the logs) that the router is shutting down approximately 30 seconds after it starts. This is probably a health or readiness problem in a kubernetes deployment.

healthcheck issue might seems the issue.
I can see that the request has these attributes:
"otel.kind":"SERVER"
"http.route":"/graphql/?query={__typename}" <-- this is actually how the health is configured in the container deployment

"status":404 ->>> it is unclear why this fail with 404
with v1.7.0 this succeeded but started to fail from v1.8-v1.11

has something changed?

@garypen
Copy link
Contributor

garypen commented Mar 6, 2023

@talr1 Things have changed, see the CHANGELOG.md for (many) details.

The router has a health_check built-in: https://www.apollographql.com/docs/router/configuration/health-checks/

So, that's the recommended way to check that the router is healthy.

Our helm chart has Values.yaml that contains configuration for this:

    health_check:
      listen: 0.0.0.0:8088

If you add this to your configuration router.yaml, then you should be good.

@chkp-talron
Copy link
Author

chkp-talron commented Mar 6, 2023

Thanks for the prompt help.
I've updated router.yaml with new health setting & the deployment.
however, the router still exits.
now I cant even see the logs for the health request.
attaching new log.

end of log looks like:

{"timestamp":"2023-03-06T15:40:03.273406Z","level":"DEBUG","message":"previous server stopped"} {"timestamp":"2023-03-06T15:40:03.273437Z","level":"TRACE","message":"worker polling for next message"} {"timestamp":"2023-03-06T15:40:03.273686Z","level":"TRACE","message":"buffer already closed"} {"timestamp":"2023-03-06T15:40:03.273814Z","level":"INFO","message":"Health check endpoint exposed at http://0.0.0.0:8088/health"} {"timestamp":"2023-03-06T15:40:03.273957Z","level":"INFO","message":"GraphQL endpoint exposed at http://0.0.0.0:4000/graphql 🚀"} {"timestamp":"2023-03-06T15:40:03.273986Z","level":"DEBUG","message":"extra endpoints the router listens to","tracing_endpoints":"http://0.0.0.0:8088"} {"timestamp":"2023-03-06T15:40:03.274009Z","level":"TRACE","message":"worker polling for next message"} {"timestamp":"2023-03-06T15:40:03.274052Z","level":"DEBUG","message":"restarted on http://0.0.0.0:8088"} {"timestamp":"2023-03-06T15:40:03.274066Z","level":"INFO","message":"reload complete"} {"timestamp":"2023-03-06T15:40:03.304237Z","level":"DEBUG","message":"deno runtime shutdown successfully"} {"timestamp":"2023-03-06T15:40:06.649540Z","level":"INFO","message":"shutting down"} {"timestamp":"2023-03-06T15:40:06.649624Z","level":"TRACE","message":"deregistering event source from poller","log.target":"mio::poll","log.module_path":"mio::poll","log.file":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.8.5/src/poll.rs","log.line":663} {"timestamp":"2023-03-06T15:40:06.649652Z","level":"TRACE","message":"deregistering event source from poller","log.target":"mio::poll","log.module_path":"mio::poll","log.file":"/home/circleci/.cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.8.5/src/poll.rs","log.line":663} {"timestamp":"2023-03-06T15:40:06.649721Z","level":"TRACE","message":"worker polling for next message"} {"timestamp":"2023-03-06T15:40:06.649732Z","level":"TRACE","message":"buffer already closed"} {"timestamp":"2023-03-06T15:40:06.670656Z","level":"DEBUG","message":"deno runtime shutdown successfully"} {"timestamp":"2023-03-06T15:40:06.671163Z","level":"DEBUG","message":"stopped"}

`supergraph:
path: /graphql
listen: "0.0.0.0:4000"
introspection: true

headers:
all:
request:
- propagate:
matching: ".*"

include_subgraph_errors:
all: true

health_check:
listen: 0.0.0.0:8088
enabled: true
`

and the k8s deployment health to :

livenessProbe: httpGet: path: "/health" port: 8080 readinessProbe: httpGet: path: "/health" port: 8080

logs file:
router_1_11_0_log2.txt

@garypen
Copy link
Contributor

garypen commented Mar 6, 2023

It looks like you are using 8080 in your k8s, but specifying 8088 in your router configuration. Those numbers need to agree.

So, just pick which one is your preference. I think 8088 is the default for the the router.

@chkp-talron
Copy link
Author

nice and quick catch. sorry for this typo.
will try the updated setup

@chkp-talron
Copy link
Author

Ok, so now it seems to work fine, thanks a lot for the help.

if I may another question, I've tried adding otel telemetry (was the whole point of upgrading to 1.11.0 due to fixes in otel)

  1. http did not work, grpc works fine. any idea?
    working:
    telemetry: tracing: trace_config: service_name: "federation-router" otlp: endpoint: "http://example-opentelemetry-collector:4317" protocol: grpc
    not working:
    telemetry: tracing: trace_config: service_name: "federation-router" otlp: endpoint: http://example-opentelemetry-collector:4318 protocol: http

  2. any way to limit to amount of spans on each request? there seems to be ~8-9 spans per request. most of the time they don't add much useful info besides taking up storage?
    spans names are:
    request, router, supergraph, query_planning, parse_query, execution, fetch,
    sub_graph, sub_graph_request

from my testing it seems that the amount of spans are tied to the router log_level (APOLLO_ROUTER_LOG). with level 'info' i can see ~8-9 spans as listed above. with level 'warn' no spans are emitted. I haven't checked other levels.

IMO: there should be at least 1 span on each service when i look on a full trace regardless of the service log level. in fact tracing is configured on the router but none is emitted in warn level.

Thanks in advance for any help,

@garypen garypen self-assigned this Mar 9, 2023
@garypen
Copy link
Contributor

garypen commented Mar 9, 2023

I'm unclear if this was actually an issue, but since the suggestion to use the provided router health_check has resolved the problem I'm happy to close this.

@talr1 I'm going to migrate your last section to a discussion question and link it back to this issue.

@garypen garypen closed this as completed Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants