Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conduit-proxy pegging cpu #519

Closed
siggy opened this issue Mar 6, 2018 · 5 comments · Fixed by #537
Closed

conduit-proxy pegging cpu #519

siggy opened this issue Mar 6, 2018 · 5 comments · Fixed by #537
Assignees
Labels
Milestone

Comments

@siggy
Copy link
Member

siggy commented Mar 6, 2018

Building and running Conduit from master, I observe the conduit-proxy in the Grafana pod pegging the CPU, about a minute after startup.

Grafana is complaining about not being able to connect to grafana.com, which is a manifestation of #155. Confirmed this is the cause of the CPU pegging.

Workarounds

We can work around this via one of two ways.

Ignore outbound 443

Ignore port 443 on the outbound Grafana proxy-init container:

- --outbound-ports-to-ignore
- "443"

Disable update check

Disable the Grafana update check in grafana.ini:

[analytics]
check_for_updates = false

Repro

DOCKER_TRACE=1 bin/docker-build
bin/conduit install --proxy-log-level debug,conduit_proxy=debug | kubectl apply -f -

Note that curl'ing Prometheus from the Grafana container hangs:

$ kubectl -n conduit exec -it grafana-68954c7876-txnsd /bin/bash -c grafana
$ curl -v prometheus.conduit.svc.cluster.local:9090
* Rebuilt URL to: prometheus.conduit.svc.cluster.local:9090/
* Hostname was NOT found in DNS cache
*   Trying 10.104.147.185...
* Connected to prometheus.conduit.svc.cluster.local (10.104.147.185) port 9090 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.38.0
> Host: prometheus.conduit.svc.cluster.local:9090
> Accept: */*
>

(hangs)

CPU

$ docker stats $(docker ps --format={{.Names}})

CONTAINER ID        NAME                                                                                                                    CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
...
dc0e821e277f        k8s_conduit-proxy_grafana-68954c7876-txnsd_conduit_13a09de6-2184-11e8-b369-025000000001_0                               95.08%              984KiB / 11.71GiB     0.01%               0B / 0B             0B / 0B             2
...

Logs

Grafana

$ kubectl -n conduit logs -f  $(kubectl --namespace=conduit get po --selector=conduit.io/control-plane-component=grafana -o jsonpath='{.items[*].metadata.name}') -c grafana
...
t=2018-03-06T22:49:39+0000 lvl=dbug msg="Failed to get plugins repo from grafana.net, Get https://grafana.com/api/plugins/versioncheck?slugIn=&grafanaVersion=5.0.0-beta4: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"

conduit-proxy

$ kubectl -n conduit logs -f  $(kubectl --namespace=conduit get po --selector=conduit.io/control-plane-component=grafana -o jsonpath='{.items[*].metadata.name}') -c conduit-proxy
DBUG tokio_core::reactor loop poll - Duration { secs: 10, nanos: 2779700 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57030, tv_nsec: 420110808 }
DBUG tokio_core::reactor notifying a task handle
DBUG tokio_core::reactor loop process - 0 events, Duration { secs: 0, nanos: 54500 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 12900 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57030, tv_nsec: 420184208 }
DBUG tokio_core::reactor "controller-client", set a timeout: 0
DBUG tokio_core::reactor "controller-client", added a timeout: 3
DBUG tokio_core::reactor "controller-client", updating a timeout: 3
DBUG tokio_core::reactor "controller-client", updating a timeout: 0
DBUG tokio_core::reactor "controller-client", scheduling direction for: 0
DBUG tokio_core::reactor "controller-client", blocking
DBUG tokio_core::reactor loop process - 1 events, Duration { secs: 0, nanos: 3084100 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 64300 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57030, tv_nsec: 423350508 }
DBUG tokio_core::reactor ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), scheduling direction for: 2
DBUG tokio_core::reactor ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), blocking
DBUG h2::codec::framed_write ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), send; frame=Frame::Headers(Headers { stream_id: StreamId(11), stream_dep: None, flags: HeadersFlag { end_stream: false, end_headers: true, padded: false, priority: false } })
DBUG tokio_core::reactor loop process - 2 events, Duration { secs: 0, nanos: 3792400 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 12800 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57030, tv_nsec: 427186708 }
DBUG tokio_core::reactor ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), scheduling direction for: 2
DBUG tokio_core::reactor ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), blocking
DBUG h2::codec::framed_write ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), send; frame=Frame::Data(Data { stream_id: StreamId(11), flags: DataFlags { end_stream: true }, pad_len: None })
DBUG tokio_core::reactor loop process - 1 events, Duration { secs: 0, nanos: 3693400 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 13000 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57030, tv_nsec: 431057108 }
DBUG tokio_core::reactor notifying a task handle
DBUG tokio_core::reactor loop process - 1 events, Duration { secs: 0, nanos: 15300 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 9200 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57030, tv_nsec: 431089508 }
DBUG h2::codec::framed_read ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), received; frame=Frame::Headers(Headers { stream_id: StreamId(11), stream_dep: None, flags: HeadersFlag { end_stream: false, end_headers: true, padded: false, priority: false } })
DBUG h2::codec::framed_read ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), received; frame=Frame::Data(Data { stream_id: StreamId(11), flags: DataFlags, pad_len: None })
DBUG h2::codec::framed_read ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), received; frame=Frame::Headers(Headers { stream_id: StreamId(11), stream_dep: None, flags: HeadersFlag { end_stream: true, end_headers: true, padded: false, priority: false } })
DBUG tokio_core::reactor ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), scheduling direction for: 2
DBUG tokio_core::reactor ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), blocking
DBUG tokio_core::reactor loop process - 1 events, Duration { secs: 0, nanos: 1272200 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 15000 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57030, tv_nsec: 432384408 }
DBUG tokio_core::reactor::timeout_token "controller-client", cancel timeout 3
DBUG tokio_core::reactor "controller-client", cancel a timeout: 3
DBUG tokio_core::reactor "controller-client", updating a timeout: 0
DBUG tokio_core::reactor "controller-client", scheduling direction for: 0
DBUG tokio_core::reactor "controller-client", blocking
DBUG tokio_core::reactor loop process - 1 events, Duration { secs: 0, nanos: 101900 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 9500 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57030, tv_nsec: 432500508 }
DBUG tokio_core::reactor loop process - 0 events, Duration { secs: 0, nanos: 9800 }
DBUG tokio_core::reactor loop poll - Duration { secs: 65, nanos: 11767500 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57037, tv_nsec: 544646508 }
DBUG tokio_core::reactor notifying a task handle
DBUG tokio_core::reactor loop process - 1 events, Duration { secs: 0, nanos: 351500 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 91300 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57037, tv_nsec: 545191508 }
DBUG tokio_core::reactor scheduling direction for: 2
DBUG tokio_core::reactor blocking
DBUG tokio_core::reactor loop process - 1 events, Duration { secs: 0, nanos: 741600 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 12000 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57037, tv_nsec: 546367708 }
DBUG tokio_core::reactor notifying a task handle
DBUG tokio_core::reactor loop process - 1 events, Duration { secs: 0, nanos: 552200 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 17400 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57037, tv_nsec: 547072608 }
DBUG tokio_core::reactor loop poll - Duration { secs: 9, nanos: 997204200 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57040, tv_nsec: 429721508 }
DBUG tokio_core::reactor notifying a task handle
DBUG tokio_core::reactor loop process - 0 events, Duration { secs: 0, nanos: 49700 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 12000 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57040, tv_nsec: 429789108 }
DBUG tokio_core::reactor "controller-client", set a timeout: 0
DBUG tokio_core::reactor "controller-client", added a timeout: 3
DBUG tokio_core::reactor "controller-client", updating a timeout: 3
DBUG tokio_core::reactor "controller-client", updating a timeout: 0
DBUG tokio_core::reactor "controller-client", scheduling direction for: 0
DBUG tokio_core::reactor "controller-client", blocking
DBUG tokio_core::reactor loop process - 1 events, Duration { secs: 0, nanos: 88500 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 7000 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57040, tv_nsec: 429889408 }
DBUG tokio_core::reactor ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), scheduling direction for: 2
DBUG tokio_core::reactor ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), blocking
DBUG h2::codec::framed_write ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), send; frame=Frame::Headers(Headers { stream_id: StreamId(13), stream_dep: None, flags: HeadersFlag { end_stream: false, end_headers: true, padded: false, priority: false } })
DBUG tokio_core::reactor loop process - 2 events, Duration { secs: 0, nanos: 123400 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 7100 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57040, tv_nsec: 430027108 }
DBUG tokio_core::reactor ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), scheduling direction for: 2
DBUG tokio_core::reactor ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), blocking
DBUG h2::codec::framed_write ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), send; frame=Frame::Data(Data { stream_id: StreamId(13), flags: DataFlags { end_stream: true }, pad_len: None })
DBUG tokio_core::reactor loop process - 1 events, Duration { secs: 0, nanos: 47000 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 4319500 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57040, tv_nsec: 434473108 }
DBUG tokio_core::reactor loop process - 0 events, Duration { secs: 0, nanos: 22100 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 3044900 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57040, tv_nsec: 437545408 }
DBUG tokio_core::reactor notifying a task handle
DBUG tokio_core::reactor loop process - 1 events, Duration { secs: 0, nanos: 41900 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 65500 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57040, tv_nsec: 437660908 }
DBUG h2::codec::framed_read ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), received; frame=Frame::Headers(Headers { stream_id: StreamId(13), stream_dep: None, flags: HeadersFlag { end_stream: false, end_headers: true, padded: false, priority: false } })
DBUG h2::codec::framed_read ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), received; frame=Frame::Data(Data { stream_id: StreamId(13), flags: DataFlags, pad_len: None })
DBUG h2::codec::framed_read ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), received; frame=Frame::Headers(Headers { stream_id: StreamId(13), stream_dep: None, flags: HeadersFlag { end_stream: true, end_headers: true, padded: false, priority: false } })
DBUG tokio_core::reactor ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), scheduling direction for: 2
DBUG tokio_core::reactor ("controller-client", "HostAndPort { host: DnsName(\"proxy-api.conduit.svc.cluster.local\"), port: 8086 }"), blocking
DBUG tokio_core::reactor loop process - 1 events, Duration { secs: 0, nanos: 893300 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 849300 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57040, tv_nsec: 439437508 }
DBUG tokio_core::reactor::timeout_token "controller-client", cancel timeout 3
DBUG tokio_core::reactor "controller-client", cancel a timeout: 3
DBUG tokio_core::reactor "controller-client", updating a timeout: 0
DBUG tokio_core::reactor "controller-client", scheduling direction for: 0
DBUG tokio_core::reactor "controller-client", blocking
DBUG tokio_core::reactor loop process - 1 events, Duration { secs: 0, nanos: 139100 }
DBUG tokio_core::reactor loop poll - Duration { secs: 0, nanos: 9000 }
DBUG tokio_core::reactor loop time - Instant { tv_sec: 57040, tv_nsec: 439595408 }
DBUG tokio_core::reactor loop process - 0 events, Duration { secs: 0, nanos: 8200 }
@briansmith briansmith added this to the 0.3.1 milestone Mar 6, 2018
@briansmith
Copy link
Contributor

If the cause is #155 then let's fix #155. I am working towards that now.

Disable the Grafana update check in grafana.ini:

[analytics]
check_for_updates = false

Maybe we should do this anyway?

siggy added a commit that referenced this issue Mar 7, 2018
Grafana by default calls out to grafana.com to check for updates. As
user's of Conduit do not have direct control over updating Grafana
directly, this update check is not needed.

Disable Grafana's update check via grafana.ini.

This is also a workaround for #155, root cause of #519.

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
@siggy
Copy link
Member Author

siggy commented Mar 7, 2018

Review to disable update check: #521.

siggy added a commit that referenced this issue Mar 7, 2018
Grafana by default calls out to grafana.com to check for updates. As
user's of Conduit do not have direct control over updating Grafana
directly, this update check is not needed.

Disable Grafana's update check via grafana.ini.

This is also a workaround for #155, root cause of #519.

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
@briansmith
Copy link
Contributor

@siggy wrote (elsewhere):

@briansmith i believe #519 goes away when #155 is fixed...
that said, perhaps it's a separate issue that the proxy pegs the cpu when it's given an address it's unable to route?

Yes, we need to find out why the proxy has gone haywire in this situation.

@siggy siggy changed the title conduit-proxy in Grafana pod pegging cpu conduit-proxy pegging cpu Mar 7, 2018
@olix0r olix0r added the priority/P0 Release Blocker label Mar 7, 2018
@olix0r
Copy link
Member

olix0r commented Mar 7, 2018

marking this P0 to at least develop a working theory of this issue so that we can assess impact.

@hawkw
Copy link
Contributor

hawkw commented Mar 7, 2018

@siggy and I found the bug that was causing this issue and fixed it in e5a7607.

I'd like to try and put together a unit test against a regression here, but would be happy to go ahead and open a PR for review.

hawkw added a commit that referenced this issue Mar 8, 2018
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
@hawkw hawkw added the review/ready Issue has a reviewable PR label Mar 8, 2018
@hawkw hawkw closed this as completed in #537 Mar 8, 2018
hawkw added a commit that referenced this issue Mar 8, 2018
An infinite loop exists in the TCP proxy, which could be triggered by any raw TCP connection (including HTTPS requests). The connection will be proxied successfully, but instead of closing, it will remain open, and the proxy's CPU usage will remain extremely high indefinitely.

Since `Duplex::poll` will call `half_in.copy_into()`/`half_out.copy_into()` repeatedly, even after they return `Async::Ready`, when one half has shut down and returned ready, it may still be polled again, as `Duplex::poll` waits until _both_ halves have returned `Ready`. Because of the guard that `!dst.is_shutdown`, intended to prevent the destination from shutting down twice, the function will not return if it is polled again after returning `Async::Ready` once.

I've fixed this by moving the guard against double shutdowns out of the loop, so that the function will return `Async::Ready` again if it is polled after shutting down the destination.

I've also included a unit test against regressions to this bug. The unit test fails against master.

Fixes #519 

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Co-Authored-By: Andrew Seigner <andrew@sig.gy>
@hawkw hawkw removed the review/ready Issue has a reviewable PR label Mar 8, 2018
hawkw added a commit that referenced this issue Mar 16, 2018
This reverts commit c2a7f29.

We probably don't want to do this, and it didn't actually fix the bug,
which was actually just #519, because those changes weren't integrated
into this branch.
hawkw added a commit that referenced this issue Mar 19, 2018
This reverts commit c2a7f29.

We probably don't want to do this, and it didn't actually fix the bug,
which was actually just #519, because those changes weren't integrated
into this branch.
khappucino pushed a commit to Nordstrom/linkerd2 that referenced this issue Mar 5, 2019
An infinite loop exists in the TCP proxy, which could be triggered by any raw TCP connection (including HTTPS requests). The connection will be proxied successfully, but instead of closing, it will remain open, and the proxy's CPU usage will remain extremely high indefinitely.

Since `Duplex::poll` will call `half_in.copy_into()`/`half_out.copy_into()` repeatedly, even after they return `Async::Ready`, when one half has shut down and returned ready, it may still be polled again, as `Duplex::poll` waits until _both_ halves have returned `Ready`. Because of the guard that `!dst.is_shutdown`, intended to prevent the destination from shutting down twice, the function will not return if it is polled again after returning `Async::Ready` once.

I've fixed this by moving the guard against double shutdowns out of the loop, so that the function will return `Async::Ready` again if it is polled after shutting down the destination.

I've also included a unit test against regressions to this bug. The unit test fails against master.

Fixes linkerd#519 

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Co-Authored-By: Andrew Seigner <andrew@sig.gy>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants