data/aws/vpc: Use HTTPS for load balancer health checks #924

wking · 2018-12-16T07:14:09Z

As suggested by @DanyC97 in #923. When we switched to network load balancers in 16dfbb3 (#594), we replaced things like:

resource "aws_elb" "api_internal" {
  ...
  health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    timeout             = 3
    target              = "SSL:6443"
    interval            = 5
  }
  ...
}

with:

resource "aws_lb_target_group" "api_internal" {
  ...
  health_check {
    healthy_threshold   = 3
    unhealthy_threshold = 3
    interval            = 10
    port                = 6443
    protocol            = "TCP"
  }
}

This resulted in logs like:

[core@ip-10-0-11-88 ~]$ sudo crictl ps
CONTAINER ID        IMAGE                                                                                                                                           CREATED             STATE               NAME                    ATTEMPT
1bf4870ea6eea       registry.svc.ci.openshift.org/openshift/origin-v4.0-2018-12-15-160933@sha256:97eac256dde260e8bee9a5948efce5edb879dc6cb522a0352567010285378a56   2 minutes ago       Running             machine-config-server   0
[core@ip-10-0-11-88 ~]$ sudo crictl logs 1bf4870ea6eea
I1215 20:23:07.088210       1 bootstrap.go:37] Version: 3.11.0-356-gb7ffe0c7-dirty
I1215 20:23:07.088554       1 api.go:54] launching server
I1215 20:23:07.088571       1 api.go:54] launching server
2018/12/15 20:24:17 http: TLS handshake error from 10.0.20.86:28372: EOF
2018/12/15 20:24:18 http: TLS handshake error from 10.0.20.86:38438: EOF
2018/12/15 20:24:18 http: TLS handshake error from 10.0.47.69:26320: EOF
...

when the health check opens a TCP connection (in this case to the machine-config server on 49500) and then hangs up without completing the TLS handshake. Network load balancers (docs here and here) do not have an analog to the classic load balancers' SSL protocol (docs here and here, although see hashicorp/terraform-provider-aws#6866), so we're using HTTPS.

There's some discussion in kubernetes/kubernetes#43784 about the best way to perform unauthenticated liveness checks on the Kubernetes API server that suggests 401s are possible in some configurations. Checking against a recent cluster:

$ curl -i https://wking-api.devcluster.openshift.com:6443/healthz
curl: (60) Peer's Certificate issuer is not recognized.
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"
 of Certificate Authority (CA) public keys (CA certs). If the default
 bundle file isn't adequate, you can specify an alternate file
 using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
 the bundle, the certificate verification probably failed due to a
 problem with the certificate (it might be expired, or the name might
 not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
 the -k (or --insecure) option.
$ curl -ik https://wking-api.devcluster.openshift.com:6443/healthz
HTTP/1.1 200 OK
Cache-Control: no-store
Date: Sun, 16 Dec 2018 06:18:23 GMT
Content-Length: 2
Content-Type: text/plain; charset=utf-8

I don't know if the network load balancer health checks care about certificate validity or not. I guess we'll see how CI testing handles this.

Ignition is only exposed inside the cluster, and checking that from a master node:

[core@ip-10-0-26-134 ~]$ curl -i https://wking-api.devcluster.openshift.com:49500/
curl: (60) Peer's Certificate issuer is not recognized.
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"
 of Certificate Authority (CA) public keys (CA certs). If the default
 bundle file isn't adequate, you can specify an alternate file
 using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
 the bundle, the certificate verification probably failed due to a
 problem with the certificate (it might be expired, or the name might
 not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
 the -k (or --insecure) option.
[core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Sun, 16 Dec 2018 06:30:14 GMT
Content-Length: 19

404 page not found

Unfortunately, setting matcher is not allowed for network load balancers (e.g. see here and here). Setting it leads to errors like:

ERROR  * module.vpc.aws_lb_target_group.api_internal: 1 error occurred:
ERROR  * aws_lb_target_group.api_internal: Error creating LB Target Group: InvalidConfigurationRequest: Custom health check matchers are not supported for health checks for target groups with the TCP protocol
ERROR  status code: 400, request id: 25a53d63-00fe-11e9-80c5-59885e191c9c

So I've left it unset here, and we'll just hope the 401s don't start happening.

This is a WIP, because we need a reliable endpoint (likely /healthz) from the machine-config server because we can't set matcher to accept the 404.

Fixes #923.

DanyC97 · 2018-12-16T10:13:22Z

@wking many thanks for being so proactive (you beat me - still getting myself up to speed before i can contribute)

I don't know if the network load balancer health checks care about certificate validity or not.

if i remember correctly it doesn't care

vrutkovs · 2018-12-16T10:44:21Z

In 3.11 GCP installs we were installing a haproxy service, which exposed healthz on 8080 and terminated SSL. This is suboptimal, but might be helpful

crawford · 2019-01-04T20:22:56Z

@wking are you comfortable taking a pass at the health endpoint for MCS or should I ask the team to tackle that?

wking · 2019-01-04T21:04:54Z

are you comfortable taking a pass at the health endpoint for MCS or should I ask the team to tackle that?

It's in flight: openshift/machine-config-operator#239

As suggested by Dani Comnea [1]. When we switched to network load balancers in 16dfbb3 (data/aws: use nlbs instead of elbs, 2018-11-01, openshift#594), we replaced things like: resource "aws_elb" "api_internal" { ... health_check { healthy_threshold = 2 unhealthy_threshold = 2 timeout = 3 target = "SSL:6443" interval = 5 } ... } with: resource "aws_lb_target_group" "api_internal" { ... health_check { healthy_threshold = 3 unhealthy_threshold = 3 interval = 10 port = 6443 protocol = "TCP" } } This resulted in logs like [2]: [core@ip-10-0-11-88 ~]$ sudo crictl ps CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT 1bf4870ea6eea registry.svc.ci.openshift.org/openshift/origin-v4.0-2018-12-15-160933@sha256:97eac256dde260e8bee9a5948efce5edb879dc6cb522a0352567010285378a56 2 minutes ago Running machine-config-server 0 [core@ip-10-0-11-88 ~]$ sudo crictl logs 1bf4870ea6eea I1215 20:23:07.088210 1 bootstrap.go:37] Version: 3.11.0-356-gb7ffe0c7-dirty I1215 20:23:07.088554 1 api.go:54] launching server I1215 20:23:07.088571 1 api.go:54] launching server 2018/12/15 20:24:17 http: TLS handshake error from 10.0.20.86:28372: EOF 2018/12/15 20:24:18 http: TLS handshake error from 10.0.20.86:38438: EOF 2018/12/15 20:24:18 http: TLS handshake error from 10.0.47.69:26320: EOF ... when the health check opens a TCP connection (in this case to the machine-config server on 49500) and then hangs up without completing the TLS handshake. Network load balancers [3,4] do not have an analog to the classic load balancers' SSL protocol [5,6,7], so we're using HTTPS. There's some discussion in [8] about the best way to perform unauthenticated liveness checks on the Kubernetes API server that suggests 401s are possible in some configurations. Checking against a recent cluster: $ curl -i https://wking-api.devcluster.openshift.com:6443/healthz curl: (60) Peer's Certificate issuer is not recognized. More details here: http://curl.haxx.se/docs/sslcerts.html curl performs SSL certificate verification by default, using a "bundle" of Certificate Authority (CA) public keys (CA certs). If the default bundle file isn't adequate, you can specify an alternate file using the --cacert option. If this HTTPS server uses a certificate signed by a CA represented in the bundle, the certificate verification probably failed due to a problem with the certificate (it might be expired, or the name might not match the domain name in the URL). If you'd like to turn off curl's verification of the certificate, use the -k (or --insecure) option. $ curl -ik https://wking-api.devcluster.openshift.com:6443/healthz HTTP/1.1 200 OK Cache-Control: no-store Date: Sun, 16 Dec 2018 06:18:23 GMT Content-Length: 2 Content-Type: text/plain; charset=utf-8 ok I don't know if the network load balancer health checks care about certificate validity or not. I guess we'll see how CI testing handles this. Ignition is only exposed inside the cluster, and checking that from a master node: [core@ip-10-0-26-134 ~]$ curl -i https://wking-api.devcluster.openshift.com:49500/ curl: (60) Peer's Certificate issuer is not recognized. More details here: http://curl.haxx.se/docs/sslcerts.html curl performs SSL certificate verification by default, using a "bundle" of Certificate Authority (CA) public keys (CA certs). If the default bundle file isn't adequate, you can specify an alternate file using the --cacert option. If this HTTPS server uses a certificate signed by a CA represented in the bundle, the certificate verification probably failed due to a problem with the certificate (it might be expired, or the name might not match the domain name in the URL). If you'd like to turn off curl's verification of the certificate, use the -k (or --insecure) option. [core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/ HTTP/1.1 404 Not Found Content-Type: text/plain; charset=utf-8 X-Content-Type-Options: nosniff Date: Sun, 16 Dec 2018 06:30:14 GMT Content-Length: 19 404 page not found So we're checking the new /healthz from openshift/machine-config-operator@d0a7ae21 (server: Add /healthz, 2019-01-04, openshift/machine-config-operator#267) instead. Unfortunately, setting matcher [9] is not allowed for network load balancers (e.g. see [10,11]). Setting it leads to errors like: ERROR * module.vpc.aws_lb_target_group.api_internal: 1 error occurred: ERROR * aws_lb_target_group.api_internal: Error creating LB Target Group: InvalidConfigurationRequest: Custom health check matchers are not supported for health checks for target groups with the TCP protocol ERROR status code: 400, request id: 25a53d63-00fe-11e9-80c5-59885e191c9c So I've left it unset here, and we'll just hope the 401s don't start happening. [1]: openshift#923 [2]: https://groups.google.com/d/msg/openshift-4-dev-preview/Jmt6AK0EJR4/Ed3W7yZyBQAJ [3]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html [4]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#protocol [5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-healthchecks.html [6]: https://www.terraform.io/docs/providers/aws/r/elb.html#target [7]: hashicorp/terraform-provider-aws#6866 [8]: kubernetes/kubernetes#43784 [9]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#matcher [10]: https://github.com/terraform-providers/terraform-provider-aws/pull/2906/files#diff-375aea487c27a6ada86edfd817ba2401R612 [11]: hashicorp/terraform-provider-aws#2708 (comment)

wking · 2019-01-09T23:30:10Z

Rebased onto master and updated to use openshift/machine-config-operator#267's /healthz with 2d5101f -> 069ba26.

crawford · 2019-01-09T23:45:39Z

/lgtm

openshift-ci-robot · 2019-01-09T23:45:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [crawford,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

DanyC97 · 2019-01-10T16:51:08Z

thank you @wking for working on it

openshift-ci-robot requested review from abhinavdahiya and tomassedovic December 16, 2018 07:14

wking mentioned this pull request Dec 16, 2018

pkg/server/api: Add /healthz for load-balancer health checks openshift/machine-config-operator#239

Closed

wking force-pushed the https-load-balancer-health-checks branch from 2d5101f to 069ba26 Compare January 9, 2019 23:28

wking changed the title ~~WIP: data/aws/vpc: Use HTTPS for load balancer health checks~~ data/aws/vpc: Use HTTPS for load balancer health checks Jan 9, 2019

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2019

openshift-ci-robot assigned crawford Jan 9, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 9, 2019

openshift-merge-robot merged commit 86a4db1 into openshift:master Jan 10, 2019

wking deleted the https-load-balancer-health-checks branch January 10, 2019 00:34

wking mentioned this pull request Jan 11, 2019

Multiple machine-config server restarts after 'http: TLS handshake error from 10.0.29.128:17205: EOF' openshift/machine-config-operator#233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/aws/vpc: Use HTTPS for load balancer health checks #924

data/aws/vpc: Use HTTPS for load balancer health checks #924

wking commented Dec 16, 2018

DanyC97 commented Dec 16, 2018

vrutkovs commented Dec 16, 2018

crawford commented Jan 4, 2019

wking commented Jan 4, 2019 •

edited

Loading

wking commented Jan 9, 2019

crawford commented Jan 9, 2019

openshift-ci-robot commented Jan 9, 2019

DanyC97 commented Jan 10, 2019

data/aws/vpc: Use HTTPS for load balancer health checks #924

data/aws/vpc: Use HTTPS for load balancer health checks #924

Conversation

wking commented Dec 16, 2018

DanyC97 commented Dec 16, 2018

vrutkovs commented Dec 16, 2018

crawford commented Jan 4, 2019

wking commented Jan 4, 2019 • edited Loading

wking commented Jan 9, 2019

crawford commented Jan 9, 2019

openshift-ci-robot commented Jan 9, 2019

DanyC97 commented Jan 10, 2019

wking commented Jan 4, 2019 •

edited

Loading