Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the LB health check from TCP to https to avoid TLS handshake errors #923

Closed
DanyC97 opened this issue Dec 15, 2018 · 0 comments
Closed

Comments

@DanyC97
Copy link
Contributor

DanyC97 commented Dec 15, 2018

As mentioned here there is a lot of noise in the logs around TLS handshake error and that is coming from LB health check .

Will be good to move to https and check healthz ? that is what we had/ have in openshift-ansible anyway

wking added a commit to wking/openshift-installer that referenced this issue Dec 16, 2018
As suggested by Dani Comnea [1].  When we switched to network load
balancers in 16dfbb3 (data/aws: use nlbs instead of elbs,
2018-11-01, openshift#594), we replaced things like:

  resource "aws_elb" "api_internal" {
    ...
    health_check {
      healthy_threshold   = 2
      unhealthy_threshold = 2
      timeout             = 3
      target              = "SSL:6443"
      interval            = 5
    }
    ...
  }

with:

  resource "aws_lb_target_group" "api_internal" {
    ...
    health_check {
      healthy_threshold   = 3
      unhealthy_threshold = 3
      interval            = 10
      port                = 6443
      protocol            = "TCP"
    }
  }

This resulted in logs like [2]:

  [core@ip-10-0-11-88 ~]$ sudo crictl ps
  CONTAINER ID        IMAGE                                                                                                                                           CREATED             STATE               NAME                    ATTEMPT
  1bf4870ea6eea       registry.svc.ci.openshift.org/openshift/origin-v4.0-2018-12-15-160933@sha256:97eac256dde260e8bee9a5948efce5edb879dc6cb522a0352567010285378a56   2 minutes ago       Running             machine-config-server   0
  [core@ip-10-0-11-88 ~]$ sudo crictl logs 1bf4870ea6eea
  I1215 20:23:07.088210       1 bootstrap.go:37] Version: 3.11.0-356-gb7ffe0c7-dirty
  I1215 20:23:07.088554       1 api.go:54] launching server
  I1215 20:23:07.088571       1 api.go:54] launching server
  2018/12/15 20:24:17 http: TLS handshake error from 10.0.20.86:28372: EOF
  2018/12/15 20:24:18 http: TLS handshake error from 10.0.20.86:38438: EOF
  2018/12/15 20:24:18 http: TLS handshake error from 10.0.47.69:26320: EOF
  ...

when the health check opens a TCP connection (in this case to the
machine-config server on 49500) and then hangs up without completing
the TLS handshake.  Network load balancers [3,4] do not have an analog
to the classic load balancers' SSL protocol [5,6,7], so we're using
HTTPS.

There's some discussion in [8] about the best way to perform
unauthenticated liveness checks on the Kubernetes API server.  For
now, I'm assuming that both 200 and 401 responses to /healthz requests
indicate a functional server, and we can evaluate other response
status codes as necessary.  Checking against a recent cluster:

  $ curl -i https://wking-api.devcluster.openshift.com:6443/healthz
  curl: (60) Peer's Certificate issuer is not recognized.
  More details here: http://curl.haxx.se/docs/sslcerts.html

  curl performs SSL certificate verification by default, using a "bundle"
   of Certificate Authority (CA) public keys (CA certs). If the default
   bundle file isn't adequate, you can specify an alternate file
   using the --cacert option.
  If this HTTPS server uses a certificate signed by a CA represented in
   the bundle, the certificate verification probably failed due to a
   problem with the certificate (it might be expired, or the name might
   not match the domain name in the URL).
  If you'd like to turn off curl's verification of the certificate, use
   the -k (or --insecure) option.
  $ curl -ik https://wking-api.devcluster.openshift.com:6443/healthz
  HTTP/1.1 200 OK
  Cache-Control: no-store
  Date: Sun, 16 Dec 2018 06:18:23 GMT
  Content-Length: 2
  Content-Type: text/plain; charset=utf-8

I don't know if the network load balancer health checks care about
certificate validity or not.  I guess we'll see how CI testing handles
this.

Ignition is only exposed inside the cluster, and checking that from a
master node:

  [core@ip-10-0-26-134 ~]$ curl -i https://wking-api.devcluster.openshift.com:49500/
  curl: (60) Peer's Certificate issuer is not recognized.
  More details here: http://curl.haxx.se/docs/sslcerts.html

  curl performs SSL certificate verification by default, using a "bundle"
   of Certificate Authority (CA) public keys (CA certs). If the default
   bundle file isn't adequate, you can specify an alternate file
   using the --cacert option.
  If this HTTPS server uses a certificate signed by a CA represented in
   the bundle, the certificate verification probably failed due to a
   problem with the certificate (it might be expired, or the name might
   not match the domain name in the URL).
  If you'd like to turn off curl's verification of the certificate, use
   the -k (or --insecure) option.
  [core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/
  HTTP/1.1 404 Not Found
  Content-Type: text/plain; charset=utf-8
  X-Content-Type-Options: nosniff
  Date: Sun, 16 Dec 2018 06:30:14 GMT
  Content-Length: 19

  404 page not found

So I've allowed 200, 401, and 404 there just to be generous.

[1]: openshift#923
[2]: https://groups.google.com/d/msg/openshift-4-dev-preview/Jmt6AK0EJR4/Ed3W7yZyBQAJ
[3]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html
[4]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#protocol
[5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-healthchecks.html
[6]: https://www.terraform.io/docs/providers/aws/r/elb.html#target
[7]: hashicorp/terraform-provider-aws#6866
[8]: kubernetes/kubernetes#43784
wking added a commit to wking/openshift-installer that referenced this issue Dec 16, 2018
As suggested by Dani Comnea [1].  When we switched to network load
balancers in 16dfbb3 (data/aws: use nlbs instead of elbs,
2018-11-01, openshift#594), we replaced things like:

  resource "aws_elb" "api_internal" {
    ...
    health_check {
      healthy_threshold   = 2
      unhealthy_threshold = 2
      timeout             = 3
      target              = "SSL:6443"
      interval            = 5
    }
    ...
  }

with:

  resource "aws_lb_target_group" "api_internal" {
    ...
    health_check {
      healthy_threshold   = 3
      unhealthy_threshold = 3
      interval            = 10
      port                = 6443
      protocol            = "TCP"
    }
  }

This resulted in logs like [2]:

  [core@ip-10-0-11-88 ~]$ sudo crictl ps
  CONTAINER ID        IMAGE                                                                                                                                           CREATED             STATE               NAME                    ATTEMPT
  1bf4870ea6eea       registry.svc.ci.openshift.org/openshift/origin-v4.0-2018-12-15-160933@sha256:97eac256dde260e8bee9a5948efce5edb879dc6cb522a0352567010285378a56   2 minutes ago       Running             machine-config-server   0
  [core@ip-10-0-11-88 ~]$ sudo crictl logs 1bf4870ea6eea
  I1215 20:23:07.088210       1 bootstrap.go:37] Version: 3.11.0-356-gb7ffe0c7-dirty
  I1215 20:23:07.088554       1 api.go:54] launching server
  I1215 20:23:07.088571       1 api.go:54] launching server
  2018/12/15 20:24:17 http: TLS handshake error from 10.0.20.86:28372: EOF
  2018/12/15 20:24:18 http: TLS handshake error from 10.0.20.86:38438: EOF
  2018/12/15 20:24:18 http: TLS handshake error from 10.0.47.69:26320: EOF
  ...

when the health check opens a TCP connection (in this case to the
machine-config server on 49500) and then hangs up without completing
the TLS handshake.  Network load balancers [3,4] do not have an analog
to the classic load balancers' SSL protocol [5,6,7], so we're using
HTTPS.

There's some discussion in [8] about the best way to perform
unauthenticated liveness checks on the Kubernetes API server.  For
now, I'm assuming that both 200 and 401 responses to /healthz requests
indicate a functional server, and we can evaluate other response
status codes as necessary.  Checking against a recent cluster:

  $ curl -i https://wking-api.devcluster.openshift.com:6443/healthz
  curl: (60) Peer's Certificate issuer is not recognized.
  More details here: http://curl.haxx.se/docs/sslcerts.html

  curl performs SSL certificate verification by default, using a "bundle"
   of Certificate Authority (CA) public keys (CA certs). If the default
   bundle file isn't adequate, you can specify an alternate file
   using the --cacert option.
  If this HTTPS server uses a certificate signed by a CA represented in
   the bundle, the certificate verification probably failed due to a
   problem with the certificate (it might be expired, or the name might
   not match the domain name in the URL).
  If you'd like to turn off curl's verification of the certificate, use
   the -k (or --insecure) option.
  $ curl -ik https://wking-api.devcluster.openshift.com:6443/healthz
  HTTP/1.1 200 OK
  Cache-Control: no-store
  Date: Sun, 16 Dec 2018 06:18:23 GMT
  Content-Length: 2
  Content-Type: text/plain; charset=utf-8

I don't know if the network load balancer health checks care about
certificate validity or not.  I guess we'll see how CI testing handles
this.

Ignition is only exposed inside the cluster, and checking that from a
master node:

  [core@ip-10-0-26-134 ~]$ curl -i https://wking-api.devcluster.openshift.com:49500/
  curl: (60) Peer's Certificate issuer is not recognized.
  More details here: http://curl.haxx.se/docs/sslcerts.html

  curl performs SSL certificate verification by default, using a "bundle"
   of Certificate Authority (CA) public keys (CA certs). If the default
   bundle file isn't adequate, you can specify an alternate file
   using the --cacert option.
  If this HTTPS server uses a certificate signed by a CA represented in
   the bundle, the certificate verification probably failed due to a
   problem with the certificate (it might be expired, or the name might
   not match the domain name in the URL).
  If you'd like to turn off curl's verification of the certificate, use
   the -k (or --insecure) option.
  [core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/
  HTTP/1.1 404 Not Found
  Content-Type: text/plain; charset=utf-8
  X-Content-Type-Options: nosniff
  Date: Sun, 16 Dec 2018 06:30:14 GMT
  Content-Length: 19

  404 page not found

Unfortunately, setting matcher [9] is not allowed for network load
balancers (e.g. see [10,11]).  Setting it leads to errors like:

  ERROR  * module.vpc.aws_lb_target_group.api_internal: 1 error occurred:
  ERROR  * aws_lb_target_group.api_internal: Error creating LB Target Group: InvalidConfigurationRequest: Custom health check matchers are not supported for health checks for target groups with the TCP protocol
  ERROR  status code: 400, request id: 25a53d63-00fe-11e9-80c5-59885e191c9c

So I've left it unset here, and we'll just hope the 401s don't start
happening.

[1]: openshift#923
[2]: https://groups.google.com/d/msg/openshift-4-dev-preview/Jmt6AK0EJR4/Ed3W7yZyBQAJ
[3]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html
[4]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#protocol
[5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-healthchecks.html
[6]: https://www.terraform.io/docs/providers/aws/r/elb.html#target
[7]: hashicorp/terraform-provider-aws#6866
[8]: kubernetes/kubernetes#43784
[9]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#matcher
[10]: https://github.com/terraform-providers/terraform-provider-aws/pull/2906/files#diff-375aea487c27a6ada86edfd817ba2401R612
[11]: hashicorp/terraform-provider-aws#2708 (comment)
wking added a commit to wking/openshift-installer that referenced this issue Jan 9, 2019
As suggested by Dani Comnea [1].  When we switched to network load
balancers in 16dfbb3 (data/aws: use nlbs instead of elbs,
2018-11-01, openshift#594), we replaced things like:

  resource "aws_elb" "api_internal" {
    ...
    health_check {
      healthy_threshold   = 2
      unhealthy_threshold = 2
      timeout             = 3
      target              = "SSL:6443"
      interval            = 5
    }
    ...
  }

with:

  resource "aws_lb_target_group" "api_internal" {
    ...
    health_check {
      healthy_threshold   = 3
      unhealthy_threshold = 3
      interval            = 10
      port                = 6443
      protocol            = "TCP"
    }
  }

This resulted in logs like [2]:

  [core@ip-10-0-11-88 ~]$ sudo crictl ps
  CONTAINER ID        IMAGE                                                                                                                                           CREATED             STATE               NAME                    ATTEMPT
  1bf4870ea6eea       registry.svc.ci.openshift.org/openshift/origin-v4.0-2018-12-15-160933@sha256:97eac256dde260e8bee9a5948efce5edb879dc6cb522a0352567010285378a56   2 minutes ago       Running             machine-config-server   0
  [core@ip-10-0-11-88 ~]$ sudo crictl logs 1bf4870ea6eea
  I1215 20:23:07.088210       1 bootstrap.go:37] Version: 3.11.0-356-gb7ffe0c7-dirty
  I1215 20:23:07.088554       1 api.go:54] launching server
  I1215 20:23:07.088571       1 api.go:54] launching server
  2018/12/15 20:24:17 http: TLS handshake error from 10.0.20.86:28372: EOF
  2018/12/15 20:24:18 http: TLS handshake error from 10.0.20.86:38438: EOF
  2018/12/15 20:24:18 http: TLS handshake error from 10.0.47.69:26320: EOF
  ...

when the health check opens a TCP connection (in this case to the
machine-config server on 49500) and then hangs up without completing
the TLS handshake.  Network load balancers [3,4] do not have an analog
to the classic load balancers' SSL protocol [5,6,7], so we're using
HTTPS.

There's some discussion in [8] about the best way to perform
unauthenticated liveness checks on the Kubernetes API server that
suggests 401s are possible in some configurations.  Checking against a
recent cluster:

  $ curl -i https://wking-api.devcluster.openshift.com:6443/healthz
  curl: (60) Peer's Certificate issuer is not recognized.
  More details here: http://curl.haxx.se/docs/sslcerts.html

  curl performs SSL certificate verification by default, using a "bundle"
   of Certificate Authority (CA) public keys (CA certs). If the default
   bundle file isn't adequate, you can specify an alternate file
   using the --cacert option.
  If this HTTPS server uses a certificate signed by a CA represented in
   the bundle, the certificate verification probably failed due to a
   problem with the certificate (it might be expired, or the name might
   not match the domain name in the URL).
  If you'd like to turn off curl's verification of the certificate, use
   the -k (or --insecure) option.
  $ curl -ik https://wking-api.devcluster.openshift.com:6443/healthz
  HTTP/1.1 200 OK
  Cache-Control: no-store
  Date: Sun, 16 Dec 2018 06:18:23 GMT
  Content-Length: 2
  Content-Type: text/plain; charset=utf-8

  ok

I don't know if the network load balancer health checks care about
certificate validity or not.  I guess we'll see how CI testing handles
this.

Ignition is only exposed inside the cluster, and checking that from a
master node:

  [core@ip-10-0-26-134 ~]$ curl -i https://wking-api.devcluster.openshift.com:49500/
  curl: (60) Peer's Certificate issuer is not recognized.
  More details here: http://curl.haxx.se/docs/sslcerts.html

  curl performs SSL certificate verification by default, using a "bundle"
   of Certificate Authority (CA) public keys (CA certs). If the default
   bundle file isn't adequate, you can specify an alternate file
   using the --cacert option.
  If this HTTPS server uses a certificate signed by a CA represented in
   the bundle, the certificate verification probably failed due to a
   problem with the certificate (it might be expired, or the name might
   not match the domain name in the URL).
  If you'd like to turn off curl's verification of the certificate, use
   the -k (or --insecure) option.
  [core@ip-10-0-26-134 ~]$ curl -ik https://wking-api.devcluster.openshift.com:49500/
  HTTP/1.1 404 Not Found
  Content-Type: text/plain; charset=utf-8
  X-Content-Type-Options: nosniff
  Date: Sun, 16 Dec 2018 06:30:14 GMT
  Content-Length: 19

  404 page not found

So we're checking the new /healthz from
openshift/machine-config-operator@d0a7ae21 (server: Add /healthz,
2019-01-04, openshift/machine-config-operator#267) instead.

Unfortunately, setting matcher [9] is not allowed for network load
balancers (e.g. see [10,11]).  Setting it leads to errors like:

  ERROR  * module.vpc.aws_lb_target_group.api_internal: 1 error occurred:
  ERROR  * aws_lb_target_group.api_internal: Error creating LB Target Group: InvalidConfigurationRequest: Custom health check matchers are not supported for health checks for target groups with the TCP protocol
  ERROR  status code: 400, request id: 25a53d63-00fe-11e9-80c5-59885e191c9c

So I've left it unset here, and we'll just hope the 401s don't start
happening.

[1]: openshift#923
[2]: https://groups.google.com/d/msg/openshift-4-dev-preview/Jmt6AK0EJR4/Ed3W7yZyBQAJ
[3]: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html
[4]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#protocol
[5]: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-healthchecks.html
[6]: https://www.terraform.io/docs/providers/aws/r/elb.html#target
[7]: hashicorp/terraform-provider-aws#6866
[8]: kubernetes/kubernetes#43784
[9]: https://www.terraform.io/docs/providers/aws/r/lb_target_group.html#matcher
[10]: https://github.com/terraform-providers/terraform-provider-aws/pull/2906/files#diff-375aea487c27a6ada86edfd817ba2401R612
[11]: hashicorp/terraform-provider-aws#2708 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant