Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancements/authentication: detect invalid certificates #980

Merged
merged 1 commit into from
Mar 21, 2022

Conversation

s-urbaniak
Copy link
Contributor

- kube-apiserver and aggregated API endpoints
- oauth-server and its external route
- oauth-server and external identity providers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custom serving certs for kube-apiserver

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts do you have a place where I can hook into the verification of those? Currently, I think we'd need a separate controller in cluster-kube-apiserver-operator that checks config.openshift.io.APIServer#servingInfo.namedCertificates, parses them, and verifies for invalid certs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @sanchezl worked on that. He can link the code here.

enhancements/authentication/invalid-certs.md Outdated Show resolved Hide resolved
enhancements/authentication/invalid-certs.md Outdated Show resolved Hide resolved

In case of having invalid custom certificates configured for the external OAuth route
we propose to extend the existing `RouterCertsDomainValidationController`.
Formally we propose to add additional logic in `validateRouterCertificates` https://github.com/openshift/cluster-authentication-operator/blob/ff156ab2bdfbdd68b49d76547fea5ec28d9c3639/pkg/controllers/routercerts/controller.go#L129.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is going to end up by what, a degraded condition on an "invalid" certificate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

augmenting https://github.com/openshift/cluster-authentication-operator/blob/d2f6218a6ab2daccbec43f71a495de1c80529ef6/pkg/controllers/customroute/custom_route_controller.go#L55 is a good idea.

However we have to be careful: this concrete logic must not be backported. If users configured legacy certificates in the previous releases we have to block upgrades.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a degraded condition on an "invalid" certificate?

most likely yes, a degraded condition of i.e. authentication-operator will appear, as the connection will fail to verify.

enhancements/authentication/invalid-certs.md Outdated Show resolved Hide resolved

### Non-Goals

Other core workloads are out of scope for this enhancement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we at least know what the other workloads are that are likely impacted by this?

does every component that exposes a TLS metrics endpoint need to do work, for example?

Copy link
Contributor Author

@s-urbaniak s-urbaniak Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware that all components expose a TLS metric like this. Our finding refers concretely to a contribution that was made upstream to Kubernetes.

An initial search on the openshift org reveals quite a few usages: https://github.com/search?q=org%3Aopenshift+x509ignoreCN%3D0&type=code

maybe @eparis has more insight as I believe he had touch points with the GODEBUG=x509ignoreCN=0 setting.


This enhancment therefore proposes:
- a way to detect invalid certificates in kube-apiserver and oauth-server
- a way to prevent upgrades from OpenShift 4.9 to OpenShift 4.10 in the face of invalid certificates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we planning to backport an Upgradeable=False guard for this, or rely on the alert, or start with the alert and maybe follow up with Upgradeable=False if we see a lot of alerts, or...?

Copy link
Contributor Author

@s-urbaniak s-urbaniak Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking the suggestion is to backport an Upgradeable=False guard for this. An alert would be feasible if, after the upgrade, the cluster would be functional. However, in the presence of invalid certificates, a cluster will simply be dysfunctional after the upgrade to 4.10.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a way to prevent upgrades from OpenShift 4.9 to OpenShift 4.10 in the face of invalid certificates
For cases where the platform is a client, this is probably valid. For cases where the platform is a server, but not a client, I think it's too strong to block upgrades. Can you separate the cases?

Copy link
Contributor Author

@s-urbaniak s-urbaniak Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think all the cases enumerated here are platform clients:

  • custom serving certificates for kube-apiserver: openshift workloads invoking api calls
  • custom API webhooks: kube-apiserver being the client
  • custom aggregated API endpoints: kube-apiserver being the client
  • custom certificates for route endpoints: openshift oauth-proxy workloads being the client
  • certificates of external auth identity providers: oauth-server being the client

@s-urbaniak
Copy link
Contributor Author

Also submitted https://bugzilla.redhat.com/show_bug.cgi?id=2031839 to track urgency with respect to 4.10.

x509: certificate relies on legacy Common Name field, use SANs instead
```

Verification of server certificates is executed during TLS client handshakes,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably mean during "TLS Handshake procedure" or when a client verified "Server Hello".

Copy link

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, rest lgtm

In Kubernetes detection of invalid certificates has been added as part of https://github.com/kubernetes/kubernetes/pull/95396.
The introduced Prometheus `apiserver_webhooks_x509_missing_san_total` and `apiserver_kube_aggregator_x509_missing_san_total` metrics provides the means to detect invalid certificates against API webhooks and aggregated API endpoints.

To detect invalid certificates this enhancement proposes to add a new controller in `cluster-kube-apiserver` that:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster-kube-apiserver-operator

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2021

For oauth-server this enhancement proposes the introduction of the following new metric:

- `openshift_auth_external_x509_missing_san_total`: a metric capturing the count of invalid certificates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What clients of this external server do we have?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What clients of this external server do we have?
If it is just the operator, could the operator build a custom verifier for the check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is just the operator, could the operator build a custom verifier for the check?

The client, in this case, is oauth-server, not authentication-operator.

#### kube-apiserver

In Kubernetes detection of invalid certificates has been added as part of https://github.com/kubernetes/kubernetes/pull/95396.
The introduced Prometheus `apiserver_webhooks_x509_missing_san_total` and `apiserver_kube_aggregator_x509_missing_san_total` metrics provides the means to detect invalid certificates against API webhooks and aggregated API endpoints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alerting on this as well seems appropriate


For oauth-server this enhancement proposes the introduction of the following new metric:

- `openshift_auth_external_x509_missing_san_total`: a metric capturing the count of invalid certificates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding the metric for an alert seems valid regardless

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, good idea.

- custom API webhooks
- custom aggregated API endpoints
- custom certificates for route endpoints
- certificates of external auth identity providers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add certificates securing infrastructure API as a target case here?

For example, an OpenStack cloud exposing its API endpoints with a custom certificate. We have several platform-specific touchpoints (eg cluster-api-provider-openstack) that would need to expose the metric, if I understand this enhancement correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd consider this as a separate enhancement proposal or separate bugzilla. This proposal was not meant to be catch-all to focus on scope for the auth team.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +169 to +176
### Upgrade / Downgrade Strategy

The above proposed changes **must** be backported to OpenShift 4.9.
Otherwise, a detection of invalid certificates will not be possible before an upgrade to OpenShift 4.10.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be interested in what happens in future releases.

Scenario: a new 4.11 feature triggers calls to a new HTTPS endpoint. The feature, that doesn't exist in 4.10, is active in 4.11 independently of user intervention. The new HTTPS endpoint is served with what you defined as "invalid certificate".

In this scenario, upgrading from 4.10 to 4.11 potentially brings the cluster to an unstable state.

In general terms: for each new HTTPS endpoint we call in 4.y, do we commit on adding an HTTPS-cert check in 4.y-1.z? Do we commit to doing that until a specific date?

For context, Chrome has dropped support for CN in v58 in April 2017.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting from 4.10 we're not having a go runtime any more that is capable to accept legacy cert based endpoints in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed to me that library-go's IsHostnameError was specifically designed for checking invalid certificates under Go v1.17+. What am I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, IsHostnameError will only work with <=1.16.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2022
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 17, 2022
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Feb 24, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 24, 2022

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sdodson
Copy link
Member

sdodson commented Feb 24, 2022

/reopen
It's my understanding that this has already been implemented. Can we do a quick review to see where this diverges from implementation and get this merged?

@openshift-ci openshift-ci bot reopened this Feb 24, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 24, 2022

@sdodson: Reopened this PR.

In response to this:

/reopen
It's my understanding that this has already been implemented. Can we do a quick review to see where this diverges from implementation and get this merged?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Mar 3, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 3, 2022

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@s-urbaniak
Copy link
Contributor Author

/reopen

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 4, 2022

@s-urbaniak: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot reopened this Mar 4, 2022
@pierreprinetti
Copy link
Member

/remove-lifecycle rotten

I am supporting @s-urbaniak in finalising this proposal.

I have addressed what I believe are the outstanding comments:

  • @slaskawi please verify the response to your comment (TLS Handshake procedure vs TSL client handshakes)
  • @soltysh please verify the response to your comment (cluster-kube-apiserver-operator vs cluster-kube-apiserver)
  • @deads2k I have added "alterting" as a non-goal, for the purpose of finalising the enhancement focusing on the minimum viable logic useful to prevent catastrophic failures on upgrade. WDYT?

I will squash before merge as soon as we converge to consensus.

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 4, 2022
@pierreprinetti
Copy link
Member

@slaskawi @soltysh @deads2k @s-urbaniak
Based on the absence of angry negative feedback, I am going to squash so that the PR is ready to merge.

Co-authored-by: Sergiusz Urbaniak <sergiusz.urbaniak@gmail.com>
Co-authored-by: Pierre Prinetti <pierreprinetti@redhat.com>
@pierreprinetti
Copy link
Member

Minor fix: correcting the "created" and "updated" dates at the top of the document.

@pierreprinetti
Copy link
Member

I believe this PR is ready for you to consider LGTM'ing, @mfojtik.

@mfojtik
Copy link
Contributor

mfojtik commented Mar 21, 2022

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 21, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 21, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mfojtik, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 21, 2022

@s-urbaniak: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 161e330 into openshift:master Mar 21, 2022
@pierreprinetti pierreprinetti deleted the invalid-certs branch March 21, 2022 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.