Skip to content

Commit 4772fdf

Browse files
committed
Enhance PKI configuration proposal with improved metric coverage, clarified upgrade scenarios, and refined operational guidance.
1 parent d72d81d commit 4772fdf

File tree

1 file changed

+177
-77
lines changed

1 file changed

+177
-77
lines changed

enhancements/security/internal-pki-config.md

Lines changed: 177 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ Currently, OpenShift provides no mechanism to configure these parameters for int
6868
### Non-Goals
6969

7070
- Modifying certificate lifetimes or rotation schedules (this is handled by existing mechanisms)
71-
- Supporting external CA integration or certificate injection (this is covered by existing user-provided certificate features)
71+
- Supporting external CA integration or certificate injection (this is covered by existing user-provided certificate features such as cert-manager and custom CA bundles)
7272
- Automatic rotation of existing certificates to new cryptographic parameters (rotation happens on natural certificate expiry or forced rotation events)
7373
- Supporting algorithms beyond RSA and ECDSA in the initial implementation (e.g., Ed25519, RSA-PSS)
7474
- Configuring signature algorithms separately from key algorithms (signature algorithm is derived from key type)
@@ -87,7 +87,7 @@ At a high level, the changes include:
8787
2. **Feature Gate**: `ConfigurablePKI` to enable the functionality (TechPreviewNoUpgrade during development, enabled by default at GA)
8888
3. **Installer Integration**: Limited Day-1 configuration support for signer certificate cryptographic parameters
8989
4. **Operator Updates**: Modifications to certificate-generating operators to watch and consume the PKI configuration independently
90-
5. **Certificate Rotation**: Integration with existing rotation mechanisms to apply new parameters
90+
5. **Certificate Rotation**: Operators apply PKI configuration parameters during existing certificate rotation cycles (no changes to rotation mechanisms themselves)
9191
6. **Metrics and Observability**: Expose metrics for certificate generation events and configuration compliance
9292

9393
Note: There is **no central PKI controller**. Each certificate-generating operator watches the PKI resource directly and applies configuration to its own certificates.
@@ -186,9 +186,6 @@ spec:
186186
```promql
187187
# Verify certificates are being generated with correct parameters
188188
openshift_pki_certificate_generated_total{algorithm="ECDSA",curve="P384",category="ServingCertificate"}
189-
190-
# Check for any generation failures
191-
rate(openshift_pki_certificate_generation_errors_total[5m])
192189
```
193190

194191
#### Forced Certificate Rotation with New Parameters
@@ -200,13 +197,13 @@ Use pre-existing workflow to force certificate rotation using the current PKI co
200197

201198
1. A cluster running OpenShift 4.N is upgraded to 4.N+1 which includes this feature.
202199

203-
2. The upgrade will create a `PKI` resource with an empty spec.
200+
2. The upgrade installs the PKI CRD (API definition) and creates an empty PKI resource instance with an empty spec.
204201

205-
2. While no `PKI` resource exists (i.e. during the upgrade), or `PKI.spec` is empty (i.e. after the upgrade), all operators continue using their existing hardcoded defaults (typically RSA 2048).
202+
3. With an empty PKI spec, all operators continue using their existing hardcoded defaults (typically RSA 2048).
206203

207-
3. The cluster administrator can update the `PKI` resource post-upgrade, which will apply on the next certificate rotation cycle.
204+
4. The cluster administrator can update the PKI resource post-upgrade to configure cryptographic parameters, which will apply on the next certificate rotation cycle.
208205

209-
4. Existing certificates continue to function until their natural rotation.
206+
5. Existing certificates continue to function until their natural rotation.
210207

211208
### API Extensions
212209

@@ -656,9 +653,102 @@ The choice of algorithm and key size has performance implications:
656653

657654
These tradeoffs will be documented in user-facing documentation to help administrators make informed choices.
658655

656+
#### Metrics and Observability
657+
658+
Each certificate-generating component (operators and installer) will expose Prometheus metrics about certificate generation events and properties.
659+
660+
**Metric Exposure Patterns:**
661+
662+
1. **Long-running components** (operators): Expose metrics directly via existing Prometheus endpoints
663+
2. **Short-lived components** (installer): Write metrics to node-exporter textfile collector at `/var/lib/node_exporter/textfile_collector/openshift_pki_installer.prom`
664+
665+
**Metrics:**
666+
667+
1. **Certificate Information (Gauge):**
668+
```promql
669+
openshift_pki_certificate_info{
670+
certificate_name="kube-apiserver-to-kubelet-signer",
671+
category="SignerCertificate",
672+
algorithm="RSA",
673+
key_size="4096", # only present for RSA
674+
curve="", # only present for ECDSA
675+
component="kube-apiserver-operator",
676+
namespace="openshift-kube-apiserver-operator"
677+
} 1
678+
```
679+
- **Purpose:** Info metric providing certificate inventory with all cryptographic properties
680+
- **Cardinality:** ~50-100 certificates cluster-wide
681+
- **Use case:** Query which certificates exist and their parameters
682+
683+
2. **Certificate Generation Events (Counter):**
684+
```promql
685+
openshift_pki_certificate_generated_total{
686+
certificate_name="kube-apiserver-to-kubelet-signer",
687+
category="SignerCertificate",
688+
algorithm="RSA",
689+
key_size="4096",
690+
curve="",
691+
component="kube-apiserver-operator",
692+
result="success" # or "failure"
693+
} 42
694+
```
695+
- **Purpose:** Track certificate generation and rotation events
696+
- **Use case:** Monitor rate of certificate generation, detect rotation patterns
697+
698+
3. **Certificate Generation Duration (Histogram):**
699+
```promql
700+
openshift_pki_certificate_generation_duration_seconds{
701+
certificate_name="kube-apiserver-to-kubelet-signer",
702+
algorithm="RSA",
703+
key_size="4096",
704+
curve="",
705+
component="kube-apiserver-operator"
706+
}
707+
```
708+
- **Purpose:** Track performance of certificate generation operations
709+
- **Buckets:** [0.01, 0.1, 0.5, 1, 2, 5, 10] seconds
710+
- **Use case:** Identify performance issues with RSA 4096 vs ECDSA, detect slow generation
711+
712+
**Example Queries:**
713+
714+
- Find all RSA certificates not using 4096-bit keys:
715+
```promql
716+
openshift_pki_certificate_info{algorithm="RSA",key_size!="4096"}
717+
```
718+
719+
- RSA 4096 generation performance (95th percentile):
720+
```promql
721+
histogram_quantile(0.95,
722+
rate(openshift_pki_certificate_generation_duration_seconds_bucket{algorithm="RSA",key_size="4096"}[5m])
723+
)
724+
```
725+
726+
- Certificate inventory by algorithm:
727+
```promql
728+
count by (algorithm) (openshift_pki_certificate_info)
729+
```
730+
731+
**Cardinality Estimate:**
732+
733+
Assuming ~50 well-known certificates cluster-wide:
734+
- `openshift_pki_certificate_info`: ~50 time series
735+
- `openshift_pki_certificate_generated_total`: ~100 time series (success/failure)
736+
- `openshift_pki_certificate_generation_duration_seconds`: ~50 histograms × 10 buckets = 500 time series
737+
738+
**Total estimated cardinality: ~650 time series** - manageable overhead for cluster-wide certificate monitoring.
739+
659740
### Risks and Mitigations
660741

661-
**Risk: Invalid configuration causes certificate generation failures**
742+
**Risk: Invalid PKI configuration in install-config.yaml prevents cluster installation**
743+
744+
*Mitigation:*
745+
- Installer validates PKI configuration schema before starting cluster creation
746+
- Clear error messages indicate which PKI parameters are invalid
747+
- Installation fails fast with actionable error message before any resources are created
748+
- Documentation provides validated examples for common configurations
749+
- Install-config validation can be tested with `openshift-install create manifests` without committing to full installation
750+
751+
**Risk: Invalid configuration causes certificate generation failures (Day-2)**
662752

663753
*Mitigation:*
664754
- Comprehensive CEL validation rules prevent most invalid configurations at admission time
@@ -750,28 +840,51 @@ Automatically regenerate all certificates when PKI configuration changes.
750840
- Normal rotation will apply changes naturally over time
751841
- Forced rotation annotation provides escape hatch if immediate re-keying needed
752842

753-
## Open Questions
843+
### Alternative 5: Additional Certificate Metrics
844+
845+
Include metrics for certificate expiry, not-before timestamps, generation errors, and configuration compliance.
846+
847+
**Considered metrics:**
848+
- `openshift_pki_certificate_expiry_timestamp_seconds` - Certificate expiration timestamp
849+
- `openshift_pki_certificate_not_before_timestamp_seconds` - Certificate validity start timestamp
850+
- `openshift_pki_certificate_generation_errors_total` - Counter for certificate generation failures
851+
- `openshift_pki_certificate_config_compliant` - Whether certificate matches current PKI configuration
754852

755-
> 1. Should we support configuration of signature algorithms separately from key algorithms?
756-
>
757-
> *Resolution*: No, signature algorithm will be derived from key algorithm (RSA key → RSA-SHA256, ECDSA k512 based on curve). This is standard practice and reduces configuration complexity.
853+
**Not selected because:**
758854

759-
> 2. Should we provide a way to query which certificates exist in the cluster and their current parameters?
760-
>
761-
> *Resolution*: Yes, through metrics. Operators will expose metrics showing certificate names, algorithms, key sizes/curves, and expiry. A future enhancement could add a discovery API.
855+
1. **Certificate expiry and not-before metrics:**
856+
- Certificate expiry monitoring will be handled in a separate enhancement
857+
- Existing certificate monitoring solutions already track expiry
858+
- Keeping scope focused on PKI configuration feature
859+
860+
2. **Certificate generation errors metric:**
861+
- Certificate generation code is straightforward with minimal error paths
862+
- Errors are primarily programmatic or I/O-related (out of disk, permission errors)
863+
- Automatic retries handle transient failures
864+
- The `openshift_pki_certificate_generated_total` metric already tracks success/failure via the `result` label
865+
- Detailed error categorization adds complexity without significant operational value
866+
867+
3. **Configuration compliance metric:**
868+
- No central component to implement compliance checking
869+
- Each operator would need to duplicate config resolution logic (for generation AND compliance)
870+
- Ambiguity about "expected" values when PKI config changes after certificate generation
871+
- Certificate might be old but was compliant when generated
872+
- Adds significant implementation complexity to every operator
873+
- Users can query compliance via PromQL using `openshift_pki_certificate_info` metric:
874+
```promql
875+
# Find RSA certificates not using 4096-bit keys
876+
openshift_pki_certificate_info{algorithm="RSA",key_size!="4096"}
877+
```
878+
- Compliance checking could be addressed in a future enhancement with a dedicated compliance-checker component if needed
762879

763-
> 3. Should we support gradual rollout of new parameters (e.g., blue-green rotation)?
764-
>
765-
> *Resolution*: Not in initial implementation. Certificates naturally rotate gradually based on their expiry times. This provides inherent gradual rollout.
880+
## Open Questions
766881

767-
> 4. How do we handle certificates that are generated by components we don't control (e.g., upstream Kubernetes components)?
768-
>
769-
> *Resolution*: This enhancement only covers certificates generated by OpenShift operators. Upstream components' certificates remain at their defaults. Future enhancements could extend coverage.
882+
None at this time.
770883

771884
## Test Plan
772885

773886
**Unit Tests:**
774-
- PKI API validation (CRD webhooks)
887+
- PKI API validation (CEL validation rules)
775888
- Configuration resolution logic (precedence rules)
776889
- Certificate generation with different algorithms and parameters
777890
- Upgrade path (empty config → defaults)
@@ -795,7 +908,6 @@ Automatically regenerate all certificates when PKI configuration changes.
795908

796909
**Performance Tests:**
797910
- Measure certificate generation time for different algorithms/sizes
798-
- Measure impact on cluster upgrade time (certificate rotation during upgrade)
799911
- Validate that ECDSA provides expected performance improvements for TLS handshakes
800912

801913
**Compatibility Tests:**
@@ -864,12 +976,13 @@ This enhancement does not deprecate or remove any existing features. It adds new
864976

865977
When upgrading from a version without this feature to a version with it:
866978

867-
1. The PKI CRD is created during upgrade
868-
2. If no PKI resource exists (first upgrade), operators use their existing hardcoded defaults
869-
3. Existing certificates continue to function unchanged
870-
4. Certificate rotation uses existing defaults until a PKI resource is created
871-
5. Administrators can create a PKI resource post-upgrade
872-
6. New parameters apply on the next rotation cycle after PKI resource is created
979+
1. The PKI CRD (API definition) is installed during upgrade
980+
2. An empty PKI resource instance (with empty spec) is automatically created
981+
3. With an empty spec, operators use their existing hardcoded defaults (typically RSA 2048)
982+
4. Existing certificates continue to function unchanged
983+
5. Certificate rotation uses existing defaults until the PKI resource is updated
984+
6. Administrators can update the PKI resource post-upgrade to configure cryptographic parameters
985+
7. New parameters apply on the next rotation cycle after PKI resource is updated
873986

874987
This approach ensures zero disruption during upgrade and preserves backward compatibility.
875988

@@ -888,44 +1001,28 @@ No manual intervention is required for downgrade. Certificates generated with no
8881001
**Version Skew:**
8891002

8901003
During rolling upgrades, different operator versions will coexist:
891-
- Old operator versions ignore the PKI resource
892-
- New operator versions honor the PKI resource
893-
- Certificates generated during upgrade use parameters based on operator version
894-
- This is safe because certificate rotation is gradual and asynchronous
895-
- Mixed algorithms (RSA and ECDSA) are explicitly supported
1004+
- An empty PKI resource instance (with empty spec) is automatically created during upgrade
1005+
- Old operator versions don't know about PKI configuration and use hardcoded defaults
1006+
- New operator versions check for PKI resource, find it has an empty spec, and use hardcoded defaults
1007+
- All operators use the same default parameters (typically RSA 2048), ensuring consistency
1008+
- Administrator can update the PKI resource after upgrade completes to configure parameters
1009+
- Certificate rotation is gradual and asynchronous
1010+
- Mixed algorithms (RSA and ECDSA) are explicitly supported when intentionally configured
8961011

8971012
## Version Skew Strategy
8981013

899-
**Control Plane Skew:**
900-
901-
During control plane upgrades, different kube-apiserver instances may be running different versions:
902-
- Old kube-apiserver: Continues serving with existing certificates
903-
- New kube-apiserver: May rotate certificates using PKI configuration
904-
- Both can serve simultaneously (certificate verification doesn't change)
905-
- Clients validate certificates based on CA trust, not algorithm/size
906-
907-
**Operator Skew:**
908-
909-
Different operators update at different times during upgrade:
910-
- Some operators support PKI configuration, others don't yet
911-
- Each operator handles its own certificates independently
912-
- No coordination required between operators
913-
- Cluster continues to function with mixed certificate parameters
1014+
Version skew is not a concern for this feature because:
9141015

915-
**Kubelet Skew:**
1016+
- All supported OpenShift component versions can validate and use both RSA (2048/3072/4096) and ECDSA (P-256/P-384/P-521) certificates
1017+
- Certificate verification is based on CA trust, not on specific algorithms or key sizes
1018+
- Components communicate using standard TLS, which transparently handles different certificate types
1019+
- Each operator independently manages its own certificates without coordination
9161020

917-
Kubelets on different nodes may be at different versions:
918-
- All supported kubelet versions can validate RSA and ECDSA certificates
919-
- Certificate generation on kubelets (kubelet-serving) happens independently per node
920-
- Mixed algorithms across nodes is explicitly supported
921-
- No coordination required between kubelets
922-
923-
**External Component Skew:**
924-
925-
Components external to OpenShift (load balancers, monitoring systems) may connect to the cluster:
926-
- All modern TLS libraries support RSA 2048/3072/4096 and ECDSA P-256/P-384/P-521
927-
- Administrators are responsible for ensuring external components support configured algorithms
928-
- Documentation will note minimum TLS library versions for ECDSA support (Go 1.13+, OpenSSL 1.1.1+, etc.)
1021+
During upgrades:
1022+
- The empty PKI resource created during upgrade ensures all operators (old and new) use consistent defaults
1023+
- Administrators can update PKI configuration after upgrade completes
1024+
- Certificate rotation is gradual and asynchronous per existing mechanisms
1025+
- Mixed certificate parameters across the cluster are explicitly supported
9291026

9301027
## Operational Aspects of API Extensions
9311028

@@ -1025,9 +1122,9 @@ However, there are some considerations:
10251122
oc get events -n openshift-kube-apiserver-operator --field-selector reason=CertificateGenerationFailed
10261123
```
10271124

1028-
2. Review certificate generation metrics:
1029-
```promql
1030-
rate(openshift_pki_certificate_generation_errors_total[5m])
1125+
2. Check operator logs for certificate generation errors:
1126+
```bash
1127+
oc logs -n openshift-kube-apiserver-operator deployment/kube-apiserver-operator | grep -i "certificate.*error"
10311128
```
10321129

10331130
3. Verify cryptographic libraries are functioning:
@@ -1090,9 +1187,11 @@ However, there are some considerations:
10901187
# Remove or fix invalid configuration
10911188
```
10921189

1093-
3. Force rotation of affected certificates:
1190+
3. Wait for natural certificate rotation, or force rotation by deleting certificate secrets:
10941191
```bash
1095-
oc patch pki cluster --type merge -p '{"metadata":{"annotations":{"pki.config.openshift.io/force-rotation":"true"}}}'
1192+
# Each operator regenerates certificates when secrets are deleted
1193+
# Example for kube-apiserver serving certificate:
1194+
oc delete secret -n openshift-kube-apiserver kube-apiserver-serving-cert
10961195
```
10971196

10981197
4. Monitor rotation progress:
@@ -1119,16 +1218,16 @@ However, there are some considerations:
11191218
# Change to more compatible algorithm (e.g., ECDSA P-521 → RSA 2048)
11201219
```
11211220

1122-
4. Force rotation of affected certificate:
1221+
4. Force rotation of affected certificate by deleting the secret:
11231222
```bash
1124-
# Annotation triggers immediate rotation
1125-
oc patch pki cluster --type merge -p '{"metadata":{"annotations":{"pki.config.openshift.io/force-rotation-certificate":"kube-apiserver-serving"}}}'
1223+
# Delete the secret to force regeneration
1224+
oc delete secret -n openshift-kube-apiserver kube-apiserver-serving-cert
11261225
```
11271226

11281227
5. Verify new certificate is generated and working:
11291228
```bash
11301229
oc get secret -n openshift-kube-apiserver kube-apiserver-serving-cert -o jsonpath='{.metadata.creationTimestamp}'
1131-
# Should show recent timestamp
1230+
# Should show recent timestamp after regeneration
11321231
```
11331232

11341233
**Scenario: Need to revert all certificates to defaults**
@@ -1138,11 +1237,12 @@ However, there are some considerations:
11381237
oc delete pki cluster
11391238
```
11401239

1141-
2. Wait for natural certificate rotation, or force rotation:
1240+
2. Wait for natural certificate rotation, or force rotation by deleting certificate secrets:
11421241
```bash
1143-
# Each operator has its own forced rotation mechanism
1144-
# Example for kube-apiserver:
1145-
oc patch kubeapiserver cluster --type merge -p '{"spec":{"forceRedeploymentReason":"pki-reset-$(date +%s)"}}'
1242+
# Delete certificate secrets to force regeneration
1243+
# Example for kube-apiserver serving certificate:
1244+
oc delete secret -n openshift-kube-apiserver kube-apiserver-serving-cert
1245+
# Repeat for other certificates as needed
11461246
```
11471247

11481248
3. Certificates will be regenerated with platform defaults

0 commit comments

Comments
 (0)