Skip to content

Conversation

@joseorpa
Copy link

@joseorpa joseorpa commented Oct 27, 2025

Enhancement: Ingress Operator Resource Configuration via v1alpha1 API
This enhancement proposes adding the ability to configure resource limits
and requests for the ingress-operator deployment containers via a new
v1alpha1 API field in the IngressController custom resource.

This addresses the need for:

  • Setting resource limits for QoS guarantees
  • Compliance requirements for resource constraints
  • Scaling operator resources for large deployments

Relates to: RFE-1476

jortizpa and others added 3 commits October 14, 2025 13:45
This enhancement proposes adding the ability to configure resource limits
and requests for the ingress-operator deployment containers via a new
v1alpha1 API field in the IngressController custom resource.

This addresses the need for:
- Setting resource limits for QoS guarantees
- Compliance requirements for resource constraints
- Scaling operator resources for large deployments

Relates to: RFE-1476
@openshift-ci openshift-ci bot requested review from Miciah and rfredette October 27, 2025 15:32
@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 27, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 27, 2025

Hi @joseorpa. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Comment on lines 574 to 588
### Alternative 2: Modify v1 API directly

Add `operatorResourceRequirements` field directly to stable v1 API.

**Pros**:
- No need for v1alpha1 version
- Simpler for users (one API version)

**Cons**:
- Changes stable API (breaking compatibility promise)
- Cannot iterate on design easily
- Difficult to remove if issues found
- Against OpenShift API stability guarantees

**Decision**: Rejected - Use v1alpha1 for new features as per OpenShift conventions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this v1alpha1 convention coming from? Can we introduce v1alpha1 when we already have v1?

The usual approach is to add the field directly to the existing v1 API:

  1. Define a new featuregate, initially in the TPNU feature set (but not Default).
  2. Add a field to the v1 API, using the new featuregate (as you've done using the // +openshift:enable:FeatureGate marker).
  3. Implement the feature and write tests.
  4. Add the featuregate to the Default feature set when it's ready.

Copy link
Author

@joseorpa joseorpa Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This v1alpha1 convention comes from openshift/api#2485 (review)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoelSpeed can you help here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a difference between adding a field to an already stable API (which Miciah has pointed out) and adding a completely new API.

The PR I reviewed, and left feedback on, was introducing a completely new API type, and as such, starting as alpha is correct per our latest guidelines.

If you think this should just be a field on an existing v1 API then that's a different discussion

Comment on lines 105 to 107
Create a new v1alpha1 API version for IngressController in the
`operator.openshift.io` group, following the pattern made for example by
[cluster monitoring v1alpha1 configuration](https://github.com/openshift/api/blob/94481d71bb6f3ce6019717ea7900e6f88f42fa2c/config/v1alpha1/types_cluster_monitoring.go#L172-L193).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a shared type for all operators?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean core Kubernetes corev1.ResourceRequirements ? I've seen there is a lot of types in operator.openshift.io group.

- Maintain backward compatibility with existing IngressController v1 API
- Use v1alpha1 API version for this Tech Preview feature
- Provide sensible defaults that work for most deployments
- Support both the ingress-operator and kube-rbac-proxy containers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why kube-rbac-proxy? Is that only for QoS?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm correcting this as well, it is for QoS but I agree, not directly related to router pods.

Comment on lines 258 to 266
A new controller (`operator-deployment-controller`) in the cluster-ingress-operator
watches the default IngressController CR and reconciles the operator's own deployment
when `operatorResourceRequirements` is specified.

**Controller responsibilities:**
1. Watch IngressController resources (v1alpha1)
2. Reconcile `ingress-operator` Deployment in `openshift-ingress-operator` namespace
3. Update container resource specifications
4. Handle error cases gracefully (invalid values, conflicts, etc.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work; CVO manages the ingress-operator deployment. You can't have cluster-ingress-operator update its own deployment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm updating this as well, ingress operator would control just the deployment of the router pods.

**Mitigation**:
- Controller reconciliation loop detects and corrects drift
- Document that configuration should be via IngressController CR, not direct deployment edits
- Admission webhooks prevent direct deployment modifications
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you proposing adding an admission webhook to block updates to the ingress-operator deployment?

Copy link
Author

@joseorpa joseorpa Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm correcting this and changing it for a conversion webhook for the different API versions.


This enhancement proposes adding the ability to configure resource limits and
requests for the ingress-operator deployment containers via a new v1alpha1 API
field in the IngressController custom resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is for the ingress-operator deployment, it doesn't make sense to put this in the IngressController CRD, which describes configuration for router pods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I will update the content of this part of the enhancement as well.

Comment on lines 351 to 352
1. **Q**: Should we support auto-scaling (VPA) in the future?
- **A**: Out of scope for initial implementation, but API should not preclude it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Autoscaling the operator?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the router pod for sure, updating this too

Comment on lines 357 to 358
3. **Q**: Should this apply to all IngressControllers or only the default?
- **A**: Initial implementation only default, but API supports any IngressController
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the configuration apply to IngressControllers (router) pods at all, or only to the ingress-operator pod?

If you mean it applies only to the ingress-operator pod, are you saying that resource requests and limits for the ingress-operator pod are read from the "default" IngressController, and resource request and limits specified on other IngressController CRs are ignored? Putting configuration for the operator in the IngressController CRD is confusing (see #1877 (comment)).

If you actually mean resource requests and limits for router pods, then it seems to me that it is simplest and least surprising to respect the configuration for all IngressControllers, not only for the default. Does respecting configuration for other IngressControllers pose some problem?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be for all router pods.

Comment on lines 360 to 361
4. **Q**: How do we handle the operator modifying its own deployment safely?
- **A**: Use owner references carefully, reconcile loop with backoff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this point? How do you avoid conflicts with CVO?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to router pods controlled by ingress-controller

Comment on lines 426 to 427
- [ ] Sufficient field testing (2+ minor releases in Tech Preview)
- [ ] No major bugs reported for 2 consecutive releases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unusual requirement for OpenShift. For a feature like this, we would usually introduce as Tech Preview and graduate to GA in the same release development cycle.

- [ ] No major bugs reported for 2 consecutive releases
- [ ] Performance impact assessed and documented
- [ ] API design validated by diverse user scenarios
- [ ] At least 10 production users providing positive feedback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you believe you will be able to find 10 production users of this feature?

Comment on lines +562 to +564
- Simpler to implement
- No API version changes needed
- Easy to update without CRD changes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would need a CRD change to add a reference to the ConfigMap... unless you would have the operator just check for a ConfigMap in openshift-config with some hard-coded name?

Comment on lines 590 to 604
### Alternative 3: Separate CRD for operator configuration

Create a new OperatorConfiguration CRD (similar to how cluster monitoring works).

**Pros**:
- Separation of concerns
- Can configure multiple operators uniformly

**Cons**:
- Increases API surface unnecessarily
- IngressController is the logical place for ingress-operator configuration
- More CRDs to manage
- Inconsistent with how other operators handle self-configuration

**Decision**: Rejected - IngressController CR is the appropriate configuration location
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really to mean for this EP to be specifically for the ingress-operator pod (and not router pods), then I really like this alternative. Have you considered a variant: adding configuration for resource requests and limits to the ClusterVersion CRD (alongside the existing component overrides)? This makes a lot of sense for a few reasons:

  • CVO is the thing that manages the deployment right now; trying to have cluster-ingress-operator update the deployment that CVO manages is asking for trouble.
  • The resource requests and limits configuration logically fits under CVO configuration, not the IngressController API.
  • The configuration logically fits in with component overrides.
  • The resource requests and limits configuration could apply to any operator, not just cluster-ingress-operator; putting the configuration under the ClusterVersion CRD would provide a centralized, consistent way to configure it for multiple operators.

Comment on lines 614 to 620
**Cons**:
- Not GitOps friendly
- Requires direct deployment modification
- Not discoverable via API
- Doesn't follow OpenShift declarative configuration patterns
- Difficult to audit and version control

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would require a CVO override.


## Design Details

### Open Questions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you address this point from https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#resources-and-limits?

We do not want cluster components to be restarted based on their resource consumption (for example, being killed due to an out-of-memory condition). We need to detect and handle those cases more gracefully, without degrading cluster performance.

@Miciah
Copy link
Contributor

Miciah commented Oct 28, 2025

/assign

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 29, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from miciah. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@joseorpa joseorpa changed the title Ingress operator resource config Ingress router resource config Oct 29, 2025
@rikatz
Copy link
Member

rikatz commented Nov 5, 2025

/cc @alebedev87

@openshift-ci openshift-ci bot requested a review from alebedev87 November 5, 2025 15:37
@candita
Copy link
Contributor

candita commented Nov 25, 2025

Sorry, I forgot to assign this.
/assign @alebedev87

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general questions from the API perspective inline

TechPreviewNoUpgrade feature set, and will be promoted to the Default feature set
once the feature graduates to GA.

**Enabling the Feature Gate:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non-blocking): feature gating and tech preview are generally described in the OpenShift docs and processes. I don't think we need to describe the workflow in detail here (to keep the enhancement more concise) but fine to keep as is as well if you prefer.

// router pods (HAProxy containers). This field allows setting resource limits
// to achieve Guaranteed QoS class for router pods.
//
// When this field is set, it takes precedence over spec.nodePlacement.resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this referring to https://github.com/openshift/api/blob/bfa868a224015e94456731c1b5b0c849f901b417/operator/v1/types_ingress.go#L435 ? I don't see a resources field in nodePlacement. Could you help understand where that is being set?


// tuning defines parameters for tuning the performance of ingress controller pods.
// +optional
Tuning *IngressControllerTuning `json:"tuning,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// metricsContainer specifies resource requirements for the metrics sidecar
// container in router pods.
//
// If not specified, uses Kubernetes default behavior (no requests or limits).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if it's worth considering adding defaults to these as well?

5. The ingress-operator reconciles the router deployment with the specified resources
6. Kubernetes performs a rolling restart of the router pods with the new resource configuration
7. Router pods achieve Guaranteed QoS class (when limits == requests)
8. Platform administrator verifies the changes with `oc describe deployment router-default -n openshift-ingress`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could have IngressControllerStatus also introduce a subfield for this to reflect if it properly applied, but maybe that's not needed given that it's relatively easy to view the deployment as well.

// limits: none
//
// +optional
RouterContainer *corev1.ResourceRequirements `json:"routerContainer,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResourceRequirements also has a Claims subfield: https://pkg.go.dev/k8s.io/api/core/v1#ResourceRequirements , would we plan on allowing the user to set that?

1. **Resource limits must be >= requests**: Kubernetes standard validation enforced by API server
2. **Feature gate check**: If `IngressRouterResourceLimits` feature gate is disabled,
the `resources` field will be ignored (with a warning event logged)
3. **Minimum values** (recommendations, not hard limits):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we be proposing that the API itself has no validation, but ingress controller would be checking the spec and emitting those events? I think it may be best to have some type of min/max validation on the API itself, which will also help with documentation to list out suggestions for the user.

3. **Minimum values** (recommendations, not hard limits):
- Router container: cpu >= 100m, memory >= 128Mi recommended for production
- Values below recommendations will generate warning events but not block the request
4. **Precedence validation**: When both `spec.resources` and `spec.nodePlacement.resources`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above, where is this field located and how would we validate it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants