Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud API component for egress IP #571

Conversation

alexanderConstantinescu
Copy link
Contributor

This enhancement proposal discusses and presents the problem of supporting OpenShift egress IP on conventional clouds, such as: GCP, AWS, Azure. It presents the problem that exists today and the different models possible for solving it. It leaves and presents a lot of open-ended questions, for which it hopes to get a discussion going during the review of this proposal and ultimately a clearer path and joint decision. It thus encourages anyone familiar with Kubernetes / OpenShift networking to join in the discussion.

/assign @danwinship @tssurya @squeed @abhat @trozet

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign squeed after the PR has been reviewed.
You can assign the PR to them by writing /assign @squeed in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
@squeed
Copy link
Contributor

squeed commented Jan 13, 2021

One thought: right now, (at least with ovn-k), assigning an egress IP to a node can't fail. That is no longer the case, so we'll have to think about how to handle that.

Because the possibility of cascading failures is higher here, I'm leaning towards a separate CR for ip assignment request, rather than an annotation.

That way, ovn-k could easily distill status and indicate to the end-user where to get more debugging information. Additionally, we could set up a proper OwnerReference, making deletion more automatic.

@alexanderConstantinescu
Copy link
Contributor Author

/assign @vpickard

@alexanderConstantinescu
Copy link
Contributor Author

One thought: right now, (at least with ovn-k), assigning an egress IP to a node can't fail. That is no longer the case, so we'll have to think about how to handle that.

But this is outside the network plugin's responsibility, right? So this is effectively not its failure, it would be this cloud components failure.

Because the possibility of cascading failures is higher here, I'm leaning towards a separate CR for ip assignment request, rather than an annotation.

Are we fine with introducing the possible dependencies mentioned in this doc though, given the selection of such a design?

That way, ovn-k could easily distill status and indicate to the end-user where to get more debugging information

Again, I don't think it's up to OVN-Kubernetes or openshift-sdn to indicate anything to the end-user except what falls under its responsibility. This cloud assignment would be the responsibility of this component.

@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Aug 14, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 14, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alexanderConstantinescu
Copy link
Contributor Author

/reopen

@openshift-ci openshift-ci bot reopened this Aug 18, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 18, 2021

@alexanderConstantinescu: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Sep 17, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 17, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alexanderConstantinescu
Copy link
Contributor Author

/reopen

@openshift-ci openshift-ci bot reopened this Sep 17, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 17, 2021

@alexanderConstantinescu: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@raffaelespazzoli
Copy link

some considerations from my experience developing the egressip-ipam-operator[1].

There seems to be three macro functionalities w.r.t to egressip:

  1. assignment of egress ips to namespaces
  2. assignment of egress ips to nodes
  3. configuration of the underlying infrastructure to acknowledge the assigned egress ips to nodes.

I think as of today in OCP we expect #1 to be manual or automated externally.
It's unclear to me what the expectation is for #2.
And #3 in the cloud is clearly the focus of the enhancement.

I'd like to point out that #2 and #3 cannot be thought of disjointed functionality when running in the cloud because each cloud provider sets limits to the number of secondary IPs that can be carried by a NIC. The rules change by cloud provider, but essentially each node has a secondary ip capacity. That should be taken into account in #2 when assigning an IP to a node.
That should also be taken in consideration when a node dies and secondary IPs must be redistributed to other surviving nodes. It's unclear to me at this point what the new egressIP capability will look like in this space.
Metrics should be emitted and collected by prometheus to allow the customer understand the capacity and consumption situations. This feature is not present in the egressip-ipam-operator and customers which issue lots of egressips had to develop something custom.

Other, possibly minor, considerations are that the Infrastructure object does not always provide all the information needed to call the underlying cloud api, sometime the information can be derived from the machine object. But care must be taken in this space if the intention is to support UBI deployments or IPI deployments to pre-existing networks. Also for Azure with internal facing load balancers there were some issues that we never fully understood, so that use case should be probably tested.

@alexanderConstantinescu
Copy link
Contributor Author

Thank you for your comment @raffaelespazzoli !

I'd like to point out that #2 and #3 cannot be thought of disjointed functionality when running in the cloud because each cloud provider sets limits to the number of secondary IPs that can be carried by a NIC. The rules change by cloud provider, but essentially each node has a secondary ip capacity. That should be taken into account in #2 when assigning an IP to a node.

I was not aware of that! This will need change the model presented in this enhancement proposal a bit. I just saw that on GCP that the alias IP limit is set to 10 per NIC, see: https://cloud.google.com/vpc/docs/quota#per_instance this means that this controller will need to convey the maximum amount to the network plugins, so that they are aware of the limits per node when performing the assignment. I will update the enhancement proposal with a solution for this.

Metrics should be emitted and collected by prometheus to allow the customer understand the capacity and consumption situations. This feature is not present in the egressip-ipam-operator and customers which issue lots of egressips had to develop something custom.

Yes, this controller will probably expose some high level metrics. Given what you are describing I assume a "count of IPs per node" is a good start.

Also for Azure with internal facing load balancers there were some issues that we never fully understood, so that use case should be probably tested.

OK, I will have a look at that during the testing phase. I will make sure to note it in this proposal.

@alexanderConstantinescu alexanderConstantinescu force-pushed the cloud-api-component branch 2 times, most recently from 2c1ca70 to 3213da4 Compare September 27, 2021 09:31
@danwinship
Copy link
Contributor

danwinship commented Sep 28, 2021

There seems to be three macro functionalities w.r.t to egressip:

  1. assignment of egress ips to namespaces

  2. assignment of egress ips to nodes

  3. configuration of the underlying infrastructure to acknowledge the assigned egress ips to nodes.

I think as of today in OCP we expect #1 to be manual or automated externally.

Yes. People occasionally ask for it to be fully automated within OCP but I don't think that really makes sense. The possibilities here seem to be:

1a. ("manual") When the admin wants to assign an egress IP to a namespace, they choose an egress IP themselves, tell OCP, and update their firewall (or whatever other external system it is that cares about the egress IPs).
1b. ("automated externally") The admin creates some external service that has credentials with both OCP and their firewall. When they want to assign an egress IP to a namespace, they ask this service to do it, and it allocates an IP and updates OCP and the firewall accordingly.
1c. ("automated internally"?) The admin first tells OCP the complete set of egress IPs that are available. Then later when they want to assign an egress IP to a namespace, they tell OCP to do it, then wait for it to happen, then check to see what IP got allocated, then update the firewall.

1c really feels just as "manual" as 1a, you're just moving the manual steps around. And there's no way we could provide a generic version of 1b, so...

It's unclear to me what the expectation is for #2.

openshift-sdn has two modes; the older mode where the admin assigns IPs to nodes manually, and the newer mode where they get assigned automatically. ovn-kubernetes only supports the latter.

@raffaelespazzoli
Copy link

@danwinship in the egressip-ipam-operator we automated the assignment of egress IPs to namespaces. This works fine as long as you don't care what IP is assigned to a namespace. In fact, the thing that really matters to enterprises is that once an IP is assigned to a namespace it does never change, so that a network admin (or another operator) can configure the firewall.
That said the operator also support manual assignment (which can be automated externally) or a combination of manual assignments and automatic assignment.
The value of having an IPAM service for egress IPs should be self-evident, especially when you think about those scenario in which each namespace gets one (and namespaces come and go frequently).
I have not stressed the need for that capability in my previous comment because this is not the right PR. But my hope is that at some point we will support it.

enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Outdated Show resolved Hide resolved
enhancements/network/cloud-egress-ip.md Show resolved Hide resolved
@danwinship
Copy link
Contributor

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 28, 2021
Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
@danwinship
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2021
Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 19, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 19, 2021
@openshift-merge-robot openshift-merge-robot merged commit ab44af9 into openshift:master Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.