Port/Trunk Re-use Logic Bugs #834

iamemilio · 2021-04-06T14:46:32Z

/kind bug

The port and trunk re-use logic in the instance create function of CAPO is buggy in all versions, but even if it was not, it would be a race condition.
Example 1: ports created for instances
Bugs:

ports found could be in use
ports found could be in error state
If user needs 2 nics from the same network, it will re-use the same port twice and create a nova error
port found may not match the requirements specified by the user

Race:

Its possible that another service could take the port CAPO has found before it was able to add it to its instance as an interface

The same points apply for trunk ports and the upcoming port logic

sbueringer · 2021-04-06T14:52:45Z

@iamemilio Just out of curiosity, what other services could take the ports created by CAPO?

What would be the correct way to implement this to avoid all these problems?

iamemilio · 2021-04-06T15:08:23Z

It might be the case that the user is deploying on a shared openstack tenant, and another service takes a port that CAPO was planning to attach to an instance. I think this is a relatively unlikely case though, and might be resolvable with just a simple reconcile loop. I think its much more likely that CAPO would mistakenly take a port that was set up for infrastructure purposes in a given network. For example, we create ports in the same subnet as our master and worker nodes that are assigned VIPs in the OpenShift platform. While OpenStack views these ports as being in "DOWN" state, they are actually in use. This is a very tricky situation to resolve. In my opinion, there are 2 ways to go about this:

The Easy Way: CAPO always creates and destroys the ports it consumes for interfaces.
Benefits:

no conflicts of interest
simple

Cost:

more neutron overhead

Example implementation in OpenShifts v1alpha1 fork: openshift#175

The Hard Way: we figure out all the nuances to how to safely re-use ports in a way that supports all users
Benifits:

infra re-use = less neutron overhead

Costs:

complexity
requires a nuanced solution that may take multiple iterations to perfect

sbueringer · 2021-04-06T15:50:44Z

Okay I think I mostly got it. And those ports are created with the same name as the server?

cluster-api-provider-openstack/pkg/cloud/services/compute/instance.go

Line 169 in 74ecd86

Name: i.Name,

If I understand our code correctly, we are creating / listing the ports with the same name as the server instances.

I'm just asking, because if we cannot even be reasonably sure that ports with the server names have been created by us I'm not sure if we can 100% avoid leaking ports without storing all resources we create somewhere (similar like terrafrom does it with a its state). Even then we're not 100% safe as the controller can theoretically be killed between OpenStack resource creation and before it stored the state (e.g. in a Kubernetes CRD).

So right now the only way to connect OpenStack resources to it's counterparts in Kubernetes/CAPO are their names. I think we have a huge problem if we cannot rely on that.

iamemilio · 2021-04-06T17:40:53Z

hmm, yeah I overlooked that to be honest 😃. I see where you are coming from, and think that its probably ok to continue listing the ports the way that we have been. That being said, I think we should be defensive about which ports we select to use and check:

The port is in "DOWN" state
The port hasn't been selected to be used with another interface for the current instance already

Then we could check or apply an update to that port to make sure that it meets the user's requirements.

sbueringer · 2021-04-06T17:44:24Z

@iamemilio Sounds good to me :)

iamemilio · 2021-04-13T14:23:30Z

What is the value of having this port re-use logic, out of curiosity?

sbueringer · 2021-04-14T16:28:51Z

Compared to delete and create the port? I'm happy as long as we're not leaking ports.

macaptain · 2021-06-10T15:31:54Z

@sbueringer @iamemilio I'd be interested in seeing what could be done to both tighten up the port re-use logic, and potentially also make it more useful.

A few comments on the discussion so far:

If user needs 2 nics from the same network, it will re-use the same port twice and create a nova error

This is fixed with #876. Now multiple NICs on the same network will have different indices, hence different port names and don't collide.

The outstanding issues are:

ports found could be in use
ports found could be in error state

We could check these conditions in code where the port name is matched.

port found may not match the requirements specified by the user

Could we copy the 'filter' pattern used for networks and subnets and ensure that if we re-use a port, it matches at least what is specified in the filter?

Its possible that another service could take the port CAPO has found before it was able to add it to its instance as an interface

This is interesting, I think port creation and port attachment will never be atomic in OpenStack, so I think as @iamemilio suggests, a retry in the reconcile loop would make sense.

What is the value of having this port re-use logic, out of curiosity?

I'd actually like to have the option to use a port that already exists for my CAPO instance, so I'm glad we have getOrCreatePort already. One of my use cases involves handling exactly which fixed IPs in a subnet (which may not be the whole subnet) are allocated to instances. I could solve this by creating all the ports with the IPs I want in my network up front, and have CAPO re-use these ports.

The biggest concern here seems to be with leaking ports. One idea is a deleteOnDetach: field on the port options, default true, guaranteeing that ports that are detached are cleared away and ports aren't leaked. In the use case above, deleteOnDetach: false would return a re-used port to my 'pool' of ports.

Let me know if you'd want a more thorough design document detailing these ideas.

sbueringer · 2021-06-11T04:54:03Z

@macaptain Sounds good to me, although I don't have these use cases so it's not that relevant for my use cases how exactly we implement this :)

I think it's important to reach consensus with the OpenShift folks. Let's see what @iamemilio thinks about it.

macaptain · 2021-06-30T08:31:33Z

Hey @iamemilio, what do you think about the port re-use suggestions above? (#834 (comment))

iamemilio · 2021-06-30T20:14:11Z

Sorry for the long delay in getting back to you. I like the idea and would be on board with that.

iamemilio · 2021-06-30T20:23:39Z

I think it would be valuable to draft a design for this, since it would be a fairly large set of features that could impact the user experience. I would be happy to help out if you want.

k8s-triage-robot · 2021-09-28T21:01:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-10-28T21:45:05Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

macaptain · 2021-10-29T17:45:28Z

I am still thinking over this ticket and what we can do about port-reuse.

/assign
/remove-lifecycle rotten

k8s-triage-robot · 2022-01-27T17:55:59Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

macaptain · 2022-02-14T08:38:50Z

I think this is still a valid issue, although lots of work on the ports have been done. I'd support a simpler solution than I proposed above. We could handle the error cases outlined in the issue above better when we get an existing port, but I haven't made progress on this and won't in the near future.

/unassign

k8s-triage-robot · 2022-03-16T09:27:38Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-04-15T09:43:06Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-04-15T09:43:15Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 6, 2021

iamemilio mentioned this issue Apr 6, 2021

Bug 1948546: Port create bugs openshift/cluster-api-provider-openstack#175

Merged

iamemilio changed the title ~~Port/Trunk Re-use Logic Race Condition and Bugs~~ Port/Trunk Re-use Logic Bugs Apr 7, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 28, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 28, 2021

k8s-ci-robot assigned macaptain Oct 29, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 29, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2022

k8s-ci-robot unassigned macaptain Feb 14, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 16, 2022

k8s-ci-robot closed this as completed Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port/Trunk Re-use Logic Bugs #834

Port/Trunk Re-use Logic Bugs #834

iamemilio commented Apr 6, 2021

sbueringer commented Apr 6, 2021

iamemilio commented Apr 6, 2021 •

edited

Loading

sbueringer commented Apr 6, 2021 •

edited

Loading

iamemilio commented Apr 6, 2021

sbueringer commented Apr 6, 2021

iamemilio commented Apr 13, 2021

sbueringer commented Apr 14, 2021

macaptain commented Jun 10, 2021

sbueringer commented Jun 11, 2021

macaptain commented Jun 30, 2021

iamemilio commented Jun 30, 2021

iamemilio commented Jun 30, 2021

k8s-triage-robot commented Sep 28, 2021

k8s-triage-robot commented Oct 28, 2021

macaptain commented Oct 29, 2021

k8s-triage-robot commented Jan 27, 2022

macaptain commented Feb 14, 2022

k8s-triage-robot commented Mar 16, 2022

k8s-triage-robot commented Apr 15, 2022

k8s-ci-robot commented Apr 15, 2022

Port/Trunk Re-use Logic Bugs #834

Port/Trunk Re-use Logic Bugs #834

Comments

iamemilio commented Apr 6, 2021

sbueringer commented Apr 6, 2021

iamemilio commented Apr 6, 2021 • edited Loading

sbueringer commented Apr 6, 2021 • edited Loading

iamemilio commented Apr 6, 2021

sbueringer commented Apr 6, 2021

iamemilio commented Apr 13, 2021

sbueringer commented Apr 14, 2021

macaptain commented Jun 10, 2021

sbueringer commented Jun 11, 2021

macaptain commented Jun 30, 2021

iamemilio commented Jun 30, 2021

iamemilio commented Jun 30, 2021

k8s-triage-robot commented Sep 28, 2021

k8s-triage-robot commented Oct 28, 2021

macaptain commented Oct 29, 2021

k8s-triage-robot commented Jan 27, 2022

macaptain commented Feb 14, 2022

k8s-triage-robot commented Mar 16, 2022

k8s-triage-robot commented Apr 15, 2022

k8s-ci-robot commented Apr 15, 2022

iamemilio commented Apr 6, 2021 •

edited

Loading

sbueringer commented Apr 6, 2021 •

edited

Loading