Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port/Trunk Re-use Logic Bugs #834

Closed
iamemilio opened this issue Apr 6, 2021 · 20 comments
Closed

Port/Trunk Re-use Logic Bugs #834

iamemilio opened this issue Apr 6, 2021 · 20 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@iamemilio
Copy link
Contributor

/kind bug

The port and trunk re-use logic in the instance create function of CAPO is buggy in all versions, but even if it was not, it would be a race condition.
Example 1: ports created for instances
Bugs:

  • ports found could be in use
  • ports found could be in error state
  • If user needs 2 nics from the same network, it will re-use the same port twice and create a nova error
  • port found may not match the requirements specified by the user

Race:

  • Its possible that another service could take the port CAPO has found before it was able to add it to its instance as an interface

The same points apply for trunk ports and the upcoming port logic

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 6, 2021
@sbueringer
Copy link
Member

@iamemilio Just out of curiosity, what other services could take the ports created by CAPO?

What would be the correct way to implement this to avoid all these problems?

@iamemilio
Copy link
Contributor Author

iamemilio commented Apr 6, 2021

It might be the case that the user is deploying on a shared openstack tenant, and another service takes a port that CAPO was planning to attach to an instance. I think this is a relatively unlikely case though, and might be resolvable with just a simple reconcile loop. I think its much more likely that CAPO would mistakenly take a port that was set up for infrastructure purposes in a given network. For example, we create ports in the same subnet as our master and worker nodes that are assigned VIPs in the OpenShift platform. While OpenStack views these ports as being in "DOWN" state, they are actually in use. This is a very tricky situation to resolve. In my opinion, there are 2 ways to go about this:

The Easy Way: CAPO always creates and destroys the ports it consumes for interfaces.
Benefits:

  • no conflicts of interest
  • simple

Cost:

  • more neutron overhead

Example implementation in OpenShifts v1alpha1 fork: openshift#175

The Hard Way: we figure out all the nuances to how to safely re-use ports in a way that supports all users
Benifits:

  • infra re-use = less neutron overhead

Costs:

  • complexity
  • requires a nuanced solution that may take multiple iterations to perfect

@sbueringer
Copy link
Member

sbueringer commented Apr 6, 2021

Okay I think I mostly got it. And those ports are created with the same name as the server?

If I understand our code correctly, we are creating / listing the ports with the same name as the server instances.

I'm just asking, because if we cannot even be reasonably sure that ports with the server names have been created by us I'm not sure if we can 100% avoid leaking ports without storing all resources we create somewhere (similar like terrafrom does it with a its state). Even then we're not 100% safe as the controller can theoretically be killed between OpenStack resource creation and before it stored the state (e.g. in a Kubernetes CRD).

So right now the only way to connect OpenStack resources to it's counterparts in Kubernetes/CAPO are their names. I think we have a huge problem if we cannot rely on that.

@iamemilio
Copy link
Contributor Author

hmm, yeah I overlooked that to be honest 😃. I see where you are coming from, and think that its probably ok to continue listing the ports the way that we have been. That being said, I think we should be defensive about which ports we select to use and check:

  • The port is in "DOWN" state
  • The port hasn't been selected to be used with another interface for the current instance already

Then we could check or apply an update to that port to make sure that it meets the user's requirements.

@sbueringer
Copy link
Member

@iamemilio Sounds good to me :)

@iamemilio iamemilio changed the title Port/Trunk Re-use Logic Race Condition and Bugs Port/Trunk Re-use Logic Bugs Apr 7, 2021
@iamemilio
Copy link
Contributor Author

What is the value of having this port re-use logic, out of curiosity?

@sbueringer
Copy link
Member

Compared to delete and create the port? I'm happy as long as we're not leaking ports.

@macaptain
Copy link
Contributor

@sbueringer @iamemilio I'd be interested in seeing what could be done to both tighten up the port re-use logic, and potentially also make it more useful.

A few comments on the discussion so far:

If user needs 2 nics from the same network, it will re-use the same port twice and create a nova error

This is fixed with #876. Now multiple NICs on the same network will have different indices, hence different port names and don't collide.

The outstanding issues are:

  • ports found could be in use
  • ports found could be in error state

We could check these conditions in code where the port name is matched.

  • port found may not match the requirements specified by the user

Could we copy the 'filter' pattern used for networks and subnets and ensure that if we re-use a port, it matches at least what is specified in the filter?

  • Its possible that another service could take the port CAPO has found before it was able to add it to its instance as an interface

This is interesting, I think port creation and port attachment will never be atomic in OpenStack, so I think as @iamemilio suggests, a retry in the reconcile loop would make sense.

What is the value of having this port re-use logic, out of curiosity?

I'd actually like to have the option to use a port that already exists for my CAPO instance, so I'm glad we have getOrCreatePort already. One of my use cases involves handling exactly which fixed IPs in a subnet (which may not be the whole subnet) are allocated to instances. I could solve this by creating all the ports with the IPs I want in my network up front, and have CAPO re-use these ports.

The biggest concern here seems to be with leaking ports. One idea is a deleteOnDetach: field on the port options, default true, guaranteeing that ports that are detached are cleared away and ports aren't leaked. In the use case above, deleteOnDetach: false would return a re-used port to my 'pool' of ports.

Let me know if you'd want a more thorough design document detailing these ideas.

@sbueringer
Copy link
Member

@macaptain Sounds good to me, although I don't have these use cases so it's not that relevant for my use cases how exactly we implement this :)

I think it's important to reach consensus with the OpenShift folks. Let's see what @iamemilio thinks about it.

@macaptain
Copy link
Contributor

Hey @iamemilio, what do you think about the port re-use suggestions above? (#834 (comment))

@iamemilio
Copy link
Contributor Author

Sorry for the long delay in getting back to you. I like the idea and would be on board with that.

@iamemilio
Copy link
Contributor Author

I think it would be valuable to draft a design for this, since it would be a fairly large set of features that could impact the user experience. I would be happy to help out if you want.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 28, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 28, 2021
@macaptain
Copy link
Contributor

I am still thinking over this ticket and what we can do about port-reuse.

/assign
/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 29, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2022
@macaptain
Copy link
Contributor

I think this is still a valid issue, although lots of work on the ports have been done. I'd support a simpler solution than I proposed above. We could handle the error cases outlined in the issue above better when we get an existing port, but I haven't made progress on this and won't in the near future.

/unassign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 16, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
No open projects
Status: Done
Development

No branches or pull requests

5 participants