Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix context issue during cleanup of kind clusters #6771

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jainpulkit22
Copy link
Contributor

Fix context issue during cleanup of kind clusters.

Fixes #6768.

ci/kind/kind-setup.sh Outdated Show resolved Hide resolved
@rajnkamr rajnkamr added this to the Antrea v2.3 release milestone Oct 25, 2024
ci/kind/kind-setup.sh Outdated Show resolved Hide resolved
@rajnkamr rajnkamr added the area/test/infra Issues or PRs related to test infrastructure (Jenkins configuration, Ansible playbook, Kind wrappers label Nov 14, 2024
ci/kind/kind-setup.sh Outdated Show resolved Hide resolved
@jainpulkit22 jainpulkit22 requested review from XinShuYang and antoninbas and removed request for antoninbas December 13, 2024 05:56
ci/kind/kind-setup.sh Outdated Show resolved Hide resolved
ci/kind/kind-setup.sh Outdated Show resolved Hide resolved
@jainpulkit22 jainpulkit22 requested review from antoninbas and removed request for antoninbas January 24, 2025 06:07
done
done
)200>>"$LOCK_FILE"
rm -rf $LOCK_FILE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a good idea IMO. It feels like there can be a race condition where we delete the file while another job is holding the lock?

Copy link
Contributor Author

@jainpulkit22 jainpulkit22 Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But untill we delete config.lock the other process will not be able to acquire the lock and it may panic.
Also since flock is not compatible with the Go based locking mechanism so what's you opinion on that should we use the alternative approach of writing the cluster names to a file and acquiring lock over that, or. should we move ahead with acquiring lock over the kubeconfig and use the below approach:

before invoking kind create cluster command we can use flock to wait for the other process to release the lock and then we should trigger this command so in that case flock will not interfere with the Go based locking mechanism its just that we will be introducing couple of unnecessary flocks in the code.

@antoninbas what's your opinion on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use ~/.kube/config.lock, I think that's a given. We can pretend it doesn't exist.
So we need our own lock file. After that, we have 2 options, and you can choose which one you want to use:

  1. have our own state file to keep track of cluster names and creation timestamps (which I was originally proposing)
  2. only rely on kubectl / kind, and do not introduce our own state file. With this option we use flock to acquire a lock before calling kubectl / kind, as appropriate

I think both approaches will work

Copy link
Contributor Author

@jainpulkit22 jainpulkit22 Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for 2nd approach we again rely on ~/.kube/config.lock right? I think we can go ahead with this approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we don't rely on ~/.kube/config.lock for either approach, as we have discussed about why this is not a viable option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we have it on all our testbeds so this should not be an issue @XinShuYang and @KMAnju-2021 can confirm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, iirc we have this /var/lib/jenkins directory on all kind testbeds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's good that you highlight this. /var/lib/jenkins/ is not an appropriate directory to use here as ci/kind/kind-setup.sh is used for local development, Github CI, etc. Please use a more "universal" directory, such as ~/.kube/antrea/ (you can create the directory if it doesn't exist). I guess ~/.antrea/ would also be a good choice if we don't want to write anything to the ~/.kube directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess ~/.antrea would be a better option here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas can you have a look at the code changes, thanks.

@jainpulkit22 jainpulkit22 force-pushed the cleanup-function-fix branch 2 times, most recently from ebc163b to 4fd57fe Compare February 12, 2025 07:35
@rajnkamr rajnkamr mentioned this pull request Feb 14, 2025
Signed-off-by: Pulkit Jain <pulkit.jain@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test/infra Issues or PRs related to test infrastructure (Jenkins configuration, Ansible playbook, Kind wrappers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleanup of kind cluster
5 participants