Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsfs | wait for endpoint startup before namespace monitor registration #8474

Merged
merged 1 commit into from
Oct 28, 2024

Conversation

alphaprinz
Copy link
Contributor

@alphaprinz alphaprinz commented Oct 19, 2024

Explain the changes

Wait for endpoint startup before registering namespace resource monitor.

Issues: Fixed #xxx / Gap #xxx

Nsr can enter "Rejected" status if endpoint is deleted by kubernetes before it is in "Ready" state.
https://bugzilla.redhat.com/show_bug.cgi?id=2284585.

Testing Instructions:

Repoduction
Sagie details scenario for ODS in bz.

I've reproduced with this scenario on minikube:
Start with a nsfs nsr on a pvc. A single endpoint A is in "Ready" state.
-Delete nsr. A new endpoint B is being spun.
-While endpoint B is done creating but NOT ready yet (endpoint A is still in "Ready" state), create nsr.
-Kubernetes will delete endpoint B and will leave endpoint A running.
-Endpoint B loads nsr from system store, but nsr is not mounted on it. Endpoint B issues NO_SUCH_BUCKET report on nsr (and then it is deleted by kubernetes).

@alphaprinz alphaprinz force-pushed the 2284585_nsfs_nsr_rejected branch 3 times, most recently from 02e747f to b4ee9d0 Compare October 20, 2024 01:28
client: internal_rpc_client,
should_monitor: nsr => Boolean(nsr.nsfs_config),
}));
setTimeout(() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not add a retry in the namespace monitor if the error was ENOENT and start time of the endpoint is lower than 60 seconds? also why 60 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like more code with equivalent solution.

60 seconds to allow pod to stabilize (get to 'ready' or be deleted).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason is understood, but how do you know it takes a minute? is it always a minute? should it be configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, there's no way for node to know status of the pod.
I assume we don't want to make such a dependency.

60 seconds is more than enough on my minikube and is low enough not to bother other deployments in other envs, but I can make it an env variable.

Signed-off-by: Amit Prinz Setter <alphaprinz@gmail.com>
@alphaprinz alphaprinz force-pushed the 2284585_nsfs_nsr_rejected branch from b4ee9d0 to 7ec7799 Compare October 28, 2024 03:02
@alphaprinz alphaprinz merged commit 2789d60 into noobaa:master Oct 28, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants