Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsfs - monitor only nsrs that are mounted. DFBUGS-153 #8561

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alphaprinz
Copy link
Contributor

@alphaprinz alphaprinz commented Nov 28, 2024

Explain the changes

  1. Namespace monitor should not monitor nsfs nsrs that are not mounted on the endpoint.
    Use the new NSFS_NSR_ env variable to test whether nsr should be mounted.
    (see nsfs - add mounted nsr name to env. DFBUGS-153 noobaa-operator#1481)

This commit reverts 2789d60.

Issues: Fixed #xxx / Gap #xxx

  1. https://issues.redhat.com/browse/DFBUGS-153

Testing Instructions:

Reduce config.NAMESPACE_MONITOR_DELAY to 1000ms.
Create nsfs nsr
nsr should not be rejected (by endpoints existing before its creation).

  • Doc added/updated
  • Tests added

@alphaprinz alphaprinz force-pushed the 153_nsfs_nsr_rejected_take2 branch 2 times, most recently from 3234a46 to a7c5ed9 Compare December 2, 2024 19:14
background_scheduler.register_bg_worker(new NamespaceMonitor({
name: 'namespace_fs_monitor',
client: internal_rpc_client,
should_monitor: nsr => Boolean(nsr.nsfs_config && process.env['NSFS_NSR_' + nsr.name]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions -

  1. If the endpoint got up before the namespace resource was mounted, when is the next time we will get to this flow for start monitoring? why not add a retry after 60 seconds -
    nsfs | wait for endpoint startup before namespace monitor registration #8474 (comment)
  2. Why avoid start monitoring instead of externalizing that the value of process.env['NSFS_NSR_' + nsr.name] is undefined which means that the PV was not mounted yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. An endpoint that was started before the mount will be deleted after a new endpoint is created with the new mount.
    The retry will not help as the endpoint that opened the report is removed when the new nsfs nsr mount is added (after nsfs nsr was created in kubernetes cluster).
  2. There could be other nsfs nsrs that should be monitored (correct me if I'm wrong).

Maybe I'll make the scenario more concrete-

  1. Operator install a system in a cluster.
  2. There is endpoint A. It does NOT have any nsfs nsr mounts.
  3. At some point, an nsfs nsr is created in the cluster.
  4. In reconcile:
  5. a. operator adds a mount for the nsfs nsr to endpoints' container.
  6. b. operator creates an nsr object in system store.
  7. A new endpoint B with the new mount is created by kubernetes.
  8. While B is being created, A updates its system store, reads the nsfs nsr. The new nsfs nsr is NOT mounted in A. A reports NOENT on the nsfs nsr. Note since default interval for nsfs nsr monitoring is less than creating a new endpoint, this doesn't necessarily happen. Reducing config.NAMESPACE_MONITOR_DELAY will ensure bug reproduction.
  9. Endpoint B is ready. Endpoint A is deleted. Nsr status is stuck in rejected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned on Slack, I think that the correct path is not to avoid monitoring a namespace resource that is still not mounted but add this check to the monitoring process.

Copy link
Contributor Author

@alphaprinz alphaprinz Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nsr will be monitored by the new endpoint.
The old endpoint is about to be deleted.
The only difference this commit makes is that old endpoints won't mistakenly report nsr as rejected.

If you think that the about-to-be-deleted endpoint should do something about the mount it will never have (or anything else, for that matter) please specify it explicitly. The current "add this to monitoring process" is too vague. Also specify explicitly if this is an enhancement or part of the bug fix.

Copy link
Contributor

@romayalon romayalon Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"About-to-be-deleted" is the happy path :)
There is also the sad path where there is an issue with the mounting and it takes a while/never happens - that's exactly why I think it's important and avoiding monitoring it if it's not mounted is a partial solution from my prespective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not trying to solve monitoring, but rather to fix a bug in monitoring.
I'm not removing any feature that we currently have.

Again, I would like a more specific way to proceed.
If you think a different fix or an enhancement to the monitoring is needed, please specify it explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alphaprinz
I already explained it in the above comment, but I'll be happy to summarize my comments -

My specific idea for solving it -
Instead of not monitoring unmounted namespace resources, I think you should move the new condition you added inside the monitoring check, and externalize that this is the current issue that the namespace resource has.
Comment 1, bullet 2
Comment 3

Why I think that my suggestion is a better behavior / user experience -
It will behave better in cases where the re-start of the endpoint takes time/won't happen at all.
Comment 5

How to proceed -
The above summary of comments is my opinion/suggestion/how I would fix it.
IMO, You shall proceed from here as how you see it, fix it, open an issue and call it enhancement, document this gap or anything else you feel appropriate.

Signed-off-by: Amit Prinz Setter <alphaprinz@gmail.com>
@alphaprinz alphaprinz force-pushed the 153_nsfs_nsr_rejected_take2 branch from a7c5ed9 to 50c80bf Compare December 4, 2024 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants