Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safeguard ctrl-mgr cancel func in ExtensionController #4937

Conversation

jnummelin
Copy link
Member

Currently we blindly call the cancel func and there are some cases when it actually can be nil.

Description

Fixes #4930

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Manual test
  • Auto test added

Checklist:

  • My code follows the style guidelines of this project
  • My commit messages are signed-off
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

@jnummelin jnummelin marked this pull request as ready for review September 4, 2024 13:40
@jnummelin jnummelin requested a review from a team as a code owner September 4, 2024 13:40
@jnummelin jnummelin force-pushed the fix/nil-cancel-func-in-extension-controller branch from 8a27de9 to f9d138c Compare September 4, 2024 13:52
@@ -515,7 +522,13 @@ func (ec *ExtensionsController) startControllerManager(ctx context.Context) {
// Stop
func (ec *ExtensionsController) Stop() error {
ec.L.Info("Stopping extensions controller")
ec.mgrCancelFn()
// We have no guarantees on concurrency here, so use mutex
ec.mux.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this lock has any actual purpose. We do this because if we call ec.mgrCancelFn it can be nil, but if I understand correctly once it's set it can never become nil.

There are the following scenarios possible scenarios:
1- ec.mgrCancelFn is nil, and remains nil when we call it, the lock doesn't serve any actual purpose
2- ec.mgrCancelFn is not nil, remains unchanged when when we call the function.
3- ec.mgrCancelFn is nil, and it gets some value later. Doesn't matter because the if checks false.
4- ec.mgrCancelFn is not nil, changes to a different cancel function for whatever reason and we call the old cancel function.

Correct me if I'm wrong, but if we're doing mgrCancelFn := ec.mgrCancelFn the lock serves no purpose at all. We should either do:

	mgrCancelFn := ec.mgrCancelFn
	if mgrCancelFn != nil {
		mgrCancelFn()
	}

or

	ec.mux.Lock()
	if ec.mgrCancelFn != nil {
		ec.mgrCancelFn()
	}
	ec.mux.Unlock()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not nil, changes to a different cancel function for whatever reason and we call the old cancel function.

That's the culprit IMO. In that case, the new context might not get cancelled. I agree we can get away without a mutex, if we prefer, but then we'd need atomics to swap values and cancel the swapped-out values (if not nil).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but if we're doing mgrCancelFn := ec.mgrCancelFn the lock serves no purpose at all.

It does: It prevents data races when reading ec.mgrCancelFn, i.e. it ensures that we see the current cancel function, not some other value that was just overwritten concurrently. It's not necessary to hold the lock for longer than the variable read, as the actual cancel function implementations are concurrency-safe and we don't need to block until they returned. However, holding the lock until they returned doesn't harm either here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does: It prevents data races when reading ec.mgrCancelFn, i.e. it ensures that we see the current cancel function, not some other value that was just overwritten concurrently.

I'm not very convinced... If this goroutine that setsmgrCancelFn := ec.mgrCancelFn acquires the lock BEFORE the goroutine that sets a new ec.mgrCancelFn you have effectively the same problem and I don't see how can a lock help here...

The only thing that would prevent this would be:

	ec.mux.Lock()
	if ec.mgrCancelFn != nil {
		ec.mgrCancelFn()
	}
	ec.mux.Unlock()

However at the end of the day it doesn't do much either because you don't have any guarantees that when a function checks if the context is done the new context may have been overwritten.

Anyway it doesn't do any harm so I guess we can ignore this for now and worry about these details in #4733 and subsequent work...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very convinced... If this goroutine that setsmgrCancelFn := ec.mgrCancelFn acquires the lock BEFORE the goroutine that sets a new ec.mgrCancelFn you have effectively the same problem and I don't see how can a lock help here...

That's right. The lock here only prevents the data race. The places where the cancel func gets written also need to ensure that the previous cancel func gets called.

I don't see how this could be fixed by holding the lock longer here, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The places where the cancel func gets written also need to ensure that the previous cancel func gets called.

Added this

Currently we blindly call the cancel func and there are some cases when it actually can be nil.

Signed-off-by: Jussi Nummelin <jnummelin@mirantis.com>
@jnummelin jnummelin force-pushed the fix/nil-cancel-func-in-extension-controller branch from f9d138c to 6e94f76 Compare September 5, 2024 08:19
@jnummelin jnummelin added backport/release-1.28 PR that needs to be backported/cherrypicked to release-1.28 branch backport/release-1.29 PR that needs to be backported/cherrypicked to the release-1.29 branch backport/release-1.30 PR that needs to be backported/cherrypicked to the release-1.30 branch labels Sep 5, 2024
@jnummelin jnummelin merged commit 9c1c02b into k0sproject:main Sep 5, 2024
89 checks passed
@k0s-bot
Copy link

k0s-bot commented Sep 5, 2024

Backport failed for release-1.28, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release-1.28
git worktree add -d .worktree/backport-4937-to-release-1.28 origin/release-1.28
cd .worktree/backport-4937-to-release-1.28
git switch --create backport-4937-to-release-1.28
git cherry-pick -x 6e94f763825c93ca7a5fa0899f79251398ee2d1b

@k0s-bot
Copy link

k0s-bot commented Sep 5, 2024

Backport failed for release-1.29, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release-1.29
git worktree add -d .worktree/backport-4937-to-release-1.29 origin/release-1.29
cd .worktree/backport-4937-to-release-1.29
git switch --create backport-4937-to-release-1.29
git cherry-pick -x 6e94f763825c93ca7a5fa0899f79251398ee2d1b

@k0s-bot
Copy link

k0s-bot commented Sep 5, 2024

Successfully created backport PR for release-1.30:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/release-1.28 PR that needs to be backported/cherrypicked to release-1.28 branch backport/release-1.29 PR that needs to be backported/cherrypicked to the release-1.29 branch backport/release-1.30 PR that needs to be backported/cherrypicked to the release-1.30 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NLLB test is flaky
4 participants