Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of CNI: use check command when restoring from restart into release/1.9.x #24796

Merged

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #24658 to be assessed for backporting due to the inclusion of the label backport/1.9.x.

The below text is copied from the body of the original PR.


When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon.

When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. So we check the plugin fingerprint before performing the check.

If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail).

This should fix the gap left #24650 for Docker task drivers and any other drivers with the MustInitiateNetwork capability.

Fixes: #24292
Ref: #24650
Ref: https://hashicorp.atlassian.net/browse/NET-11869

Testing & Reproduction steps

Run a cluster on a set of VMs, with at least one client. This can't be a server+client because we need to reboot the hosts. You should probably set the server.heartbeat_grace = "5m" to give yourself time to work.

  • Run a Docker task with network.mode = "bridge". Wait for it to be healthy.
  • Restart the agent.
  • Make sure the alloc is untouched and networking works.
  • Reboot the client host.
  • Make sure the alloc is restored, the tasks are restarted, and networking works.

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

Overview of commits

@vercel vercel bot temporarily deployed to Preview – nomad-ui January 7, 2025 14:41 Inactive
@vercel vercel bot temporarily deployed to Preview – nomad January 7, 2025 14:45 Inactive
@tgross tgross merged commit f27c521 into release/1.9.x Jan 7, 2025
23 of 25 checks passed
@tgross tgross deleted the backport/b-24292-netns-restore/miserably-rapid-ostrich branch January 7, 2025 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants