Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix cpuset cgroup manager initialization #14229

Closed
shoenig opened this issue Aug 23, 2022 · 1 comment · Fixed by #14230
Closed

fix cpuset cgroup manager initialization #14229

shoenig opened this issue Aug 23, 2022 · 1 comment · Fixed by #14230
Assignees
Labels

Comments

@shoenig
Copy link
Member

shoenig commented Aug 23, 2022

While tracking down another bug, I got really confused by log messages that you'll only see on the first run of a Nomad agent on a new system, caused by some spaghetti code in how we bootstrap the cpuset manager on the client.

(the error content is from a branch)

    2022-08-22T21:30:21.284-0500 [WARN]  client: failed to lookup cpuset from cgroup parent, and not set as reservable_cores: parent=nomad.slice error="openat2 /sys/fs/cgroup/nomad.slice/cpuset.cpus.effective: no such file or directory"
    2022-08-22T21:30:21.284-0500 [DEBUG] client.cpuset.v2: initializing with: cores=[]
    2022-08-22T21:30:21.300-0500 [DEBUG] client.cpuset.v2: establish cgroup hierarchy: parent=nomad.slice

This block will attempt to retrieve the effective cpuset (if reservable cores are not set) - before actually ensuring the parent cgroup for nomad exists. This means on first run, we get the warning message above and an empty set of usable cores.

// Ensure cgroups are created on linux platform
if runtime.GOOS == "linux" && c.cpusetManager != nil {
	// use the client configuration for reservable_cores if set
	cores := conf.ReservableCores
	if len(cores) == 0 {
		// otherwise lookup the effective cores from the parent cgroup
		cores, err = cgutil.GetCPUsFromCgroup(conf.CgroupParent)
		if err != nil {
			c.logger.Warn("failed to lookup cpuset from cgroup parent, and not set as reservable_cores", "parent", conf.CgroupParent)
			// will continue with a disabled cpuset manager
		}
	}
	if cpuErr := c.cpusetManager.Init(cores); cpuErr != nil {
		// If the client cannot initialize the cgroup then reserved cores will not be reported and the cpuset manager
		// will be disabled. this is common when running in dev mode under a non-root user for example.
		c.logger.Warn("failed to initialize cpuset cgroup subsystem, cpuset management disabled", "error", cpuErr)
		c.cpusetManager = new(cgutil.NoopCpusetManager)
	}
}

The empty set of useable cores means tasks configured to run will not have a value set in their cpuset.cpus cgroup.

cgroup/nomad.slice/594f8fba-c4e1-0e6a-38c3-b056db697bfa.py1.scope 
➜ cat cpuset.cpus


cgroup/nomad.slice/594f8fba-c4e1-0e6a-38c3-b056db697bfa.py1.scope

Any restart of the Nomad agent afterwords and the bug is gone - the parent cgroup will exist and the initialization will work as intended.

@shoenig shoenig self-assigned this Aug 23, 2022
shoenig added a commit that referenced this issue Aug 25, 2022
This PR refactors the code path in Client startup for setting up the cpuset
cgroup manager (non-linux systems not affected).

Before, there was a logic bug where we would try to read the cpuset.cpus.effective
cgroup interface file before ensuring nomad's parent cgroup existed. Therefor that
file would not exist, and the list of useable cpus would be empty. Tasks started
thereafter would not have a value set for their cpuset.cpus.

The refactoring fixes some less than ideal coding style. Instead we now bootstrap
each cpuset manager type (v1/v2) within its own constructor. If something goes
awry during bootstrap (e.g. cgroups not enabled), the constructor returns the
noop implementation and logs a warning.

Fixes #14229
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant