Scheduler enhancements #7703

magik6k · 2021-11-29T15:22:33Z

This is a rebased version of #7269 with some cleanup

TLDR:

Make resource accounting cgroup-aware on Linux
Allow overriding the default resource table on a per-worker basis through env-vars
- All envvars with defaults -> https://gist.github.com/magik6k/c0e1c7cd73c1241a9acabc30bf469a43 (generated with lotus-worker resources --default)
Add proper support for multiple GPUs per worker
Separate cpu thread counts in the resource table depending on whether or not workers have GPUs
Added some tests to the original PR

codecov · 2021-11-29T15:25:58Z

Codecov Report

Merging #7703 (330cfc3) into master (19e808f) will increase coverage by 0.72%.
The diff coverage is 54.98%.

@@            Coverage Diff             @@
##           master    #7703      +/-   ##
==========================================
+ Coverage   38.83%   39.56%   +0.72%     
==========================================
  Files         638      640       +2     
  Lines       68122    68325     +203     
==========================================
+ Hits        26456    27031     +575     
+ Misses      37126    36654     -472     
- Partials     4540     4640     +100

Impacted Files	Coverage Δ
api/version.go	`80.00% <ø> (ø)`
cmd/lotus-seal-worker/info.go	`12.69% <0.00%> (-0.64%)`	⬇️
cmd/lotus-seal-worker/main.go	`0.00% <0.00%> (ø)`
cmd/lotus-seal-worker/resources.go	`0.00% <0.00%> (ø)`
extern/sector-storage/manager.go	`64.73% <ø> (-0.21%)`	⬇️
extern/sector-storage/cgroups_linux.go	`28.78% <28.78%> (ø)`
extern/sector-storage/worker_local.go	`59.66% <45.23%> (-2.42%)`	⬇️
extern/sector-storage/storiface/resources.go	`75.38% <73.58%> (ø)`
cmd/lotus-miner/sealing.go	`42.10% <82.35%> (+0.58%)`	⬆️
extern/sector-storage/sched_resources.go	`86.25% <85.71%> (+0.95%)`	⬆️
... and 54 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19e808f...330cfc3. Read the comment docs.

arajasek · 2021-11-29T19:14:54Z

api/version.go

@@ -58,7 +58,7 @@ var (
 	FullAPIVersion1 = newVer(2, 1, 0)

 	MinerAPIVersion0  = newVer(1, 2, 0)
-	WorkerAPIVersion0 = newVer(1, 1, 0)
+	WorkerAPIVersion0 = newVer(1, 5, 0)


why jump so big

Testing with existing miners. Should be ok to keep like this.

cmd/lotus-seal-worker/resources.go

Worker processes may have memory limitations imposed by Systemd. But /proc/meminfo shows the entire system memory regardless of these limits. This results in the scheduler believing the worker has the entire system memory avaliable and the worker being allocated too many tasks. This change attempts to read cgroup memory limits for the worker process. It supports cgroups v1 and v2, and compares cgroup limits against the system memory and returns the most conservative values to prevent the worker from being allocated too many tasks and potentially triggering an OOM event.

Attempting to report "memory used by other processes" in the MemReserved field fails to take into account the fact that the system's memory used includes memory used by ongoing tasks. To properly account for this, worker should report the memory and swap used, then the scheduler that is aware of the memory requirements for a task can determine if there is sufficient memory available for a task.

Before this change workers can only be allocated one GPU task, regardless of how much of the GPU resources that task uses, or how many GPUs are in the system. This makes GPUUtilization a float which can represent that a task needs a portion, or multiple GPUs. GPUs are accounted for like RAM and CPUs so that workers with more GPUs can be allocated more tasks. A known issue is that PC2 cannot use multiple GPUs. And even if the worker has multiple GPUs and is allocated multiple PC2 tasks, those tasks will only run on the first GPU. This could result in unexpected behavior when a worker with multiple GPUs is assigned multiple PC2 tasks. But this should not suprise any existing users who upgrade, as any existing users who run workers with multiple GPUs should already know this and be running a worker per GPU for PC2. But now those users have the freedom to customize the GPU utilization of PC2 to be less than one and effectively run multiple PC2 processes in a single worker. C2 is capable of utilizing multiple GPUs, and now workers can be customized for C2 accordingly.

In an environment with heterogenious worker nodes, a universal resource table for all workers does not allow effective scheduling of tasks. Some workers may have different proof cache settings, changing the required memory for different tasks. Some workers may have a different count of CPUs per core-complex, changing the max parallelism of PC1. This change allows workers to customize these parameters with environment variables. A worker could set the environment variable PC1_MIN_MEMORY for example to customize the minimum memory requirement for PC1 tasks. If no environment variables are specified, the resource table on the miner is used, except for PC1 parallelism. If PC1_MAX_PARALLELISM is not specified, and FIL_PROOFS_USE_MULTICORE_SDR is set, PC1_MAX_PARALLELSIM will automatically be set to FIL_PROOFS_MULTICORE_SDR_PRODUCERS + 1.

Co-authored-by: Aayush Rajasekaran <arajasek94@gmail.com>

magik6k · 2021-11-30T01:07:24Z

(rebased on latest master)

ZenGround0

This is all new to me and some of the scheduling logic (canHandleRequest for example) is still a little hard for me to follow. But overall I understand what you are doing and this looks good.

ZenGround0 · 2021-11-29T20:01:09Z

extern/sector-storage/testworker_test.go


 	return storiface.WorkerInfo{
 		Hostname: "testworkerer",
 		Resources: storiface.WorkerResources{
 			MemPhysical: res.MinMemory * 3,
+			MemUsed:     res.MinMemory,


Should MemSwapUsed be added too?

ZenGround0 · 2021-11-30T02:13:46Z

extern/sector-storage/cgroups_linux.go

+)
+
+func cgroupV2MountPoint() (string, error) {
+	f, err := os.Open("/proc/self/mountinfo")


Do we expect lotus will have permission to open this path? I guess users of cgroups will set up these directories to have the right permissions? People using this feature probably know what they're doing but maybe worth calling out in documentation?

Normal users have access to this by default

ZenGround0 · 2021-11-30T02:25:43Z

extern/sector-storage/cgroups_linux.go

+	}
+	defer f.Close() //nolint
+
+	scanner := bufio.NewScanner(f)


Kinda overkill but consider parsing with something nice

ZenGround0 · 2021-11-30T16:24:52Z

extern/sector-storage/cgroups_linux.go

+		return 0, 0, 0, 0, err
+	}
+
+	for path != "/" {


I know almost nothing about Cgroups but this loop structure confuses me. I am wondering why doesn't the output of cgroupv2.PidGroupPath return the correct path for getting memory limit information

ZenGround0 · 2021-11-30T17:51:57Z

extern/sector-storage/manager_test.go

+	defer cleanup()
+
+	localTasks := []sealtasks.TaskType{
+		sealtasks.TTAddPiece, sealtasks.TTPreCommit1, sealtasks.TTCommit1, sealtasks.TTFinalize, sealtasks.TTFetch,


nit: for more clarify you could restrict operations to what the test uses, TTAddPiece and TTFetch iiuc

ZenGround0 · 2021-11-30T17:53:53Z

extern/sector-storage/manager_test.go

+			if w.MemUsedMax > 0 {
+				break l
+			}
+			time.Sleep(time.Millisecond)


Any chance this could hang? Maybe adding a break signal on the AddPiece goroutine completing would guard against weird hangs if for some reason memory usage is not propagating to worker correctly?

It shouldn't (if it does for whatever reason, the test will just time out in 30min); (Normally this test takes 0.1s to run)

ZenGround0 · 2021-11-30T18:03:12Z

extern/sector-storage/sched_resources.go

-	maxNeedMem := res.MemReserved + a.memUsedMax + needRes.MaxMemory + needRes.BaseMinMemory
+	vmemNeeded := needRes.MaxMemory + needRes.BaseMinMemory
+	vmemUsed := a.memUsedMax
+	if vmemUsed < res.MemUsed+res.MemSwapUsed {


nit: giving res.MemUsed + res.MemSwapUsed a name would make following along a bit easier

ZenGround0 · 2021-11-30T18:08:02Z

extern/sector-storage/storiface/resources.go

-	MinMemory uint64 // What Must be in RAM for decent perf
-	MaxMemory uint64 // Memory required (swap + ram)
+	MinMemory uint64 `envname:"MIN_MEMORY"` // What Must be in RAM for decent perf
+	MaxMemory uint64 `envname:"MAX_MEMORY"` // Memory required (swap + ram)


"What Must be in RAM for decent perf" makes sense to me, but "Memory required (swap + ram)" is confusing me a bit given that the name is MAX. It sounds like this is MIN_MEMORY + MIN_SWAP? I think clarifying this comment would be helpful.

ZenGround0 · 2021-11-30T18:21:10Z

extern/sector-storage/storiface/resources_test.go

+	require.Equal(t, 1, ResourceTable[sealtasks.TTUnseal][stabi.RegisteredSealProof_StackedDrg2KiBV1_1].MaxParallelism)
+}
+
+func TestListResourceSDRMulticoreOverride(t *testing.T) {


ZenGround0 · 2021-11-30T18:23:24Z

extern/sector-storage/storiface/resources.go

+				envval, found := lookup(taskType.Short()+"_"+shortSize+"_"+envname, fmt.Sprint(rr.Elem().Field(i).Interface()))
+				if !found {
+					// special multicore SDR handling
+					if (taskType == sealtasks.TTPreCommit1 || taskType == sealtasks.TTUnseal) && envname == "MAX_PARALLELISM" {


This isn't that bad but it makes me wonder if we are hoping to deprecate FIL_PROOFS_USE_MULTICORE_SDR. Other than supporting old workflows is there a reason to keep this old pattern around?

FIL_PROOFS_USE_MULTICORE_SDR is used internally inside proofs, and we don't have a better way to pass that into proofs right now.

Address Scheduler enhancements (#7703) review

magik6k requested a review from a team as a code owner November 29, 2021 15:22

magik6k mentioned this pull request Nov 29, 2021

Scheduler Enhancements #7269

Closed

jennijuju modified the milestones: v1.13.1, v1.13.2 Nov 29, 2021

arajasek reviewed Nov 29, 2021

View reviewed changes

cmd/lotus-seal-worker/resources.go Outdated Show resolved Hide resolved

jennijuju added area/api P1 P1: Must be resolved labels Nov 29, 2021

magik6k force-pushed the feat/scheduler-enhancements branch from 2368358 to 04c016d Compare November 29, 2021 19:36

clinta and others added 13 commits November 30, 2021 02:06

sched: C2 is not all-core load

36868a8

sched resources: Separate Parallelism defaults depending on GPU presence

b961e1a

cleanup worker resource overrides

c9a2ff4

Fix docsgen

6d52d85

worker: Test resource table overrides

f25efec

fix sched tests

a597b07

fix lint

001ecbb

worker: Command to print resource-table env vars

cf20b0b

worker: Typo in resources cmd usage

330cfc3

Co-authored-by: Aayush Rajasekaran <arajasek94@gmail.com>

magik6k force-pushed the feat/scheduler-enhancements branch from 04c016d to 330cfc3 Compare November 30, 2021 01:07

jennijuju approved these changes Nov 30, 2021

View reviewed changes

magik6k merged commit 73f16f0 into master Nov 30, 2021

magik6k deleted the feat/scheduler-enhancements branch November 30, 2021 18:14

ZenGround0 approved these changes Nov 30, 2021

View reviewed changes

magik6k added a commit that referenced this pull request Nov 30, 2021

Address Scheduler enhancements (#7703) review

71329f6

magik6k added a commit that referenced this pull request Nov 30, 2021

Merge pull request #7714 from filecoin-project/feat/sched-review

26c9120

Address Scheduler enhancements (#7703) review

jennijuju mentioned this pull request Jan 10, 2022

Add docs for scheduler cgroup settings filecoin-project/lotus-docs#52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler enhancements #7703

Scheduler enhancements #7703

magik6k commented Nov 29, 2021

codecov bot commented Nov 29, 2021 •

edited

Loading

arajasek Nov 29, 2021

magik6k Nov 29, 2021

magik6k commented Nov 30, 2021

ZenGround0 left a comment

ZenGround0 Nov 29, 2021

ZenGround0 Nov 30, 2021

magik6k Nov 30, 2021

ZenGround0 Nov 30, 2021

ZenGround0 Nov 30, 2021

ZenGround0 Nov 30, 2021

ZenGround0 Nov 30, 2021

magik6k Nov 30, 2021

ZenGround0 Nov 30, 2021

ZenGround0 Nov 30, 2021

ZenGround0 Nov 30, 2021

ZenGround0 Nov 30, 2021

magik6k Nov 30, 2021

Scheduler enhancements #7703

Scheduler enhancements #7703

Conversation

magik6k commented Nov 29, 2021

codecov bot commented Nov 29, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

magik6k commented Nov 30, 2021

ZenGround0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 29, 2021 •

edited

Loading