Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler enhancements #7703

Merged
merged 13 commits into from
Nov 30, 2021
Merged

Scheduler enhancements #7703

merged 13 commits into from
Nov 30, 2021

Conversation

magik6k
Copy link
Contributor

@magik6k magik6k commented Nov 29, 2021

This is a rebased version of #7269 with some cleanup

TLDR:

  • Make resource accounting cgroup-aware on Linux
  • Allow overriding the default resource table on a per-worker basis through env-vars
  • Add proper support for multiple GPUs per worker
  • Separate cpu thread counts in the resource table depending on whether or not workers have GPUs
  • Added some tests to the original PR

@magik6k magik6k requested a review from a team as a code owner November 29, 2021 15:22
@codecov
Copy link

codecov bot commented Nov 29, 2021

Codecov Report

Merging #7703 (330cfc3) into master (19e808f) will increase coverage by 0.72%.
The diff coverage is 54.98%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7703      +/-   ##
==========================================
+ Coverage   38.83%   39.56%   +0.72%     
==========================================
  Files         638      640       +2     
  Lines       68122    68325     +203     
==========================================
+ Hits        26456    27031     +575     
+ Misses      37126    36654     -472     
- Partials     4540     4640     +100     
Impacted Files Coverage Δ
api/version.go 80.00% <ø> (ø)
cmd/lotus-seal-worker/info.go 12.69% <0.00%> (-0.64%) ⬇️
cmd/lotus-seal-worker/main.go 0.00% <0.00%> (ø)
cmd/lotus-seal-worker/resources.go 0.00% <0.00%> (ø)
extern/sector-storage/manager.go 64.73% <ø> (-0.21%) ⬇️
extern/sector-storage/cgroups_linux.go 28.78% <28.78%> (ø)
extern/sector-storage/worker_local.go 59.66% <45.23%> (-2.42%) ⬇️
extern/sector-storage/storiface/resources.go 75.38% <73.58%> (ø)
cmd/lotus-miner/sealing.go 42.10% <82.35%> (+0.58%) ⬆️
extern/sector-storage/sched_resources.go 86.25% <85.71%> (+0.95%) ⬆️
... and 54 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19e808f...330cfc3. Read the comment docs.

@magik6k magik6k mentioned this pull request Nov 29, 2021
@jennijuju jennijuju modified the milestones: v1.13.1, v1.13.2 Nov 29, 2021
@@ -58,7 +58,7 @@ var (
FullAPIVersion1 = newVer(2, 1, 0)

MinerAPIVersion0 = newVer(1, 2, 0)
WorkerAPIVersion0 = newVer(1, 1, 0)
WorkerAPIVersion0 = newVer(1, 5, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why jump so big

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing with existing miners. Should be ok to keep like this.

@jennijuju jennijuju added area/api P1 P1: Must be resolved labels Nov 29, 2021
clinta and others added 13 commits November 30, 2021 02:06
Worker processes may have memory limitations imposed by Systemd. But
/proc/meminfo shows the entire system memory regardless of these limits.
This results in the scheduler believing the worker has the entire system
memory avaliable and the worker being allocated too many tasks.

This change attempts to read cgroup memory limits for the worker
process. It supports cgroups v1 and v2, and compares cgroup limits
against the system memory and returns the most conservative values to
prevent the worker from being allocated too many tasks and potentially
triggering an OOM event.
Attempting to report "memory used by other processes" in the MemReserved
field fails to take into account the fact that the system's memory used
includes memory used by ongoing tasks.

To properly account for this, worker should report the memory and swap
used, then the scheduler that is aware of the memory requirements for a
task can determine if there is sufficient memory available for a task.
Before this change workers can only be allocated one GPU task,
regardless of how much of the GPU resources that task uses, or how many
GPUs are in the system.

This makes GPUUtilization a float which can represent that a task needs
a portion, or multiple GPUs. GPUs are accounted for like RAM and CPUs so
that workers with more GPUs can be allocated more tasks.

A known issue is that PC2 cannot use multiple GPUs. And even if the
worker has multiple GPUs and is allocated multiple PC2 tasks, those
tasks will only run on the first GPU.

This could result in unexpected behavior when a worker with multiple
GPUs is assigned multiple PC2 tasks. But this should not suprise any
existing users who upgrade, as any existing users who run workers with
multiple GPUs should already know this and be running a worker per GPU
for PC2. But now those users have the freedom to customize the GPU
utilization of PC2 to be less than one and effectively run multiple PC2
processes in a single worker.

C2 is capable of utilizing multiple GPUs, and now workers can be
customized for C2 accordingly.
In an environment with heterogenious worker nodes, a universal resource
table for all workers does not allow effective scheduling of tasks. Some
workers may have different proof cache settings, changing the required
memory for different tasks. Some workers may have a different count of
CPUs per core-complex, changing the max parallelism of PC1.

This change allows workers to customize these parameters with
environment variables. A worker could set the environment variable
PC1_MIN_MEMORY for example to customize the minimum memory requirement
for PC1 tasks. If no environment variables are specified, the resource
table on the miner is used, except for PC1 parallelism.

If PC1_MAX_PARALLELISM is not specified, and
FIL_PROOFS_USE_MULTICORE_SDR is set, PC1_MAX_PARALLELSIM will
automatically be set to FIL_PROOFS_MULTICORE_SDR_PRODUCERS + 1.
Co-authored-by: Aayush Rajasekaran <arajasek94@gmail.com>
@magik6k
Copy link
Contributor Author

magik6k commented Nov 30, 2021

(rebased on latest master)

@magik6k magik6k merged commit 73f16f0 into master Nov 30, 2021
@magik6k magik6k deleted the feat/scheduler-enhancements branch November 30, 2021 18:14
Copy link
Contributor

@ZenGround0 ZenGround0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all new to me and some of the scheduling logic (canHandleRequest for example) is still a little hard for me to follow. But overall I understand what you are doing and this looks good.


return storiface.WorkerInfo{
Hostname: "testworkerer",
Resources: storiface.WorkerResources{
MemPhysical: res.MinMemory * 3,
MemUsed: res.MinMemory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should MemSwapUsed be added too?

)

func cgroupV2MountPoint() (string, error) {
f, err := os.Open("/proc/self/mountinfo")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect lotus will have permission to open this path? I guess users of cgroups will set up these directories to have the right permissions? People using this feature probably know what they're doing but maybe worth calling out in documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal users have access to this by default

}
defer f.Close() //nolint

scanner := bufio.NewScanner(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda overkill but consider parsing with something nice

return 0, 0, 0, 0, err
}

for path != "/" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know almost nothing about Cgroups but this loop structure confuses me. I am wondering why doesn't the output of cgroupv2.PidGroupPath return the correct path for getting memory limit information

defer cleanup()

localTasks := []sealtasks.TaskType{
sealtasks.TTAddPiece, sealtasks.TTPreCommit1, sealtasks.TTCommit1, sealtasks.TTFinalize, sealtasks.TTFetch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for more clarify you could restrict operations to what the test uses, TTAddPiece and TTFetch iiuc

if w.MemUsedMax > 0 {
break l
}
time.Sleep(time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance this could hang? Maybe adding a break signal on the AddPiece goroutine completing would guard against weird hangs if for some reason memory usage is not propagating to worker correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't (if it does for whatever reason, the test will just time out in 30min); (Normally this test takes 0.1s to run)

maxNeedMem := res.MemReserved + a.memUsedMax + needRes.MaxMemory + needRes.BaseMinMemory
vmemNeeded := needRes.MaxMemory + needRes.BaseMinMemory
vmemUsed := a.memUsedMax
if vmemUsed < res.MemUsed+res.MemSwapUsed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: giving res.MemUsed + res.MemSwapUsed a name would make following along a bit easier

MinMemory uint64 // What Must be in RAM for decent perf
MaxMemory uint64 // Memory required (swap + ram)
MinMemory uint64 `envname:"MIN_MEMORY"` // What Must be in RAM for decent perf
MaxMemory uint64 `envname:"MAX_MEMORY"` // Memory required (swap + ram)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"What Must be in RAM for decent perf" makes sense to me, but "Memory required (swap + ram)" is confusing me a bit given that the name is MAX. It sounds like this is MIN_MEMORY + MIN_SWAP? I think clarifying this comment would be helpful.

require.Equal(t, 1, ResourceTable[sealtasks.TTUnseal][stabi.RegisteredSealProof_StackedDrg2KiBV1_1].MaxParallelism)
}

func TestListResourceSDRMulticoreOverride(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test

envval, found := lookup(taskType.Short()+"_"+shortSize+"_"+envname, fmt.Sprint(rr.Elem().Field(i).Interface()))
if !found {
// special multicore SDR handling
if (taskType == sealtasks.TTPreCommit1 || taskType == sealtasks.TTUnseal) && envname == "MAX_PARALLELISM" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't that bad but it makes me wonder if we are hoping to deprecate FIL_PROOFS_USE_MULTICORE_SDR. Other than supporting old workflows is there a reason to keep this old pattern around?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIL_PROOFS_USE_MULTICORE_SDR is used internally inside proofs, and we don't have a better way to pass that into proofs right now.

magik6k added a commit that referenced this pull request Nov 30, 2021
magik6k added a commit that referenced this pull request Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api P1 P1: Must be resolved
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants