Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource: mapping user-request resources to different underlying system resources #634

Open
SteVwonder opened this issue Mar 28, 2020 · 1 comment

Comments

@SteVwonder
Copy link
Member

In certain scenarios, the resource set that the user requests in their jobspec may not be sufficient to execute the job. Consider the following examples:

  • the user requests slot[1]->gpu[1]. With current technology, at least one CPU core is required to execute a task on the gpu. Rather than rejecting the job at submission, the datacenter wants to automatically add the core as a quality-of-life improvement for the user.
  • the user requests slot[1]->core[1] and the datacenter wants to enforce that for each core the user gets, they also gets a certain amount of memory to prevent OOM'ing other jobs on a shared node.
  • the user requests 1TB of global storage on a burst buffer or IO device, but they also requested redundancy (e.g., duplication, RAID). So in order to deliver 1TB of usable capacity, the system actually needs a larger amount of space for the redundancy overhead.
  • the user requests 1TB of storage for an IO library like LABIOS or UnifyFS. In addition to the previous use-case: overallocating the IO resource for overheads; we also want to allocate additional compute resources for these libraries to perform their tasks (CC @Keith-Bateman).

In all of these examples, the R (concrete resource set) that the scheduler outputs will contain a superset of resources compared to the ones requested by the J (jobspec).

In a more extreme example, assume the user requests some amount of storage bandwidth. The scheduler then converts that request into some other resources (e.g., X TB of capacity spread across Y storage devices) and matches based on those. In this case, R isn't even a superset of J; it will contain resources not present in J and not contain some resources that are present in J.

After discussing with @dongahn and others, it seems like this could be handled at either the job submission time via front-end tool plugins (flux-framework/flux-core#2875) or at scheduling time by a scheduler match/select plugin. This issue is to track the design and implementation progress on the latter.

Advantages of doing it at scheduling time:

  • The same jobspec will be more portable across different systems (assuming the systems all have a match/select plugin that supports the relevant use-case)

Complications of doing it at scheduling time:

  • Since a direct 1:1 match is no longer occuring, R isn't necessarily going to be strictly equal to J (and in the last example it isn't even guaranteed to be a superset of J).
  • If we want to avoid large, monolithic scheduler plugins that must handle every possible use-case, we will need to figure out a way to compose multiple plugins together (one for each use-case)
  • Need to figure out how to handle the case where resources "expand" into different places in the hierarchy. Like in the last example about bandwidth->capacity. The bandwidth might be requested at the global level, but it gets satisfied by allocating many node-local SSDs. If this is done at match time, this could cause issues since we cannot control traversal of the jobspec or system resources at match time.
@dongahn
Copy link
Member

dongahn commented Jul 12, 2020

I think it would be best to have a few real-world use cases, which can help design/implement this. The concept may be "minimum complete resource set"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants