Jobspec "intents" Discussion #1153

vsoch · 2024-03-09T03:35:10Z

vsoch
Mar 9, 2024
Maintainer

Summary

We are wanting to extend jobspec to be able to hold generic metadata about application needs. In flux terms, the main jobspec is for the dominant subsystem, typically asking for nodes, cores, socket, etc. This is on the level of these base resources, hence why it's dominant.

Now we add to the discussion the registration of different subsystems where we might be talking about resources such as IO, drivers, power, a more basic variant of IO that is just "kinds of storage" (e.g., like a simple IO case for Kubernetes storage operators), or noodles. It doesn't matter - it's some kind of resource specification that an application needs, and thus needs to be accounted for in a request for work. In simpler terms, if my application needs noodles, the scheduler better match me up with nodes that have noodles! Note that the first part of this post will be a thinking exercise (with too much detail) and I'll try to finish with my early summary thoughts.

Design thinking through Example: IOR

As an early prototype, @hariharan-devarajan had suggested the idea of an intent, which belongs under a task and can describe this more nuanced request. And Hari - feel free to add more detail about the choice of term or definition, but (high level) I see it as the application saying "this is my intent or purpose, please ensure that it's satisfied." We care about intents because when we don't get what we need, we get suboptimal performance. We will be demonstrating that empirically (and likely there are already examples in the wild or from experience).

Since the intents are paired with an application, in the current jobspec they would go under tasks. E.g.,:

resources:
- count: 1
  type: rack
  with:
  - count: 2
    type: node
    with:
    - count: 1
      label: default
      type: slot
      with:
      - count: 2
        type: socket
        with:
        - count: 24
          type: core
        - count: 128
          type: memory
tasks:
- command:
  - ior
  count:
    per_slot: 1
  intents:
    ....

In the above, we are running the application "ior" to assess something about IO, and likely our intents will be related to that. Actually, we have a full example of that (kept separately because it's massive!) that makes the assumption that we care about metadata down to the level of the file. To start my thinking, I am looking at this list for IOR and seeing duplication. For this one use case, could simplify it down to something like (this is only tasks from above):

tasks:
- command:
  - ior
  count:
    per_slot: 1
  intents:
    files:
      - access_pattern: random
        file_size: 0
        filename: /p/lustre2/haridev/iflux//test_ior.000000[00-79]
        priority: 1
        process_access_pattern: fpp
      - access_pattern: random
        file_size: 33554432
        filename: /p/lustre2/haridev/iflux//test_ior.000000[00-79].[0-4]
        priority: 1
        process_access_pattern: fpp

It's (mostly, I think) the same information, but using expansion for the filenames, assuming matrix expansion for the second (and note to Hari, this is likely hugely wrong, but I want to use as a short hand for this thought experiment)!

So for the above, we are saying (high level) that this information about access patterns needs to inform the scheduler. What I'm realizing now is that this too is still an intermediate representation - because what the scheduler sees is this, where the vertices in that graph (JGF) correspond to different storage types available for this specific system. They certainly won't understand a path that is in haridev if it's run by anyone other than Hari! I think I need to try to walk through this to think more about it, and I'll do that now.

IOR through running with mapping

Originally I was thinking of a jobspec with intents as "the thing we stick in GitHub to reproduce the analysis" but I'm not sure that makes sense here. The intents that are generated for a specific application on one system might be different for the same application on another, and (based on the example above) are even unique to the user, unless output can be described more generally. Providing that level of detail to then run on a generic system seems like some kind of overfitting. I'm going to try to walk through how the above was generated (please correct me I am likely getting something wrong).

We start with a job to run IOR. We don't know what our application is going to do, IO wise and need to map out the patterns. That jobspec tasks section might look like this:

tasks:
  - command:
      - ior
    count:
      per_slot: 1
    slot: default

In this case, we are doing a trace with LD_PRELOAD and we dump out a bunch of intermediate files that have entries like this:

{"id":"130","name":"write","cat":"POSIX","pid":"67826","tid":"135652","ts":"1708734201508017","dur":"702","ph":"X","args":{"hostname":"66d21f2f3d09","ret":"1048576","count":"1048576","fd":15,"fname":"/home/vscode/iflux/test_ior.00000000.1"}}

This is classic performance analysis - tracing something and getting out metrics for what it did. Then we have a python script that generates the intents from those entries, and there is a mapping done to generate placements, and the prototype flux job run again with these placements in mind. I walked through this entire process once and I still find it really complex - it's a lot of work to do just to run one thing. The process is really complicated, and it's scoped to a specific system and user.

For my own thinking, I am going to describe this at a high level. I (think) the idea is that we want to understand IO patterns for an application, and then map that into a resource request that will optimally suit the application. The problem I see with this design is that it isn't just relying on the scheduler to provide some specific node with storage, it's also relying on the presence of a plugin or library on the node to ensure that the placements are honored. In the example here, that looks like:

export LD_LIBRARY_PATH=/workspaces/iflux/build/lib:/workspaces/iflux/build/lib64
export IFLUX_PLACEMENT_FILE=/home/vscode/iflux/data/resource/placements/ior_fpp_1_4_wo.yaml
LD_PRELOAD=/workspaces/iflux/build/lib/libiflux_executor.so flux run -N 1 -n 4 ior -m -k -F -w -a=POSIX -b=32m -t=1m -i=5 -o=/home/vscode/iflux/test_ior

Notice in the above we need to somehow generate and get a placement file there, and then have a plugin available to use it.
I'm not sure in practice we can guarantee that. Let me try to unpack the entities involved here.

The application developer writes the application. They probably don't assess it for IO.
The application user (another developer, likely) then assesses the patterns for a specific system.
Some fine grained mapping is written for the optimal running of said application on those specific systems. This would have to be done by the application user that has access / permissions to those systems. They could be informed by the patterns provided in the previous step (e.g., testing on the specific storage type). The output is a fine grained mapping.
The scheduler then needs to select the systems that can support that mapping.
The mapping needs to be enforced by some other plugin on the system (in this case, called iflux that is using LD_PRELOAD

In an experimental setup where we control all of that? it would be very easy to demonstrate "when we give the application exactly what it needs, it runs better." But I'm struggling to think through how this maps to real life - a scientist that chooses to run lammps, and maybe has a tiny incentive to do it optimally, but probably won't go beyond selecting a jobspec (or similar) definition file someone else has made to do it. Let me try to unwrap these thoughts more.

Problems with generalization of the above

In practice I'm having a hard time seeing how this generalizes. Here are some things that will go wrong:

The scientist just wants to take a job definition file that someone else has created, and run it. The above would require someone else to do the mapping for them, and generate a custom file for the space of resources they have access to. Is that reasonable?
The success of the run is also dependent on a node having some plugin (or similar) that reads in the mapping and enforces it. This means that for many cases, the scheduler would need to get info about plugins available, ask nodes ("Do you have that plugin?" and if they do not, fall back to just run the application on the storage type (without it).
Arguably adding in stuff "on the fly" is easier and (security wise) less severe than on a shared HPC resource. But adding in plugins that require LD_PRELOAD even in Kubernetes (akin to hpctoolkit) adds an entire universe of complexity that no reasonable developer wants to make a standard pattern of operation because of the ABI compatibility of the wrapper to the application. It's like, really hard. I've done it a few times and have not enjoyed it and do not advocate for it as a common pattern because the compatibility issues are enormous.
No HPC system that doesn't have the plugins would want / trust some third party plugin to be pulled in for this use case, at runtime.

In the perfect experimental setup, everything would work, and a paper could be published that shows that. But I don't have vision for anything beyond that. Maybe someone can help me with seeing this or ideas?

Generalization

My task is currently to design something for rainbow that looks like an intent, and generally maps a subsystem resource request to something in the graph to schedule work to nodes. What makes sense to me (to start) is to separate responsibility.

jobspec: intents take the high level mapping of where you want an application to run, from an IO standpoint
"how that gets done" is placed on the user.

There are also two different levels of development here that we are tangling up:

scheduling or image selection compatibility is a guarantee that subsystem resources are selected optimally for it to run. This is scoped to what we can put in the scheduler resource graph. There should be no additional data files or plugins needed.
application compatibility is about optimization of that, and much finer grained detail.

For this early work, we are going to get in trouble by starting with trying to satisfy the needs of extra data files / plugin execution to get a finer grained mapping of an application to a chosen subsystem resource. I'm not saying we can't eventually get there, but it's not step one. It smells like something more under the jurisdiction of a workflow tool. I think we need to focus (and am going to suggest we do) on the first bullet - the very straight forward resources offered by a node (containment subsystem plus high level kinds of storage, power, gpu, etc) and focus on that first. In summary:

Level 1: scheduling and image selection for a node comes down to matching high level application needs to resources in the graph.
Level 2: application optimization that requires additional data and plugins belongs at the level of a workflow tool.

Update: moving my prototype to a different discussion item that can be linked to! I can't link to headers here it seems.

vsoch · 2024-03-09T03:40:16Z

vsoch
Mar 9, 2024
Maintainer Author

Suggested Prototype

Based on the thinking above, this is what I am going to try. I understand people won't be happy with me to say that this level of detail is not what we are ready to do, at least for the generalized use cases (something that runs across different HPC or cloud clusters for which we only know the high level storage options, and cannot measure things in advance). Let me talk about what I think makes sense to do, as a first step. Here is what I'd want this request for an application resources (subsystem resources) to look like.

Instead of intents, tasks have resources, because technically speaking, a subsystem is another kind of resource. It's just the needs specific to a task.
Each resource entry is scoped to the equivalently named subsystem. The subsystem can control the algorithms and metadata provided in this section. In simpler terms, the "io" subsystem controls the sections and key value pairs provided, and they can be very different from power.

I would want the jobspec intent to match what the scheduler sees, and that reflects what metadata is extracted from the node directly.

tasks:
- command:
  - ior
  count:
    per_slot: 1
  resources:
    io:
      storage:
      - priority: 1
        storage: mtl2unit
      - priority: 2
        storage: shm

We would also give the control of the algorithm (how the match is done and structure of the metadata) to the subsystem. E.g., in the above, when I register "io" I might say something like:

Select the "storage" algorithm (io->storage in the above)
Go through the list, try to first match to nodes with exact string mtl2unit. If that's not possible, try for shm.

And we would come up with some cool / simple design that can support writing out this logic into a form that can be run across the graph or other. For the prototype I've generated algorithms as another kind of plugin here and I think we could have node selection algorithms that follow specific patterns and can take metadata from a subsystem and then run the logic. I'll think more about that later.

So I think this is what I can develop now - what is shown above and described in use case 1. To summarize, it makes sense that a user can request a kind of resource defined in a subsystem that is present on the cluster (e.g., the storage type, power, GPU, etc.) without knowing about additional plugins / data files that are needed. The more complex scheduling of applications (and their needs) likely is a next step that I'm not sure belongs here, but maybe better with a workflow tool.

0 replies

vsoch · 2024-03-10T18:06:50Z

vsoch
Mar 10, 2024
Maintainer Author

Updating discussion from slack with @trws. We don't need a design for a scout job (two separately submit jobs) or requiring some saving of intermediate data (e.g., for intents) because we can submit a job from an initial job (that performs the same task to do a mapping, etc.) E.g.,

T: you can always submit new work from inside existing jobs
V: so you can submit a job from another job?
T: absolutely
V: oh I see - I was implying an existing job extending itself.
T: if there's another instance inside the new allocation you have to ask for the parent to do it, but yeah that's always been an option

So we would just need to have some kind of software subsystem (as we already might) that ensures that this initial job hits the nodes that it needs to run the "sniffing" step, likely having some application to be ld preloaded alongside the main application.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobspec "intents" Discussion #1153

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Jobspec "intents" Discussion #1153

vsoch Mar 9, 2024 Maintainer

Summary

Design thinking through Example: IOR

IOR through running with mapping

Problems with generalization of the above

Generalization

Replies: 2 comments

vsoch Mar 9, 2024 Maintainer Author

Suggested Prototype

vsoch Mar 10, 2024 Maintainer Author

vsoch
Mar 9, 2024
Maintainer

vsoch
Mar 9, 2024
Maintainer Author

vsoch
Mar 10, 2024
Maintainer Author