Replies: 2 comments
-
Suggested PrototypeBased on the thinking above, this is what I am going to try. I understand people won't be happy with me to say that this level of detail is not what we are ready to do, at least for the generalized use cases (something that runs across different HPC or cloud clusters for which we only know the high level storage options, and cannot measure things in advance). Let me talk about what I think makes sense to do, as a first step. Here is what I'd want this request for an application resources (subsystem resources) to look like.
I would want the jobspec intent to match what the scheduler sees, and that reflects what metadata is extracted from the node directly. tasks:
- command:
- ior
count:
per_slot: 1
resources:
io:
storage:
- priority: 1
storage: mtl2unit
- priority: 2
storage: shm We would also give the control of the algorithm (how the match is done and structure of the metadata) to the subsystem. E.g., in the above, when I register "io" I might say something like:
And we would come up with some cool / simple design that can support writing out this logic into a form that can be run across the graph or other. For the prototype I've generated algorithms as another kind of plugin here and I think we could have node selection algorithms that follow specific patterns and can take metadata from a subsystem and then run the logic. I'll think more about that later. So I think this is what I can develop now - what is shown above and described in use case 1. To summarize, it makes sense that a user can request a kind of resource defined in a subsystem that is present on the cluster (e.g., the storage type, power, GPU, etc.) without knowing about additional plugins / data files that are needed. The more complex scheduling of applications (and their needs) likely is a next step that I'm not sure belongs here, but maybe better with a workflow tool. |
Beta Was this translation helpful? Give feedback.
-
Updating discussion from slack with @trws. We don't need a design for a scout job (two separately submit jobs) or requiring some saving of intermediate data (e.g., for intents) because we can submit a job from an initial job (that performs the same task to do a mapping, etc.) E.g.,
So we would just need to have some kind of software subsystem (as we already might) that ensures that this initial job hits the nodes that it needs to run the "sniffing" step, likely having some application to be ld preloaded alongside the main application. |
Beta Was this translation helpful? Give feedback.
-
Summary
We are wanting to extend jobspec to be able to hold generic metadata about application needs. In flux terms, the main jobspec is for the dominant subsystem, typically asking for nodes, cores, socket, etc. This is on the level of these base resources, hence why it's dominant.
Now we add to the discussion the registration of different subsystems where we might be talking about resources such as IO, drivers, power, a more basic variant of IO that is just "kinds of storage" (e.g., like a simple IO case for Kubernetes storage operators), or noodles. It doesn't matter - it's some kind of resource specification that an application needs, and thus needs to be accounted for in a request for work. In simpler terms, if my application needs noodles, the scheduler better match me up with nodes that have noodles! Note that the first part of this post will be a thinking exercise (with too much detail) and I'll try to finish with my early summary thoughts.
Design thinking through Example: IOR
As an early prototype, @hariharan-devarajan had suggested the idea of an intent, which belongs under a task and can describe this more nuanced request. And Hari - feel free to add more detail about the choice of term or definition, but (high level) I see it as the application saying "this is my intent or purpose, please ensure that it's satisfied." We care about intents because when we don't get what we need, we get suboptimal performance. We will be demonstrating that empirically (and likely there are already examples in the wild or from experience).
Since the intents are paired with an application, in the current jobspec they would go under tasks. E.g.,:
In the above, we are running the application "ior" to assess something about IO, and likely our intents will be related to that. Actually, we have a full example of that (kept separately because it's massive!) that makes the assumption that we care about metadata down to the level of the file. To start my thinking, I am looking at this list for IOR and seeing duplication. For this one use case, could simplify it down to something like (this is only tasks from above):
It's (mostly, I think) the same information, but using expansion for the filenames, assuming matrix expansion for the second (and note to Hari, this is likely hugely wrong, but I want to use as a short hand for this thought experiment)!
So for the above, we are saying (high level) that this information about access patterns needs to inform the scheduler. What I'm realizing now is that this too is still an intermediate representation - because what the scheduler sees is this, where the vertices in that graph (JGF) correspond to different storage types available for this specific system. They certainly won't understand a path that is in
haridev
if it's run by anyone other than Hari! I think I need to try to walk through this to think more about it, and I'll do that now.IOR through running with mapping
Originally I was thinking of a jobspec with intents as "the thing we stick in GitHub to reproduce the analysis" but I'm not sure that makes sense here. The intents that are generated for a specific application on one system might be different for the same application on another, and (based on the example above) are even unique to the user, unless output can be described more generally. Providing that level of detail to then run on a generic system seems like some kind of overfitting. I'm going to try to walk through how the above was generated (please correct me I am likely getting something wrong).
In this case, we are doing a trace with
LD_PRELOAD
and we dump out a bunch of intermediate files that have entries like this:{"id":"130","name":"write","cat":"POSIX","pid":"67826","tid":"135652","ts":"1708734201508017","dur":"702","ph":"X","args":{"hostname":"66d21f2f3d09","ret":"1048576","count":"1048576","fd":15,"fname":"/home/vscode/iflux/test_ior.00000000.1"}}
This is classic performance analysis - tracing something and getting out metrics for what it did. Then we have a python script that generates the intents from those entries, and there is a mapping done to generate placements, and the prototype flux job run again with these placements in mind. I walked through this entire process once and I still find it really complex - it's a lot of work to do just to run one thing. The process is really complicated, and it's scoped to a specific system and user.
For my own thinking, I am going to describe this at a high level. I (think) the idea is that we want to understand IO patterns for an application, and then map that into a resource request that will optimally suit the application. The problem I see with this design is that it isn't just relying on the scheduler to provide some specific node with storage, it's also relying on the presence of a plugin or library on the node to ensure that the placements are honored. In the example here, that looks like:
Notice in the above we need to somehow generate and get a placement file there, and then have a plugin available to use it.
I'm not sure in practice we can guarantee that. Let me try to unpack the entities involved here.
LD_PRELOAD
In an experimental setup where we control all of that? it would be very easy to demonstrate "when we give the application exactly what it needs, it runs better." But I'm struggling to think through how this maps to real life - a scientist that chooses to run lammps, and maybe has a tiny incentive to do it optimally, but probably won't go beyond selecting a jobspec (or similar) definition file someone else has made to do it. Let me try to unwrap these thoughts more.
Problems with generalization of the above
In practice I'm having a hard time seeing how this generalizes. Here are some things that will go wrong:
In the perfect experimental setup, everything would work, and a paper could be published that shows that. But I don't have vision for anything beyond that. Maybe someone can help me with seeing this or ideas?
Generalization
My task is currently to design something for rainbow that looks like an intent, and generally maps a subsystem resource request to something in the graph to schedule work to nodes. What makes sense to me (to start) is to separate responsibility.
There are also two different levels of development here that we are tangling up:
For this early work, we are going to get in trouble by starting with trying to satisfy the needs of extra data files / plugin execution to get a finer grained mapping of an application to a chosen subsystem resource. I'm not saying we can't eventually get there, but it's not step one. It smells like something more under the jurisdiction of a workflow tool. I think we need to focus (and am going to suggest we do) on the first bullet - the very straight forward resources offered by a node (containment subsystem plus high level kinds of storage, power, gpu, etc) and focus on that first. In summary:
Update: moving my prototype to a different discussion item that can be linked to! I can't link to headers here it seems.
Beta Was this translation helpful? Give feedback.
All reactions