-
Notifications
You must be signed in to change notification settings - Fork 1
Start here, initial implementation #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
deitch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start with these comments on the high-level. I won't do on the details of each logical component yet, until we have agreement on the overview.
| @@ -0,0 +1,320 @@ | |||
| # ClowdControl Design Notes | |||
|
|
|||
| ClowdControl controller is a software used to control Clowder | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"is software" - no need for "a". You might even be more explicit, like "ClowdControl controller is the key controller for the Clowder GenAI distributed inference cluster".
|
|
||
| ClowdControl controller is a software used to control Clowder | ||
| GenAI distributed inference cluster, from managing models, | ||
| to making privisioning and load scheduling decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"provisioning" - typo
| ClowdControl controller is a software used to control Clowder | ||
| GenAI distributed inference cluster, from managing models, | ||
| to making privisioning and load scheduling decisions. | ||
| Effectively it is a control plane part of the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get rid of this line. What you wrote above covers it?
|
|
||
| ## Architecture | ||
|
|
||
| Clowder cluster is a system comprised of several parts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit redundant. Maybe, something like:
"The architecture of a Clowder system is composed of components in several layers:"
And then you can get rid of "There are several layers..."
| Clowder cluster is a system comprised of several parts. | ||
|
|
||
| There are several layers to it: | ||
| - physical, representing individual computers (can be VMs with dedicated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the physical/cluster/logical, while we can tighten the language up a bit. I don't think it matters so much, but I am putting on a marketing hat and thinking, how will people react.
First, physical and logical make sense. I like those.
"cluster" on the other hand, is a word we already use. What is the purpose of the k8s cluster here? Distributed workload orchestration, I believe? So maybe we should call it the "workload orchestration" layer? Or the "workload platform" layer (I like that even better)? That is its job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only minor suggestion for the "physical" line.
Can we take out the brackets, and tighten it up? So something like:
- physical layer: compute nodes on which workloads run. Can be VMs or bare metal, with specific hardware configuration, such as RAM, CPU, accelerators etc.
| 5. Scheduler (evaluates request according to system state and decides on the | ||
| optimal route). Pluggable (add adapters to support llm-d filters and | ||
| sorters). | ||
| 6. Inventory manager: keeps state of all available resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "all available resources", you mean physical resources? Stuff at the physical layer? So then let's be explicit about it.
| optimal route). Pluggable (add adapters to support llm-d filters and | ||
| sorters). | ||
| 6. Inventory manager: keeps state of all available resources. | ||
| 7. Model manager: manages collection of model metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the job of the model manager? When we say, "manages collection of model metadata", is that how it does it? What is its job? I think it is "manages state of all models", to parallel the inventory manager?
| sorters). | ||
| 6. Inventory manager: keeps state of all available resources. | ||
| 7. Model manager: manages collection of model metadata. | ||
| 8. Scaler: analyzes demand (from analytics) and modifies cluster accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can set this as "analyzes inference utilization and modifies cluster accordingly." No need to write here where it comes from. Also, you might want to mark that this too is pluggable for algorithms.
| Can call provisioning manager. | ||
| 9. API service: exposes operational APIs to internal components (e.g., Scaler, Model Manager, Inventory Manager) and for direct system manipulation by administrators. | ||
| 10. Storage manager - fetches and stores model data. | ||
| 11. Runtime engine(-s): performs inference using local model and exposes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Runtime engine(s)" (remove the -)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can stop after "using local model and exposes an API." If that is OpenAPI or something else doesn't matter much, since we have other layers there that can mediate.
| 10. Storage manager - fetches and stores model data. | ||
| 11. Runtime engine(-s): performs inference using local model and exposes | ||
| standard inference API (eg OpenAI API compatible chat completions). | ||
| 12. Aggregated API service: provides a user-facing facade for standard APIs (eg /v1/models) by aggregating data from the whole cluster, offering cluster-wide views. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between this and the "API Service" (item 9)?
Initial commit of the Clowd Control software. Squashed from a relatively finished tree of very experimental commits just to have something shared. (The commit tree was cut at somewhat arbitrary point, there are few more changesets in different states of "wip", will push these later (if ever)).
The important part is
docs/DESIGN.md.Namespacing still references
aifoundry-orggithub organization.The AI was used as a "typing assistant" to write code, but all the design decisions and therefore all architectural cludges and bugs are purely mine.