Skip to content

Conversation

@vidas
Copy link
Contributor

@vidas vidas commented Jun 24, 2025

Initial commit of the Clowd Control software. Squashed from a relatively finished tree of very experimental commits just to have something shared. (The commit tree was cut at somewhat arbitrary point, there are few more changesets in different states of "wip", will push these later (if ever)).

The important part is docs/DESIGN.md.

Namespacing still references aifoundry-org github organization.

The AI was used as a "typing assistant" to write code, but all the design decisions and therefore all architectural cludges and bugs are purely mine.

@vidas vidas self-assigned this Jun 24, 2025
Copy link

@deitch deitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start with these comments on the high-level. I won't do on the details of each logical component yet, until we have agreement on the overview.

@@ -0,0 +1,320 @@
# ClowdControl Design Notes

ClowdControl controller is a software used to control Clowder
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is software" - no need for "a". You might even be more explicit, like "ClowdControl controller is the key controller for the Clowder GenAI distributed inference cluster".


ClowdControl controller is a software used to control Clowder
GenAI distributed inference cluster, from managing models,
to making privisioning and load scheduling decisions.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"provisioning" - typo

ClowdControl controller is a software used to control Clowder
GenAI distributed inference cluster, from managing models,
to making privisioning and load scheduling decisions.
Effectively it is a control plane part of the cluster.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get rid of this line. What you wrote above covers it?


## Architecture

Clowder cluster is a system comprised of several parts.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit redundant. Maybe, something like:

"The architecture of a Clowder system is composed of components in several layers:"

And then you can get rid of "There are several layers..."

Clowder cluster is a system comprised of several parts.

There are several layers to it:
- physical, representing individual computers (can be VMs with dedicated
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the physical/cluster/logical, while we can tighten the language up a bit. I don't think it matters so much, but I am putting on a marketing hat and thinking, how will people react.

First, physical and logical make sense. I like those.

"cluster" on the other hand, is a word we already use. What is the purpose of the k8s cluster here? Distributed workload orchestration, I believe? So maybe we should call it the "workload orchestration" layer? Or the "workload platform" layer (I like that even better)? That is its job.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor suggestion for the "physical" line.

Can we take out the brackets, and tighten it up? So something like:

- physical layer: compute nodes on which workloads run. Can be VMs or bare metal, with specific hardware configuration, such as RAM, CPU, accelerators etc.

5. Scheduler (evaluates request according to system state and decides on the
optimal route). Pluggable (add adapters to support llm-d filters and
sorters).
6. Inventory manager: keeps state of all available resources.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "all available resources", you mean physical resources? Stuff at the physical layer? So then let's be explicit about it.

optimal route). Pluggable (add adapters to support llm-d filters and
sorters).
6. Inventory manager: keeps state of all available resources.
7. Model manager: manages collection of model metadata.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the job of the model manager? When we say, "manages collection of model metadata", is that how it does it? What is its job? I think it is "manages state of all models", to parallel the inventory manager?

sorters).
6. Inventory manager: keeps state of all available resources.
7. Model manager: manages collection of model metadata.
8. Scaler: analyzes demand (from analytics) and modifies cluster accordingly.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can set this as "analyzes inference utilization and modifies cluster accordingly." No need to write here where it comes from. Also, you might want to mark that this too is pluggable for algorithms.

Can call provisioning manager.
9. API service: exposes operational APIs to internal components (e.g., Scaler, Model Manager, Inventory Manager) and for direct system manipulation by administrators.
10. Storage manager - fetches and stores model data.
11. Runtime engine(-s): performs inference using local model and exposes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Runtime engine(s)" (remove the -)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can stop after "using local model and exposes an API." If that is OpenAPI or something else doesn't matter much, since we have other layers there that can mediate.

10. Storage manager - fetches and stores model data.
11. Runtime engine(-s): performs inference using local model and exposes
standard inference API (eg OpenAI API compatible chat completions).
12. Aggregated API service: provides a user-facing facade for standard APIs (eg /v1/models) by aggregating data from the whole cluster, offering cluster-wide views.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between this and the "API Service" (item 9)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants