Start here, initial implementation #1

vidas · 2025-06-24T11:07:03Z

Initial commit of the Clowd Control software. Squashed from a relatively finished tree of very experimental commits just to have something shared. (The commit tree was cut at somewhat arbitrary point, there are few more changesets in different states of "wip", will push these later (if ever)).

The important part is docs/DESIGN.md.

Namespacing still references aifoundry-org github organization.

The AI was used as a "typing assistant" to write code, but all the design decisions and therefore all architectural cludges and bugs are purely mine.

deitch

Let's start with these comments on the high-level. I won't do on the details of each logical component yet, until we have agreement on the overview.

deitch · 2025-06-24T12:14:54Z

docs/DESIGN.md

@@ -0,0 +1,320 @@
+# ClowdControl Design Notes
+
+ClowdControl controller is a software used to control Clowder


"is software" - no need for "a". You might even be more explicit, like "ClowdControl controller is the key controller for the Clowder GenAI distributed inference cluster".

deitch · 2025-06-24T12:15:06Z

docs/DESIGN.md

+
+ClowdControl controller is a software used to control Clowder
+GenAI distributed inference cluster, from managing models,
+to making privisioning and load scheduling decisions.


"provisioning" - typo

deitch · 2025-06-24T12:15:23Z

docs/DESIGN.md

+ClowdControl controller is a software used to control Clowder
+GenAI distributed inference cluster, from managing models,
+to making privisioning and load scheduling decisions.
+Effectively it is a control plane part of the cluster.


Get rid of this line. What you wrote above covers it?

deitch · 2025-06-24T12:17:47Z

docs/DESIGN.md

+
+## Architecture
+
+Clowder cluster is a system comprised of several parts.


This is a bit redundant. Maybe, something like:

"The architecture of a Clowder system is composed of components in several layers:"

And then you can get rid of "There are several layers..."

deitch · 2025-06-24T12:20:14Z

docs/DESIGN.md

+Clowder cluster is a system comprised of several parts.
+
+There are several layers to it:
+- physical, representing individual computers (can be VMs with dedicated


I like the physical/cluster/logical, while we can tighten the language up a bit. I don't think it matters so much, but I am putting on a marketing hat and thinking, how will people react.

First, physical and logical make sense. I like those.

"cluster" on the other hand, is a word we already use. What is the purpose of the k8s cluster here? Distributed workload orchestration, I believe? So maybe we should call it the "workload orchestration" layer? Or the "workload platform" layer (I like that even better)? That is its job.

Only minor suggestion for the "physical" line.

Can we take out the brackets, and tighten it up? So something like:

- physical layer: compute nodes on which workloads run. Can be VMs or bare metal, with specific hardware configuration, such as RAM, CPU, accelerators etc.

deitch · 2025-06-24T12:40:20Z

docs/DESIGN.md

+5. Scheduler (evaluates request according to system state and decides on the
+   optimal route). Pluggable (add adapters to support llm-d filters and
+   sorters).
+6. Inventory manager: keeps state of all available resources.


When you say "all available resources", you mean physical resources? Stuff at the physical layer? So then let's be explicit about it.

deitch · 2025-06-24T12:48:10Z

docs/DESIGN.md

+   optimal route). Pluggable (add adapters to support llm-d filters and
+   sorters).
+6. Inventory manager: keeps state of all available resources.
+7. Model manager: manages collection of model metadata.


What is the job of the model manager? When we say, "manages collection of model metadata", is that how it does it? What is its job? I think it is "manages state of all models", to parallel the inventory manager?

deitch · 2025-06-24T12:50:18Z

docs/DESIGN.md

+   sorters).
+6. Inventory manager: keeps state of all available resources.
+7. Model manager: manages collection of model metadata.
+8. Scaler: analyzes demand (from analytics) and modifies cluster accordingly.


I think we can set this as "analyzes inference utilization and modifies cluster accordingly." No need to write here where it comes from. Also, you might want to mark that this too is pluggable for algorithms.

deitch · 2025-06-24T12:50:55Z

docs/DESIGN.md

+   Can call provisioning manager.
+9. API service: exposes operational APIs to internal components (e.g., Scaler, Model Manager, Inventory Manager) and for direct system manipulation by administrators.
+10. Storage manager - fetches and stores model data.
+11. Runtime engine(-s): performs inference using local model and exposes


"Runtime engine(s)" (remove the -)

I think you can stop after "using local model and exposes an API." If that is OpenAPI or something else doesn't matter much, since we have other layers there that can mediate.

deitch · 2025-06-24T12:57:06Z

docs/DESIGN.md

+10. Storage manager - fetches and stores model data.
+11. Runtime engine(-s): performs inference using local model and exposes
+    standard inference API (eg OpenAI API compatible chat completions).
+12. Aggregated API service: provides a user-facing facade for standard APIs (eg /v1/models) by aggregating data from the whole cluster, offering cluster-wide views.


What is the difference between this and the "API Service" (item 9)?

Start here, initial implementation

e213281

vidas self-assigned this Jun 24, 2025

deitch requested changes Jun 24, 2025

View reviewed changes

		@@ -0,0 +1,320 @@
		# ClowdControl Design Notes

		ClowdControl controller is a software used to control Clowder


		## Architecture

		Clowder cluster is a system comprised of several parts.

Start here, initial implementation #1

Are you sure you want to change the base?

Start here, initial implementation #1

Uh oh!

Conversation

vidas commented Jun 24, 2025

Uh oh!

deitch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants