A custom Kubernetes scheduler extension that optimizes GPU workload placement based on network topology constraints, designed for high-performance GPU clusters.
A custom Kubernetes scheduler extension that optimizes GPU workload placement based on network topology constraints, designed for high-performance GPU clusters. This scheduler ensures optimal placement of GPU workloads by respecting the physical network topology of leaf-spine architecture, improving performance by up to 30% through smart placement decisions.
- 🎯 Topology-aware scheduling for GPU workloads
- 🔄 Smart domain selection based on job size
- 🔁 Automatic recovery with topology constraints
- 🧩 Anti-fragmentation mechanisms
- 📊 Real-time cluster state monitoring
- Scheduler: Optimizes GPU workload placement considering network topology
- Domain Manager: Manages network domains and node relationships
- Plugin: Kubernetes scheduler plugin implementation
- Metrics: Prometheus metrics for monitoring
# Clone repository
git clone https://github.com/nod-ai/topology-aware-scheduler
# Build
make build
# Deploy
kubectl apply -f deploy/
scheduler.go
: Core scheduling logicdomain.go
: Domain managementtopology.go
: Network topology handlingrecovery.go
: Failure recovery mechanisms
schedulerName: topology-aware-scheduler
Configuration:
apiVersion: topology.scheduler/v1alpha1
kind: TopologySchedulerConfig
metadata:
name: topology-scheduler-config
spec:
scoringWeights:
resourceAvailability: 0.4
topologyAlignment: 0.3
domainUtilization: 0.2
historicalPerformance: 0.1
Available at /metrics
:
topology_scheduler_latency_seconds
topology_domain_utilization_ratio
topology_gpu_allocation_ratio
topology_placement_decisions_total
# Start scheduler
./bin/scheduler --kubeconfig=config --scheduler-name=topology-aware-scheduler
# Start controller
./bin/controller --kubeconfig=config
# Unit tests
go test ./pkg/...
# Integration tests
go test ./test/integration
# Coverage
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
make build
make docker-build
make deploy
apiVersion: batch/v1
kind: Job
metadata:
name: single-gpu-job
annotations:
topology.scheduler/gpu-count: "1"
spec:
template:
spec:
schedulerName: topology-aware-scheduler
containers:
- name: gpu-job
resources:
limits:
nvidia.com/gpu: 1
apiVersion: batch/v1
kind: Job
metadata:
name: multi-gpu-job
annotations:
topology.scheduler/gpu-count: "8"
topology.scheduler/preferred-domain: "leaf-1"
spec:
template:
spec:
schedulerName: topology-aware-scheduler
containers:
- name: gpu-job
resources:
limits:
nvidia.com/gpu: 8
- Kubernetes 1.24+
- Go 1.20+
- Docker
- Access to a GPU cluster
kubectl
configured with cluster access
- Clone the repository:
git clone https://github.com/nod-ai/topology-aware-scheduler
cd topology-aware-scheduler
- Install dependencies:
go mod download
- Build:
make build
- Deploy using the provided script:
./scripts/deploy.sh
The scheduler configuration is managed through a ConfigMap. Here's an example configuration:
apiVersion: topology.scheduler/v1alpha1
kind: SchedulerConfig
metadata:
name: topology-scheduler-config
spec:
scoringWeights:
resourceAvailability: 0.4
topologyAlignment: 0.3
domainUtilization: 0.2
historicalPerformance: 0.1
topologyConstraints:
maxNodesPerLeaf: 4
maxGPUsPerLeaf: 32
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-job
annotations:
topology.scheduler/gpu-count: "8"
topology.scheduler/preferred-domain: "leaf-1"
spec:
template:
spec:
schedulerName: topology-aware-scheduler
containers:
- name: gpu-container
image: gpu-workload:latest
resources:
limits:
nvidia.com/gpu: 8
The scheduler enforces the following placement rules:
- 2 nodes → Same leaf domain
- 4 nodes → Complete leaf domain
- 8 nodes → Two adjacent leaves
- 16 nodes → Four adjacent leaves
- Scheduling latency: < 500ms
- Recovery time: < 30s
- Placement accuracy: 99.99%
The scheduler exports Prometheus metrics at /metrics
:
topology_scheduler_placement_duration_seconds
topology_scheduler_recovery_duration_seconds
topology_scheduler_domain_fragmentation_ratio
For latency-sensitive inference services:
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-inference-service
annotations:
topology.scheduler/gpu-count: "4"
topology.scheduler/latency-sensitive: "true"
spec:
replicas: 1
template:
spec:
schedulerName: topology-aware-scheduler
containers:
- name: inference
image: tensorflow/serving:latest
resources:
limits:
nvidia.com/gpu: 4
For large-scale distributed training:
apiVersion: batch/v1
kind: Job
metadata:
name: multi-node-training
annotations:
topology.scheduler/gpu-count: "16"
topology.scheduler/preferred-domain: "spine-1"
topology.scheduler/network-bandwidth: "200Gb"
topology.scheduler/placement-strategy: "consolidated"
spec:
parallelism: 4
completions: 4
The scheduler supports various annotations to optimize placement:
Annotation | Description | Example Value |
---|---|---|
topology.scheduler/gpu-count |
Number of GPUs required | "8" |
topology.scheduler/preferred-domain |
Preferred network domain | "leaf-1" |
topology.scheduler/network-bandwidth |
Minimum network bandwidth | "100Gb" |
topology.scheduler/latency-sensitive |
Indicates latency-sensitive workload | "true" |
topology.scheduler/placement-strategy |
Placement strategy | "consolidated" |
The scheduler supports several placement strategies:
-
Consolidated (
consolidated
)- Attempts to place all GPUs as close as possible
- Optimizes for inter-GPU communication
- Best for training workloads
-
Distributed (
distributed
)- Spreads GPUs across nodes
- Optimizes for fault tolerance
- Best for inference workloads
-
Balanced (
balanced
)- Balances between consolidation and distribution
- Default strategy
-
Deep Learning Training
- Use consolidated placement
- Request high network bandwidth
- Specify GPU count based on model size
-
Inference Services
- Use distributed placement
- Enable latency-sensitive flag
- Consider using node anti-affinity
-
Research Workloads
- Use balanced placement
- Specify preferred domain if needed
- Adjust based on experiment requirements
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
Common issues and solutions:
- Scheduler not starting: Check logs using:
kubectl logs -n kube-system -l app=topology-scheduler
- Jobs not being scheduled: Verify scheduler configuration:
kubectl get configmap -n kube-system topology-scheduler-config -o yaml
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
For support, please:
- Check the documentation
- Open an issue
- Join our Slack channel