Open
Description
Model Hub Benchmarks for Cortex
Models to Test
- R1 Unsloth Qwen Series (1.5B, 7B, 14B, 32B, 70B)
- Tulu
- Mistral Small
- Llama 3.3 series
Specs
- Single GPU
- consumer-lever hardware
- constrained hardware environment
- raspberry pi, orange pi, turing pi (cluster maybe?)
- arduino
- Multi-GPU same machine set up
- Different GPU's
- RAM variations
- vRAM variations
1. Model Initialization
Loading Performance
- Disk to RAM loading time
- Single GPU loading time
- Multi-GPU loading time and scaling efficiency
- Model switching overhead
- Initialization memory spike
- Warm vs cold start times
2. Runtime Performance
Core Inference Metrics
- Time to first token (latency)
- Tokens per second (throughput)
- Token generation consistency (variance)
- Streaming performance
- Response quality vs speed tradeoffs
Context Handling
- Context window utilization impact
- KV cache efficiency
- Context length scaling
- Memory usage per token
- Context window fill rate impact
Batching Performance
- Standard batch processing
- Continuous batching efficiency
- Batch size impact on throughput
- Optimal batch size determination
- Queue management efficiency
3. Resource Utilization
Memory Management
- Peak memory usage
- Memory usage over time
- Memory growth patterns
- Cache efficiency
- Memory cleanup behavior
- Memory fragmentation
Hardware Utilization
- CPU core scaling efficiency
- GPU memory bandwidth
- PCIe bandwidth usage
- CPU-GPU transfer overhead
- Multi-GPU scaling
- Temperature monitoring
- Power consumption patterns
- Hardware-specific optimizations
4. Advanced Processing Scenarios
Model Sharing
- Multi-model GPU sharing (layer allocation)
- Resource isolation effectiveness
- Inter-model interference
- Memory sharing efficiency
Concurrency
- Multi-user performance
- Request queuing behavior
- Thread scaling efficiency
- Resource contention handling
- Load balancing effectiveness
5. Runtime Environment Comparison
Implementation Comparison
- Llama.cpp vs Python runtime
- Framework overhead comparison
- API efficiency
- Integration overhead
Precision and Quantization
- Full precision performance
- Various quantization levels
- Accuracy vs speed tradeoffs
- Memory savings vs performance impact
6. Workload-Specific Performance
Task-Specific Metrics
- Short vs long prompt handling
- Code generation performance
- Mathematical computation speed
- Multi-language performance
- System prompt impact
Real-World Patterns
- Mixed workload handling
- Interruption recovery
- Context switching efficiency
- Session management
- Long-running stability
- Error handling and recovery
7. System Integration
Network Performance
- API latency
- Bandwidth utilization
- Connection management
- WebSocket performance
- Request queue behavior
Infrastructure Integration
- Inter-process communication
- Plugin/extension impact
- System integration efficiency
- Monitoring overhead
- Logging impact
8. Reliability and Stability
Long-Term Performance
- Performance degradation patterns
- Memory leak detection
- Error rates and types
- Recovery behavior
- Thermal throttling impact
Energy Efficiency
- Inference power consumption
- Idle power usage
- CPU-only vs GPU power profiles
- Performance per watt
- Cooling requirements
9. Scaling Characteristics
Resource Scaling
- Model size vs performance
- Quantization level impact
- Context length scaling
- Batch size optimization
- Multi-instance behavior
- Resource allocation efficiency
Task
- reformat this ticket to reflect the requirements
- Cortex benchmark command
- [ ]