Skip to content

Benchmark module for models and hardware #1937

Open
@ramonpzg

Description

@ramonpzg

Model Hub Benchmarks for Cortex

Models to Test

  1. R1 Unsloth Qwen Series (1.5B, 7B, 14B, 32B, 70B)
  2. Tulu
  3. Mistral Small
  4. Llama 3.3 series

Specs

  • Single GPU
    • consumer-lever hardware
    • constrained hardware environment
      • raspberry pi, orange pi, turing pi (cluster maybe?)
      • arduino
  • Multi-GPU same machine set up
  • Different GPU's
  • RAM variations
  • vRAM variations

1. Model Initialization

Loading Performance

  • Disk to RAM loading time
  • Single GPU loading time
  • Multi-GPU loading time and scaling efficiency
  • Model switching overhead
  • Initialization memory spike
  • Warm vs cold start times

2. Runtime Performance

Core Inference Metrics

  • Time to first token (latency)
  • Tokens per second (throughput)
  • Token generation consistency (variance)
  • Streaming performance
  • Response quality vs speed tradeoffs

Context Handling

  • Context window utilization impact
  • KV cache efficiency
  • Context length scaling
  • Memory usage per token
  • Context window fill rate impact

Batching Performance

  • Standard batch processing
  • Continuous batching efficiency
  • Batch size impact on throughput
  • Optimal batch size determination
  • Queue management efficiency

3. Resource Utilization

Memory Management

  • Peak memory usage
  • Memory usage over time
  • Memory growth patterns
  • Cache efficiency
  • Memory cleanup behavior
  • Memory fragmentation

Hardware Utilization

  • CPU core scaling efficiency
  • GPU memory bandwidth
  • PCIe bandwidth usage
  • CPU-GPU transfer overhead
  • Multi-GPU scaling
  • Temperature monitoring
  • Power consumption patterns
  • Hardware-specific optimizations

4. Advanced Processing Scenarios

Model Sharing

  • Multi-model GPU sharing (layer allocation)
  • Resource isolation effectiveness
  • Inter-model interference
  • Memory sharing efficiency

Concurrency

  • Multi-user performance
  • Request queuing behavior
  • Thread scaling efficiency
  • Resource contention handling
  • Load balancing effectiveness

5. Runtime Environment Comparison

Implementation Comparison

  • Llama.cpp vs Python runtime
  • Framework overhead comparison
  • API efficiency
  • Integration overhead

Precision and Quantization

  • Full precision performance
  • Various quantization levels
  • Accuracy vs speed tradeoffs
  • Memory savings vs performance impact

6. Workload-Specific Performance

Task-Specific Metrics

  • Short vs long prompt handling
  • Code generation performance
  • Mathematical computation speed
  • Multi-language performance
  • System prompt impact

Real-World Patterns

  • Mixed workload handling
  • Interruption recovery
  • Context switching efficiency
  • Session management
  • Long-running stability
  • Error handling and recovery

7. System Integration

Network Performance

  • API latency
  • Bandwidth utilization
  • Connection management
  • WebSocket performance
  • Request queue behavior

Infrastructure Integration

  • Inter-process communication
  • Plugin/extension impact
  • System integration efficiency
  • Monitoring overhead
  • Logging impact

8. Reliability and Stability

Long-Term Performance

  • Performance degradation patterns
  • Memory leak detection
  • Error rates and types
  • Recovery behavior
  • Thermal throttling impact

Energy Efficiency

  • Inference power consumption
  • Idle power usage
  • CPU-only vs GPU power profiles
  • Performance per watt
  • Cooling requirements

9. Scaling Characteristics

Resource Scaling

  • Model size vs performance
  • Quantization level impact
  • Context length scaling
  • Batch size optimization
  • Multi-instance behavior
  • Resource allocation efficiency

Task

  • reformat this ticket to reflect the requirements
  • Cortex benchmark command
  • [ ]

Metadata

Metadata

Labels

Type

No type

Projects

Status

Investigating

Status

In Progress

Relationships

None yet

Development

No branches or pull requests

Issue actions