perf: optimize cosine similarity with SIMD

## Problem

Hand-rolled cosine similarity loop performs 44,800 floating-point ops per skill match query (50 skills × 896-dim embeddings) without SIMD optimization.

**File:** `crates/zeph-skills/src/matcher.rs` lines 145-167

**Current code:**
```rust
pub fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
    let mut dot = 0.0_f32;
    let mut norm_a = 0.0_f32;
    let mut norm_b = 0.0_f32;
    for (x, y) in a.iter().zip(b.iter()) {
        dot += x * y;
        norm_a += x * x;
        norm_b += y * y;
    }
    dot / (norm_a.sqrt() * norm_b.sqrt())
}
```

## Impact
- CPU: Measurable impact with >20 skills
- LLVM auto-vectorization not guaranteed

## Solution

Use SIMD-optimized library:
```toml
[dependencies]
simsimd = "6"  # or ndarray = "0.16"
```

Expected speedup: 5-10× on large vectors

**Priority:** P1  
**Effort:** Medium (3-4 hours, testing required)  
Related to #391

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize cosine similarity with SIMD #404

Problem

Impact

Solution

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf: optimize cosine similarity with SIMD #404

Description

Problem

Impact

Solution

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions