-
Notifications
You must be signed in to change notification settings - Fork 59
Description
The similarity score calculation in src/search.ts line 139 produces incorrect results (negative percentages, meaningless rankings) because it treats the Euclidean (L2) distance returned by sqlite-vec as if it were cosine distance.
Current code (src/search.ts, line 139):
similarity: mode === 'text' ? undefined : 1 - row.distance,Problem:
sqlite-vec returns L2 (Euclidean) distance, not cosine distance. For normalized embedding vectors, the relationship between L2 distance d and cosine similarity s is:
s = 1 - (d^2 / 2)
The current formula 1 - d produces values that go negative for any distance > 1, which is common with L2 distances on 384-dimensional vectors. In practice, this means almost every search result shows a negative or near-zero similarity score.
Fix:
similarity: mode === 'text' ? undefined : 1 - (row.distance * row.distance / 2),Before fix: Scores like 0%, -5%, -11%, -16%
After fix: Scores like 50%, 45%, 39%, 32%
The ranking order of results was unaffected (L2 and cosine ordering are monotonically related for normalized vectors), but the scores were meaningless and confusing to users.
Tested on v1.0.15 with all-MiniLM-L6-v2 embeddings against ~235 indexed conversations.