-
Notifications
You must be signed in to change notification settings - Fork 568
fix: improve face clustering accuracy with similarity threshold and p… #771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: improve face clustering accuracy with similarity threshold and p… #771
Conversation
…ost-clustering merge - Added similarity_threshold parameter (0.85) to filter dissimilar faces before clustering - Implemented post-clustering merge to combine duplicate clusters of same person - Added quality filtering for embeddings (std and norm checks) - Improved distance calculation using precomputed cosine distances - Enhanced logging for better debugging and monitoring - Prevents side-face false positives while maintaining high accuracy
|
|
📝 WalkthroughWalkthroughAdds embedding validation, cosine-distance precomputed matrices for DBSCAN, similarity and merge thresholds, a post-clustering cluster-merge step, and expanded logging; updates clustering/assignment public signatures. Also adds Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/app/utils/face_clusters.py(4 hunks)
🔇 Additional comments (4)
backend/app/utils/face_clusters.py (4)
211-229: LGTM - Precomputed distance approach is sound.The similarity threshold filtering correctly marks dissimilar faces (distance > 0.15) with distance 1.0, which exceeds
eps=0.75and prevents them from clustering together.Consider moving the import to the top of the file for consistency, though it's acceptable here.
264-266: LGTM - Post-clustering merge step.The merge step with threshold 0.7 is appropriate for consolidating clusters that DBSCAN may have over-segmented.
373-391: Transitive merging not handled - verify if intentional.The merge logic skips clusters already in
merge_mapping, but this misses transitive relationships. For example: if A≈B and B≈C, then B→A is recorded, but C won't merge to A because B is skipped when checking C.If transitive merging is desired, resolve the target cluster:
for i, uuid1 in enumerate(cluster_uuids): if uuid1 in merge_mapping: - continue # Already merged + continue for j in range(i + 1, len(cluster_uuids)): uuid2 = cluster_uuids[j] if uuid2 in merge_mapping: continue # Already merged # Calculate similarity between cluster means - emb1 = cluster_means[uuid1].reshape(1, -1) + # Resolve target for uuid1 in case of prior merges + target_uuid1 = merge_mapping.get(uuid1, uuid1) + emb1 = cluster_means[target_uuid1].reshape(1, -1) emb2 = cluster_means[uuid2].reshape(1, -1)Alternatively, if the current behavior is intentional (to limit merge chains), add a comment explaining the design choice.
270-272: LGTM - Stricter threshold for cluster assignment.The increased
similarity_thresholddefault of 0.8 aligns with the PR objective to prevent side-face false positives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/utils/face_clusters.py (1)
258-275: Docstring default doesn’t match the function default.
Signature defaultssimilarity_thresholdto0.8(Line 259) but the doc says “Default: 0.7” (Line 274). Please sync them.- Higher values = more strict assignment. Default: 0.7 + Higher values = more strict assignment. Default: 0.8
🧹 Nitpick comments (3)
backend/app/utils/face_clusters.py (3)
9-11: Combine the sklearn pairwise imports (small cleanup).
Not required, but you can dofrom sklearn.metrics.pairwise import cosine_distances, cosine_similarityin one line for readability.
200-207: O(n²) distance matrix can blow up memory for large face sets.
cosine_distances(embeddings_array)allocates an NxN float matrix. If this can run on thousands+ of faces, consider chunking/approximate neighbors, blocking, or capping inputs per run.
361-400: Greedy merge can miss transitive merges; consider union-find/connected components.
Current mapping merges only when (uuid1, uuid2) directly passes threshold. If AB and BC but A !~ C, C won’t merge (depending on iteration). If the intent is “merge any clusters connected by similarity>=threshold”, use a DSU/union-find (or build a similarity graph and take connected components).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/app/utils/face_clusters.py(5 hunks)
🔇 Additional comments (1)
backend/app/utils/face_clusters.py (1)
200-218: DBSCAN precomputed distances: verify API + noteepsmay become ineffective.
After thresholding, distances are essentially{<= (1-similarity_threshold)} ∪ {1.0}; ifepsis much larger than(1-similarity_threshold), changingepswon’t meaningfully change neighbors. Also please verify scikit-learn 1.5.1 supportsDBSCAN(metric="precomputed", n_jobs=-1)as used here.
| # Post-clustering merge: merge similar clusters based on representative faces | ||
| results = _merge_similar_clusters(results, merge_threshold=0.7) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded merge_threshold=0.7 conflicts with other defaults and likely harms accuracy.
You’re clustering with similarity_threshold=0.85, but then merging clusters at 0.7 (and the helper default is 0.85). This looks accidental and can over-merge. Suggest: use merge_threshold=similarity_threshold (or introduce an explicit parameter/plumb it through).
- results = _merge_similar_clusters(results, merge_threshold=0.7)
+ results = _merge_similar_clusters(results, merge_threshold=similarity_threshold)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Post-clustering merge: merge similar clusters based on representative faces | |
| results = _merge_similar_clusters(results, merge_threshold=0.7) | |
| # Post-clustering merge: merge similar clusters based on representative faces | |
| results = _merge_similar_clusters(results, merge_threshold=similarity_threshold) |
🤖 Prompt for AI Agents
In backend/app/utils/face_clusters.py around lines 252 to 254, the call uses a
hardcoded merge_threshold=0.7 which conflicts with the clustering
similarity_threshold (0.85) and the helper default, causing potential
over-merge; update the call to pass the active similarity_threshold (e.g.,
merge_threshold=similarity_threshold) or add a merge_threshold parameter to the
enclosing function and thread it through so the merge uses the same threshold
used for clustering (or an explicitly provided value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/utils/face_clusters.py (1)
305-323: Docstring default is stale (0.7vs actual0.8).
Line 306 setssimilarity_threshold: float = 0.8, but the docstring says “Default: 0.7”.
♻️ Duplicate comments (1)
backend/app/utils/face_clusters.py (1)
297-301: Bug/behavior mismatch: merge default ignoressimilarity_thresholdand hardcodes0.7.
Your docstring says “default None, uses similarity_threshold”, but the implementation uses0.7. This can easily over-merge relative tosimilarity_threshold=0.85and contradicts the PR intent.Apply this diff:
- effective_merge_threshold = merge_threshold if merge_threshold is not None else 0.7 + effective_merge_threshold = ( + merge_threshold if merge_threshold is not None else similarity_threshold + ) results = _merge_similar_clusters(results, merge_threshold=effective_merge_threshold)
🧹 Nitpick comments (5)
.gitignore (1)
34-34: Ignoringenv/seems fine; consider whether you also want.venv/explicitly.
You already ignorevenv/; addingenv/helps cover common virtualenv folder names. If the repo standardizes on.venv/, consider adding that too (optional).backend/app/utils/face_clusters.py (4)
165-186:_validate_embeddingis good; consider making it robust to non-ndarray inputs.
If DB returns lists/JSON,np.isfinite(embedding)can behave unexpectedly. A small hardening is toembedding = np.asarray(embedding)and guardembedding.ndim == 1.
188-202: Signature/docs look directionally right; useOptional[float]formerge_threshold.
Right now it’s annotated asfloat = None; prefermerge_threshold: Optional[float] = Nonefor type-checkers.
239-252: Distance thresholding works, but consider marking “too far” as2.0(max cosine distance) instead of1.0.
cosine_distancesranges[0, 2]. Setting “dissimilar” pairs to1.0is fine with today’seps=0.75, but ifepsis increased above1.0, those “forced-far” pairs could re-enter neighborhoods unexpectedly.Also applies to: 261-263
404-529: Merge logic is reasonable; watch O(N²) similarity computation for large cluster counts.
If cluster counts can grow large, consider vectorizing similarities (matrix) or adding a cheap prefilter (e.g., only compare candidates with nearby centroids via ANN).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
frontend/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (3)
.gitignore(1 hunks)backend/app/utils/face_clusters.py(7 hunks)docs/backend/backend_python/openapi.json(1 hunks)
🔇 Additional comments (6)
docs/backend/backend_python/openapi.json (1)
1120-1128:allOfwrapper around$refis valid; double-check client generator compatibility.
This is OpenAPI 3.1–legal, but some generators handle “simple enum$ref” more cleanly thanallOf([$ref]). If this file is generated, this is fine; if it’s hand-maintained, ensure downstream clients still renderInputTypeas the expected enum.backend/app/utils/face_clusters.py (5)
9-11: Good: import moves to module scope (no per-call import overhead).
209-235: Embedding validation + clear logging is a solid improvement.
This should prevent NaN/zero-vector corruption from silently degrading clustering.
337-359: Validation of cluster means before assignment is a good guardrail.
364-400: Good: skips invalid face embeddings + NaN distance guard prevents bad assignments.
534-565: Safer cosine distance computation (normalization + clipping) looks solid.
|
@keshaviscool |
- Added _validate_embedding() to check for NaN/inf/zero vectors - Guard against invalid embeddings in clustering and assignment - Safe division in cosine distance calculations - Format code with Black for lint compliance
|
@rahulharpal1603 fixed lint issues. kindly review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
backend/app/utils/face_clusters.py (1)
308-313: Hardcoded merge threshold conflicts with comment and clustering threshold.Line 310 uses a hardcoded fallback of
0.7whenmerge_thresholdisNone, but:
- The comment on Line 309 states it should use
similarity_threshold- The clustering uses
similarity_threshold=0.85by default- Merging at 0.7 (lower threshold) after clustering at 0.85 will over-merge distinct clusters, degrading accuracy
This issue was flagged in previous reviews but the fix is incomplete.
🔎 Proposed fix
- # Post-clustering merge: merge similar clusters based on representative faces - # Use similarity_threshold if merge_threshold not explicitly provided - effective_merge_threshold = merge_threshold if merge_threshold is not None else 0.7 + # Post-clustering merge: merge similar clusters based on mean embeddings + # Use similarity_threshold if merge_threshold not explicitly provided + effective_merge_threshold = merge_threshold if merge_threshold is not None else similarity_threshold results = _merge_similar_clusters( results, merge_threshold=effective_merge_threshold )
🧹 Nitpick comments (1)
backend/app/utils/face_clusters.py (1)
9-10: Combine imports from the same module.Both imports are from
sklearn.metrics.pairwiseand can be combined into a single import statement for cleaner code.🔎 Suggested refactor
-from sklearn.metrics.pairwise import cosine_distances -from sklearn.metrics.pairwise import cosine_similarity +from sklearn.metrics.pairwise import cosine_distances, cosine_similarity
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/app/utils/face_clusters.py
🔇 Additional comments (8)
backend/app/utils/face_clusters.py (8)
165-186: Excellent addition for robustness.This validation function effectively guards against NaN, infinite, and zero-vector embeddings that would corrupt distance calculations. The
min_norm=1e-6threshold is appropriate for detecting effectively-zero vectors.
195-201: LGTM - docstring now matches implementation.The docstring correctly reflects the actual default values for all parameters. Good fix addressing the previous review.
213-237: LGTM - proper validation implemented.The validation loop now correctly filters out invalid embeddings and handles the empty case appropriately. This properly addresses the previous review about the no-op filter loop.
244-260: Robust distance calculation with proper guards.The NaN guards and similarity threshold application are correctly implemented. Converting similarity to distance (
1 - similarity_threshold) and marking dissimilar pairs as completely different (distance 1.0) is the right approach.
350-373: Good defensive validation of cluster means.Properly validates cluster mean embeddings before using them for assignment, with appropriate logging and early exit handling.
379-398: Robust validation with NaN guards in assignment.Properly validates both face embeddings and resulting distances, preventing invalid data from corrupting cluster assignments.
421-554: Well-implemented merge with proper safeguards.The function correctly:
- Validates cluster means before merging (lines 445-467)
- Guards against NaN similarities (lines 488-493)
- Resolves transitive merge chains (lines 505-511)
- Applies majority voting for consistent cluster names (lines 527-546)
This properly addresses the previous review concern about inconsistent cluster names after merging.
556-593: Excellent defensive handling of edge cases.The safe normalization (lines 570-582) and result clipping (lines 590-591) properly guard against zero vectors and numerical errors, ensuring finite, valid distance values.
| eps: float = 0.75, | ||
| min_samples: int = 2, | ||
| similarity_threshold: float = 0.85, | ||
| merge_threshold: float = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix type hint for optional parameter.
The type hint merge_threshold: float = None is incorrect since None is not a valid float. Use Optional[float] instead.
🔎 Proposed fix
- merge_threshold: float = None,
+ merge_threshold: Optional[float] = None,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| merge_threshold: float = None, | |
| merge_threshold: Optional[float] = None, |
🤖 Prompt for AI Agents
In backend/app/utils/face_clusters.py around line 192, the parameter declaration
uses the invalid type hint "merge_threshold: float = None"; change it to use
Optional[float] (i.e. "merge_threshold: Optional[float] = None") and ensure
Optional is imported from typing at the top of the file (add "from typing import
Optional" if not already present).
rahulharpal1603
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @keshaviscool!
This was one of the most critical improvements to our project.
…ost-clustering merge
#722 Issue fixed
Summary by CodeRabbit
Improvements
Stability
Chores
✏️ Tip: You can customize this high-level summary in your review settings.