fix: improve face clustering accuracy with similarity threshold and p… #771

keshaviscool · 2025-12-13T22:15:20Z

…ost-clustering merge

#722 Issue fixed

Added similarity_threshold parameter (0.85) to filter dissimilar faces before clustering
Implemented post-clustering merge to combine duplicate clusters of same person
Improved distance calculation using precomputed cosine distances
Enhanced logging for better debugging and monitoring
Prevents side-face false positives while maintaining high accuracy

Summary by CodeRabbit

Improvements
- Enhanced face clustering with cosine-based similarity thresholds, exposed configurable similarity/merge thresholds, and optional post-cluster merging to reduce duplicates and improve grouping.
- Better assignment of previously unclustered faces with stricter validation and more accurate matching.
Stability
- Robust validation to skip invalid embeddings, safer distance calculations, and expanded logging for numerical edge cases.
Chores
- Ignored local environment directory.
- Tightened image metadata schema to disallow arbitrary extra properties.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ost-clustering merge - Added similarity_threshold parameter (0.85) to filter dissimilar faces before clustering - Implemented post-clustering merge to combine duplicate clusters of same person - Added quality filtering for embeddings (std and norm checks) - Improved distance calculation using precomputed cosine distances - Enhanced logging for better debugging and monitoring - Prevents side-face false positives while maintaining high accuracy

github-actions · 2025-12-13T22:15:36Z

⚠️ No issue was linked in the PR description.
Please make sure to link an issue (e.g., 'Fixes #issue_number')

coderabbitai · 2025-12-13T22:15:42Z

📝 Walkthrough

Walkthrough

Adds embedding validation, cosine-distance precomputed matrices for DBSCAN, similarity and merge thresholds, a post-clustering cluster-merge step, and expanded logging; updates clustering/assignment public signatures. Also adds env/ to .gitignore and tightens an OpenAPI metadata schema.

Changes

Cohort / File(s)	Summary
Face clustering core & helpers `backend/app/utils/face_clusters.py`	Added `similarity_threshold` and `merge_threshold` parameters; validate embeddings (NaN/inf/near-zero) via `_validate_embedding`; compute cosine-distance precomputed matrices, clip distances to reflect similarity thresholds, run DBSCAN with `metric='precomputed'`; add `_merge_similar_clusters(results, merge_threshold)` to merge clusters by mean-embedding similarity and update cluster UUIDs/names; update `cluster_util_assign_cluster_to_faces_without_clusterId` to use cosine distances and validate inputs. Note: duplicate `_merge_similar_clusters` declaration present.
Repository ignore `.gitignore`	Added `env/` to ignored paths.
API schema `docs/backend/backend_python/openapi.json`	Wrapped two query parameter input schemas in `allOf` with a `title` and removed `additionalProperties: true` from `ImageInCluster.Metadata`, tightening allowed metadata properties.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: add global face reclustering option in settings with backend API and UI support #560 — Directly related modifications to face clustering utilities and recluster behavior.
feat : Centralized Logging System #548 — (if present) prior changes to embedding normalization or distance helpers that this update depends on.
Fix the Hover state issue (#531) #532 — (if present) earlier changes to cluster assignment endpoints interacting with these utilities.

Suggested reviewers

rahulharpal1603

Poem

🐰 I hopped through cosine fields at dawn,
checked each vector, skipped the wan,
I nudged close clusters into one bright throng,
logged my hops and hummed a tidy song,
— Rabbit, pleased the faces belong.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately reflects the main change: improving face clustering accuracy through similarity thresholds and post-clustering merge functionality.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d07d817 and ea8d62b.

📒 Files selected for processing (1)

backend/app/utils/face_clusters.py (4 hunks)

🔇 Additional comments (4)

backend/app/utils/face_clusters.py (4)
211-229: LGTM - Precomputed distance approach is sound.

The similarity threshold filtering correctly marks dissimilar faces (distance > 0.15) with distance 1.0, which exceeds eps=0.75 and prevents them from clustering together.

Consider moving the import to the top of the file for consistency, though it's acceptable here.

264-266: LGTM - Post-clustering merge step.

The merge step with threshold 0.7 is appropriate for consolidating clusters that DBSCAN may have over-segmented.

373-391: Transitive merging not handled - verify if intentional.

The merge logic skips clusters already in merge_mapping, but this misses transitive relationships. For example: if A≈B and B≈C, then B→A is recorded, but C won't merge to A because B is skipped when checking C.

If transitive merging is desired, resolve the target cluster:
     for i, uuid1 in enumerate(cluster_uuids):
         if uuid1 in merge_mapping:
-            continue  # Already merged
+            continue
             
         for j in range(i + 1, len(cluster_uuids)):
             uuid2 = cluster_uuids[j]
             if uuid2 in merge_mapping:
                 continue  # Already merged
             
             # Calculate similarity between cluster means
-            emb1 = cluster_means[uuid1].reshape(1, -1)
+            # Resolve target for uuid1 in case of prior merges
+            target_uuid1 = merge_mapping.get(uuid1, uuid1)
+            emb1 = cluster_means[target_uuid1].reshape(1, -1)
             emb2 = cluster_means[uuid2].reshape(1, -1)
Alternatively, if the current behavior is intentional (to limit merge chains), add a comment explaining the design choice.

270-272: LGTM - Stricter threshold for cluster assignment.

The increased similarity_threshold default of 0.8 aligns with the PR objective to prevent side-face false positives.

backend/app/utils/face_clusters.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/utils/face_clusters.py (1)
258-275: Docstring default doesn’t match the function default.
Signature defaults similarity_threshold to 0.8 (Line 259) but the doc says “Default: 0.7” (Line 274). Please sync them.
-            Higher values = more strict assignment. Default: 0.7
+            Higher values = more strict assignment. Default: 0.8

🧹 Nitpick comments (3)

backend/app/utils/face_clusters.py (3)

9-11: Combine the sklearn pairwise imports (small cleanup).
Not required, but you can do from sklearn.metrics.pairwise import cosine_distances, cosine_similarity in one line for readability.

200-207: O(n²) distance matrix can blow up memory for large face sets.
cosine_distances(embeddings_array) allocates an NxN float matrix. If this can run on thousands+ of faces, consider chunking/approximate neighbors, blocking, or capping inputs per run.

361-400: Greedy merge can miss transitive merges; consider union-find/connected components.
Current mapping merges only when (uuid1, uuid2) directly passes threshold. If A~~B and B~~C but A !~ C, C won’t merge (depending on iteration). If the intent is “merge any clusters connected by similarity>=threshold”, use a DSU/union-find (or build a similarity graph and take connected components).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea8d62b and 9a07952.

📒 Files selected for processing (1)

backend/app/utils/face_clusters.py (5 hunks)

🔇 Additional comments (1)

backend/app/utils/face_clusters.py (1)

200-218: DBSCAN precomputed distances: verify API + note eps may become ineffective.
After thresholding, distances are essentially {<= (1-similarity_threshold)} ∪ {1.0}; if eps is much larger than (1-similarity_threshold), changing eps won’t meaningfully change neighbors. Also please verify scikit-learn 1.5.1 supports DBSCAN(metric="precomputed", n_jobs=-1) as used here.

backend/app/utils/face_clusters.py

coderabbitai · 2025-12-14T04:18:15Z

backend/app/utils/face_clusters.py

+    # Post-clustering merge: merge similar clusters based on representative faces
+    results = _merge_similar_clusters(results, merge_threshold=0.7)
+


⚠️ Potential issue | 🟠 Major

Hardcoded merge_threshold=0.7 conflicts with other defaults and likely harms accuracy.
You’re clustering with similarity_threshold=0.85, but then merging clusters at 0.7 (and the helper default is 0.85). This looks accidental and can over-merge. Suggest: use merge_threshold=similarity_threshold (or introduce an explicit parameter/plumb it through).

- results = _merge_similar_clusters(results, merge_threshold=0.7) + results = _merge_similar_clusters(results, merge_threshold=similarity_threshold)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Post-clustering merge: merge similar clusters based on representative faces

results = _merge_similar_clusters(results, merge_threshold=0.7)

# Post-clustering merge: merge similar clusters based on representative faces

results = _merge_similar_clusters(results, merge_threshold=similarity_threshold)

🤖 Prompt for AI Agents

In backend/app/utils/face_clusters.py around lines 252 to 254, the call uses a hardcoded merge_threshold=0.7 which conflicts with the clustering similarity_threshold (0.85) and the helper default, causing potential over-merge; update the call to pass the active similarity_threshold (e.g., merge_threshold=similarity_threshold) or add a merge_threshold parameter to the enclosing function and thread it through so the merge uses the same threshold used for clustering (or an explicitly provided value).

backend/app/utils/face_clusters.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/utils/face_clusters.py (1)

305-323: Docstring default is stale (0.7 vs actual 0.8).
Line 306 sets similarity_threshold: float = 0.8, but the docstring says “Default: 0.7”.

♻️ Duplicate comments (1)

backend/app/utils/face_clusters.py (1)
297-301: Bug/behavior mismatch: merge default ignores similarity_threshold and hardcodes 0.7.
Your docstring says “default None, uses similarity_threshold”, but the implementation uses 0.7. This can easily over-merge relative to similarity_threshold=0.85 and contradicts the PR intent.

Apply this diff:
-    effective_merge_threshold = merge_threshold if merge_threshold is not None else 0.7
+    effective_merge_threshold = (
+        merge_threshold if merge_threshold is not None else similarity_threshold
+    )
     results = _merge_similar_clusters(results, merge_threshold=effective_merge_threshold)

🧹 Nitpick comments (5)

.gitignore (1)

34-34: Ignoring env/ seems fine; consider whether you also want .venv/ explicitly.
You already ignore venv/; adding env/ helps cover common virtualenv folder names. If the repo standardizes on .venv/, consider adding that too (optional).

backend/app/utils/face_clusters.py (4)

165-186: _validate_embedding is good; consider making it robust to non-ndarray inputs.
If DB returns lists/JSON, np.isfinite(embedding) can behave unexpectedly. A small hardening is to embedding = np.asarray(embedding) and guard embedding.ndim == 1.

188-202: Signature/docs look directionally right; use Optional[float] for merge_threshold.
Right now it’s annotated as float = None; prefer merge_threshold: Optional[float] = None for type-checkers.

239-252: Distance thresholding works, but consider marking “too far” as 2.0 (max cosine distance) instead of 1.0.
cosine_distances ranges [0, 2]. Setting “dissimilar” pairs to 1.0 is fine with today’s eps=0.75, but if eps is increased above 1.0, those “forced-far” pairs could re-enter neighborhoods unexpectedly.

Also applies to: 261-263

404-529: Merge logic is reasonable; watch O(N²) similarity computation for large cluster counts.
If cluster counts can grow large, consider vectorizing similarities (matrix) or adding a cheap prefilter (e.g., only compare candidates with nearby centroids via ANN).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a07952 and e7dc6a2.

⛔ Files ignored due to path filters (1)

frontend/package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (3)

.gitignore (1 hunks)
backend/app/utils/face_clusters.py (7 hunks)
docs/backend/backend_python/openapi.json (1 hunks)

🔇 Additional comments (6)

docs/backend/backend_python/openapi.json (1)

1120-1128: allOf wrapper around $ref is valid; double-check client generator compatibility.
This is OpenAPI 3.1–legal, but some generators handle “simple enum $ref” more cleanly than allOf([$ref]). If this file is generated, this is fine; if it’s hand-maintained, ensure downstream clients still render InputType as the expected enum.

backend/app/utils/face_clusters.py (5)

9-11: Good: import moves to module scope (no per-call import overhead).

209-235: Embedding validation + clear logging is a solid improvement.
This should prevent NaN/zero-vector corruption from silently degrading clustering.

337-359: Validation of cluster means before assignment is a good guardrail.

364-400: Good: skips invalid face embeddings + NaN distance guard prevents bad assignments.

534-565: Safer cosine distance computation (normalization + clipping) looks solid.

rahulharpal1603 · 2025-12-24T05:57:53Z

@keshaviscool
Kindly fix the Lint issues.

- Added _validate_embedding() to check for NaN/inf/zero vectors - Guard against invalid embeddings in clustering and assignment - Safe division in cosine distance calculations - Format code with Black for lint compliance

keshaviscool · 2025-12-24T06:14:28Z

@rahulharpal1603 fixed lint issues. kindly review

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

backend/app/utils/face_clusters.py (1)
308-313: Hardcoded merge threshold conflicts with comment and clustering threshold.

Line 310 uses a hardcoded fallback of 0.7 when merge_threshold is None, but:

The comment on Line 309 states it should use similarity_threshold

The clustering uses similarity_threshold=0.85 by default

Merging at 0.7 (lower threshold) after clustering at 0.85 will over-merge distinct clusters, degrading accuracy

This issue was flagged in previous reviews but the fix is incomplete.
🔎 Proposed fix
-    # Post-clustering merge: merge similar clusters based on representative faces
-    # Use similarity_threshold if merge_threshold not explicitly provided
-    effective_merge_threshold = merge_threshold if merge_threshold is not None else 0.7
+    # Post-clustering merge: merge similar clusters based on mean embeddings
+    # Use similarity_threshold if merge_threshold not explicitly provided
+    effective_merge_threshold = merge_threshold if merge_threshold is not None else similarity_threshold
     results = _merge_similar_clusters(
         results, merge_threshold=effective_merge_threshold
     )

🧹 Nitpick comments (1)

backend/app/utils/face_clusters.py (1)
9-10: Combine imports from the same module.

Both imports are from sklearn.metrics.pairwise and can be combined into a single import statement for cleaner code.
🔎 Suggested refactor
-from sklearn.metrics.pairwise import cosine_distances
-from sklearn.metrics.pairwise import cosine_similarity
+from sklearn.metrics.pairwise import cosine_distances, cosine_similarity

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7dc6a2 and 1d2b7b1.

📒 Files selected for processing (1)

backend/app/utils/face_clusters.py

🔇 Additional comments (8)

backend/app/utils/face_clusters.py (8)

165-186: Excellent addition for robustness.

This validation function effectively guards against NaN, infinite, and zero-vector embeddings that would corrupt distance calculations. The min_norm=1e-6 threshold is appropriate for detecting effectively-zero vectors.

195-201: LGTM - docstring now matches implementation.

The docstring correctly reflects the actual default values for all parameters. Good fix addressing the previous review.

213-237: LGTM - proper validation implemented.

The validation loop now correctly filters out invalid embeddings and handles the empty case appropriately. This properly addresses the previous review about the no-op filter loop.

244-260: Robust distance calculation with proper guards.

The NaN guards and similarity threshold application are correctly implemented. Converting similarity to distance (1 - similarity_threshold) and marking dissimilar pairs as completely different (distance 1.0) is the right approach.

350-373: Good defensive validation of cluster means.

Properly validates cluster mean embeddings before using them for assignment, with appropriate logging and early exit handling.

379-398: Robust validation with NaN guards in assignment.

Properly validates both face embeddings and resulting distances, preventing invalid data from corrupting cluster assignments.

421-554: Well-implemented merge with proper safeguards.

The function correctly:

Validates cluster means before merging (lines 445-467)

Guards against NaN similarities (lines 488-493)

Resolves transitive merge chains (lines 505-511)

Applies majority voting for consistent cluster names (lines 527-546)

This properly addresses the previous review concern about inconsistent cluster names after merging.

556-593: Excellent defensive handling of edge cases.

The safe normalization (lines 570-582) and result clipping (lines 590-591) properly guard against zero vectors and numerical errors, ensuring finite, valid distance values.

coderabbitai · 2025-12-24T06:16:39Z

backend/app/utils/face_clusters.py

+    eps: float = 0.75,
+    min_samples: int = 2,
+    similarity_threshold: float = 0.85,
+    merge_threshold: float = None,


⚠️ Potential issue | 🟡 Minor

Fix type hint for optional parameter.

The type hint merge_threshold: float = None is incorrect since None is not a valid float. Use Optional[float] instead.

🔎 Proposed fix

- merge_threshold: float = None, + merge_threshold: Optional[float] = None,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

merge_threshold: float = None,

merge_threshold: Optional[float] = None,

🤖 Prompt for AI Agents

In backend/app/utils/face_clusters.py around line 192, the parameter declaration uses the invalid type hint "merge_threshold: float = None"; change it to use Optional[float] (i.e. "merge_threshold: Optional[float] = None") and ensure Optional is imported from typing at the top of the file (add "from typing import Optional" if not already present).

rahulharpal1603

Thank you so much @keshaviscool!

This was one of the most critical improvements to our project.

github-actions bot added backend bug Something isn't working medium labels Dec 13, 2025

coderabbitai bot reviewed Dec 13, 2025

View reviewed changes

backend/app/utils/face_clusters.py Outdated Show resolved Hide resolved

backend/app/utils/face_clusters.py Outdated Show resolved Hide resolved

backend/app/utils/face_clusters.py Outdated Show resolved Hide resolved

backend/app/utils/face_clusters.py Outdated Show resolved Hide resolved

keshaviscool mentioned this pull request Dec 13, 2025

BUG: Incorrect clustering for a folder with many images #722

Closed

7 tasks

coderabbit minor changes

9a07952

coderabbitai bot reviewed Dec 14, 2025

View reviewed changes

coderabbit review minor changes

e7dc6a2

coderabbitai bot reviewed Dec 14, 2025

View reviewed changes

rahulharpal1603 mentioned this pull request Dec 24, 2025

fix: improve face clustering parameters to prevent incorrect grouping… #742

Closed

4 tasks

fix: add NaN/zero-vector guards and format code

1d2b7b1

- Added _validate_embedding() to check for NaN/inf/zero vectors - Guard against invalid embeddings in clustering and assignment - Safe division in cosine distance calculations - Format code with Black for lint compliance

coderabbitai bot reviewed Dec 24, 2025

View reviewed changes

rahulharpal1603 approved these changes Dec 25, 2025

View reviewed changes

rahulharpal1603 merged commit 7649865 into AOSSIE-Org:main Dec 25, 2025
9 checks passed

This was referenced Dec 29, 2025

Add pipeline summary metrics to face clustering logs #878

Open

Improve Face Clustering Accuracy & Enable Automatic Clustering on Folder Add #914

Closed

		# Post-clustering merge: merge similar clusters based on representative faces
		results = _merge_similar_clusters(results, merge_threshold=0.7)

	merge_threshold: float = None,
	merge_threshold: Optional[float] = None,

fix: improve face clustering accuracy with similarity threshold and p… #771

fix: improve face clustering accuracy with similarity threshold and p… #771

Uh oh!

Conversation

keshaviscool commented Dec 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Dec 13, 2025

Uh oh!

coderabbitai bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

rahulharpal1603 commented Dec 24, 2025

Uh oh!

keshaviscool commented Dec 24, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

rahulharpal1603 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

keshaviscool commented Dec 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 13, 2025 •

edited

Loading