Skip to content

Conversation

@keshaviscool
Copy link
Contributor

@keshaviscool keshaviscool commented Dec 13, 2025

…ost-clustering merge

#722 Issue fixed

  • Added similarity_threshold parameter (0.85) to filter dissimilar faces before clustering
  • Implemented post-clustering merge to combine duplicate clusters of same person
  • Improved distance calculation using precomputed cosine distances
  • Enhanced logging for better debugging and monitoring
  • Prevents side-face false positives while maintaining high accuracy

Summary by CodeRabbit

  • Improvements

    • Enhanced face clustering with cosine-based similarity thresholds, exposed configurable similarity/merge thresholds, and optional post-cluster merging to reduce duplicates and improve grouping.
    • Better assignment of previously unclustered faces with stricter validation and more accurate matching.
  • Stability

    • Robust validation to skip invalid embeddings, safer distance calculations, and expanded logging for numerical edge cases.
  • Chores

    • Ignored local environment directory.
    • Tightened image metadata schema to disallow arbitrary extra properties.

✏️ Tip: You can customize this high-level summary in your review settings.

…ost-clustering merge

- Added similarity_threshold parameter (0.85) to filter dissimilar faces before clustering
- Implemented post-clustering merge to combine duplicate clusters of same person
- Added quality filtering for embeddings (std and norm checks)
- Improved distance calculation using precomputed cosine distances
- Enhanced logging for better debugging and monitoring
- Prevents side-face false positives while maintaining high accuracy
@github-actions
Copy link
Contributor

⚠️ No issue was linked in the PR description.
Please make sure to link an issue (e.g., 'Fixes #issue_number')

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 13, 2025

📝 Walkthrough

Walkthrough

Adds embedding validation, cosine-distance precomputed matrices for DBSCAN, similarity and merge thresholds, a post-clustering cluster-merge step, and expanded logging; updates clustering/assignment public signatures. Also adds env/ to .gitignore and tightens an OpenAPI metadata schema.

Changes

Cohort / File(s) Summary
Face clustering core & helpers
backend/app/utils/face_clusters.py
Added similarity_threshold and merge_threshold parameters; validate embeddings (NaN/inf/near-zero) via _validate_embedding; compute cosine-distance precomputed matrices, clip distances to reflect similarity thresholds, run DBSCAN with metric='precomputed'; add _merge_similar_clusters(results, merge_threshold) to merge clusters by mean-embedding similarity and update cluster UUIDs/names; update cluster_util_assign_cluster_to_faces_without_clusterId to use cosine distances and validate inputs. Note: duplicate _merge_similar_clusters declaration present.
Repository ignore
.gitignore
Added env/ to ignored paths.
API schema
docs/backend/backend_python/openapi.json
Wrapped two query parameter input schemas in allOf with a title and removed additionalProperties: true from ImageInCluster.Metadata, tightening allowed metadata properties.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • rahulharpal1603

Poem

🐰 I hopped through cosine fields at dawn,
checked each vector, skipped the wan,
I nudged close clusters into one bright throng,
logged my hops and hummed a tidy song,
— Rabbit, pleased the faces belong.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately reflects the main change: improving face clustering accuracy through similarity thresholds and post-clustering merge functionality.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added backend bug Something isn't working medium labels Dec 13, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d07d817 and ea8d62b.

📒 Files selected for processing (1)
  • backend/app/utils/face_clusters.py (4 hunks)
🔇 Additional comments (4)
backend/app/utils/face_clusters.py (4)

211-229: LGTM - Precomputed distance approach is sound.

The similarity threshold filtering correctly marks dissimilar faces (distance > 0.15) with distance 1.0, which exceeds eps=0.75 and prevents them from clustering together.

Consider moving the import to the top of the file for consistency, though it's acceptable here.


264-266: LGTM - Post-clustering merge step.

The merge step with threshold 0.7 is appropriate for consolidating clusters that DBSCAN may have over-segmented.


373-391: Transitive merging not handled - verify if intentional.

The merge logic skips clusters already in merge_mapping, but this misses transitive relationships. For example: if A≈B and B≈C, then B→A is recorded, but C won't merge to A because B is skipped when checking C.

If transitive merging is desired, resolve the target cluster:

     for i, uuid1 in enumerate(cluster_uuids):
         if uuid1 in merge_mapping:
-            continue  # Already merged
+            continue
             
         for j in range(i + 1, len(cluster_uuids)):
             uuid2 = cluster_uuids[j]
             if uuid2 in merge_mapping:
                 continue  # Already merged
             
             # Calculate similarity between cluster means
-            emb1 = cluster_means[uuid1].reshape(1, -1)
+            # Resolve target for uuid1 in case of prior merges
+            target_uuid1 = merge_mapping.get(uuid1, uuid1)
+            emb1 = cluster_means[target_uuid1].reshape(1, -1)
             emb2 = cluster_means[uuid2].reshape(1, -1)

Alternatively, if the current behavior is intentional (to limit merge chains), add a comment explaining the design choice.


270-272: LGTM - Stricter threshold for cluster assignment.

The increased similarity_threshold default of 0.8 aligns with the PR objective to prevent side-face false positives.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/utils/face_clusters.py (1)

258-275: Docstring default doesn’t match the function default.
Signature defaults similarity_threshold to 0.8 (Line 259) but the doc says “Default: 0.7” (Line 274). Please sync them.

-            Higher values = more strict assignment. Default: 0.7
+            Higher values = more strict assignment. Default: 0.8
🧹 Nitpick comments (3)
backend/app/utils/face_clusters.py (3)

9-11: Combine the sklearn pairwise imports (small cleanup).
Not required, but you can do from sklearn.metrics.pairwise import cosine_distances, cosine_similarity in one line for readability.


200-207: O(n²) distance matrix can blow up memory for large face sets.
cosine_distances(embeddings_array) allocates an NxN float matrix. If this can run on thousands+ of faces, consider chunking/approximate neighbors, blocking, or capping inputs per run.


361-400: Greedy merge can miss transitive merges; consider union-find/connected components.
Current mapping merges only when (uuid1, uuid2) directly passes threshold. If AB and BC but A !~ C, C won’t merge (depending on iteration). If the intent is “merge any clusters connected by similarity>=threshold”, use a DSU/union-find (or build a similarity graph and take connected components).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea8d62b and 9a07952.

📒 Files selected for processing (1)
  • backend/app/utils/face_clusters.py (5 hunks)
🔇 Additional comments (1)
backend/app/utils/face_clusters.py (1)

200-218: DBSCAN precomputed distances: verify API + note eps may become ineffective.
After thresholding, distances are essentially {<= (1-similarity_threshold)} ∪ {1.0}; if eps is much larger than (1-similarity_threshold), changing eps won’t meaningfully change neighbors. Also please verify scikit-learn 1.5.1 supports DBSCAN(metric="precomputed", n_jobs=-1) as used here.

Comment on lines 252 to 254
# Post-clustering merge: merge similar clusters based on representative faces
results = _merge_similar_clusters(results, merge_threshold=0.7)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded merge_threshold=0.7 conflicts with other defaults and likely harms accuracy.
You’re clustering with similarity_threshold=0.85, but then merging clusters at 0.7 (and the helper default is 0.85). This looks accidental and can over-merge. Suggest: use merge_threshold=similarity_threshold (or introduce an explicit parameter/plumb it through).

-    results = _merge_similar_clusters(results, merge_threshold=0.7)
+    results = _merge_similar_clusters(results, merge_threshold=similarity_threshold)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Post-clustering merge: merge similar clusters based on representative faces
results = _merge_similar_clusters(results, merge_threshold=0.7)
# Post-clustering merge: merge similar clusters based on representative faces
results = _merge_similar_clusters(results, merge_threshold=similarity_threshold)
🤖 Prompt for AI Agents
In backend/app/utils/face_clusters.py around lines 252 to 254, the call uses a
hardcoded merge_threshold=0.7 which conflicts with the clustering
similarity_threshold (0.85) and the helper default, causing potential
over-merge; update the call to pass the active similarity_threshold (e.g.,
merge_threshold=similarity_threshold) or add a merge_threshold parameter to the
enclosing function and thread it through so the merge uses the same threshold
used for clustering (or an explicitly provided value).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/utils/face_clusters.py (1)

305-323: Docstring default is stale (0.7 vs actual 0.8).
Line 306 sets similarity_threshold: float = 0.8, but the docstring says “Default: 0.7”.

♻️ Duplicate comments (1)
backend/app/utils/face_clusters.py (1)

297-301: Bug/behavior mismatch: merge default ignores similarity_threshold and hardcodes 0.7.
Your docstring says “default None, uses similarity_threshold”, but the implementation uses 0.7. This can easily over-merge relative to similarity_threshold=0.85 and contradicts the PR intent.

Apply this diff:

-    effective_merge_threshold = merge_threshold if merge_threshold is not None else 0.7
+    effective_merge_threshold = (
+        merge_threshold if merge_threshold is not None else similarity_threshold
+    )
     results = _merge_similar_clusters(results, merge_threshold=effective_merge_threshold)
🧹 Nitpick comments (5)
.gitignore (1)

34-34: Ignoring env/ seems fine; consider whether you also want .venv/ explicitly.
You already ignore venv/; adding env/ helps cover common virtualenv folder names. If the repo standardizes on .venv/, consider adding that too (optional).

backend/app/utils/face_clusters.py (4)

165-186: _validate_embedding is good; consider making it robust to non-ndarray inputs.
If DB returns lists/JSON, np.isfinite(embedding) can behave unexpectedly. A small hardening is to embedding = np.asarray(embedding) and guard embedding.ndim == 1.


188-202: Signature/docs look directionally right; use Optional[float] for merge_threshold.
Right now it’s annotated as float = None; prefer merge_threshold: Optional[float] = None for type-checkers.


239-252: Distance thresholding works, but consider marking “too far” as 2.0 (max cosine distance) instead of 1.0.
cosine_distances ranges [0, 2]. Setting “dissimilar” pairs to 1.0 is fine with today’s eps=0.75, but if eps is increased above 1.0, those “forced-far” pairs could re-enter neighborhoods unexpectedly.

Also applies to: 261-263


404-529: Merge logic is reasonable; watch O(N²) similarity computation for large cluster counts.
If cluster counts can grow large, consider vectorizing similarities (matrix) or adding a cheap prefilter (e.g., only compare candidates with nearby centroids via ANN).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a07952 and e7dc6a2.

⛔ Files ignored due to path filters (1)
  • frontend/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (3)
  • .gitignore (1 hunks)
  • backend/app/utils/face_clusters.py (7 hunks)
  • docs/backend/backend_python/openapi.json (1 hunks)
🔇 Additional comments (6)
docs/backend/backend_python/openapi.json (1)

1120-1128: allOf wrapper around $ref is valid; double-check client generator compatibility.
This is OpenAPI 3.1–legal, but some generators handle “simple enum $ref” more cleanly than allOf([$ref]). If this file is generated, this is fine; if it’s hand-maintained, ensure downstream clients still render InputType as the expected enum.

backend/app/utils/face_clusters.py (5)

9-11: Good: import moves to module scope (no per-call import overhead).


209-235: Embedding validation + clear logging is a solid improvement.
This should prevent NaN/zero-vector corruption from silently degrading clustering.


337-359: Validation of cluster means before assignment is a good guardrail.


364-400: Good: skips invalid face embeddings + NaN distance guard prevents bad assignments.


534-565: Safer cosine distance computation (normalization + clipping) looks solid.

@rahulharpal1603
Copy link
Contributor

@keshaviscool
Kindly fix the Lint issues.

- Added _validate_embedding() to check for NaN/inf/zero vectors
- Guard against invalid embeddings in clustering and assignment
- Safe division in cosine distance calculations
- Format code with Black for lint compliance
@keshaviscool
Copy link
Contributor Author

@rahulharpal1603 fixed lint issues. kindly review

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
backend/app/utils/face_clusters.py (1)

308-313: Hardcoded merge threshold conflicts with comment and clustering threshold.

Line 310 uses a hardcoded fallback of 0.7 when merge_threshold is None, but:

  1. The comment on Line 309 states it should use similarity_threshold
  2. The clustering uses similarity_threshold=0.85 by default
  3. Merging at 0.7 (lower threshold) after clustering at 0.85 will over-merge distinct clusters, degrading accuracy

This issue was flagged in previous reviews but the fix is incomplete.

🔎 Proposed fix
-    # Post-clustering merge: merge similar clusters based on representative faces
-    # Use similarity_threshold if merge_threshold not explicitly provided
-    effective_merge_threshold = merge_threshold if merge_threshold is not None else 0.7
+    # Post-clustering merge: merge similar clusters based on mean embeddings
+    # Use similarity_threshold if merge_threshold not explicitly provided
+    effective_merge_threshold = merge_threshold if merge_threshold is not None else similarity_threshold
     results = _merge_similar_clusters(
         results, merge_threshold=effective_merge_threshold
     )
🧹 Nitpick comments (1)
backend/app/utils/face_clusters.py (1)

9-10: Combine imports from the same module.

Both imports are from sklearn.metrics.pairwise and can be combined into a single import statement for cleaner code.

🔎 Suggested refactor
-from sklearn.metrics.pairwise import cosine_distances
-from sklearn.metrics.pairwise import cosine_similarity
+from sklearn.metrics.pairwise import cosine_distances, cosine_similarity
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7dc6a2 and 1d2b7b1.

📒 Files selected for processing (1)
  • backend/app/utils/face_clusters.py
🔇 Additional comments (8)
backend/app/utils/face_clusters.py (8)

165-186: Excellent addition for robustness.

This validation function effectively guards against NaN, infinite, and zero-vector embeddings that would corrupt distance calculations. The min_norm=1e-6 threshold is appropriate for detecting effectively-zero vectors.


195-201: LGTM - docstring now matches implementation.

The docstring correctly reflects the actual default values for all parameters. Good fix addressing the previous review.


213-237: LGTM - proper validation implemented.

The validation loop now correctly filters out invalid embeddings and handles the empty case appropriately. This properly addresses the previous review about the no-op filter loop.


244-260: Robust distance calculation with proper guards.

The NaN guards and similarity threshold application are correctly implemented. Converting similarity to distance (1 - similarity_threshold) and marking dissimilar pairs as completely different (distance 1.0) is the right approach.


350-373: Good defensive validation of cluster means.

Properly validates cluster mean embeddings before using them for assignment, with appropriate logging and early exit handling.


379-398: Robust validation with NaN guards in assignment.

Properly validates both face embeddings and resulting distances, preventing invalid data from corrupting cluster assignments.


421-554: Well-implemented merge with proper safeguards.

The function correctly:

  • Validates cluster means before merging (lines 445-467)
  • Guards against NaN similarities (lines 488-493)
  • Resolves transitive merge chains (lines 505-511)
  • Applies majority voting for consistent cluster names (lines 527-546)

This properly addresses the previous review concern about inconsistent cluster names after merging.


556-593: Excellent defensive handling of edge cases.

The safe normalization (lines 570-582) and result clipping (lines 590-591) properly guard against zero vectors and numerical errors, ensuring finite, valid distance values.

eps: float = 0.75,
min_samples: int = 2,
similarity_threshold: float = 0.85,
merge_threshold: float = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix type hint for optional parameter.

The type hint merge_threshold: float = None is incorrect since None is not a valid float. Use Optional[float] instead.

🔎 Proposed fix
-    merge_threshold: float = None,
+    merge_threshold: Optional[float] = None,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
merge_threshold: float = None,
merge_threshold: Optional[float] = None,
🤖 Prompt for AI Agents
In backend/app/utils/face_clusters.py around line 192, the parameter declaration
uses the invalid type hint "merge_threshold: float = None"; change it to use
Optional[float] (i.e. "merge_threshold: Optional[float] = None") and ensure
Optional is imported from typing at the top of the file (add "from typing import
Optional" if not already present).

Copy link
Contributor

@rahulharpal1603 rahulharpal1603 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @keshaviscool!

This was one of the most critical improvements to our project.

@rahulharpal1603 rahulharpal1603 merged commit 7649865 into AOSSIE-Org:main Dec 25, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants