fix: Improve face clustering with quality assessment and adaptive parameters #794

SiddharthJiyani · 2025-12-18T04:25:11Z

Overview

Enhances face clustering accuracy with quality-based filtering, adaptive parameter selection, and robust cluster representatives. Includes comprehensive unit tests.

Fixes BUG: Face Clustering Quality Issues #793

Changes Made

1. Face Quality Assessment (`face_quality.py`)

Scores faces based on sharpness, brightness, and size
Filters low-quality faces (threshold: 0.4) to improve embedding quality

2. Adaptive DBSCAN Parameters (`clustering_advanced.py`)

Auto-selects optimal epsilon using k-NN + knee detection
Adapts to dataset characteristics instead of hardcoded values

3. Hierarchical Clustering Option

Adds Agglomerative clustering as DBSCAN alternative
Configurable linkage methods (average, complete, single)

4. Quality-Weighted Medoid

Uses medoid (most representative actual face) instead of mean
More robust to outliers and low-quality faces

5. Database Schema Update

Added quality column to faces table

Files Changed

New Files:

app/utils/face_quality.py
app/utils/clustering_advanced.py
migrate_add_quality.py

Modified Files:

app/utils/face_clusters.py
app/models/FaceDetector.py
app/database/faces.py
requirements.txt (added kneed>=0.8.5)

Tests:

tests/test_face_quality.py (24 tests)
tests/test_clustering_improvements.py (25 tests)

Benefits

Reduces mis-clustered faces by filtering poor quality detections
Automatically adapts to different photo collections
More robust cluster representatives

Demo

Before

face_clustering_before.mp4

After

face_clustering_after.mp4

Summary by CodeRabbit

New Features
- Per-face quality scores (sharpness, brightness, size) are recorded with embeddings and returned in face and cluster results.
- Quality-aware clustering: auto-tuned clustering, hierarchical option, quality-weighted cluster representatives, and quality-based filtering/assignment.
Tests
- New unit tests for clustering, quality-weighted medoid/mean logic, and face quality utilities.
Chores
- Added dependency to support automatic epsilon selection for clustering.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-18T04:25:30Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

A face-quality metric was added end-to-end: computed during detection, persisted in the faces DB schema, returned by retrieval queries, and used by upgraded clustering routines (adaptive DBSCAN/hierarchical and quality-weighted medoid/means).

Changes

Cohort / File(s)	Summary
Database schema & access `backend/app/database/faces.py`	Added `quality` column (`REAL DEFAULT 0.5`), updated `FaceData` typing, extended insert APIs (`db_insert_face_embeddings`, `db_insert_face_embeddings_by_image_id`) to accept `quality`, and updated getters (`db_get_faces_unassigned_clusters`, `db_get_all_faces_with_cluster_names`, `db_get_cluster_mean_embeddings`) to select and propagate per-face quality and per-cluster quality arrays.
Face detection pipeline `backend/app/models/FaceDetector.py`	Compute per-face quality via `calculate_face_quality`, collect `qualities`, log quality stats, and pass `quality=qualities` into `db_insert_face_embeddings_by_image_id`.
Face quality utilities `backend/app/utils/face_quality.py`	New module providing `assess_face_sharpness`, `assess_face_brightness`, `assess_face_size`, `calculate_face_quality` (weighted aggregation), `should_include_face`, and `filter_quality_faces`.
Advanced clustering algorithms `backend/app/utils/clustering_advanced.py`	New module with `auto_select_epsilon`, `cluster_dbscan_adaptive`, `cluster_hierarchical`, `calculate_quality_weighted_medoid`, `calculate_cluster_mean` (supports quality-weighted mean/medoid), and `cluster_faces` unified interface.
Clustering orchestration & config `backend/app/utils/face_clusters.py`	Switched to advanced clustering flow (`cluster_faces`), added clustering config constants, filters faces by quality before clustering, uses quality-weighted representatives for assignments, and includes more verbose logging.
Tests `backend/tests/test_clustering_improvements.py`, `backend/tests/test_face_quality.py`	New unit tests covering adaptive epsilon, DBSCAN/hierarchical clustering, medoid/quality-weighted mean logic, and face-quality functions with synthetic data.
Dependencies & test tweaks `backend/requirements.txt`, `backend/test.py`	Added `kneed>=0.8.5` to requirements. Tweaked DBSCAN parameters in `backend/test.py` (eps 0.3→0.2, min_samples 2→1).

Sequence Diagram(s)

sequenceDiagram
    participant Image as Image Input
    participant Detector as FaceDetector
    participant Quality as face_quality
    participant DB as Database
    participant ClustMgr as face_clusters
    participant Advanced as clustering_advanced

    Image->>Detector: detect_faces(image)
    Detector->>Detector: extract face patches, compute embeddings
    loop per face
        Detector->>Quality: calculate_face_quality(face_img)
        Quality-->>Detector: quality_score
    end
    Detector->>DB: db_insert_face_embeddings_by_image_id(embeddings, quality=qualities)
    DB-->>Detector: face IDs

    ClustMgr->>DB: db_get_all_faces_with_cluster_names()
    DB-->>ClustMgr: faces + embeddings + quality
    ClustMgr->>ClustMgr: filter_quality_faces(min_threshold)
    ClustMgr->>Advanced: cluster_faces(embeddings, quality_scores)
    Advanced->>Advanced: auto_select_epsilon / cluster_dbscan_adaptive / cluster_hierarchical
    Advanced->>Advanced: calculate_quality_weighted_medoid / calculate_cluster_mean
    Advanced-->>ClustMgr: labels + representatives
    ClustMgr->>DB: update cluster assignments
    DB-->>ClustMgr: ack

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Files needing focused review:
- backend/app/database/faces.py — schema migration, SQL queries, COALESCE defaults, grouping semantics.
- backend/app/utils/clustering_advanced.py — adaptive epsilon, medoid/weighted-mean math, edge-case handling.
- backend/app/utils/face_clusters.py — quality filtering, representative computation, assignment thresholds.
- backend/app/models/FaceDetector.py — integration of quality computation and propagation.
- Tests — ensure synthetic cases and fallbacks adequately exercise logic.

Possibly related PRs

Gsoc'25 Frontend Revamp #489 — Modifies the FaceDetector → DB pipeline to propagate per-face metadata (confidence/bbox); overlaps with adding quality propagation.
Fix: Critical database connection leaks in all database functions #547 — Changes insertion/selection functions in backend/app/database/faces.py; closely related to the quality field additions.
GSoC 2025 Backend Revamp #466 — Overlaps DB schema and clustering pipeline changes; directly related to embedding storage and clustering updates.

Suggested labels

backend, enhancement, bug

Suggested reviewers

rahulharpal1603

Poem

🐰 I measured whisker-spark and light,
Each patch rated by day and night.
Sharp, bright, or small — a number true,
I hop and sort the faces through.
Clusters bloom where quality grew.

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main improvement: face clustering enhanced with quality assessment and adaptive parameters, matching the primary changes.
Linked Issues check	✅ Passed	The PR implements quality-based filtering and adaptive clustering to prevent mis-clustering and person splits, addressing core objectives from #793.
Out of Scope Changes check	✅ Passed	All changes are focused on clustering improvements: face quality assessment, adaptive parameters, and clustering logic. Test files and dependency additions support the main objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 95.77% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/database/faces.py (1)

36-50: No migration script has been integrated into the startup flow.

The codebase uses CREATE TABLE IF NOT EXISTS in the FastAPI lifespan context manager (backend/main.py, lines 47-54) as the sole database initialization mechanism. While this works for new installations, existing databases with the faces table already created will not receive the new quality column. The SQL statements in faces.py directly access this column (INSERT on line 88, SELECT operations on lines 242, 276, 363), which will fail with "table faces has no column named quality" errors for existing installations. The migrate_add_quality.py file mentioned in the PR description does not exist in the repository, nor is any migration logic referenced anywhere in the codebase. To resolve this, either implement a proper migration that executes ALTER TABLE faces ADD COLUMN quality REAL DEFAULT 0.5 on startup for existing databases, or establish a database versioning system.

🧹 Nitpick comments (9)

backend/test.py (1)

43-45: Consider the implications of min_samples=1.

Setting min_samples=1 means every point with at least one neighbor within eps becomes a core point. This significantly changes clustering behavior compared to the production code in face_clusters.py which uses CLUSTERING_MIN_SAMPLES = 2.

If this test file is meant to validate production behavior, consider aligning these parameters with the actual production configuration, or clarify in comments that this is intentional for a specific test scenario.
backend/tests/test_clustering_improvements.py (2)
91-101: Strengthen the cluster count assertion for separated data.

The test creates two well-separated clusters but only asserts len(unique_labels) >= 1. With data offset by ±2 and variance 0.1, these clusters should be clearly distinguishable.
🔎 Suggested fix
-        unique_labels = set(labels) - {-1}  # Exclude noise
-        assert len(unique_labels) >= 1  # At least some clusters
+        unique_labels = set(labels) - {-1}  # Exclude noise
+        assert len(unique_labels) >= 2, f"Expected 2 distinct clusters for separated data, got {len(unique_labels)}"
155-163: Consider numpy integer type compatibility.

The assertion isinstance(idx, int) may fail if the implementation returns np.int64 or another NumPy integer type. NumPy integer types are not subclasses of Python's int in recent Python versions.
🔎 Suggested fix
-        assert 0 <= idx < 10
-        assert isinstance(idx, int)
+        assert 0 <= idx < 10
+        assert isinstance(idx, (int, np.integer))
backend/app/utils/face_quality.py (2)
15-41: Consider handling edge case of empty or invalid images.

The sharpness function will raise an error if face_image is empty or has zero dimensions. Consider adding a guard clause for robustness, especially since face crops from edge regions might occasionally produce degenerate results.
🔎 Suggested defensive check
 def assess_face_sharpness(face_image: NDArray) -> float:
+    # Guard against empty images
+    if face_image.size == 0:
+        return 0.0
+
     # Convert to grayscale if needed
     if len(face_image.shape) == 3:
187-205: Consider removing or guarding the __main__ block for production.

The CLI testing block is useful during development but may be unnecessary in production. If this module is only imported (never run directly), consider removing it or moving the test code to the test suite.
backend/app/database/faces.py (1)
140-145: Use keyword arguments for clarity in function call.

The call to db_insert_face_embeddings uses positional arguments which is fragile and harder to read. Using keyword arguments would improve maintainability.
🔎 Suggested fix
-            face_id = db_insert_face_embeddings(image_id, emb, conf, bb, cid, qual)
+            face_id = db_insert_face_embeddings(
+                image_id=image_id,
+                embeddings=emb,
+                confidence=conf,
+                bbox=bb,
+                cluster_id=cid,
+                quality=qual,
+            )
backend/app/utils/face_clusters.py (2)
35-44: Consider externalizing clustering configuration.

The clustering parameters are hardcoded as module-level constants. For production flexibility, consider moving these to the application settings/config module to allow runtime configuration or environment-based overrides.

165-177: Function parameters may conflict with configuration constants.

The function accepts eps and min_samples parameters but uses CLUSTERING_AUTO_EPSILON and CLUSTERING_FIXED_EPSILON internally. The eps parameter is only used when CLUSTERING_AUTO_EPSILON is False. Consider documenting this behavior or removing unused parameters to avoid confusion.
🔎 Clarify parameter usage in docstring
 def cluster_util_cluster_all_face_embeddings(
     eps: float = 0.3, min_samples: int = 2
 ) -> List[ClusterResult]:
     """
     Cluster face embeddings using advanced clustering algorithms with quality filtering.

     Args:
-        eps: DBSCAN epsilon parameter (used only if auto_epsilon=False)
+        eps: DBSCAN epsilon parameter (used only if CLUSTERING_AUTO_EPSILON=False,
+             otherwise ignored in favor of adaptive selection)
         min_samples: Minimum samples parameter for core points
backend/app/utils/clustering_advanced.py (1)
253-288: Consider exposing k_neighbors parameter.

The cluster_faces function doesn't expose the k_neighbors parameter for epsilon auto-selection, defaulting to 5. Consider adding this as an optional parameter for users who want to tune the adaptive epsilon behavior.
🔎 Suggested enhancement:
 def cluster_faces(
     embeddings: NDArray,
     algorithm: Literal["dbscan", "hierarchical"] = "dbscan",
     min_samples: int = 2,
     auto_eps: bool = True,
     fixed_eps: float = 0.3,
+    k_neighbors: int = 5,
     distance_threshold: float = 0.3,
     linkage: str = "average",
 ) -> NDArray:
     """
     Main clustering function with algorithm selection.
 
     Args:
         embeddings: Array of face embeddings
         algorithm: Clustering algorithm choice ('dbscan' or 'hierarchical')
         min_samples: Minimum samples for DBSCAN (default: 2)
         auto_eps: Auto-select epsilon for DBSCAN (default: True)
         fixed_eps: Fixed epsilon if auto_eps=False (default: 0.3)
+        k_neighbors: K for epsilon auto-selection (default: 5)
         distance_threshold: Distance threshold for hierarchical (default: 0.3)
         linkage: Linkage for hierarchical ('average', 'complete', 'single')
 
     Returns:
         Array of cluster labels
     """
     if algorithm == "dbscan":
         return cluster_dbscan_adaptive(
-            embeddings, min_samples=min_samples, auto_eps=auto_eps, fixed_eps=fixed_eps
+            embeddings, min_samples=min_samples, auto_eps=auto_eps, 
+            fixed_eps=fixed_eps, k_neighbors=k_neighbors
         )

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d07d817 and 7910d8c.

📒 Files selected for processing (9)

backend/app/database/faces.py (13 hunks)
backend/app/models/FaceDetector.py (4 hunks)
backend/app/utils/clustering_advanced.py (1 hunks)
backend/app/utils/face_clusters.py (6 hunks)
backend/app/utils/face_quality.py (1 hunks)
backend/requirements.txt (1 hunks)
backend/test.py (1 hunks)
backend/tests/test_clustering_improvements.py (1 hunks)
backend/tests/test_face_quality.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

backend/tests/test_face_quality.py (1)

backend/app/utils/face_quality.py (6)

assess_face_sharpness (15-41)

assess_face_brightness (44-73)

assess_face_size (76-102)

calculate_face_quality (105-156)

should_include_face (159-170)

filter_quality_faces (173-184)

backend/app/utils/face_clusters.py (5)

backend/app/utils/clustering_advanced.py (2)

cluster_faces (253-288)

calculate_cluster_mean (207-250)

backend/app/utils/face_quality.py (1)

filter_quality_faces (173-184)

backend/app/logging/setup_logging.py (1)

get_logger (199-209)

sync-microservice/app/logging/setup_logging.py (1)

get_logger (192-202)

backend/app/database/faces.py (3)

db_get_all_faces_with_cluster_names (261-301)

db_get_faces_unassigned_clusters (230-258)

db_get_cluster_mean_embeddings (349-406)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Agent
GitHub Check: Backend Tests

🔇 Additional comments (12)

backend/requirements.txt (1)

74-74: LGTM!

The kneed dependency is appropriately added with a minimum version constraint for the knee-detection functionality used in adaptive epsilon selection for DBSCAN clustering.

backend/tests/test_clustering_improvements.py (1)

1-20: Well-structured test module.

Good coverage of the advanced clustering functionality with tests for all major components: epsilon selection, DBSCAN, hierarchical clustering, medoid calculation, and integration workflows. The test organization by class is clean and maintainable.

backend/tests/test_face_quality.py (2)

25-35: Test creates sparse checkerboard, not full checkerboard.

The pattern img[::2, ::2] = 255 sets pixels at (0,0), (0,2), (2,0), (2,2), etc. - only 25% of pixels. This still produces sharp edges for Laplacian detection, but a full checkerboard (img[::2, ::2] = 255 and img[1::2, 1::2] = 255) would be more representative.

The test passes regardless since sparse patterns still have high-frequency content, so this is a minor observation.

1-265: Comprehensive test coverage for face quality module.

Excellent test coverage including:

Boundary conditions (exact threshold, empty lists)

Edge cases (missing quality keys, grayscale images)

Custom parameters (weights, thresholds)

Error conditions (invalid weight sums)

backend/app/models/FaceDetector.py (1)

53-72: Quality assessment integration looks correct.

The quality assessment is properly computed on the raw cropped face image before the FaceNet preprocessing step, which ensures the quality metrics reflect the actual input quality rather than the normalized/preprocessed version. Good use of debug-level logging for per-face metrics.

backend/app/utils/face_quality.py (1)

105-156: Well-designed quality scoring function.

Good implementation with:

Weight validation using np.isclose for floating-point tolerance

Clear separation of individual metrics

Comprehensive return dictionary for debugging

Appropriate default weights (sharpness prioritized at 0.4)

backend/app/database/faces.py (1)

24-24: Quality field type annotation is correct.

The Optional[float] type correctly reflects that quality may be None for faces inserted before this feature was added.

backend/app/utils/face_clusters.py (1)

306-319: Quality-weighted cluster representative calculation is well implemented.

Good use of the calculate_cluster_mean function with use_medoid=CLUSTERING_USE_QUALITY_WEIGHTING to leverage quality scores when computing cluster representatives. This should improve assignment accuracy by using higher-quality faces as reference points.

backend/app/utils/clustering_advanced.py (4)

1-17: LGTM! Clean imports and clear module documentation.

The imports are appropriate for the clustering functionality, and the module docstring clearly outlines the provided features.

88-125: LGTM! Efficient DBSCAN wrapper with good defaults.

The adaptive epsilon selection and parallel processing (n_jobs=-1) are well-implemented. The cosine metric is appropriate for face embeddings.

128-163: LGTM! Correct mutual exclusivity and parameter handling.

The hierarchical clustering implementation correctly handles the mutual exclusivity between n_clusters and distance_threshold (line 156), and uses appropriate defaults for face embedding clustering.

291-323: LGTM! Useful example code for testing.

The example usage in the __main__ block provides a good quick test of the clustering functionality with synthetic data.

backend/app/models/FaceDetector.py

backend/app/utils/clustering_advanced.py

backend/app/utils/face_clusters.py

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

backend/tests/test_clustering_improvements.py (4)
24-55: Consider adding random seeds for reproducibility.

Tests using np.random.randn without seeds are non-deterministic, which can occasionally cause test flakiness. While not critical here, adding seeds improves reproducibility during debugging.
🔎 Example fix for one test:
 def test_returns_float(self):
     """Test that function returns a float."""
+    np.random.seed(42)
     embeddings = np.random.randn(50, 128)
189-232: Add test for quality-weighted mean calculation.

The test suite covers simple mean and quality-weighted medoid, but misses the quality-weighted mean path (when use_medoid=False and quality_scores are provided). This code path is exercised in the implementation but not validated by tests.
🔎 Add this test:
def test_quality_weighted_mean(self):
    """Test quality-weighted mean calculation."""
    embeddings = np.array([
        [1.0, 0.0],
        [0.0, 1.0],
    ])
    quality_scores = np.array([0.8, 0.2])  # First embedding has higher quality
    
    result = calculate_cluster_mean(
        embeddings, use_medoid=False, quality_scores=quality_scores
    )
    
    # Result should be closer to first embedding due to quality weighting
    weights = quality_scores / quality_scores.sum()
    expected_raw = np.average(embeddings, axis=0, weights=weights)
    expected = expected_raw / np.linalg.norm(expected_raw)
    np.testing.assert_array_almost_equal(result, expected)
234-270: Add test for invalid algorithm error handling.

The cluster_faces function raises a ValueError for unknown algorithms, but this error path isn't tested. Adding a negative test improves coverage of error handling.
🔎 Add this test:
def test_invalid_algorithm(self):
    """Test that invalid algorithm raises ValueError."""
    embeddings = np.random.randn(30, 128)
    
    with pytest.raises(ValueError, match="Unknown algorithm"):
        cluster_faces(embeddings, algorithm="invalid")
301-315: Consider strengthening epsilon adaptation assertion.

The test creates dense (variance 0.01) and sparse (variance 10.0) data but only asserts both epsilons are positive. To better validate adaptive behavior, consider asserting that eps_sparse > eps_dense, which aligns with the intuition that sparser data should yield larger epsilon values.
🔎 Apply this diff:
     # Both should return valid positive epsilon
     assert eps_dense > 0
     assert eps_sparse > 0
+    # Sparse data should yield larger epsilon
+    assert eps_sparse > eps_dense

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7910d8c and cba7a9f.

📒 Files selected for processing (3)

backend/app/models/FaceDetector.py (3 hunks)
backend/tests/test_clustering_improvements.py (1 hunks)
backend/tests/test_face_quality.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

backend/tests/test_face_quality.py
backend/app/models/FaceDetector.py

🧰 Additional context used

🧬 Code graph analysis (1)

backend/tests/test_clustering_improvements.py (2)

backend/app/utils/clustering_advanced.py (6)

auto_select_epsilon (19-85)

cluster_dbscan_adaptive (88-125)

cluster_hierarchical (128-163)

calculate_quality_weighted_medoid (166-204)

calculate_cluster_mean (207-250)

cluster_faces (253-288)

backend/test.py (1)

main (35-67)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Backend Tests

🔇 Additional comments (2)

backend/tests/test_clustering_improvements.py (2)

1-19: LGTM! Well-structured test module.

The imports are clean and comprehensive, covering all the clustering functions that need testing. The module docstring clearly describes the scope.

1-319: Comprehensive test coverage with room for edge cases.

This test suite provides solid coverage of the advanced clustering module with 25 tests spanning basic functionality, parameter variations, and integration workflows. The structure is clear and tests are well-documented.

A few edge cases could be added for even better coverage:

Empty cluster handling (should raise ValueError per implementation)

Single-element clusters

Error handling for invalid inputs

Overall, this is a well-crafted test module that validates the core clustering functionality effectively.

backend/tests/test_clustering_improvements.py

Copilot

Pull request overview

This PR enhances face clustering accuracy by introducing quality-based filtering, adaptive parameter selection, and robust cluster representatives. The implementation adds comprehensive quality assessment for detected faces and improves clustering algorithms with automatic epsilon selection and quality-weighted medoid calculation.

Key Changes:

New face quality assessment module scoring faces on sharpness, brightness, and size
Adaptive DBSCAN with automatic epsilon selection using k-NN and knee detection
Quality-weighted medoid calculation for more representative cluster centers

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 19 comments.

Show a summary per file

File	Description
`backend/tests/test_face_quality.py`	Comprehensive unit tests (24 tests) for face quality assessment functions
`backend/tests/test_clustering_improvements.py`	Unit tests (25 tests) for advanced clustering algorithms including adaptive DBSCAN and hierarchical clustering
`backend/app/utils/face_quality.py`	New module implementing sharpness, brightness, and size assessment with configurable weighting
`backend/app/utils/clustering_advanced.py`	New module with adaptive DBSCAN, hierarchical clustering, and quality-weighted medoid calculation
`backend/app/utils/face_clusters.py`	Integration of quality filtering and advanced clustering into the main clustering workflow
`backend/app/models/FaceDetector.py`	Face detection enhanced with quality score calculation and storage
`backend/app/database/faces.py`	Database schema updated with quality column; queries modified to include quality scores
`backend/test.py`	Test script updated with new DBSCAN parameters
`backend/requirements.txt`	Added kneed>=0.8.5 dependency for knee detection in epsilon selection

Comments suppressed due to low confidence (1)

backend/app/models/FaceDetector.py:99

This assignment to 'close' is unnecessary as it is redefined before this value is used.

    def close(self):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-18T04:34:40Z

backend/app/utils/clustering_advanced.py

+        # Fallback heuristic: Use 90th percentile of k-distances
+        # This is a reasonable estimate for dense regions


The fallback heuristic using the 90th percentile is chosen without explanation. Consider adding a comment explaining why 90th percentile is preferred over other percentiles (e.g., 75th, 95th) for better maintainability and understanding of this design decision.

Suggested change

# Fallback heuristic: Use 90th percentile of k-distances

# This is a reasonable estimate for dense regions

# Fallback heuristic: use 90th percentile of k-distances.

# Rationale: lower percentiles (e.g., 75th) tend to under-estimate epsilon

# and can fragment dense clusters, while very high percentiles (e.g., 95th)

# tend to over-estimate epsilon and merge distinct identities. Empirically,

# the 90th percentile provides a reasonable balance for typical face-embedding

# densities when knee detection is unavailable or unreliable.

Copilot · 2025-12-18T04:34:40Z

backend/app/utils/clustering_advanced.py

+    except Exception:
+        # Any error in epsilon selection, use fallback
+        return fallback_eps


The bare except Exception catches all exceptions without logging what went wrong. This makes debugging difficult. Consider logging the exception or at least adding a comment explaining why silent failure is acceptable here.

Copilot · 2025-12-18T04:34:40Z

backend/app/utils/clustering_advanced.py

+    # Weight distances inversely by quality (lower quality = higher penalty)
+    # Add small constant to avoid division by zero
+    quality_weights = 1.0 / (quality_scores + 0.1)


The quality weighting logic is inverted. Using 1.0 / (quality_scores + 0.1) means higher quality scores result in higher penalties (larger weights in the distance calculation), which is opposite to the intended behavior. This should multiply distances by quality_scores directly, or use quality_scores without inversion to prioritize high-quality faces.

Suggested change

# Weight distances inversely by quality (lower quality = higher penalty)

# Add small constant to avoid division by zero

quality_weights = 1.0 / (quality_scores + 0.1)

# Weight distances by quality scores so higher-quality faces have greater influence

# (assumes larger quality_scores indicate better quality)

quality_weights = quality_scores

Copilot · 2025-12-18T04:34:41Z

backend/app/utils/face_clusters.py

    eps: float = 0.3, min_samples: int = 2
 ) -> List[ClusterResult]:


The function signature still includes eps and min_samples parameters with default values, but the function description states that eps is only used if auto_epsilon=False. However, the function doesn't accept an auto_epsilon parameter - it uses the global CLUSTERING_AUTO_EPSILON constant. This creates a confusing API where parameters don't match the configuration mechanism. Consider either accepting auto_eps as a parameter or removing the eps parameter entirely.

Copilot · 2025-12-18T04:34:41Z

backend/test.py


-    dbscan = DBSCAN(eps=0.3, min_samples=2, metric="cosine")
+    # Updated parameters to match production improvements
+    # eps=0.2 for tighter clustering, min_samples=1 to include all faces


The comment states "min_samples=1 to include all faces" which is misleading. Setting min_samples=1 doesn't guarantee all faces are included; it only means a single point can be a core point in DBSCAN. Faces can still be marked as noise (-1) if they don't meet the epsilon distance requirement. Consider clarifying this comment.

Suggested change

# eps=0.2 for tighter clustering, min_samples=1 to include all faces

# eps=0.2 for tighter clustering; min_samples=1 allows single-face clusters (reducing, but not eliminating, noise)

Copilot · 2025-12-18T04:34:43Z

backend/app/database/faces.py

        # Calculate mean embeddings for each cluster
        cluster_means = []
        for cluster_id, embeddings_list in cluster_embeddings.items():
            # Stack all embeddings for this cluster and calculate mean


The comment "Stack all embeddings for this cluster and calculate mean" is now inaccurate. The function stacks embeddings but no longer calculates the mean here. Remove or update this comment to reflect the actual behavior.

Suggested change

# Stack all embeddings for this cluster and calculate mean

# Stack all embeddings for this cluster

Copilot · 2025-12-18T04:34:44Z

backend/app/utils/clustering_advanced.py

+    # Calculate pairwise cosine distances
+    dist_matrix = cosine_distances(embeddings)
+
+    # Weight distances inversely by quality (lower quality = higher penalty)


The comment states "Weight distances inversely by quality (lower quality = higher penalty)" but this contradicts the intended behavior described in the docstring which says "Quality weighting prioritizes high-quality faces." The comment should be corrected to match the intended implementation once the weighting logic is fixed.

Suggested change

# Weight distances inversely by quality (lower quality = higher penalty)

# Weight distances so that lower-quality faces incur a higher penalty,

# biasing the medoid selection toward higher-quality embeddings.

Copilot · 2025-12-18T04:34:45Z

backend/tests/test_clustering_improvements.py

+
+import pytest
+import numpy as np
+


Import of 'Tuple' is not used.

Suggested change

Copilot · 2025-12-18T04:34:46Z

backend/tests/test_face_quality.py

+
+import pytest
+import numpy as np
+


Import of 'cv2' is not used.

Suggested change

Copilot · 2025-12-18T04:34:47Z

backend/requirements.txt

 psutil>=5.9.5
 pytest-asyncio>=1.0.0
 setuptools==66.1.1
+kneed>=0.8.5


The new dependency kneed>=0.8.5 is specified with an open-ended version range, meaning builds will automatically pull whatever latest version satisfies this constraint, which increases supply-chain risk compared to the other fully pinned runtime dependencies in this file. If an attacker compromises the kneed package or its distribution channel, a future malicious release could be installed and executed in your backend environment with access to application data and secrets. To reduce this risk, pin kneed to a specific, vetted version (and optionally enforce hashes) so that builds are reproducible and cannot silently upgrade to untrusted code.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/utils/face_clusters.py (1)

165-228: API inconsistency between function parameters and module-level configuration.

The function signature includes eps and min_samples parameters, but the implementation primarily uses module-level constants (CLUSTERING_AUTO_EPSILON, CLUSTERING_FIXED_EPSILON, etc.). The docstring references auto_epsilon which isn't a function parameter. This creates confusion about what controls clustering behavior.

Consider either:

Making these function parameters and removing module constants, OR
Removing the function parameters and using only module constants

🔎 Option 1: Use function parameters consistently:

 def cluster_util_cluster_all_face_embeddings(
-    eps: float = 0.3, min_samples: int = 2
+    algorithm: str = "dbscan",
+    auto_eps: bool = True,
+    fixed_eps: float = 0.3,
+    min_samples: int = 2,
 ) -> List[ClusterResult]:
     """
-    Cluster face embeddings using advanced clustering algorithms with quality filtering.
+    Cluster face embeddings using advanced algorithms with quality filtering.
 
     Args:
+        algorithm: Clustering algorithm ('dbscan' or 'hierarchical')
+        auto_eps: Whether to auto-select epsilon for DBSCAN
-        eps: DBSCAN epsilon parameter (used only if auto_epsilon=False)
+        fixed_eps: Fixed epsilon if auto_eps=False
         min_samples: Minimum samples parameter for core points
 
     Returns:
         List of ClusterResult objects
     """
     # ... quality filtering ...
     
     # Perform clustering using provided parameters
     cluster_labels = cluster_faces(
         embeddings_array,
-        algorithm=CLUSTERING_ALGORITHM,
+        algorithm=algorithm,
         min_samples=min_samples,
-        auto_eps=CLUSTERING_AUTO_EPSILON,
+        auto_eps=auto_eps,
-        fixed_eps=eps,
+        fixed_eps=fixed_eps,
-        distance_threshold=HIERARCHICAL_DISTANCE_THRESHOLD,
-        linkage=HIERARCHICAL_LINKAGE,
+        distance_threshold=HIERARCHICAL_DISTANCE_THRESHOLD,  # or add as param
+        linkage=HIERARCHICAL_LINKAGE,  # or add as param
     )

Option 2: Remove function parameters entirely:

-def cluster_util_cluster_all_face_embeddings(
-    eps: float = 0.3, min_samples: int = 2
-) -> List[ClusterResult]:
+def cluster_util_cluster_all_face_embeddings() -> List[ClusterResult]:
     """
     Cluster face embeddings using advanced clustering algorithms with quality filtering.
+    
+    Configuration is controlled via module-level constants:
+    - CLUSTERING_ALGORITHM, CLUSTERING_AUTO_EPSILON, etc.
 
-    Args:
-        eps: DBSCAN epsilon parameter (used only if auto_epsilon=False)
-        min_samples: Minimum samples parameter for core points
-
     Returns:
         List of ClusterResult objects
     """

🧹 Nitpick comments (2)

backend/tests/test_clustering_improvements.py (1)
304-318: Consider asserting relative epsilon values to verify adaptive behavior.

The test creates dense and sparse data but only verifies both epsilon values are positive. To better validate the adaptive epsilon selection, consider asserting that sparse data yields a larger epsilon than dense data.
🔎 Suggested enhancement:
     # Very sparse cluster (large variance)
     sparse = np.random.randn(50, 128) * 10.0
     eps_sparse = auto_select_epsilon(sparse)
 
     # Both should return valid positive epsilon
     assert eps_dense > 0
     assert eps_sparse > 0
+    # Sparse data should yield larger epsilon than dense data
+    assert eps_sparse > eps_dense
backend/app/utils/face_clusters.py (1)
35-44: Consider organizing configuration constants for better maintainability.

The module-level constants are functional but could be better organized. As the number of configuration options grows, consider grouping them into a configuration dataclass or dictionary.
🔎 Example refactoring approach:
from dataclasses import dataclass

@dataclass
class ClusteringConfig:
    algorithm: Literal["dbscan", "hierarchical"] = "dbscan"
    auto_epsilon: bool = True
    fixed_epsilon: float = 0.3
    min_samples: int = 2
    use_quality_weighting: bool = True
    quality_filter_enabled: bool = True
    quality_min_threshold: float = 0.4
    hierarchical_linkage: str = "average"
    hierarchical_distance_threshold: float = 0.3

CLUSTERING_CONFIG = ClusteringConfig()

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cba7a9f and ec4c7dd.

📒 Files selected for processing (3)

backend/app/utils/clustering_advanced.py (1 hunks)
backend/app/utils/face_clusters.py (6 hunks)
backend/tests/test_clustering_improvements.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/utils/clustering_advanced.py

🧰 Additional context used

🧬 Code graph analysis (1)

backend/app/utils/face_clusters.py (3)

backend/app/utils/clustering_advanced.py (2)

cluster_faces (255-290)

calculate_cluster_mean (209-252)

backend/app/utils/face_quality.py (1)

filter_quality_faces (173-184)

backend/app/database/faces.py (3)

db_get_all_faces_with_cluster_names (261-301)

db_get_faces_unassigned_clusters (230-258)

db_get_cluster_mean_embeddings (349-406)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Backend Tests

🔇 Additional comments (8)

backend/tests/test_clustering_improvements.py (6)

21-56: LGTM! Comprehensive coverage of epsilon selection.

The test cases appropriately validate return types, value constraints, fallback behavior with insufficient samples, and custom parameter handling.

86-99: Well done! Previous weak assertion has been strengthened.

The test now correctly asserts exactly 2 clusters and verifies no noise points, which is the expected behavior for well-separated data with fixed epsilon. The fixed seed ensures reproducibility.

101-137: LGTM! Tests appropriately validate hierarchical clustering behavior.

The tests correctly verify that hierarchical clustering assigns all points to clusters (no noise), supports configurable cluster counts, and works with all linkage methods.

171-190: Excellent fix! Zero-vector issue resolved.

The test now uses a proper unit vector (np.ones(128) / np.sqrt(128)) instead of zeros, which prevents undefined cosine distance calculations while maintaining the test's intent to verify quality-based medoid selection.

192-235: LGTM! Tests comprehensively validate cluster representative calculation.

The tests correctly verify normalized mean calculation, medoid selection with quality weighting, and the important property that medoids are actual data points rather than computed averages.

237-273: LGTM! Tests appropriately cover the main clustering interface.

The tests validate both DBSCAN and hierarchical algorithm paths, verify all embeddings receive labels, and support custom parameters.

backend/app/utils/face_clusters.py (2)

186-200: LGTM! Quality filtering is well-implemented.

The filtering logic correctly applies the quality threshold, logs diagnostic information about filtered faces, and handles the edge case where all faces are filtered out with an early return.

305-361: LGTM! Quality-aware face assignment is well-implemented.

The assignment logic effectively integrates quality assessment at multiple levels:

Uses quality-weighted cluster representatives (medoid or mean)

Filters low-quality faces during assignment

Provides debug logging for non-assignments and skipped faces

This should significantly improve clustering accuracy by preventing low-quality faces from being incorrectly assigned.

Copilot AI review requested due to automatic review settings December 18, 2025 04:25

Copilot started reviewing on behalf of SiddharthJiyani December 18, 2025 04:25 View session

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

backend/tests/test_clustering_improvements.py Outdated Show resolved Hide resolved

backend/tests/test_clustering_improvements.py Outdated Show resolved Hide resolved

Copilot AI reviewed Dec 18, 2025

View reviewed changes

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

SiddharthJiyani closed this Dec 18, 2025

SiddharthJiyani force-pushed the fix/improve-face-clustering branch from ec4c7dd to 8980d72 Compare December 18, 2025 05:28

		# Fallback heuristic: Use 90th percentile of k-distances
		# This is a reasonable estimate for dense regions

-        # Fallback heuristic: Use 90th percentile of k-distances
-        # This is a reasonable estimate for dense regions
+        # Fallback heuristic: use 90th percentile of k-distances.
+        # Rationale: lower percentiles (e.g., 75th) tend to under-estimate epsilon
+        # and can fragment dense clusters, while very high percentiles (e.g., 95th)
+        # tend to over-estimate epsilon and merge distinct identities. Empirically,
+        # the 90th percentile provides a reasonable balance for typical face-embedding
+        # densities when knee detection is unavailable or unreliable.

		eps: float = 0.3, min_samples: int = 2
		) -> List[ClusterResult]:

	# eps=0.2 for tighter clustering, min_samples=1 to include all faces
	# eps=0.2 for tighter clustering; min_samples=1 allows single-face clusters (reducing, but not eliminating, noise)

	# Stack all embeddings for this cluster and calculate mean
	# Stack all embeddings for this cluster

	# Weight distances inversely by quality (lower quality = higher penalty)
	# Weight distances so that lower-quality faces incur a higher penalty,
	# biasing the medoid selection toward higher-quality embeddings.

fix: Improve face clustering with quality assessment and adaptive parameters #794

fix: Improve face clustering with quality assessment and adaptive parameters #794

Uh oh!

Conversation

SiddharthJiyani commented Dec 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes Made

1. Face Quality Assessment (face_quality.py)

2. Adaptive DBSCAN Parameters (clustering_advanced.py)

3. Hierarchical Clustering Option

4. Quality-Weighted Medoid

5. Database Schema Update

Files Changed

Benefits

Demo

Before

After

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

SiddharthJiyani commented Dec 18, 2025 •

edited by coderabbitai bot

Loading

1. Face Quality Assessment (`face_quality.py`)

2. Adaptive DBSCAN Parameters (`clustering_advanced.py`)

coderabbitai bot commented Dec 18, 2025 •

edited

Loading