feat: Improve face clustering algorithm #833

SiddharthJiyani · 2025-12-24T06:33:10Z

Fixes BUG: Face Clustering Quality Issues #793

Problem

Face clustering was merging different people into the same cluster due to bridge-point chaining, where a single intermediate face could connect two distinct people's clusters.

Solution

Implemented a two-phase clustering approach:

DBSCAN with min_samples=2 - Prevents single faces from acting as bridges between clusters
Post-cluster mean merge - Merges clusters with similar mean embeddings (threshold: 0.28) to fix same-person splits caused by pose/lighting variation

Changes

face_clusters.py: Added post-merge logic and _calculate_cosine_distance() helper
clustering_conservative.py: Code cleanup (removed unused functions)
clustering_advanced.py: Code cleanup (removed unused functions)
test_clustering_algorithm.py: Added 16 comprehensive unit tests

Configuration

CLUSTERING_ALGORITHM = "dbscan"
CLUSTERING_MIN_SAMPLES = 2  # Prevents bridge chaining
CLUSTERING_FIXED_EPSILON = 0.35
POST_MERGE_ENABLED = True
POST_MERGE_MEAN_DISTANCE_THRESHOLD = 0.28

Benefits

Reduces mis-clustered faces by filtering poor quality detections
Automatically adapts to different photo collections
More robust cluster representatives

Demo

Before

face_clustering_before.mp4

After

To see the clustering results directly, please jump to 00:40.

face_clustering_after.mp4

Summary by CodeRabbit

Release Notes

New Features
- Face detection now evaluates image quality based on sharpness, brightness, and size metrics
- Improved face clustering with advanced algorithms and quality-based filtering to exclude low-quality detections
- Added automatic detection and consolidation of near-duplicate face groups post-clustering
- Face quality scores are now tracked and retrievable with each stored face record

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ng and same-person splits

coderabbitai · 2025-12-24T06:33:33Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This PR introduces face quality assessment and integrates it into the face clustering pipeline. New modules provide quality computation based on sharpness, brightness, and size metrics. Advanced clustering strategies (conservative, DBSCAN, hierarchical) are implemented with quality-based filtering to exclude low-quality faces before clustering. Quality scores are persisted in the database and propagated through detection and storage layers.

Changes

Cohort / File(s)	Summary
Database Schema and Retrieval `backend/app/database/faces.py`	Added `quality` field to FaceData TypedDict and faces table schema with default 0.5. Updated `db_insert_face_embeddings*` functions to accept and store quality parameter. Modified retrieval functions (`db_get_faces_unassigned_clusters`, `db_get_all_faces_with_cluster_names`, `db_get_cluster_mean_embeddings`) to return quality scores with COALESCE defaults.
Face Quality Assessment `backend/app/utils/face_quality.py`	NEW: Multi-criteria face quality assessment module. Provides `assess_face_sharpness`, `assess_face_brightness`, `assess_face_size`, and `calculate_face_quality` (weighted composite score). Includes utility filters: `should_include_face`, `filter_quality_faces`.
Conservative Clustering Implementation `backend/app/utils/clustering_conservative.py`	NEW: Implements conservative face clustering with validation and safe merging. Core components: `compute_pairwise_distances`, `select_conservative_epsilon`, `validate_cluster`, `split_loose_cluster`, and `ConservativeFaceClustering` class with `fit_predict`, cluster validation/splitting, and merge logic.
Advanced Clustering Strategies `backend/app/utils/clustering_advanced.py`	NEW: Multi-algorithm clustering interface routing to conservative, DBSCAN, or hierarchical methods. Provides `cluster_faces` (central API), `cluster_faces_dbscan`, `cluster_faces_hierarchical`, `calculate_adaptive_eps`, `get_cluster_stats`, `calculate_cluster_mean`.
Face Detection Integration `backend/app/models/FaceDetector.py`	Integrated quality assessment into face detection pipeline. Added `import` for `calculate_face_quality`, accumulates per-face quality scores during extraction, and logs detailed metrics (quality, sharpness, brightness, size). Passes `quality=qualities` to `db_insert_face_embeddings_by_image_id` along with daily quality statistics.
Clustering System Integration `backend/app/utils/face_clusters.py`	Replaced direct DBSCAN with advanced `cluster_faces` workflow supporting multiple algorithms. Added quality filtering configuration (`CLUSTERING_QUALITY_FILTER_ENABLED`, `CLUSTERING_QUALITY_MIN_THRESHOLD`) and post-merge step (`POST_MERGE_ENABLED`, `POST_MERGE_MEAN_DISTANCE_THRESHOLD`). Updated `cluster_util_cluster_all_face_embeddings` and `cluster_util_assign_cluster_to_faces_without_clusterId` signatures; added clustering algorithm selection and metrics logging.
Dependencies and Configuration `backend/requirements.txt`, `backend/test.py`	Added `kneed>=0.8.5` dependency. Updated DBSCAN parameter from `eps=0.3` to `eps=0.35` with alignment comment.
Test Coverage `backend/tests/test_clustering_algorithm.py`, `backend/tests/test_face_quality.py`	NEW: Comprehensive unit tests for clustering (cosine distance, configuration validation, no-faces/single-face/general clustering, noise handling, post-merge behavior, quality filtering, cluster naming) and face quality assessment (sharpness, brightness, size, weighted composite, threshold filtering).

Sequence Diagram

sequenceDiagram
    participant User
    participant FaceDetector
    participant QualityModule as Face Quality<br/>Module
    participant ClusteringModule as Clustering<br/>Module
    participant Database
    
    User->>FaceDetector: detect_faces(image)
    activate FaceDetector
    FaceDetector->>FaceDetector: extract face regions
    
    loop for each face
        FaceDetector->>QualityModule: calculate_face_quality(face_image)
        activate QualityModule
        QualityModule->>QualityModule: assess_sharpness()
        QualityModule->>QualityModule: assess_brightness()
        QualityModule->>QualityModule: assess_size()
        QualityModule->>QualityModule: weighted composite score
        QualityModule-->>FaceDetector: quality_score
        deactivate QualityModule
        
        FaceDetector->>FaceDetector: accumulate embeddings + quality
    end
    
    FaceDetector->>Database: db_insert_face_embeddings_by_image_id<br/>(embeddings, quality=qualities)
    activate Database
    Database->>Database: store faces + quality scores
    Database-->>FaceDetector: inserted face IDs
    deactivate Database
    
    User->>ClusteringModule: trigger clustering
    activate ClusteringModule
    ClusteringModule->>Database: get_all_face_embeddings()
    activate Database
    Database-->>ClusteringModule: embeddings + quality scores
    deactivate Database
    
    ClusteringModule->>ClusteringModule: filter by quality_threshold
    rect rgb(200, 220, 240)
        Note over ClusteringModule: Quality Filter: exclude low-quality faces
    end
    
    ClusteringModule->>ClusteringModule: cluster_faces(filtered_embeddings<br/>algorithm="conservative"|"dbscan")
    rect rgb(240, 220, 200)
        alt Conservative Algorithm
            ClusteringModule->>ClusteringModule: normalize embeddings
            ClusteringModule->>ClusteringModule: select_conservative_epsilon()
            ClusteringModule->>ClusteringModule: run DBSCAN + validation
            ClusteringModule->>ClusteringModule: split loose clusters
            ClusteringModule->>ClusteringModule: safe_merge nearby clusters
        else DBSCAN/Hierarchical
            ClusteringModule->>ClusteringModule: adaptive epsilon or fixed params
            ClusteringModule->>ClusteringModule: apply selected algorithm
        end
        Note over ClusteringModule: Clustering Algorithms
    end
    
    ClusteringModule->>ClusteringModule: post-merge if enabled
    
    ClusteringModule->>Database: update cluster assignments
    activate Database
    Database->>Database: store cluster_id for each face
    Database-->>ClusteringModule: success
    deactivate Database
    
    ClusteringModule-->>User: clustering complete
    deactivate ClusteringModule
    deactivate FaceDetector

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

GSoC 2025 Backend Revamp #466: Adds per-face quality assessment and integrates it into clustering—directly overlaps with quality computation, FaceDetector integration, and face_clusters workflow.
Fix PR #524 Lint and Build Errors #525: Modifies same database modules (backend/app/database/faces.py) and face-related APIs affecting detection and search routing.
Fix: Critical database connection leaks in all database functions #547: Overlaps in backend/app/database/faces.py with changes to face schema and retrieval functions (db_create_faces_table, db_get_cluster_mean_embeddings).

Suggested labels

backend, enhancement

Suggested reviewers

rahulharpal1603

Poem

🐰 Whiskers twitching with delight
Quality faces shine so bright!
Clustering now sees sharp from blurred,
Each person's photos finally heard.
Conservative wisdom guides the way,
Better groupings here to stay! 🎨✨

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d07d817 and 50bb3b1.

📒 Files selected for processing (10)

backend/app/database/faces.py
backend/app/models/FaceDetector.py
backend/app/utils/clustering_advanced.py
backend/app/utils/clustering_conservative.py
backend/app/utils/face_clusters.py
backend/app/utils/face_quality.py
backend/requirements.txt
backend/test.py
backend/tests/test_clustering_algorithm.py
backend/tests/test_face_quality.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR improves the face clustering algorithm to fix bridge-point chaining issues where single intermediate faces incorrectly merge distinct people into the same cluster. The solution implements a two-phase approach: DBSCAN clustering with min_samples=2 to prevent single-face bridges, followed by a post-cluster mean merge step (threshold: 0.28) to rejoin same-person clusters split by pose/lighting variations.

Key changes:

Introduced comprehensive face quality assessment based on sharpness, brightness, and size metrics
Implemented conservative and advanced clustering algorithms with configurable parameters
Added extensive unit test coverage (261 lines for quality tests, 368 lines for clustering tests)

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`backend/app/utils/face_quality.py`	New module for assessing face quality based on sharpness, brightness, and size
`backend/app/utils/clustering_conservative.py`	New conservative clustering implementation with validation and safe merging
`backend/app/utils/clustering_advanced.py`	New advanced clustering entry point supporting multiple algorithms
`backend/app/utils/face_clusters.py`	Updated main clustering logic with post-merge functionality and quality filtering
`backend/app/models/FaceDetector.py`	Integrated quality assessment during face detection
`backend/app/database/faces.py`	Added `quality` column to faces table with default value 0.5
`backend/tests/test_face_quality.py`	Comprehensive unit tests for face quality assessment functions
`backend/tests/test_clustering_algorithm.py`	Comprehensive unit tests for clustering algorithm and post-merge logic
`backend/test.py`	Updated test script to match production clustering parameters
`backend/requirements.txt`	Added `kneed>=0.8.5` dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-24T06:37:59Z

backend/app/utils/face_clusters.py

+            mean_i = np.mean([f["embedding"] for f in faces_i], axis=0)
+            merged[label_i] = list(faces_i)  # Copy the list
+
+            for j in range(i + 1, len(cluster_items)):
+                label_j, faces_j = cluster_items[j]
+                if label_j in used:
+                    continue
+
+                mean_j = np.mean([f["embedding"] for f in faces_j], axis=0)
+                dist = _calculate_cosine_distance(mean_i, mean_j)


The mean embeddings used in the post-merge logic are not normalized before calculating cosine distance. This could lead to incorrect distance calculations. Consider normalizing mean_i and mean_j before passing them to _calculate_cosine_distance, or normalize within the distance calculation function itself.

Copilot · 2025-12-24T06:37:59Z

backend/app/utils/face_clusters.py

+    norm_a = embedding_a / np.linalg.norm(embedding_a)
+    norm_b = embedding_b / np.linalg.norm(embedding_b)
+    similarity = np.dot(norm_a, norm_b)
+    return 1 - similarity


The _calculate_cosine_distance function doesn't handle the case where either embedding has zero norm, which would cause a division by zero error. Consider adding a check or using np.maximum(np.linalg.norm(embedding_a), 1e-10) similar to how normalization is handled elsewhere in the codebase.

Copilot · 2025-12-24T06:37:59Z

backend/app/database/faces.py

                embeddings TEXT,
                confidence REAL,
                bbox TEXT,
+                quality REAL DEFAULT 0.5,


The database schema adds a new quality column, but there's no migration script to handle existing databases. Existing rows will use the DEFAULT 0.5 value, but this doesn't address the case where the table already exists without the quality column. Consider adding a migration script or ALTER TABLE logic to handle schema updates for existing installations.

Copilot · 2025-12-24T06:38:00Z

backend/app/utils/clustering_advanced.py

+def cluster_faces(
+    embeddings: NDArray,
+    algorithm: str = "conservative",
+    eps: float = 0.25,
+    min_samples: int = 2,
+    max_cluster_diameter: float = 0.60,
+    auto_eps: bool = True,
+    distance_threshold: float = 0.5,
+    n_clusters: Optional[int] = None,
+    merge_close_clusters: bool = True,
+    merge_threshold: float = 0.40,
+    **kwargs,
+) -> NDArray:


The cluster_faces function accepts a fixed_eps parameter (used in face_clusters.py line 262), but this parameter is not defined in the function signature. The function uses eps instead. This will cause a TypeError when called with fixed_eps=. Either rename the parameter to fixed_eps or update the caller to use eps=.

Copilot · 2025-12-24T06:38:00Z

backend/app/utils/clustering_advanced.py

+def cluster_faces(
+    embeddings: NDArray,
+    algorithm: str = "conservative",
+    eps: float = 0.25,
+    min_samples: int = 2,
+    max_cluster_diameter: float = 0.60,
+    auto_eps: bool = True,
+    distance_threshold: float = 0.5,
+    n_clusters: Optional[int] = None,
+    merge_close_clusters: bool = True,
+    merge_threshold: float = 0.40,
+    **kwargs,
+) -> NDArray:


The cluster_faces function accepts a density_refinement parameter (used in face_clusters.py line 265), but this parameter is not defined in the function signature and is not handled in any of the algorithm branches. This will be silently ignored due to **kwargs, but it should either be added as an explicit parameter or removed from the calling code.

Copilot · 2025-12-24T06:38:00Z

backend/app/utils/clustering_advanced.py

+def cluster_faces_hierarchical(
+    embeddings: NDArray,
+    n_clusters: Optional[int] = None,
+    distance_threshold: float = 0.5,
+) -> NDArray:
+    """
+    Cluster face embeddings using hierarchical clustering.
+
+    Uses complete linkage which ensures all pairs in a cluster
+    are within the distance threshold (conservative).
+
+    Args:
+        embeddings: Face embeddings (n_faces, embedding_dim)
+        n_clusters: Number of clusters (mutually exclusive with distance_threshold)
+        distance_threshold: Max distance within cluster (if n_clusters is None)
+
+    Returns:
+        Cluster labels
+    """
+    n_samples = len(embeddings)
+
+    if n_samples < 2:
+        return np.zeros(n_samples, dtype=int)
+
+    # Normalize
+    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
+    normalized = embeddings / np.maximum(norms, 1e-10)
+
+    # Setup clustering
+    if n_clusters is not None:
+        clustering = AgglomerativeClustering(
+            n_clusters=n_clusters,
+            metric="cosine",
+            linkage="complete",  # Conservative: all pairs must be similar
+        )
+    else:
+        clustering = AgglomerativeClustering(
+            n_clusters=None,
+            distance_threshold=distance_threshold,
+            metric="cosine",
+            linkage="complete",
+        )
+
+    labels = clustering.fit_predict(normalized)
+    return labels


The cluster_faces function is called with a linkage parameter (line 264 in face_clusters.py), but hierarchical clustering doesn't accept this parameter - it's hardcoded to "complete". This parameter will be silently ignored due to **kwargs. Either add linkage as a parameter to cluster_faces_hierarchical or remove it from the calling code.

feat: Improve face clustering algorithm - prevent bridge-point chaini…

50bb3b1

…ng and same-person splits

Copilot AI review requested due to automatic review settings December 24, 2025 06:33

Copilot started reviewing on behalf of SiddharthJiyani December 24, 2025 06:33 View session

SiddharthJiyani closed this Dec 24, 2025

SiddharthJiyani deleted the feature/face-clustering-improvements branch December 24, 2025 06:33

Copilot AI reviewed Dec 24, 2025

View reviewed changes

feat: Improve face clustering algorithm #833

feat: Improve face clustering algorithm #833

Uh oh!

Conversation

SiddharthJiyani commented Dec 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Configuration

Benefits

Demo

Before

After

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SiddharthJiyani commented Dec 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 24, 2025 •

edited

Loading