Skip to content

Conversation

@SiddharthJiyani
Copy link
Contributor

@SiddharthJiyani SiddharthJiyani commented Dec 24, 2025

Problem

Face clustering was merging different people into the same cluster due to bridge-point chaining, where a single intermediate face could connect two distinct people's clusters.

Solution

Implemented a two-phase clustering approach:

  1. DBSCAN with min_samples=2 - Prevents single faces from acting as bridges between clusters
  2. Post-cluster mean merge - Merges clusters with similar mean embeddings (threshold: 0.28) to fix same-person splits caused by pose/lighting variation

Changes

  • face_clusters.py: Added post-merge logic and _calculate_cosine_distance() helper
  • clustering_conservative.py: Code cleanup (removed unused functions)
  • clustering_advanced.py: Code cleanup (removed unused functions)
  • test_clustering_algorithm.py: Added 16 comprehensive unit tests

Configuration

CLUSTERING_ALGORITHM = "dbscan"
CLUSTERING_MIN_SAMPLES = 2  # Prevents bridge chaining
CLUSTERING_FIXED_EPSILON = 0.35
POST_MERGE_ENABLED = True
POST_MERGE_MEAN_DISTANCE_THRESHOLD = 0.28

Benefits

  • Reduces mis-clustered faces by filtering poor quality detections
  • Automatically adapts to different photo collections
  • More robust cluster representatives

Demo

Before

face_clustering_before.mp4

After

  • To see the clustering results directly, please jump to 00:40.
face_clustering_after.mp4

Summary by CodeRabbit

Release Notes

  • New Features
    • Face detection now evaluates image quality based on sharpness, brightness, and size metrics
    • Improved face clustering with advanced algorithms and quality-based filtering to exclude low-quality detections
    • Added automatic detection and consolidation of near-duplicate face groups post-clustering
    • Face quality scores are now tracked and retrievable with each stored face record

✏️ Tip: You can customize this high-level summary in your review settings.

Copilot AI review requested due to automatic review settings December 24, 2025 06:33
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 24, 2025

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This PR introduces face quality assessment and integrates it into the face clustering pipeline. New modules provide quality computation based on sharpness, brightness, and size metrics. Advanced clustering strategies (conservative, DBSCAN, hierarchical) are implemented with quality-based filtering to exclude low-quality faces before clustering. Quality scores are persisted in the database and propagated through detection and storage layers.

Changes

Cohort / File(s) Summary
Database Schema and Retrieval
backend/app/database/faces.py
Added quality field to FaceData TypedDict and faces table schema with default 0.5. Updated db_insert_face_embeddings* functions to accept and store quality parameter. Modified retrieval functions (db_get_faces_unassigned_clusters, db_get_all_faces_with_cluster_names, db_get_cluster_mean_embeddings) to return quality scores with COALESCE defaults.
Face Quality Assessment
backend/app/utils/face_quality.py
NEW: Multi-criteria face quality assessment module. Provides assess_face_sharpness, assess_face_brightness, assess_face_size, and calculate_face_quality (weighted composite score). Includes utility filters: should_include_face, filter_quality_faces.
Conservative Clustering Implementation
backend/app/utils/clustering_conservative.py
NEW: Implements conservative face clustering with validation and safe merging. Core components: compute_pairwise_distances, select_conservative_epsilon, validate_cluster, split_loose_cluster, and ConservativeFaceClustering class with fit_predict, cluster validation/splitting, and merge logic.
Advanced Clustering Strategies
backend/app/utils/clustering_advanced.py
NEW: Multi-algorithm clustering interface routing to conservative, DBSCAN, or hierarchical methods. Provides cluster_faces (central API), cluster_faces_dbscan, cluster_faces_hierarchical, calculate_adaptive_eps, get_cluster_stats, calculate_cluster_mean.
Face Detection Integration
backend/app/models/FaceDetector.py
Integrated quality assessment into face detection pipeline. Added import for calculate_face_quality, accumulates per-face quality scores during extraction, and logs detailed metrics (quality, sharpness, brightness, size). Passes quality=qualities to db_insert_face_embeddings_by_image_id along with daily quality statistics.
Clustering System Integration
backend/app/utils/face_clusters.py
Replaced direct DBSCAN with advanced cluster_faces workflow supporting multiple algorithms. Added quality filtering configuration (CLUSTERING_QUALITY_FILTER_ENABLED, CLUSTERING_QUALITY_MIN_THRESHOLD) and post-merge step (POST_MERGE_ENABLED, POST_MERGE_MEAN_DISTANCE_THRESHOLD). Updated cluster_util_cluster_all_face_embeddings and cluster_util_assign_cluster_to_faces_without_clusterId signatures; added clustering algorithm selection and metrics logging.
Dependencies and Configuration
backend/requirements.txt, backend/test.py
Added kneed>=0.8.5 dependency. Updated DBSCAN parameter from eps=0.3 to eps=0.35 with alignment comment.
Test Coverage
backend/tests/test_clustering_algorithm.py, backend/tests/test_face_quality.py
NEW: Comprehensive unit tests for clustering (cosine distance, configuration validation, no-faces/single-face/general clustering, noise handling, post-merge behavior, quality filtering, cluster naming) and face quality assessment (sharpness, brightness, size, weighted composite, threshold filtering).

Sequence Diagram

sequenceDiagram
    participant User
    participant FaceDetector
    participant QualityModule as Face Quality<br/>Module
    participant ClusteringModule as Clustering<br/>Module
    participant Database
    
    User->>FaceDetector: detect_faces(image)
    activate FaceDetector
    FaceDetector->>FaceDetector: extract face regions
    
    loop for each face
        FaceDetector->>QualityModule: calculate_face_quality(face_image)
        activate QualityModule
        QualityModule->>QualityModule: assess_sharpness()
        QualityModule->>QualityModule: assess_brightness()
        QualityModule->>QualityModule: assess_size()
        QualityModule->>QualityModule: weighted composite score
        QualityModule-->>FaceDetector: quality_score
        deactivate QualityModule
        
        FaceDetector->>FaceDetector: accumulate embeddings + quality
    end
    
    FaceDetector->>Database: db_insert_face_embeddings_by_image_id<br/>(embeddings, quality=qualities)
    activate Database
    Database->>Database: store faces + quality scores
    Database-->>FaceDetector: inserted face IDs
    deactivate Database
    
    User->>ClusteringModule: trigger clustering
    activate ClusteringModule
    ClusteringModule->>Database: get_all_face_embeddings()
    activate Database
    Database-->>ClusteringModule: embeddings + quality scores
    deactivate Database
    
    ClusteringModule->>ClusteringModule: filter by quality_threshold
    rect rgb(200, 220, 240)
        Note over ClusteringModule: Quality Filter: exclude low-quality faces
    end
    
    ClusteringModule->>ClusteringModule: cluster_faces(filtered_embeddings<br/>algorithm="conservative"|"dbscan")
    rect rgb(240, 220, 200)
        alt Conservative Algorithm
            ClusteringModule->>ClusteringModule: normalize embeddings
            ClusteringModule->>ClusteringModule: select_conservative_epsilon()
            ClusteringModule->>ClusteringModule: run DBSCAN + validation
            ClusteringModule->>ClusteringModule: split loose clusters
            ClusteringModule->>ClusteringModule: safe_merge nearby clusters
        else DBSCAN/Hierarchical
            ClusteringModule->>ClusteringModule: adaptive epsilon or fixed params
            ClusteringModule->>ClusteringModule: apply selected algorithm
        end
        Note over ClusteringModule: Clustering Algorithms
    end
    
    ClusteringModule->>ClusteringModule: post-merge if enabled
    
    ClusteringModule->>Database: update cluster assignments
    activate Database
    Database->>Database: store cluster_id for each face
    Database-->>ClusteringModule: success
    deactivate Database
    
    ClusteringModule-->>User: clustering complete
    deactivate ClusteringModule
    deactivate FaceDetector
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

backend, enhancement

Suggested reviewers

  • rahulharpal1603

Poem

🐰 Whiskers twitching with delight
Quality faces shine so bright!
Clustering now sees sharp from blurred,
Each person's photos finally heard.
Conservative wisdom guides the way,
Better groupings here to stay! 🎨✨

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d07d817 and 50bb3b1.

📒 Files selected for processing (10)
  • backend/app/database/faces.py
  • backend/app/models/FaceDetector.py
  • backend/app/utils/clustering_advanced.py
  • backend/app/utils/clustering_conservative.py
  • backend/app/utils/face_clusters.py
  • backend/app/utils/face_quality.py
  • backend/requirements.txt
  • backend/test.py
  • backend/tests/test_clustering_algorithm.py
  • backend/tests/test_face_quality.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@SiddharthJiyani SiddharthJiyani deleted the feature/face-clustering-improvements branch December 24, 2025 06:33
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the face clustering algorithm to fix bridge-point chaining issues where single intermediate faces incorrectly merge distinct people into the same cluster. The solution implements a two-phase approach: DBSCAN clustering with min_samples=2 to prevent single-face bridges, followed by a post-cluster mean merge step (threshold: 0.28) to rejoin same-person clusters split by pose/lighting variations.

Key changes:

  • Introduced comprehensive face quality assessment based on sharpness, brightness, and size metrics
  • Implemented conservative and advanced clustering algorithms with configurable parameters
  • Added extensive unit test coverage (261 lines for quality tests, 368 lines for clustering tests)

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
backend/app/utils/face_quality.py New module for assessing face quality based on sharpness, brightness, and size
backend/app/utils/clustering_conservative.py New conservative clustering implementation with validation and safe merging
backend/app/utils/clustering_advanced.py New advanced clustering entry point supporting multiple algorithms
backend/app/utils/face_clusters.py Updated main clustering logic with post-merge functionality and quality filtering
backend/app/models/FaceDetector.py Integrated quality assessment during face detection
backend/app/database/faces.py Added quality column to faces table with default value 0.5
backend/tests/test_face_quality.py Comprehensive unit tests for face quality assessment functions
backend/tests/test_clustering_algorithm.py Comprehensive unit tests for clustering algorithm and post-merge logic
backend/test.py Updated test script to match production clustering parameters
backend/requirements.txt Added kneed>=0.8.5 dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +299 to +308
mean_i = np.mean([f["embedding"] for f in faces_i], axis=0)
merged[label_i] = list(faces_i) # Copy the list

for j in range(i + 1, len(cluster_items)):
label_j, faces_j = cluster_items[j]
if label_j in used:
continue

mean_j = np.mean([f["embedding"] for f in faces_j], axis=0)
dist = _calculate_cosine_distance(mean_i, mean_j)
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mean embeddings used in the post-merge logic are not normalized before calculating cosine distance. This could lead to incorrect distance calculations. Consider normalizing mean_i and mean_j before passing them to _calculate_cosine_distance, or normalize within the distance calculation function itself.

Copilot uses AI. Check for mistakes.
Comment on lines +448 to +451
norm_a = embedding_a / np.linalg.norm(embedding_a)
norm_b = embedding_b / np.linalg.norm(embedding_b)
similarity = np.dot(norm_a, norm_b)
return 1 - similarity
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _calculate_cosine_distance function doesn't handle the case where either embedding has zero norm, which would cause a division by zero error. Consider adding a check or using np.maximum(np.linalg.norm(embedding_a), 1e-10) similar to how normalization is handled elsewhere in the codebase.

Copilot uses AI. Check for mistakes.
embeddings TEXT,
confidence REAL,
bbox TEXT,
quality REAL DEFAULT 0.5,
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The database schema adds a new quality column, but there's no migration script to handle existing databases. Existing rows will use the DEFAULT 0.5 value, but this doesn't address the case where the table already exists without the quality column. Consider adding a migration script or ALTER TABLE logic to handle schema updates for existing installations.

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +176
def cluster_faces(
embeddings: NDArray,
algorithm: str = "conservative",
eps: float = 0.25,
min_samples: int = 2,
max_cluster_diameter: float = 0.60,
auto_eps: bool = True,
distance_threshold: float = 0.5,
n_clusters: Optional[int] = None,
merge_close_clusters: bool = True,
merge_threshold: float = 0.40,
**kwargs,
) -> NDArray:
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster_faces function accepts a fixed_eps parameter (used in face_clusters.py line 262), but this parameter is not defined in the function signature. The function uses eps instead. This will cause a TypeError when called with fixed_eps=. Either rename the parameter to fixed_eps or update the caller to use eps=.

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +176
def cluster_faces(
embeddings: NDArray,
algorithm: str = "conservative",
eps: float = 0.25,
min_samples: int = 2,
max_cluster_diameter: float = 0.60,
auto_eps: bool = True,
distance_threshold: float = 0.5,
n_clusters: Optional[int] = None,
merge_close_clusters: bool = True,
merge_threshold: float = 0.40,
**kwargs,
) -> NDArray:
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster_faces function accepts a density_refinement parameter (used in face_clusters.py line 265), but this parameter is not defined in the function signature and is not handled in any of the algorithm branches. This will be silently ignored due to **kwargs, but it should either be added as an explicit parameter or removed from the calling code.

Copilot uses AI. Check for mistakes.
Comment on lines +117 to +161
def cluster_faces_hierarchical(
embeddings: NDArray,
n_clusters: Optional[int] = None,
distance_threshold: float = 0.5,
) -> NDArray:
"""
Cluster face embeddings using hierarchical clustering.

Uses complete linkage which ensures all pairs in a cluster
are within the distance threshold (conservative).

Args:
embeddings: Face embeddings (n_faces, embedding_dim)
n_clusters: Number of clusters (mutually exclusive with distance_threshold)
distance_threshold: Max distance within cluster (if n_clusters is None)

Returns:
Cluster labels
"""
n_samples = len(embeddings)

if n_samples < 2:
return np.zeros(n_samples, dtype=int)

# Normalize
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
normalized = embeddings / np.maximum(norms, 1e-10)

# Setup clustering
if n_clusters is not None:
clustering = AgglomerativeClustering(
n_clusters=n_clusters,
metric="cosine",
linkage="complete", # Conservative: all pairs must be similar
)
else:
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=distance_threshold,
metric="cosine",
linkage="complete",
)

labels = clustering.fit_predict(normalized)
return labels
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster_faces function is called with a linkage parameter (line 264 in face_clusters.py), but hierarchical clustering doesn't accept this parameter - it's hardcoded to "complete". This parameter will be silently ignored due to **kwargs. Either add linkage as a parameter to cluster_faces_hierarchical or remove it from the calling code.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Face Clustering Quality Issues

1 participant