-
Notifications
You must be signed in to change notification settings - Fork 566
feat: Improve face clustering algorithm #833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Improve face clustering algorithm #833
Conversation
…ng and same-person splits
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughThis PR introduces face quality assessment and integrates it into the face clustering pipeline. New modules provide quality computation based on sharpness, brightness, and size metrics. Advanced clustering strategies (conservative, DBSCAN, hierarchical) are implemented with quality-based filtering to exclude low-quality faces before clustering. Quality scores are persisted in the database and propagated through detection and storage layers. Changes
Sequence DiagramsequenceDiagram
participant User
participant FaceDetector
participant QualityModule as Face Quality<br/>Module
participant ClusteringModule as Clustering<br/>Module
participant Database
User->>FaceDetector: detect_faces(image)
activate FaceDetector
FaceDetector->>FaceDetector: extract face regions
loop for each face
FaceDetector->>QualityModule: calculate_face_quality(face_image)
activate QualityModule
QualityModule->>QualityModule: assess_sharpness()
QualityModule->>QualityModule: assess_brightness()
QualityModule->>QualityModule: assess_size()
QualityModule->>QualityModule: weighted composite score
QualityModule-->>FaceDetector: quality_score
deactivate QualityModule
FaceDetector->>FaceDetector: accumulate embeddings + quality
end
FaceDetector->>Database: db_insert_face_embeddings_by_image_id<br/>(embeddings, quality=qualities)
activate Database
Database->>Database: store faces + quality scores
Database-->>FaceDetector: inserted face IDs
deactivate Database
User->>ClusteringModule: trigger clustering
activate ClusteringModule
ClusteringModule->>Database: get_all_face_embeddings()
activate Database
Database-->>ClusteringModule: embeddings + quality scores
deactivate Database
ClusteringModule->>ClusteringModule: filter by quality_threshold
rect rgb(200, 220, 240)
Note over ClusteringModule: Quality Filter: exclude low-quality faces
end
ClusteringModule->>ClusteringModule: cluster_faces(filtered_embeddings<br/>algorithm="conservative"|"dbscan")
rect rgb(240, 220, 200)
alt Conservative Algorithm
ClusteringModule->>ClusteringModule: normalize embeddings
ClusteringModule->>ClusteringModule: select_conservative_epsilon()
ClusteringModule->>ClusteringModule: run DBSCAN + validation
ClusteringModule->>ClusteringModule: split loose clusters
ClusteringModule->>ClusteringModule: safe_merge nearby clusters
else DBSCAN/Hierarchical
ClusteringModule->>ClusteringModule: adaptive epsilon or fixed params
ClusteringModule->>ClusteringModule: apply selected algorithm
end
Note over ClusteringModule: Clustering Algorithms
end
ClusteringModule->>ClusteringModule: post-merge if enabled
ClusteringModule->>Database: update cluster assignments
activate Database
Database->>Database: store cluster_id for each face
Database-->>ClusteringModule: success
deactivate Database
ClusteringModule-->>User: clustering complete
deactivate ClusteringModule
deactivate FaceDetector
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (10)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves the face clustering algorithm to fix bridge-point chaining issues where single intermediate faces incorrectly merge distinct people into the same cluster. The solution implements a two-phase approach: DBSCAN clustering with min_samples=2 to prevent single-face bridges, followed by a post-cluster mean merge step (threshold: 0.28) to rejoin same-person clusters split by pose/lighting variations.
Key changes:
- Introduced comprehensive face quality assessment based on sharpness, brightness, and size metrics
- Implemented conservative and advanced clustering algorithms with configurable parameters
- Added extensive unit test coverage (261 lines for quality tests, 368 lines for clustering tests)
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
backend/app/utils/face_quality.py |
New module for assessing face quality based on sharpness, brightness, and size |
backend/app/utils/clustering_conservative.py |
New conservative clustering implementation with validation and safe merging |
backend/app/utils/clustering_advanced.py |
New advanced clustering entry point supporting multiple algorithms |
backend/app/utils/face_clusters.py |
Updated main clustering logic with post-merge functionality and quality filtering |
backend/app/models/FaceDetector.py |
Integrated quality assessment during face detection |
backend/app/database/faces.py |
Added quality column to faces table with default value 0.5 |
backend/tests/test_face_quality.py |
Comprehensive unit tests for face quality assessment functions |
backend/tests/test_clustering_algorithm.py |
Comprehensive unit tests for clustering algorithm and post-merge logic |
backend/test.py |
Updated test script to match production clustering parameters |
backend/requirements.txt |
Added kneed>=0.8.5 dependency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| mean_i = np.mean([f["embedding"] for f in faces_i], axis=0) | ||
| merged[label_i] = list(faces_i) # Copy the list | ||
|
|
||
| for j in range(i + 1, len(cluster_items)): | ||
| label_j, faces_j = cluster_items[j] | ||
| if label_j in used: | ||
| continue | ||
|
|
||
| mean_j = np.mean([f["embedding"] for f in faces_j], axis=0) | ||
| dist = _calculate_cosine_distance(mean_i, mean_j) |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mean embeddings used in the post-merge logic are not normalized before calculating cosine distance. This could lead to incorrect distance calculations. Consider normalizing mean_i and mean_j before passing them to _calculate_cosine_distance, or normalize within the distance calculation function itself.
| norm_a = embedding_a / np.linalg.norm(embedding_a) | ||
| norm_b = embedding_b / np.linalg.norm(embedding_b) | ||
| similarity = np.dot(norm_a, norm_b) | ||
| return 1 - similarity |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _calculate_cosine_distance function doesn't handle the case where either embedding has zero norm, which would cause a division by zero error. Consider adding a check or using np.maximum(np.linalg.norm(embedding_a), 1e-10) similar to how normalization is handled elsewhere in the codebase.
| embeddings TEXT, | ||
| confidence REAL, | ||
| bbox TEXT, | ||
| quality REAL DEFAULT 0.5, |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The database schema adds a new quality column, but there's no migration script to handle existing databases. Existing rows will use the DEFAULT 0.5 value, but this doesn't address the case where the table already exists without the quality column. Consider adding a migration script or ALTER TABLE logic to handle schema updates for existing installations.
| def cluster_faces( | ||
| embeddings: NDArray, | ||
| algorithm: str = "conservative", | ||
| eps: float = 0.25, | ||
| min_samples: int = 2, | ||
| max_cluster_diameter: float = 0.60, | ||
| auto_eps: bool = True, | ||
| distance_threshold: float = 0.5, | ||
| n_clusters: Optional[int] = None, | ||
| merge_close_clusters: bool = True, | ||
| merge_threshold: float = 0.40, | ||
| **kwargs, | ||
| ) -> NDArray: |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cluster_faces function accepts a fixed_eps parameter (used in face_clusters.py line 262), but this parameter is not defined in the function signature. The function uses eps instead. This will cause a TypeError when called with fixed_eps=. Either rename the parameter to fixed_eps or update the caller to use eps=.
| def cluster_faces( | ||
| embeddings: NDArray, | ||
| algorithm: str = "conservative", | ||
| eps: float = 0.25, | ||
| min_samples: int = 2, | ||
| max_cluster_diameter: float = 0.60, | ||
| auto_eps: bool = True, | ||
| distance_threshold: float = 0.5, | ||
| n_clusters: Optional[int] = None, | ||
| merge_close_clusters: bool = True, | ||
| merge_threshold: float = 0.40, | ||
| **kwargs, | ||
| ) -> NDArray: |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cluster_faces function accepts a density_refinement parameter (used in face_clusters.py line 265), but this parameter is not defined in the function signature and is not handled in any of the algorithm branches. This will be silently ignored due to **kwargs, but it should either be added as an explicit parameter or removed from the calling code.
| def cluster_faces_hierarchical( | ||
| embeddings: NDArray, | ||
| n_clusters: Optional[int] = None, | ||
| distance_threshold: float = 0.5, | ||
| ) -> NDArray: | ||
| """ | ||
| Cluster face embeddings using hierarchical clustering. | ||
|
|
||
| Uses complete linkage which ensures all pairs in a cluster | ||
| are within the distance threshold (conservative). | ||
|
|
||
| Args: | ||
| embeddings: Face embeddings (n_faces, embedding_dim) | ||
| n_clusters: Number of clusters (mutually exclusive with distance_threshold) | ||
| distance_threshold: Max distance within cluster (if n_clusters is None) | ||
|
|
||
| Returns: | ||
| Cluster labels | ||
| """ | ||
| n_samples = len(embeddings) | ||
|
|
||
| if n_samples < 2: | ||
| return np.zeros(n_samples, dtype=int) | ||
|
|
||
| # Normalize | ||
| norms = np.linalg.norm(embeddings, axis=1, keepdims=True) | ||
| normalized = embeddings / np.maximum(norms, 1e-10) | ||
|
|
||
| # Setup clustering | ||
| if n_clusters is not None: | ||
| clustering = AgglomerativeClustering( | ||
| n_clusters=n_clusters, | ||
| metric="cosine", | ||
| linkage="complete", # Conservative: all pairs must be similar | ||
| ) | ||
| else: | ||
| clustering = AgglomerativeClustering( | ||
| n_clusters=None, | ||
| distance_threshold=distance_threshold, | ||
| metric="cosine", | ||
| linkage="complete", | ||
| ) | ||
|
|
||
| labels = clustering.fit_predict(normalized) | ||
| return labels |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cluster_faces function is called with a linkage parameter (line 264 in face_clusters.py), but hierarchical clustering doesn't accept this parameter - it's hardcoded to "complete". This parameter will be silently ignored due to **kwargs. Either add linkage as a parameter to cluster_faces_hierarchical or remove it from the calling code.
Problem
Face clustering was merging different people into the same cluster due to bridge-point chaining, where a single intermediate face could connect two distinct people's clusters.
Solution
Implemented a two-phase clustering approach:
min_samples=2- Prevents single faces from acting as bridges between clustersChanges
face_clusters.py: Added post-merge logic and_calculate_cosine_distance()helperclustering_conservative.py: Code cleanup (removed unused functions)clustering_advanced.py: Code cleanup (removed unused functions)test_clustering_algorithm.py: Added 16 comprehensive unit testsConfiguration
Benefits
Demo
Before
face_clustering_before.mp4
After
face_clustering_after.mp4
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.