-
Notifications
You must be signed in to change notification settings - Fork 572
fix: Improve face clustering with quality assessment and adaptive parameters #794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Improve face clustering with quality assessment and adaptive parameters #794
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughA face-quality metric was added end-to-end: computed during detection, persisted in the faces DB schema, returned by retrieval queries, and used by upgraded clustering routines (adaptive DBSCAN/hierarchical and quality-weighted medoid/means). Changes
Sequence Diagram(s)sequenceDiagram
participant Image as Image Input
participant Detector as FaceDetector
participant Quality as face_quality
participant DB as Database
participant ClustMgr as face_clusters
participant Advanced as clustering_advanced
Image->>Detector: detect_faces(image)
Detector->>Detector: extract face patches, compute embeddings
loop per face
Detector->>Quality: calculate_face_quality(face_img)
Quality-->>Detector: quality_score
end
Detector->>DB: db_insert_face_embeddings_by_image_id(embeddings, quality=qualities)
DB-->>Detector: face IDs
ClustMgr->>DB: db_get_all_faces_with_cluster_names()
DB-->>ClustMgr: faces + embeddings + quality
ClustMgr->>ClustMgr: filter_quality_faces(min_threshold)
ClustMgr->>Advanced: cluster_faces(embeddings, quality_scores)
Advanced->>Advanced: auto_select_epsilon / cluster_dbscan_adaptive / cluster_hierarchical
Advanced->>Advanced: calculate_quality_weighted_medoid / calculate_cluster_mean
Advanced-->>ClustMgr: labels + representatives
ClustMgr->>DB: update cluster assignments
DB-->>ClustMgr: ack
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/database/faces.py (1)
36-50: No migration script has been integrated into the startup flow.The codebase uses
CREATE TABLE IF NOT EXISTSin the FastAPI lifespan context manager (backend/main.py, lines 47-54) as the sole database initialization mechanism. While this works for new installations, existing databases with thefacestable already created will not receive the newqualitycolumn. The SQL statements infaces.pydirectly access this column (INSERT on line 88, SELECT operations on lines 242, 276, 363), which will fail with "table faces has no column named quality" errors for existing installations. Themigrate_add_quality.pyfile mentioned in the PR description does not exist in the repository, nor is any migration logic referenced anywhere in the codebase. To resolve this, either implement a proper migration that executesALTER TABLE faces ADD COLUMN quality REAL DEFAULT 0.5on startup for existing databases, or establish a database versioning system.
🧹 Nitpick comments (9)
backend/test.py (1)
43-45: Consider the implications ofmin_samples=1.Setting
min_samples=1means every point with at least one neighbor withinepsbecomes a core point. This significantly changes clustering behavior compared to the production code inface_clusters.pywhich usesCLUSTERING_MIN_SAMPLES = 2.If this test file is meant to validate production behavior, consider aligning these parameters with the actual production configuration, or clarify in comments that this is intentional for a specific test scenario.
backend/tests/test_clustering_improvements.py (2)
91-101: Strengthen the cluster count assertion for separated data.The test creates two well-separated clusters but only asserts
len(unique_labels) >= 1. With data offset by ±2 and variance 0.1, these clusters should be clearly distinguishable.🔎 Suggested fix
- unique_labels = set(labels) - {-1} # Exclude noise - assert len(unique_labels) >= 1 # At least some clusters + unique_labels = set(labels) - {-1} # Exclude noise + assert len(unique_labels) >= 2, f"Expected 2 distinct clusters for separated data, got {len(unique_labels)}"
155-163: Consider numpy integer type compatibility.The assertion
isinstance(idx, int)may fail if the implementation returnsnp.int64or another NumPy integer type. NumPy integer types are not subclasses of Python'sintin recent Python versions.🔎 Suggested fix
- assert 0 <= idx < 10 - assert isinstance(idx, int) + assert 0 <= idx < 10 + assert isinstance(idx, (int, np.integer))backend/app/utils/face_quality.py (2)
15-41: Consider handling edge case of empty or invalid images.The sharpness function will raise an error if
face_imageis empty or has zero dimensions. Consider adding a guard clause for robustness, especially since face crops from edge regions might occasionally produce degenerate results.🔎 Suggested defensive check
def assess_face_sharpness(face_image: NDArray) -> float: + # Guard against empty images + if face_image.size == 0: + return 0.0 + # Convert to grayscale if needed if len(face_image.shape) == 3:
187-205: Consider removing or guarding the__main__block for production.The CLI testing block is useful during development but may be unnecessary in production. If this module is only imported (never run directly), consider removing it or moving the test code to the test suite.
backend/app/database/faces.py (1)
140-145: Use keyword arguments for clarity in function call.The call to
db_insert_face_embeddingsuses positional arguments which is fragile and harder to read. Using keyword arguments would improve maintainability.🔎 Suggested fix
- face_id = db_insert_face_embeddings(image_id, emb, conf, bb, cid, qual) + face_id = db_insert_face_embeddings( + image_id=image_id, + embeddings=emb, + confidence=conf, + bbox=bb, + cluster_id=cid, + quality=qual, + )backend/app/utils/face_clusters.py (2)
35-44: Consider externalizing clustering configuration.The clustering parameters are hardcoded as module-level constants. For production flexibility, consider moving these to the application settings/config module to allow runtime configuration or environment-based overrides.
165-177: Function parameters may conflict with configuration constants.The function accepts
epsandmin_samplesparameters but usesCLUSTERING_AUTO_EPSILONandCLUSTERING_FIXED_EPSILONinternally. Theepsparameter is only used whenCLUSTERING_AUTO_EPSILONis False. Consider documenting this behavior or removing unused parameters to avoid confusion.🔎 Clarify parameter usage in docstring
def cluster_util_cluster_all_face_embeddings( eps: float = 0.3, min_samples: int = 2 ) -> List[ClusterResult]: """ Cluster face embeddings using advanced clustering algorithms with quality filtering. Args: - eps: DBSCAN epsilon parameter (used only if auto_epsilon=False) + eps: DBSCAN epsilon parameter (used only if CLUSTERING_AUTO_EPSILON=False, + otherwise ignored in favor of adaptive selection) min_samples: Minimum samples parameter for core pointsbackend/app/utils/clustering_advanced.py (1)
253-288: Consider exposing k_neighbors parameter.The
cluster_facesfunction doesn't expose thek_neighborsparameter for epsilon auto-selection, defaulting to 5. Consider adding this as an optional parameter for users who want to tune the adaptive epsilon behavior.🔎 Suggested enhancement:
def cluster_faces( embeddings: NDArray, algorithm: Literal["dbscan", "hierarchical"] = "dbscan", min_samples: int = 2, auto_eps: bool = True, fixed_eps: float = 0.3, + k_neighbors: int = 5, distance_threshold: float = 0.3, linkage: str = "average", ) -> NDArray: """ Main clustering function with algorithm selection. Args: embeddings: Array of face embeddings algorithm: Clustering algorithm choice ('dbscan' or 'hierarchical') min_samples: Minimum samples for DBSCAN (default: 2) auto_eps: Auto-select epsilon for DBSCAN (default: True) fixed_eps: Fixed epsilon if auto_eps=False (default: 0.3) + k_neighbors: K for epsilon auto-selection (default: 5) distance_threshold: Distance threshold for hierarchical (default: 0.3) linkage: Linkage for hierarchical ('average', 'complete', 'single') Returns: Array of cluster labels """ if algorithm == "dbscan": return cluster_dbscan_adaptive( - embeddings, min_samples=min_samples, auto_eps=auto_eps, fixed_eps=fixed_eps + embeddings, min_samples=min_samples, auto_eps=auto_eps, + fixed_eps=fixed_eps, k_neighbors=k_neighbors )
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
backend/app/database/faces.py(13 hunks)backend/app/models/FaceDetector.py(4 hunks)backend/app/utils/clustering_advanced.py(1 hunks)backend/app/utils/face_clusters.py(6 hunks)backend/app/utils/face_quality.py(1 hunks)backend/requirements.txt(1 hunks)backend/test.py(1 hunks)backend/tests/test_clustering_improvements.py(1 hunks)backend/tests/test_face_quality.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
backend/tests/test_face_quality.py (1)
backend/app/utils/face_quality.py (6)
assess_face_sharpness(15-41)assess_face_brightness(44-73)assess_face_size(76-102)calculate_face_quality(105-156)should_include_face(159-170)filter_quality_faces(173-184)
backend/app/utils/face_clusters.py (5)
backend/app/utils/clustering_advanced.py (2)
cluster_faces(253-288)calculate_cluster_mean(207-250)backend/app/utils/face_quality.py (1)
filter_quality_faces(173-184)backend/app/logging/setup_logging.py (1)
get_logger(199-209)sync-microservice/app/logging/setup_logging.py (1)
get_logger(192-202)backend/app/database/faces.py (3)
db_get_all_faces_with_cluster_names(261-301)db_get_faces_unassigned_clusters(230-258)db_get_cluster_mean_embeddings(349-406)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Agent
- GitHub Check: Backend Tests
🔇 Additional comments (12)
backend/requirements.txt (1)
74-74: LGTM!The
kneeddependency is appropriately added with a minimum version constraint for the knee-detection functionality used in adaptive epsilon selection for DBSCAN clustering.backend/tests/test_clustering_improvements.py (1)
1-20: Well-structured test module.Good coverage of the advanced clustering functionality with tests for all major components: epsilon selection, DBSCAN, hierarchical clustering, medoid calculation, and integration workflows. The test organization by class is clean and maintainable.
backend/tests/test_face_quality.py (2)
25-35: Test creates sparse checkerboard, not full checkerboard.The pattern
img[::2, ::2] = 255sets pixels at (0,0), (0,2), (2,0), (2,2), etc. - only 25% of pixels. This still produces sharp edges for Laplacian detection, but a full checkerboard (img[::2, ::2] = 255andimg[1::2, 1::2] = 255) would be more representative.The test passes regardless since sparse patterns still have high-frequency content, so this is a minor observation.
1-265: Comprehensive test coverage for face quality module.Excellent test coverage including:
- Boundary conditions (exact threshold, empty lists)
- Edge cases (missing quality keys, grayscale images)
- Custom parameters (weights, thresholds)
- Error conditions (invalid weight sums)
backend/app/models/FaceDetector.py (1)
53-72: Quality assessment integration looks correct.The quality assessment is properly computed on the raw cropped face image before the FaceNet preprocessing step, which ensures the quality metrics reflect the actual input quality rather than the normalized/preprocessed version. Good use of debug-level logging for per-face metrics.
backend/app/utils/face_quality.py (1)
105-156: Well-designed quality scoring function.Good implementation with:
- Weight validation using
np.isclosefor floating-point tolerance- Clear separation of individual metrics
- Comprehensive return dictionary for debugging
- Appropriate default weights (sharpness prioritized at 0.4)
backend/app/database/faces.py (1)
24-24: Quality field type annotation is correct.The
Optional[float]type correctly reflects that quality may beNonefor faces inserted before this feature was added.backend/app/utils/face_clusters.py (1)
306-319: Quality-weighted cluster representative calculation is well implemented.Good use of the
calculate_cluster_meanfunction withuse_medoid=CLUSTERING_USE_QUALITY_WEIGHTINGto leverage quality scores when computing cluster representatives. This should improve assignment accuracy by using higher-quality faces as reference points.backend/app/utils/clustering_advanced.py (4)
1-17: LGTM! Clean imports and clear module documentation.The imports are appropriate for the clustering functionality, and the module docstring clearly outlines the provided features.
88-125: LGTM! Efficient DBSCAN wrapper with good defaults.The adaptive epsilon selection and parallel processing (
n_jobs=-1) are well-implemented. The cosine metric is appropriate for face embeddings.
128-163: LGTM! Correct mutual exclusivity and parameter handling.The hierarchical clustering implementation correctly handles the mutual exclusivity between
n_clustersanddistance_threshold(line 156), and uses appropriate defaults for face embedding clustering.
291-323: LGTM! Useful example code for testing.The example usage in the
__main__block provides a good quick test of the clustering functionality with synthetic data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
backend/tests/test_clustering_improvements.py (4)
24-55: Consider adding random seeds for reproducibility.Tests using
np.random.randnwithout seeds are non-deterministic, which can occasionally cause test flakiness. While not critical here, adding seeds improves reproducibility during debugging.🔎 Example fix for one test:
def test_returns_float(self): """Test that function returns a float.""" + np.random.seed(42) embeddings = np.random.randn(50, 128)
189-232: Add test for quality-weighted mean calculation.The test suite covers simple mean and quality-weighted medoid, but misses the quality-weighted mean path (when
use_medoid=Falseandquality_scoresare provided). This code path is exercised in the implementation but not validated by tests.🔎 Add this test:
def test_quality_weighted_mean(self): """Test quality-weighted mean calculation.""" embeddings = np.array([ [1.0, 0.0], [0.0, 1.0], ]) quality_scores = np.array([0.8, 0.2]) # First embedding has higher quality result = calculate_cluster_mean( embeddings, use_medoid=False, quality_scores=quality_scores ) # Result should be closer to first embedding due to quality weighting weights = quality_scores / quality_scores.sum() expected_raw = np.average(embeddings, axis=0, weights=weights) expected = expected_raw / np.linalg.norm(expected_raw) np.testing.assert_array_almost_equal(result, expected)
234-270: Add test for invalid algorithm error handling.The
cluster_facesfunction raises aValueErrorfor unknown algorithms, but this error path isn't tested. Adding a negative test improves coverage of error handling.🔎 Add this test:
def test_invalid_algorithm(self): """Test that invalid algorithm raises ValueError.""" embeddings = np.random.randn(30, 128) with pytest.raises(ValueError, match="Unknown algorithm"): cluster_faces(embeddings, algorithm="invalid")
301-315: Consider strengthening epsilon adaptation assertion.The test creates dense (variance 0.01) and sparse (variance 10.0) data but only asserts both epsilons are positive. To better validate adaptive behavior, consider asserting that
eps_sparse > eps_dense, which aligns with the intuition that sparser data should yield larger epsilon values.🔎 Apply this diff:
# Both should return valid positive epsilon assert eps_dense > 0 assert eps_sparse > 0 + # Sparse data should yield larger epsilon + assert eps_sparse > eps_dense
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
backend/app/models/FaceDetector.py(3 hunks)backend/tests/test_clustering_improvements.py(1 hunks)backend/tests/test_face_quality.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- backend/tests/test_face_quality.py
- backend/app/models/FaceDetector.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/tests/test_clustering_improvements.py (2)
backend/app/utils/clustering_advanced.py (6)
auto_select_epsilon(19-85)cluster_dbscan_adaptive(88-125)cluster_hierarchical(128-163)calculate_quality_weighted_medoid(166-204)calculate_cluster_mean(207-250)cluster_faces(253-288)backend/test.py (1)
main(35-67)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests
🔇 Additional comments (2)
backend/tests/test_clustering_improvements.py (2)
1-19: LGTM! Well-structured test module.The imports are clean and comprehensive, covering all the clustering functions that need testing. The module docstring clearly describes the scope.
1-319: Comprehensive test coverage with room for edge cases.This test suite provides solid coverage of the advanced clustering module with 25 tests spanning basic functionality, parameter variations, and integration workflows. The structure is clear and tests are well-documented.
A few edge cases could be added for even better coverage:
- Empty cluster handling (should raise ValueError per implementation)
- Single-element clusters
- Error handling for invalid inputs
Overall, this is a well-crafted test module that validates the core clustering functionality effectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances face clustering accuracy by introducing quality-based filtering, adaptive parameter selection, and robust cluster representatives. The implementation adds comprehensive quality assessment for detected faces and improves clustering algorithms with automatic epsilon selection and quality-weighted medoid calculation.
Key Changes:
- New face quality assessment module scoring faces on sharpness, brightness, and size
- Adaptive DBSCAN with automatic epsilon selection using k-NN and knee detection
- Quality-weighted medoid calculation for more representative cluster centers
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 19 comments.
Show a summary per file
| File | Description |
|---|---|
backend/tests/test_face_quality.py |
Comprehensive unit tests (24 tests) for face quality assessment functions |
backend/tests/test_clustering_improvements.py |
Unit tests (25 tests) for advanced clustering algorithms including adaptive DBSCAN and hierarchical clustering |
backend/app/utils/face_quality.py |
New module implementing sharpness, brightness, and size assessment with configurable weighting |
backend/app/utils/clustering_advanced.py |
New module with adaptive DBSCAN, hierarchical clustering, and quality-weighted medoid calculation |
backend/app/utils/face_clusters.py |
Integration of quality filtering and advanced clustering into the main clustering workflow |
backend/app/models/FaceDetector.py |
Face detection enhanced with quality score calculation and storage |
backend/app/database/faces.py |
Database schema updated with quality column; queries modified to include quality scores |
backend/test.py |
Test script updated with new DBSCAN parameters |
backend/requirements.txt |
Added kneed>=0.8.5 dependency for knee detection in epsilon selection |
Comments suppressed due to low confidence (1)
backend/app/models/FaceDetector.py:99
- This assignment to 'close' is unnecessary as it is redefined before this value is used.
def close(self):
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Fallback heuristic: Use 90th percentile of k-distances | ||
| # This is a reasonable estimate for dense regions |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback heuristic using the 90th percentile is chosen without explanation. Consider adding a comment explaining why 90th percentile is preferred over other percentiles (e.g., 75th, 95th) for better maintainability and understanding of this design decision.
| # Fallback heuristic: Use 90th percentile of k-distances | |
| # This is a reasonable estimate for dense regions | |
| # Fallback heuristic: use 90th percentile of k-distances. | |
| # Rationale: lower percentiles (e.g., 75th) tend to under-estimate epsilon | |
| # and can fragment dense clusters, while very high percentiles (e.g., 95th) | |
| # tend to over-estimate epsilon and merge distinct identities. Empirically, | |
| # the 90th percentile provides a reasonable balance for typical face-embedding | |
| # densities when knee detection is unavailable or unreliable. |
| except Exception: | ||
| # Any error in epsilon selection, use fallback | ||
| return fallback_eps |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bare except Exception catches all exceptions without logging what went wrong. This makes debugging difficult. Consider logging the exception or at least adding a comment explaining why silent failure is acceptable here.
| # Weight distances inversely by quality (lower quality = higher penalty) | ||
| # Add small constant to avoid division by zero | ||
| quality_weights = 1.0 / (quality_scores + 0.1) |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quality weighting logic is inverted. Using 1.0 / (quality_scores + 0.1) means higher quality scores result in higher penalties (larger weights in the distance calculation), which is opposite to the intended behavior. This should multiply distances by quality_scores directly, or use quality_scores without inversion to prioritize high-quality faces.
| # Weight distances inversely by quality (lower quality = higher penalty) | |
| # Add small constant to avoid division by zero | |
| quality_weights = 1.0 / (quality_scores + 0.1) | |
| # Weight distances by quality scores so higher-quality faces have greater influence | |
| # (assumes larger quality_scores indicate better quality) | |
| quality_weights = quality_scores |
| eps: float = 0.3, min_samples: int = 2 | ||
| ) -> List[ClusterResult]: |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature still includes eps and min_samples parameters with default values, but the function description states that eps is only used if auto_epsilon=False. However, the function doesn't accept an auto_epsilon parameter - it uses the global CLUSTERING_AUTO_EPSILON constant. This creates a confusing API where parameters don't match the configuration mechanism. Consider either accepting auto_eps as a parameter or removing the eps parameter entirely.
backend/test.py
Outdated
|
|
||
| dbscan = DBSCAN(eps=0.3, min_samples=2, metric="cosine") | ||
| # Updated parameters to match production improvements | ||
| # eps=0.2 for tighter clustering, min_samples=1 to include all faces |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "min_samples=1 to include all faces" which is misleading. Setting min_samples=1 doesn't guarantee all faces are included; it only means a single point can be a core point in DBSCAN. Faces can still be marked as noise (-1) if they don't meet the epsilon distance requirement. Consider clarifying this comment.
| # eps=0.2 for tighter clustering, min_samples=1 to include all faces | |
| # eps=0.2 for tighter clustering; min_samples=1 allows single-face clusters (reducing, but not eliminating, noise) |
| # Calculate mean embeddings for each cluster | ||
| cluster_means = [] | ||
| for cluster_id, embeddings_list in cluster_embeddings.items(): | ||
| # Stack all embeddings for this cluster and calculate mean |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "Stack all embeddings for this cluster and calculate mean" is now inaccurate. The function stacks embeddings but no longer calculates the mean here. Remove or update this comment to reflect the actual behavior.
| # Stack all embeddings for this cluster and calculate mean | |
| # Stack all embeddings for this cluster |
| # Calculate pairwise cosine distances | ||
| dist_matrix = cosine_distances(embeddings) | ||
|
|
||
| # Weight distances inversely by quality (lower quality = higher penalty) |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "Weight distances inversely by quality (lower quality = higher penalty)" but this contradicts the intended behavior described in the docstring which says "Quality weighting prioritizes high-quality faces." The comment should be corrected to match the intended implementation once the weighting logic is fixed.
| # Weight distances inversely by quality (lower quality = higher penalty) | |
| # Weight distances so that lower-quality faces incur a higher penalty, | |
| # biasing the medoid selection toward higher-quality embeddings. |
|
|
||
| import pytest | ||
| import numpy as np | ||
|
|
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Tuple' is not used.
backend/tests/test_face_quality.py
Outdated
|
|
||
| import pytest | ||
| import numpy as np | ||
|
|
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'cv2' is not used.
backend/requirements.txt
Outdated
| psutil>=5.9.5 | ||
| pytest-asyncio>=1.0.0 | ||
| setuptools==66.1.1 | ||
| kneed>=0.8.5 |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new dependency kneed>=0.8.5 is specified with an open-ended version range, meaning builds will automatically pull whatever latest version satisfies this constraint, which increases supply-chain risk compared to the other fully pinned runtime dependencies in this file. If an attacker compromises the kneed package or its distribution channel, a future malicious release could be installed and executed in your backend environment with access to application data and secrets. To reduce this risk, pin kneed to a specific, vetted version (and optionally enforce hashes) so that builds are reproducible and cannot silently upgrade to untrusted code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/utils/face_clusters.py (1)
165-228: API inconsistency between function parameters and module-level configuration.The function signature includes
epsandmin_samplesparameters, but the implementation primarily uses module-level constants (CLUSTERING_AUTO_EPSILON,CLUSTERING_FIXED_EPSILON, etc.). The docstring referencesauto_epsilonwhich isn't a function parameter. This creates confusion about what controls clustering behavior.Consider either:
- Making these function parameters and removing module constants, OR
- Removing the function parameters and using only module constants
🔎 Option 1: Use function parameters consistently:
def cluster_util_cluster_all_face_embeddings( - eps: float = 0.3, min_samples: int = 2 + algorithm: str = "dbscan", + auto_eps: bool = True, + fixed_eps: float = 0.3, + min_samples: int = 2, ) -> List[ClusterResult]: """ - Cluster face embeddings using advanced clustering algorithms with quality filtering. + Cluster face embeddings using advanced algorithms with quality filtering. Args: + algorithm: Clustering algorithm ('dbscan' or 'hierarchical') + auto_eps: Whether to auto-select epsilon for DBSCAN - eps: DBSCAN epsilon parameter (used only if auto_epsilon=False) + fixed_eps: Fixed epsilon if auto_eps=False min_samples: Minimum samples parameter for core points Returns: List of ClusterResult objects """ # ... quality filtering ... # Perform clustering using provided parameters cluster_labels = cluster_faces( embeddings_array, - algorithm=CLUSTERING_ALGORITHM, + algorithm=algorithm, min_samples=min_samples, - auto_eps=CLUSTERING_AUTO_EPSILON, + auto_eps=auto_eps, - fixed_eps=eps, + fixed_eps=fixed_eps, - distance_threshold=HIERARCHICAL_DISTANCE_THRESHOLD, - linkage=HIERARCHICAL_LINKAGE, + distance_threshold=HIERARCHICAL_DISTANCE_THRESHOLD, # or add as param + linkage=HIERARCHICAL_LINKAGE, # or add as param )Option 2: Remove function parameters entirely:
-def cluster_util_cluster_all_face_embeddings( - eps: float = 0.3, min_samples: int = 2 -) -> List[ClusterResult]: +def cluster_util_cluster_all_face_embeddings() -> List[ClusterResult]: """ Cluster face embeddings using advanced clustering algorithms with quality filtering. + + Configuration is controlled via module-level constants: + - CLUSTERING_ALGORITHM, CLUSTERING_AUTO_EPSILON, etc. - Args: - eps: DBSCAN epsilon parameter (used only if auto_epsilon=False) - min_samples: Minimum samples parameter for core points - Returns: List of ClusterResult objects """
🧹 Nitpick comments (2)
backend/tests/test_clustering_improvements.py (1)
304-318: Consider asserting relative epsilon values to verify adaptive behavior.The test creates dense and sparse data but only verifies both epsilon values are positive. To better validate the adaptive epsilon selection, consider asserting that sparse data yields a larger epsilon than dense data.
🔎 Suggested enhancement:
# Very sparse cluster (large variance) sparse = np.random.randn(50, 128) * 10.0 eps_sparse = auto_select_epsilon(sparse) # Both should return valid positive epsilon assert eps_dense > 0 assert eps_sparse > 0 + # Sparse data should yield larger epsilon than dense data + assert eps_sparse > eps_densebackend/app/utils/face_clusters.py (1)
35-44: Consider organizing configuration constants for better maintainability.The module-level constants are functional but could be better organized. As the number of configuration options grows, consider grouping them into a configuration dataclass or dictionary.
🔎 Example refactoring approach:
from dataclasses import dataclass @dataclass class ClusteringConfig: algorithm: Literal["dbscan", "hierarchical"] = "dbscan" auto_epsilon: bool = True fixed_epsilon: float = 0.3 min_samples: int = 2 use_quality_weighting: bool = True quality_filter_enabled: bool = True quality_min_threshold: float = 0.4 hierarchical_linkage: str = "average" hierarchical_distance_threshold: float = 0.3 CLUSTERING_CONFIG = ClusteringConfig()
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
backend/app/utils/clustering_advanced.py(1 hunks)backend/app/utils/face_clusters.py(6 hunks)backend/tests/test_clustering_improvements.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/app/utils/clustering_advanced.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/utils/face_clusters.py (3)
backend/app/utils/clustering_advanced.py (2)
cluster_faces(255-290)calculate_cluster_mean(209-252)backend/app/utils/face_quality.py (1)
filter_quality_faces(173-184)backend/app/database/faces.py (3)
db_get_all_faces_with_cluster_names(261-301)db_get_faces_unassigned_clusters(230-258)db_get_cluster_mean_embeddings(349-406)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Backend Tests
🔇 Additional comments (8)
backend/tests/test_clustering_improvements.py (6)
21-56: LGTM! Comprehensive coverage of epsilon selection.The test cases appropriately validate return types, value constraints, fallback behavior with insufficient samples, and custom parameter handling.
86-99: Well done! Previous weak assertion has been strengthened.The test now correctly asserts exactly 2 clusters and verifies no noise points, which is the expected behavior for well-separated data with fixed epsilon. The fixed seed ensures reproducibility.
101-137: LGTM! Tests appropriately validate hierarchical clustering behavior.The tests correctly verify that hierarchical clustering assigns all points to clusters (no noise), supports configurable cluster counts, and works with all linkage methods.
171-190: Excellent fix! Zero-vector issue resolved.The test now uses a proper unit vector (
np.ones(128) / np.sqrt(128)) instead of zeros, which prevents undefined cosine distance calculations while maintaining the test's intent to verify quality-based medoid selection.
192-235: LGTM! Tests comprehensively validate cluster representative calculation.The tests correctly verify normalized mean calculation, medoid selection with quality weighting, and the important property that medoids are actual data points rather than computed averages.
237-273: LGTM! Tests appropriately cover the main clustering interface.The tests validate both DBSCAN and hierarchical algorithm paths, verify all embeddings receive labels, and support custom parameters.
backend/app/utils/face_clusters.py (2)
186-200: LGTM! Quality filtering is well-implemented.The filtering logic correctly applies the quality threshold, logs diagnostic information about filtered faces, and handles the edge case where all faces are filtered out with an early return.
305-361: LGTM! Quality-aware face assignment is well-implemented.The assignment logic effectively integrates quality assessment at multiple levels:
- Uses quality-weighted cluster representatives (medoid or mean)
- Filters low-quality faces during assignment
- Provides debug logging for non-assignments and skipped faces
This should significantly improve clustering accuracy by preventing low-quality faces from being incorrectly assigned.
ec4c7dd to
8980d72
Compare
Overview
Enhances face clustering accuracy with quality-based filtering, adaptive parameter selection, and robust cluster representatives. Includes comprehensive unit tests.
Changes Made
1. Face Quality Assessment (
face_quality.py)2. Adaptive DBSCAN Parameters (
clustering_advanced.py)3. Hierarchical Clustering Option
4. Quality-Weighted Medoid
5. Database Schema Update
qualitycolumn tofacestableFiles Changed
New Files:
app/utils/face_quality.pyapp/utils/clustering_advanced.pymigrate_add_quality.pyModified Files:
app/utils/face_clusters.pyapp/models/FaceDetector.pyapp/database/faces.pyrequirements.txt(addedkneed>=0.8.5)Tests:
tests/test_face_quality.py(24 tests)tests/test_clustering_improvements.py(25 tests)Benefits
Demo
Before
face_clustering_before.mp4
After
face_clustering_after.mp4
Summary by CodeRabbit
New Features
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.