Skip to content

Conversation

@kartikeyg0104
Copy link

@kartikeyg0104 kartikeyg0104 commented Dec 13, 2025

Fixes #722

📋 Description

This PR addresses the bug where face clustering incorrectly groups unrelated images together when processing folders with many images. The issue was caused by overly permissive DBSCAN clustering parameters that allowed faces with relatively low similarity (>0.7) to be grouped together.

🔧 Changes Made

1. DBSCAN Clustering Parameters (backend/app/utils/face_clusters.py)

  • eps parameter: Reduced from 0.30.15

    • With cosine distance, this now requires similarity > 0.85 (previously > 0.7)
    • Ensures only genuinely similar faces are clustered together
  • min_samples parameter: Increased from 23

    • Requires at least 3 similar faces to form a cluster
    • Reduces noise and prevents false groupings from face detection errors

2. Incremental Assignment Threshold (backend/app/utils/face_clusters.py)

  • similarity_threshold: Increased from 0.70.85
    • Matches the DBSCAN eps parameter for consistency
    • Prevents new faces from being incorrectly assigned to existing clusters

3. Test Script Update (backend/test.py)

  • Updated DBSCAN parameters to match the main implementation

4. Documentation Update (docs/backend/backend_python/image-processing.md)

  • Updated parameter values and descriptions to reflect new settings
  • Added explanation of what the parameters mean for face similarity

🎯 Impact

Before

  • Folders with many images would create clusters with 314+ unrelated images
  • Face similarity threshold of ~70% was too low for reliable clustering
  • Any 2 faces with moderate similarity would form a cluster

After

  • Stricter similarity requirement (>85%) ensures accurate face grouping
  • Minimum of 3 faces required prevents spurious clusters from noise
  • Better clustering quality with fewer false positives

✅ Testing

  • Syntax validation passed (no Python compilation errors)
  • No linting errors introduced
  • Code changes follow existing patterns
  • Documentation updated to match code changes

Testing Instructions for Reviewers

  1. Reset the database: python backend/reset_database.py
  2. Start backend and frontend
  3. Add a folder with many diverse images
  4. Enable AI Tagging for the folder
  5. Navigate to AI-Tagging page
  6. Verify that clusters now contain only related faces (no large clusters of 314+ unrelated images)

Note: This fix improves clustering accuracy but may result in fewer clusters being formed initially. This is the desired behavior as it prevents incorrect groupings. As more images of the same person are added, proper clusters will form with the higher confidence threshold.

Summary by CodeRabbit

  • Bug Fixes

    • Improved face clustering accuracy by implementing stricter matching parameters.
  • Documentation

    • Updated documentation to reflect refined clustering behavior and requirements.

✏️ Tip: You can customize this high-level summary in your review settings.

…AOSSIE-Org#722)

- Reduce DBSCAN eps parameter from 0.3 to 0.15 for stricter clustering
  (now requires cosine similarity > 0.85 instead of > 0.7)
- Increase min_samples from 2 to 3 to require at least 3 similar faces
  to form a cluster, reducing noise and false groupings
- Update similarity_threshold in incremental assignment from 0.7 to 0.85
  to match the DBSCAN parameters
- Update documentation to reflect new parameters and their impact

This fix addresses the issue where folders with many images would have
one cluster containing 314+ unrelated images. The stricter parameters
ensure that only genuinely similar faces are grouped together.

Fixes AOSSIE-Org#722
@github-actions github-actions bot added backend bug Something isn't working medium labels Dec 13, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 13, 2025

Walkthrough

DBSCAN clustering parameters are tightened across the codebase: epsilon reduced from 0.3 to 0.15 and minimum samples increased from 2 to 3. Similarity threshold for cluster assignment is raised from 0.7 to 0.85. Documentation is updated to reflect the new parameter values and their effects on clustering behavior.

Changes

Cohort / File(s) Change Summary
Parameter Tuning
backend/app/utils/face_clusters.py
Updated DBSCAN defaults in cluster_util_cluster_all_face_embeddings (eps: 0.3→0.15, min_samples: 2→3) and similarity threshold in cluster_util_assign_cluster_to_faces_without_clusterId (0.7→0.85). Expanded docstrings to document new values.
Test Updates
backend/test.py
Updated DBSCAN clustering parameters in main flow to match new defaults (eps: 0.3→0.15, min_samples: 2→3).
Documentation
docs/backend/backend_python/image-processing.md
Updated Face Clustering parameter table to reflect new DBSCAN values and their interpretations regarding similarity thresholds and cluster formation requirements.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Parameter changes are consistent and homogeneous across files, reducing reasoning overhead
  • Documentation updates are straightforward mappings of the new parameter values
  • Logic flow remains unchanged; only default values and thresholds are adjusted

Poem

🐰 Clusters tighten, epsilon shrinks low,
Three faces now dance where two used to flow,
Similarity rises, the threshold's at eighty-five,
No more false companions—true groups now thrive! 🎯

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main fix: improving face clustering parameters to prevent incorrect grouping, which directly addresses the core issue.
Linked Issues check ✅ Passed The PR implementation fully addresses issue #722 by tightening DBSCAN parameters (eps: 0.3→0.15, min_samples: 2→3) and similarity threshold (0.7→0.85), with supporting updates to code and documentation.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the face clustering issue: parameter adjustments in production code, test updates, and documentation changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d07d817 and 062f970.

📒 Files selected for processing (3)
  • backend/app/utils/face_clusters.py (3 hunks)
  • backend/test.py (1 hunks)
  • docs/backend/backend_python/image-processing.md (1 hunks)
🔇 Additional comments (4)
docs/backend/backend_python/image-processing.md (1)

84-88: Documentation accurately reflects the updated parameters.

The DBSCAN parameter documentation has been updated correctly to match the implementation changes. The explanations clearly convey the impact of the stricter thresholds on clustering behavior.

backend/test.py (1)

43-43: Test script parameters correctly synchronized with production code.

The DBSCAN parameters in the test script now match the updated defaults in backend/app/utils/face_clusters.py, ensuring consistent clustering behavior between testing and production.

backend/app/utils/face_clusters.py (2)

246-263: [rewritten review comment]
[classification tag]


162-172: DBSCAN parameters correctly tightened to reduce false groupings.

The updated defaults (eps=0.15, min_samples=3) align with the PR objective. The docstring accurately explains the impact: with cosine distance, eps=0.15 requires similarity > 0.85, and min_samples=3 ensures clusters form only with sufficient evidence. The only call to this function uses the new defaults, so no external code depends on the previous parameter values.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rahulharpal1603
Copy link
Contributor

Hi, sorry, but only tuning parameters won't help in this case. You just tuned the parameters to solve the problem for the given dataset; we need a more general solution.

PR #771 has proposed the best solution for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Incorrect clustering for a folder with many images

2 participants