Skip to content

[Critical] Unscalable Global Face Re-Clustering Causing OOM & Service Blocking #810

@Nithin9585

Description

@Nithin9585

The current face clustering implementation performs a full global re-clustering of all face embeddings every 24 hours (or when the number of unassigned faces exceeds a threshold).
This process loads all face embeddings into memory, runs DBSCAN on the entire dataset, deletes all existing clusters, and reinserts new ones.

This design does not scale and will inevaitably crash or freeze the application as the user’s photo library grows.


Location

  • File: backend/app/utils/face_clusters.py
  • Function: cluster_util_face_clusters_sync
  • Trigger:
    • Automatic 24-hour re-clustering
    • Manual API call: /global-recluster

Current Logic

 if time_since_last_reclustering > 86400 or unassigned_faces > 100:
     results = cluster_util_cluster_all_face_embeddings()  # Loads ALL embeddings into RAM
     db_delete_all_clusters(cursor)                        # Deletes all clusters
     db_insert_clusters_batch(results)

Impact

1. Out of Memory (OOM) Risk

Loading all embeddings (e.g., 50,000+ faces) into memory causes large RAM spikes.

DBSCAN performs O(N²) distance computations, making crashes inevitable at scale.

2. Service Blocking / Self-DoS

The clustering runs synchronously, blocking background workers.

API requests will time out while the backend continues heavy computation.

3. Data & UX Instability

Daily full re-clustering causes clusters to shift, split, or merge unexpectedly.

User-assigned names or manual merges can be lost, damaging user trust.


#Proposed Improvements

  • Incremental Clustering
    Assign new faces to existing clusters instead of reprocessing the entire dataset.

  • Background Execution
    Move /global-recluster to an async/background task to avoid blocking the API.

  • Batch / Chunk Processing
    Process embeddings in chunks to avoid RAM spikes.

  • Cluster Stability Guarantees
    Preserve existing clusters and user labels wherever possible.

Severity

Severity: 1 (Critical)
This is an architectural scalability issue that can:

Crash the application (OOM)

Block services for extended periods

Corrupt user organization over time

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageThe maintainer needs time to review this issue. Please do not begin working on it.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions