You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently, the reindex-from-snapshot process will attempt to migrate all docs in a shard with one worker and will only mark the shard as completed once all docs are migrated in a single attempt. This increases migration time and risk for larger shards as the first docs in each shard may be retried several times before succeeding. The reliance on a long-running single worker for large shards increases risk of failure increasing the time to complete the migration.
Describe the solution you'd like
Implement the ability to regularly checkpoint and resume migration of shards, which limits the amount of duplicate times a doc is migrated particularly for large shards. Implement a ceiling on the duration a lease for a work-item can reach.
Describe alternatives you've considered
We can upfront split up the shard into sub-shard work items that can be migrated in parallel, but this introduces complexity and increases unevenness of work distribution in the target cluster as when an index shard count remains the same during a migration, each sub-shard worker for a given source shard will hit a single node/shard in the target cluster.
sumobrian
changed the title
[FEATURE] Optimized Reindex-from-Snapshot with Direct S3 Ingestion and Sub-Shard Parallelization
Optimized Reindex-from-Snapshot with Direct S3 Ingestion and Sub-Shard Parallelization
Oct 23, 2024
sumobrian
changed the title
Optimized Reindex-from-Snapshot with Direct S3 Ingestion and Sub-Shard Parallelization
Optimized Reindex-from-Snapshot with Shard Parallelization
Nov 5, 2024
sumobrian
changed the title
Optimized Reindex-from-Snapshot with Shard Parallelization
Optimized Reindex-from-Snapshot with Sub-shard checkpoints
Dec 7, 2024
Is your feature request related to a problem? Please describe.
Currently, the reindex-from-snapshot process will attempt to migrate all docs in a shard with one worker and will only mark the shard as completed once all docs are migrated in a single attempt. This increases migration time and risk for larger shards as the first docs in each shard may be retried several times before succeeding. The reliance on a long-running single worker for large shards increases risk of failure increasing the time to complete the migration.
Describe the solution you'd like
Implement the ability to regularly checkpoint and resume migration of shards, which limits the amount of duplicate times a doc is migrated particularly for large shards. Implement a ceiling on the duration a lease for a work-item can reach.
Describe alternatives you've considered
Additional context
Jira Epic(s)
The text was updated successfully, but these errors were encountered: