Optimized Reindex-from-Snapshot with Sub-shard checkpoints #1095

sumobrian · 2024-10-23T03:28:51Z

Is your feature request related to a problem? Please describe.

Currently, the reindex-from-snapshot process will attempt to migrate all docs in a shard with one worker and will only mark the shard as completed once all docs are migrated in a single attempt. This increases migration time and risk for larger shards as the first docs in each shard may be retried several times before succeeding. The reliance on a long-running single worker for large shards increases risk of failure increasing the time to complete the migration.

Describe the solution you'd like

Implement the ability to regularly checkpoint and resume migration of shards, which limits the amount of duplicate times a doc is migrated particularly for large shards. Implement a ceiling on the duration a lease for a work-item can reach.

Describe alternatives you've considered

We can upfront split up the shard into sub-shard work items that can be migrated in parallel, but this introduces complexity and increases unevenness of work distribution in the target cluster as when an index shard count remains the same during a migration, each sub-shard worker for a given source shard will hit a single node/shard in the target cluster.

Additional context

Jira Epic(s)

https://opensearch.atlassian.net/issues/MIGRATIONS-2103

sumobrian added enhancement New feature or request untriaged labels Oct 23, 2024

sumobrian added this to OpenSearch Migrations - Roadmap Oct 23, 2024

github-project-automation bot moved this to Not Committed in OpenSearch Migrations - Roadmap Oct 23, 2024

sumobrian moved this from Not Committed to 3-6 Months in OpenSearch Migrations - Roadmap Oct 23, 2024

sumobrian removed the untriaged label Oct 23, 2024

sumobrian changed the title ~~[FEATURE] Optimized Reindex-from-Snapshot with Direct S3 Ingestion and Sub-Shard Parallelization~~ Optimized Reindex-from-Snapshot with Direct S3 Ingestion and Sub-Shard Parallelization Oct 23, 2024

sumobrian added the MAv2.x label Nov 5, 2024

sumobrian changed the title ~~Optimized Reindex-from-Snapshot with Direct S3 Ingestion and Sub-Shard Parallelization~~ Optimized Reindex-from-Snapshot with Shard Parallelization Nov 5, 2024

sumobrian added MAv2.1 and removed MAv2.x labels Nov 5, 2024

sumobrian changed the title ~~Optimized Reindex-from-Snapshot with Shard Parallelization~~ Optimized Reindex-from-Snapshot with Sub-shard checkpoints Dec 7, 2024

sumobrian added MAv2.2 and removed MAv2.1 labels Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized Reindex-from-Snapshot with Sub-shard checkpoints #1095

Optimized Reindex-from-Snapshot with Sub-shard checkpoints #1095

sumobrian commented Oct 23, 2024 •

edited

Loading

Optimized Reindex-from-Snapshot with Sub-shard checkpoints #1095

Optimized Reindex-from-Snapshot with Sub-shard checkpoints #1095

Comments

sumobrian commented Oct 23, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Jira Epic(s)

sumobrian commented Oct 23, 2024 •

edited

Loading