Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Optimized Prefix Pattern for Shard-Level Files for Efficient Snapshots #15146

Closed
ashking94 opened this issue Aug 7, 2024 · 4 comments · Fixed by #15426
Closed

[RFC] Optimized Prefix Pattern for Shard-Level Files for Efficient Snapshots #15146

ashking94 opened this issue Aug 7, 2024 · 4 comments · Fixed by #15426
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Storage:Snapshots

Comments

@ashking94
Copy link
Member

ashking94 commented Aug 7, 2024

TL;DR - This RFC, inspired by #12567, proposes an optimized prefix pattern for shard-level files in a snapshot.

Problem statement

Snapshots are backups of a cluster's indexes and state, including cluster settings, node information, index metadata, and shard allocation information. They are used to recover from failures, such as a red cluster, or to move data between clusters without loss. Snapshots are stored in a repository in a hierarchical manner that represents the composition of shards, indexes, and the cluster. However, this structure poses a scaling challenge when there are numerous shards due to limitations on concurrent operations over a fixed prefix in a remote store. In this RFC, we discuss various aspects to achieve a solution that scales well with a high number of shards.

Current repository structure

Below is the current structure of the snapshot -
image
Image used from https://opensearch.org/blog/snapshot-operations/.

The files created once per snapshot or once per index per snapshot are somewhat immune to throttling due to their fewer numbers and are uploaded by the active cluster manager using only five snapshot threads. These files include:

  1. index-N (RepositoryData)
  2. index.latest (Latest generation of RepositoryData to refer to)
  3. incompatible-snapshots (lust of snapshot ids that are no longer compatible with current OpenSearch version)
  4. snap-uuid.dat (SnapshotInfo for snapshot )
  5. meta-uuid.dat (Metadata for snapshot )
  6. indices/Xy1234-z_x/meta-uuid.dat (IndexMetadata for the index)

Files susceptible to throttling are created on data nodes, generally per primary shard per snapshot:

  1. indices/Xy1234-z_x/0/__VP05oDMVT & more (Files mapped to real segment files)
  2. indices/Xy1234-z_x/0/snap-uuid.dat (BlobStoreIndexShardSnapshot)
  3. indices/Xy1234-z_x/0/index-uuid (BlobStoreIndexShardSnapshots)
  4. Similar folders for different shards

For index with snapshot uuid Xy1234-z_x. Similarly, there will be more number of folder for different indexes.

Issue with current structure

The existing structure leads to throttling in clusters with a high shard count, resulting in longer snapshot creation and deletion times due to retries. In worst-case scenarios, this can lead to partial or failed snapshots.

Requirements

  1. Ensure smooth snapshot operations (create, delete) without throttling for high shard counts.
  2. Implement a scalable prefix strategy to handle increasing concurrency on the remote store as shard count grows.

Proposed solution

Introduce a prefix pattern accepted by multiple repository providers (e.g., AWS S3, GCP Storage, Azure Blob Storage) that maximizes data spread across prefixes for better scaling. This prefix strategy will be applied to shard-level files. The general recommendation by the providers is to maximise the spread of data across as many prefixes as possible. This allows them to scale better. I propose to introduce this prefix strategy for shard level files. This has been introduced already in #12567 for remote store shard level data & metadata files.

Key changes in the proposed structure:

  • Shard-level path now includes a hash prefix: hash(index-name-or-uuid, shard-id)//indices//
  • With this prefix pattern, each shard will get a unique prefix that will be prepended at the start.
  • If a base path is present, it would appear between the hash prefix and "indices"

High level approach

Store the path type in customData within IndexMetadata, which is already stored during snapshot creation.

Cloud neutral solution

This approach is supported by multiple cloud providers:

  1. GCP Storage - https://cloud.google.com/storage/docs/request-rate#ramp-up
  2. AWS S3 - https://repost.aws/knowledge-center/http-5xx-errors-s3
  3. Azure blob storage - https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist

Proposed repository structure

<ROOT>
├── 210101010101101
│   └── indices
│       └── 1
│           └── U-DBzYuhToOfmgahZonF0w
│               ├── __C-Fa1S-NSbqEWJU3h1tdVQ
│               ├── __EeKqbgKCRHuHn-EF5GzlJA
│               ├── __JS_z4nUURmSsyFVsmgmBcA
│               ├── __a2cczAV8Sg2hcqs1uQBryA
│               ├── __d0CF9LqhTN6REWmuLzO96g
│               ├── __gGvHzcveQwiXknAWxor2IQ
│               ├── __jlv0IqxNRGqbmXZNQ3cr4A
│               ├── __oEfbUhM3QvOh7EDg1lzlnQ
│               ├── index-Y99a4XQjRwmVM5s5uIrFRw
│               ├── snap-L7HfWiUYSj6as4jSi3yKeg.dat
│               └── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat
├── 910010111101110
│   └── indices
│       └── 0
│           └── LsKqUfn2ROmjAkW6Rec1lQ
│               ├── __8FggEzgeRXaXUQY3U6Xlmg
│               ├── __F1iiaK-uTcmUD608gQ54Ng
│               ├── __JW0cFZiVTzaAOdzK03bsEw
│               ├── __N19wYbk0TVOqkU8NJw6ptA
│               ├── __dE6opnQ6QLmChNwQ4i3L6w
│               ├── __hylfOCALSPiLwRNfR27wbA
│               ├── __rNFtqVvuRnOomXxeF0hLfg
│               ├── __t8fnoBiATiqR8N9T2Qiejw
│               ├── index-cEnUDNHXTveF6OR779cKSA
│               ├── snap-L7HfWiUYSj6as4jSi3yKeg.dat
│               └── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat
├── M11010111001101
│   └── indices
│       └── 0
│           └── U-DBzYuhToOfmgahZonF0w
│               ├── __EKjl8lZlTCuUADJKlCM1GQ
│               ├── __Gd_-r-qlTve8B-VwxKicWA
│               ├── __f9qc9DWARWu6cw41q0pbOw
│               ├── __g_dEX3n3TISxMmO_SDmjYg
│               ├── __jjBTwJCrQKuT6gPxfNRbag
│               ├── __neVNkcaOSjuNAW_J9XiPag
│               ├── __ubsXbHN5QMi6dX0ragf-TQ
│               ├── __yzFDy62NTuildlyJc0DknA
│               ├── index-5RTtDql2Q9CBrBVcKgfI5Q
│               ├── snap-L7HfWiUYSj6as4jSi3yKeg.dat
│               └── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat
├── index-3
├── index.latest
├── indices
│   ├── LsKqUfn2ROmjAkW6Rec1lQ
│   │   └── meta-eNNJLJEBNa6hCAn8tcSn.dat
│   └── U-DBzYuhToOfmgahZonF0w
│       └── meta-d9NJLJEBNa6hCAn8tcSn.dat
├── meta-L7HfWiUYSj6as4jSi3yKeg.dat
├── meta-QajL9GX7Tj2Itdbm5GzAqQ.dat
├── snap-L7HfWiUYSj6as4jSi3yKeg.dat
├── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat
└── w10110111010101
    └── indices
        └── 1
            └── LsKqUfn2ROmjAkW6Rec1lQ
                ├── __ABzT7uaYQq2_WpF22ewuJg
                ├── __PTGQRYT_QyeAZ7llHCHP_Q
                ├── __ROWGAvOQRWarW9K6c_8BuQ
                ├── __RYbp_hw2Qb2T9rEECYZM5w
                ├── __YFbMl9ZNRMqm9i7ItDkz4Q
                ├── __bGaQqDsPR--08ifzAOA23A
                ├── __dTPVxsjCTfqeUg37B8_zUg
                ├── __z1N6Uz2MR26GoNjLuU_LQA
                ├── index-Mv_2sClsS1Wf0ynx8c697g
                ├── snap-L7HfWiUYSj6as4jSi3yKeg.dat
                └── snap-QajL9GX7Tj2Itdbm5GzAqQ.dat
  • In the above example, we have changed the shard level path to have 1st hash 210101010101101, 2nd indices, 3rd shard-id, 4th uuid (this is generated per index name by snapshot) and then all the files.
  • In the above example the base path is empty or not there. If there was a base path, then the base path would show up between 1st and 2nd word.

Appendix

Sample current repository structure

<BASE_PATH>
├── index-1
├── index.latest
├── indices
│   ├── OQb-77UPRLmek3IzfRR3Fg
│   │   ├── 0
│   │   │   ├── __3lX2qNtkShSMD2zhbW2hww
│   │   │   ├── __5ILryCGbSd-ojWvIxAK7Mw
│   │   │   ├── __UHpuUxuoTyqrc5kzCmSAHA
│   │   │   ├── __c7Tuu2zRSROfiy7UCq7KOQ
│   │   │   ├── __d1jt099nQSycUnO4mx9XXQ
│   │   │   ├── __oUVAcviJRdu0kWqYQyLrnA
│   │   │   ├── index-GoobOcTzQmqG5LI67Y7dvg
│   │   │   ├── snap-4JlPnU7nT5O1dzW7SP318Q.dat
│   │   │   └── snap-I0esTsK9Qiy114LI1xFD4Q.dat
│   │   ├── 1
│   │   │   ├── __DRjMuHHKRa-FluuZl94qLQ
│   │   │   ├── __EDoFw1gqQQastQjG8bat1Q
│   │   │   ├── __GypZv7z0SUu6LNiML4sefQ
│   │   │   ├── __ZP1jsyQnTQOMIO9Vx2n5Eg
│   │   │   ├── __nrm3d_o9Skqw3tAyyPnr7A
│   │   │   ├── __qI-jax0aTaycxgZkzsKmRw
│   │   │   ├── index-WsEwZHu6QZ-_Ub5h8p0AGA
│   │   │   ├── snap-4JlPnU7nT5O1dzW7SP318Q.dat
│   │   │   └── snap-I0esTsK9Qiy114LI1xFD4Q.dat
│   │   └── meta-UV9P7pABMtEAgBnTfhV6.dat
│   └── gYpI7EgrQlqGqiKBWnp1bw
│       ├── .DS_Store
│       ├── 0
│       │   ├── __ATpweBS4RV2e7-iWmOItfg
│       │   ├── __Kv_a0dSNQfyzSlkiONETnw
│       │   ├── __Q3EuuawlQQq2Cm2rHZpYRw
│       │   ├── __kHrmxfa5RAuHhjEEeP8czA
│       │   ├── index--0B07yZUQr63TPhdStkDzQ
│       │   ├── snap-4JlPnU7nT5O1dzW7SP318Q.dat
│       │   └── snap-I0esTsK9Qiy114LI1xFD4Q.dat
│       ├── 1
│       │   ├── __N2Imh37BSQGwFVmg5GQAsg
│       │   ├── __OVbkh2p-QxGjFfMM-XuQ-g
│       │   ├── __r2989-oaR_S--xOC-1Mlgw
│       │   ├── __r5yJ8QY-TdOK9Uy5cmozHw
│       │   ├── index-oqu0C2_PTXamVfS7BCpH1A
│       │   ├── snap-4JlPnU7nT5O1dzW7SP318Q.dat
│       │   └── snap-I0esTsK9Qiy114LI1xFD4Q.dat
│       └── meta-UF9P7pABMtEAgBnTfhV6.dat
├── meta-4JlPnU7nT5O1dzW7SP318Q.dat
├── meta-I0esTsK9Qiy114LI1xFD4Q.dat
├── snap-4JlPnU7nT5O1dzW7SP318Q.dat
└── snap-I0esTsK9Qiy114LI1xFD4Q.dat
@ashking94 ashking94 added enhancement Enhancement or improvement to existing feature or request untriaged labels Aug 7, 2024
@github-project-automation github-project-automation bot moved this to Issues and PR's in OpenSearch Roadmap Aug 7, 2024
@ashking94 ashking94 added discuss Issues intended to help drive brainstorming and decision making RFC Issues requesting major changes and removed untriaged labels Aug 7, 2024
@sachinpkale
Copy link
Member

Thanks @ashking94 for the proposal.

In the above example the base path is empty or not there. If there was a base path, then the base path would show up between 1st and 2nd word.

Currently, entire snapshot contents (data + metadata) are stored within the provided base path. With new approach, this is not true anymore (correct me if I am wrong).

This may change the way user organises snapshots (For example, I would be running 5 clusters each using repository-s3 plugin for snapshot and uses bucket s3://snapshot_bucket with base path as cluster1/2/3/4/5 and cleanup the corresponding base path once a particular cluster is no longer needed and deleted).

To keep the existing behaviour consistent, does it make sense to introduce the proposed changes as a new type of snapshot?

@ashking94
Copy link
Member Author

Currently, entire snapshot contents (data + metadata) are stored within the provided base path. With new approach, this is not true anymore (correct me if I am wrong).

That's right.

This may change the way user organises snapshots (For example, I would be running 5 clusters each using repository-s3 plugin for snapshot and uses bucket s3://snapshot_bucket with base path as cluster1/2/3/4/5 and cleanup the corresponding base path once a particular cluster is no longer needed and deleted).

We will have a fixed path where we will be uploading the paths for all the different paths where the data is stored corresponding to a cluster. Also, this will be by default disabled on a cluster for not breaking backward compatibility.

To keep the existing behaviour consistent, does it make sense to introduce the proposed changes as a new type of snapshot?

I am still debating in my head how to support the new mode with data that has been uploaded already in the fixed path. Apart from that, we would also need capability to have the new mode only for repository r1, but not repository r2. Let me cover these in the PRs.

@harishbhakuni
Copy link
Contributor

harishbhakuni commented Aug 22, 2024

Thanks @ashking94 for the proposal. On the similar lines of what sachin mentioned. Today an admin level user can give path level access to different users in a bucket and those users can provide those paths as base paths and can use it across multiple clusters. looks like this feature would not work for those usecases. to use it, user must have root level access on the bucket.

@ashking94
Copy link
Member Author

Thanks @ashking94 for the proposal. On the similar lines of what sachin mentioned. Today an admin level user can give path level access to different users in a bucket and those users can provide those paths as base paths and can use it across multiple clusters. looks like this feature would not work for those usecases. to use it, user must have root level access on the bucket.

Thanks for your comment, @harishbhakuni. The access would need to be provided at the bucket level to the cluster. Having more cluster would allow the autoscaling to work even better. We definitely need to build some mechanisms to segregate access to the domain level paths based on the base path substring in the key path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Storage:Snapshots
Projects
Status: New
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

3 participants