Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Proposal] Add Remote Storage Options for Improved Durability #1968

Closed
sachinpkale opened this issue Jan 24, 2022 · 23 comments
Closed
Labels
>breaking Identifies a breaking change. discuss Issues intended to help drive brainstorming and decision making feature New feature or request Storage:Remote

Comments

@sachinpkale
Copy link
Member

sachinpkale commented Jan 24, 2022

Feature Proposal : Add Remote Storage Options for Improved Durability

Overview

OpenSearch is the search and analytics suite powering popular use cases such as application search, log analytics, and more. While OpenSearch is part of critical business and application workflows, it is seldom used as a primary data store because there are no strong guarantees on data durability as the cluster is susceptible to data loss in case of hardware failure. In this document, we propose providing strong durability guarantees in OpenSearch using remote storage.

Challenges

Today, data indexed in OpenSearch clusters is stored on a local disk. To achieve durability, that is, to ensure that committed operations are not lost on hardware failure, users either back segments up using snapshots or add more replica copies. However, snapshots do not fully guarantee durability: data indexed since the last snapshot can be lost in case of failure. On the other hand, adding replicas is cost prohibitive. We propose a robust durability mechanism.

Proposed Solution

In addition to storing data on a local disk, we propose providing an option to store indexed data in a remote durable data store. These new APIs will enable building integrations with remote durable storage options like Amazon S3, OCI Object Storage, Azure Blob Storage and GCP Cloud Storage. This will give users the flexibility to choose the storage option that best fits their requirements.

Indexed data is of two types: committed or un-committed. After the indexing operation is successful, and before OpenSearch invokes commit, data is stored in translog. After commit, data becomes part of a segment on disk. To achieve the durability of indexed data, both committed and un-committed data need to be saved durably. We propose storing uncommitted data in a remote translog store and committed data in a remote segment store. This will pave the way to support PITR (Point In Time Recovery) in the future.

Remote Translog Storage

The translog (Transaction Log) contains data that is indexed successfully but yet to be committed. Each successful indexing operation makes an entry to the translog, which is a write-ahead transaction log. With remote translog storage, we keep a translog copy on a remote store as well. OpenSearch invokes a “flush” operation to perform Lucene commit. This process starts a new translog. We need to checkpoint the remote store with the committed changes in order to keep data consistent with primary. Feature Proposal - Pluggable Translog

Remote Segment Storage

Segments represent committed indexed data. Using remote segment storage, all the newly created segments in primary will be stored to the configured remote store. New segment is created either by OpenSearch “flush” operation or a “refresh” operation that creates segments on the page cache or by merging the already created segments. Segment deletions on primary will mark remote segments for deletion. These marked segments will be kept for certain configurable period.

While remote storage enables durability, there are tradeoffs for both remote translog store and remote segment store.

Considerations

Durability

Durability of indexed data depends on the durability of the configured remote store. The user can choose a remote store based on their durability requirements.

Availability

Data must be consistent between the remote and primary stores. Write availability of indexed data depends on the availability of the data node and the remote translog store. OpenSearch writes will fail if the data node or remote translog store is unavailable. Read operations can still be supported. However, in case of a failover, data availability for read will depend on the availability of remote store. The user can choose a remote data store based on their availability requirements.

Performance

We need a synchronous remote translog store call during an indexing operation to guarantee that all operations have been durably persisted to the store. This affects the average response time of the write API, however there shouldn’t be any significant throughput degradation due to this which will be covered with more details in the design doc. But as indexed data becomes durable, if extra availability provided by the replicas is not required, users can opt out of replica. This will remove the sync network call from primary to replica which will decrease the average response time of the write API. The overhead of the network call to remote store depends on the performance of the remote store.

We will have one more network call to the remote segment store when segments are created. As segment creation happens in the background, there won’t be any direct impact on the performance of search and indexing. But there are dedicated APIs to create segments (flush, refresh). Response time of these APIs will increase because of the extra network call to the remote store.

Consistency

Data in the remote segment store will be eventually consistent with the data on the local disk. It will not impact data visibility for read operations as queries will be served from the local disk.

Cost

Any change in cost will be determined by two factors: configured remote stores and current cluster configuration. The change in cost of an OpenSearch cluster depends on different factors like change in the number of replica nodes and remote stores used for translog as well as segment.

As an illustration, consider a user who has configured replicas only for durability and who uses snapshots for periodic backup. Replacing replicas with remote storage will lower cost. The cost of storing remote segments will remain the same because the store used for snapshots be used for storing remote segments. However, the additional storage cost for the remote translog will lead to an increase in cost. But since durable storage options like Amazon S3 are not expensive, the increase in cost is minimal.

Recovery

In the current OpenSearch implementation for peer recovery, data is copied from the primary to replicas, which consumes system resources (disk, network) of the primary. By using remote store to copy the indexed data to replicas, the scalability of primary will be increased, since data sync operations will be performed on the replica; but the overall recovery time will also increase.

Next steps

Based on the feedback from this Feature Proposal, we’ll be drafting a design (link to be added). We look forward to collaborating with the community.

FAQs

1. With remote storage option, will the indexed data be stored to local disk as well?

Yes, we will continue to write data to local disk as well. This is to ensure that search performance is not impacted. Data in the remote store will be used for restore and recovery operations.

2. What would be the durability SLA?

Durability SLA would be same as that of configured remote store. If we use 2 different remote stores for translog and segment, durability of OpenSearch will be defined as minimum of these two stores.

3. Is it possible to use one remote data store for translog and segments?

Yes it is possible. While translog is an append only log, segment is an immutable file. Write throughput to translog is same as indexing throughput whereas segment creation is a background process and not affected by the indexing throughput. Due to these differences, we will have different set of APIs for translog and segment store. If one data store can be integrated with both set of APIs, it can be used as remote translog store as well as remote segment store.

4. Do I still need to use snapshots for backup and restore?

Architecture of snapshot and remote storage will be unified. If you configure remote store option, then using snapshot would be a no-op. But if durability provided by snapshot is as per your requirement, then you can disable the remote storage and continue using snapshots.

5. Can the remote data be used across different OpenSearch versions?

Version compatibility would be similar to what OpenSearch supports with snapshots.

Open Questions

  1. Will it be possible to enable remote storage only for the set of indexes?
  2. Would it be possible to change the remote store or disable remote storage option for a live cluster?
@sachinpkale sachinpkale added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 24, 2022
@sachinpkale sachinpkale changed the title Feature Proposal : OpenSearch Durability using Remote Storage [Feature Proposal] OpenSearch Durability using Remote Storage Jan 24, 2022
@nknize
Copy link
Collaborator

nknize commented Jan 24, 2022

Can we change the title to Add Remote Storage Options for Improved Persistence. Remote storage isn't the silver bullet to durability, it's one mechanism. It's also not just about the durability story, it's just as applicable to availability; so we should be careful to not imply remote storage support is being added solely for solving all durability issues.

Let's also change "tradeoffs" to "considerations". This isn't a design document so we need to make sure we aren't capturing the technical tradeoffs here without communicating the technical design and concrete implementation (which will illuminate the tradeoffs upon further prototyping and concrete benchmarking).

@nknize nknize added discuss Issues intended to help drive brainstorming and decision making feature New feature or request >breaking Identifies a breaking change. distributed framework and removed enhancement Enhancement or improvement to existing feature or request labels Jan 24, 2022
@Bukhtawar
Copy link
Collaborator

It's also not just about the durability story, it's just as applicable to availability; so we should be careful to not imply remote storage support is being added solely for solving all durability issues.

While it does improve the mean time to recovery when then there are no backing in-sync replicas, it isn't directly improving availability since(for now) we still need sufficient replica copies on disk for high availability.

@TheAlgo
Copy link
Member

TheAlgo commented Jan 25, 2022

hese new APIs will enable building integrations with remote durable storage options like Amazon S3, OCI Object Storage, Azure Blob Storage and GCP Cloud Storage. This will give users the flexibility to choose the storage option that best fits their requirements.

Will be able to choose cloud provider on the basis of indexes?

@sachinpkale
Copy link
Member Author

Will be able to choose cloud provider on the basis of indexes?

Not cloud provider but the remote store. Proposal is to integrate with a remote storage to store the indexed data. There are no assumptions on how the remote store is hosted. For example, if the remote store is HDFS, it could either be self hosted or a managed service.

@sachinpkale sachinpkale changed the title [Feature Proposal] OpenSearch Durability using Remote Storage [Feature Proposal] Add Remote Storage Options for Improved Durability Jan 25, 2022
@jayeshathila
Copy link
Contributor

jayeshathila commented Jan 25, 2022

Each successful indexing operation makes an entry to the translog, which is a write-ahead transaction log.

In Transactional DBs the first entry goes to WAL(Write Ahead Log) and that guarantees that the data won't be lost if it is written to WAL. But here, we are first indexing and then writing to WAL, so there is still a chance of data lost between indexing and WAL append.

@jayeshathila
Copy link
Contributor

We propose storing uncommitted data in a remote translog store and committed data in a remote segment store.

Won't this storing data in remote storage (fastest being some streaming service) itself be a major latency bottleneck ?

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Jan 25, 2022

In Transactional DBs the first entry goes to WAL(Write Ahead Log) and that guarantees that the data won't be lost if it is written to WAL. But here, we are first indexing and then writing to WAL, so there is still a chance of data lost between indexing and WAL append.

In OpensSearch all operations are written to the translog after being processed by the internal Lucene index to avoid any unexpected document failure upfront. The guarantee with remote translog store is no acknowledged writes would ever be lost

Won't this storing data in remote storage (fastest being some streaming service) itself be a major latency bottleneck ?

No this probably is not a major bottleneck since only one indexing thread would be at any point writing translog to the remote store, while other threads would still be processing indexing requests. While per request latencies might increase slightly, the throughput overall shouldn't have a major impact.
Moreover this would relax the needs to have sufficient replica copies which are needed solely for high durability, while indexing(replication pushes the overall ingestion latencies by up to 2x). Users might want to even turn down replica copies while doing bulk indexing and turn them back on as needed for search availability.

I recommend you read through #1319 for more details.

@nknize
Copy link
Collaborator

nknize commented Jan 25, 2022

it isn't directly improving availability

It improves the availability story as new replicas can be spun up w/ an rsync of segments on remote storage. Besides, remote storage is used as a mechanism in the segment replication implementation which is not durability so we need to loosely couple the durability story from this issue.

@sachinpkale
Copy link
Member Author

It improves the availability story as new replicas can be spun up w/ an rsync of segments on remote storage.

Agree but we should not see remote storage as the option to increase availability. Even though overall availability may increase by using remote store, we can't guarantee availability as it depends on the availability of configured remote store as well as the cluster configuration.
On the other hand, we can guarantee durability using remote store as it directly depends on durability of the remote store.

Besides, remote storage is used as a mechanism in the segment replication implementation which is not durability so we need to loosely couple the durability story from this issue.

There are two use cases of remote storage.

  1. As you explained, segment replication using remote storage is one use case where primary node need not get involved into sync to replica.
  2. Second use case is to store indexed data to remote store in order to minimize the possibility of data loss.

The proposal in this doc revolves around the second use case. It is independent of the replication strategy used (Even with two different use cases, implementation details can be same)

If we start storing committed and uncommitted data to the remote store, wouldn't that improve durability? What else is required to make durability guarantees?

@dblock
Copy link
Member

dblock commented Feb 2, 2022

This is a good proposal 👍 I understand the limitations of the existing system (translog, then Lucene), but if we're talking potential redesign I would at least try to step back and think bigger.

The three features I think I'd really enjoy are 1) use S3 as my only store, 2) the ability to have multiple nodes use the same store, and 3) the ability to specify durability requirements per write.

Longer version:

I like that we are thinking about durable storage replacing the need for replicas. In that spirit, I'd want a consistent way of configuring durability intent in both global configuration (I want all writes to be committed to the majority), and writes ala MongoDB (e.g. I want write operation A to be a transaction and return only after it was committed, while another write return immediately, written optimistically, and accept eventual loss) regardless of what storage I use. Ideally I'd want the current local store and replica implementation to become just the default storage with certain promises and capabilities, thus "replica" is not "another feature", but a "feature of (durable) storage".

If you do the above, then data does not need to be written to local disk, but local disk may be the default storage, or an optimization of certain durable storage (given that S3 is able to sustain hundreds of MB/s read throughput I wouldn't be so sure it's always an optimization). Ideally, I'd like to be able to run nodes without disk, and to be able to run multiple nodes on top of the same remote durable storage.

All that said, I wouldn't want to feature creep this proposal, and wouldn't hold anything back as designed because it's a step forward in the right direction.

@asafm
Copy link

asafm commented Feb 6, 2022

I like the idea of this proposal and how it might lead us, as @dblock mentioned, to be able to query data directly from S3 (with caching on local disk if needed).
My biggest concern here is the ingest performance when using remote translog - say S3. At the end of the day, as you said, you have one thread, taking buffered writes and writing them now to S3 instead of local SSD. If those writes, happening sequentially of course, will take more time per write (and I'm a bit more pessimistic to the usage of the word "slightly" which has been used), this means ingestion throughput will take a hit, since you only ack when writes have been written. I mean, yes you can increase buffer size to hold those writes until it's their turn to be written, but if you write at same rate as before, the queue will only increase size if not cleared at same rate. The question remains to be POCed: by how much percentage the rate will decrease.

On the other hand, when talking about only committed data being uploaded to Remote storage - i.e. just relying on the durability of replica copies and their respective translog - it's a different story all together, since here you're uploading a much bigger unmodified data, which may allow the usage of parallel uploading of same file to S3 or compatible remote storage (chunks), which may allow reaching same throughput as without it or very close to it. Same thing here I guess - if you can't keep up with upload rate, it will forever increase until you're lagging too much behind, so rate will be lower, question is: by how much.

Final question - are you aware of any similar systems using Remote Storage (i.e. S3) for WAL?

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Feb 7, 2022

@asafm Yes there are trade-offs and we are working on clearly documenting them for eg with remote store you don't need to replicate operations to replica for segment based replication and with document based replication where a high search availability isn't needed(you can achieve high throughput with just replicating to remote store)

With remote store we need to ensure it meets the availability, consistency and performance characteristics having said that today's shard local copies can lag as well(network backed disk IO etc) and hence we have a back pressure mechanism built.

We will optimize cases as needed(batching, streaming etc) once we have a baseline performance. "Premature optimization is the root of all evil" :)

S3 is not the only option for WAL, it could for instance have support for streaming services like Kafka

@asafm
Copy link

asafm commented Feb 7, 2022

I mean other databases that use Remote Storage for WAL

@sachinpkale
Copy link
Member Author

Hi @llermaly , we keep only one copy of data in remote store, so there is no effect on using 0 or 1 replica from this feature's point of view.

Setting replicas to 0 will still do remote storage ?

Yes

setting more than 1 will create more copies in the remote storage?

No

My guess was setting 1 is enough and the object storage will take care of the redundancy. Is that correct?

You need to set replicas based on availability requirement. Durability will be taken care by this feature.

@anasalkouz
Copy link
Member

@rramachand21 @sachinpkale are you still tracking this for 2.8?

@Bukhtawar
Copy link
Collaborator

We moved this to 2.9.0

andrross added a commit to andrross/OpenSearch that referenced this issue Jun 12, 2023
I have nominated and maintainers have agreed to invite Sachin Kale
(@sachinpkale) to be a co-maintainer. Sachin has kindly accepted.

Sachin has led the design and implementation of the remote backed
storage feature in OpenSearch. This feature was introduced as
experimental in OpenSearch 2.3 and is planned for general availability
in 2.9. Some significant issues and PRs authored by Sachin for this
effort are as follows:

Feature proposal: opensearch-project#1968
Upload segments to remote store post refresh: opensearch-project#3460
Add rest endpoint for remote store restore: opensearch-project#3576
Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020

In total, Sachin has authored 57 PRs going back to May 2022. He also
frequently reviews contributions from others and has reviewed nearly 100
PRs in the same time frame.

Signed-off-by: Andrew Ross <andrross@amazon.com>
andrross added a commit to andrross/OpenSearch that referenced this issue Jun 12, 2023
I have nominated and maintainers have agreed to invite Sachin Kale
(@sachinpkale) to be a co-maintainer. Sachin has kindly accepted.

Sachin has led the design and implementation of the remote backed
storage feature in OpenSearch. This feature was introduced as
experimental in OpenSearch 2.3 and is planned for general availability
in 2.9. Some significant issues and PRs authored by Sachin for this
effort are as follows:

Feature proposal: opensearch-project#1968
Upload segments to remote store post refresh: opensearch-project#3460
Add rest endpoint for remote store restore: opensearch-project#3576
Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020

In total, Sachin has authored 57 PRs going back to May 2022. He also
frequently reviews contributions from others and has reviewed nearly 100
PRs in the same time frame.

Signed-off-by: Andrew Ross <andrross@amazon.com>
andrross added a commit to andrross/OpenSearch that referenced this issue Jun 12, 2023
I have nominated and maintainers have agreed to invite Sachin Kale
(@sachinpkale) to be a co-maintainer. Sachin has kindly accepted.

Sachin has led the design and implementation of the remote backed
storage feature in OpenSearch. This feature was introduced as
experimental in OpenSearch 2.3 and is planned for general availability
in 2.9. Some significant issues and PRs authored by Sachin for this
effort are as follows:

Feature proposal: opensearch-project#1968
Upload segments to remote store post refresh: opensearch-project#3460
Add rest endpoint for remote store restore: opensearch-project#3576
Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020

In total, Sachin has authored 57 PRs going back to May 2022. He also
frequently reviews contributions from others and has reviewed nearly 100
PRs in the same time frame.

Signed-off-by: Andrew Ross <andrross@amazon.com>
kotwanikunal pushed a commit that referenced this issue Jun 12, 2023
I have nominated and maintainers have agreed to invite Sachin Kale
(@sachinpkale) to be a co-maintainer. Sachin has kindly accepted.

Sachin has led the design and implementation of the remote backed
storage feature in OpenSearch. This feature was introduced as
experimental in OpenSearch 2.3 and is planned for general availability
in 2.9. Some significant issues and PRs authored by Sachin for this
effort are as follows:

Feature proposal: #1968
Upload segments to remote store post refresh: #3460
Add rest endpoint for remote store restore: #3576
Add RemoteSegmentStoreDirectory to interact with remote segment store: #4020

In total, Sachin has authored 57 PRs going back to May 2022. He also
frequently reviews contributions from others and has reviewed nearly 100
PRs in the same time frame.

Signed-off-by: Andrew Ross <andrross@amazon.com>
andrross added a commit to andrross/OpenSearch that referenced this issue Jun 13, 2023
I have nominated and maintainers have agreed to invite Sachin Kale
(@sachinpkale) to be a co-maintainer. Sachin has kindly accepted.

Sachin has led the design and implementation of the remote backed
storage feature in OpenSearch. This feature was introduced as
experimental in OpenSearch 2.3 and is planned for general availability
in 2.9. Some significant issues and PRs authored by Sachin for this
effort are as follows:

Feature proposal: opensearch-project#1968
Upload segments to remote store post refresh: opensearch-project#3460
Add rest endpoint for remote store restore: opensearch-project#3576
Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020

In total, Sachin has authored 57 PRs going back to May 2022. He also
frequently reviews contributions from others and has reviewed nearly 100
PRs in the same time frame.

Signed-off-by: Andrew Ross <andrross@amazon.com>
saratvemulapalli pushed a commit that referenced this issue Jun 13, 2023
I have nominated and maintainers have agreed to invite Sachin Kale
(@sachinpkale) to be a co-maintainer. Sachin has kindly accepted.

Sachin has led the design and implementation of the remote backed
storage feature in OpenSearch. This feature was introduced as
experimental in OpenSearch 2.3 and is planned for general availability
in 2.9. Some significant issues and PRs authored by Sachin for this
effort are as follows:

Feature proposal: #1968
Upload segments to remote store post refresh: #3460
Add rest endpoint for remote store restore: #3576
Add RemoteSegmentStoreDirectory to interact with remote segment store: #4020

In total, Sachin has authored 57 PRs going back to May 2022. He also
frequently reviews contributions from others and has reviewed nearly 100
PRs in the same time frame.

Signed-off-by: Andrew Ross <andrross@amazon.com>
gaiksaya pushed a commit to gaiksaya/OpenSearch that referenced this issue Jun 26, 2023
opensearch-project#8047)

I have nominated and maintainers have agreed to invite Sachin Kale
(@sachinpkale) to be a co-maintainer. Sachin has kindly accepted.

Sachin has led the design and implementation of the remote backed
storage feature in OpenSearch. This feature was introduced as
experimental in OpenSearch 2.3 and is planned for general availability
in 2.9. Some significant issues and PRs authored by Sachin for this
effort are as follows:

Feature proposal: opensearch-project#1968
Upload segments to remote store post refresh: opensearch-project#3460
Add rest endpoint for remote store restore: opensearch-project#3576
Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020

In total, Sachin has authored 57 PRs going back to May 2022. He also
frequently reviews contributions from others and has reviewed nearly 100
PRs in the same time frame.

Signed-off-by: Andrew Ross <andrross@amazon.com>
imRishN pushed a commit to imRishN/OpenSearch that referenced this issue Jun 27, 2023
I have nominated and maintainers have agreed to invite Sachin Kale
(@sachinpkale) to be a co-maintainer. Sachin has kindly accepted.

Sachin has led the design and implementation of the remote backed
storage feature in OpenSearch. This feature was introduced as
experimental in OpenSearch 2.3 and is planned for general availability
in 2.9. Some significant issues and PRs authored by Sachin for this
effort are as follows:

Feature proposal: opensearch-project#1968
Upload segments to remote store post refresh: opensearch-project#3460
Add rest endpoint for remote store restore: opensearch-project#3576
Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020

In total, Sachin has authored 57 PRs going back to May 2022. He also
frequently reviews contributions from others and has reviewed nearly 100
PRs in the same time frame.

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@DarshitChanpura
Copy link
Member

@anasalkouz Should this be moved to 2.11?

@anasalkouz
Copy link
Member

This is tracking green for 2.10.
@sachinpkale is there anything pending on this proposal? can you close it?

@anasalkouz
Copy link
Member

Closing it, since it's already released on 2.10.

shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this issue Apr 25, 2024
I have nominated and maintainers have agreed to invite Sachin Kale
(@sachinpkale) to be a co-maintainer. Sachin has kindly accepted.

Sachin has led the design and implementation of the remote backed
storage feature in OpenSearch. This feature was introduced as
experimental in OpenSearch 2.3 and is planned for general availability
in 2.9. Some significant issues and PRs authored by Sachin for this
effort are as follows:

Feature proposal: opensearch-project#1968
Upload segments to remote store post refresh: opensearch-project#3460
Add rest endpoint for remote store restore: opensearch-project#3576
Add RemoteSegmentStoreDirectory to interact with remote segment store: opensearch-project#4020

In total, Sachin has authored 57 PRs going back to May 2022. He also
frequently reviews contributions from others and has reviewed nearly 100
PRs in the same time frame.

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
@github-project-automation github-project-automation bot moved this to 2.10.0 (Launched) in OpenSearch Project Roadmap Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking Identifies a breaking change. discuss Issues intended to help drive brainstorming and decision making feature New feature or request Storage:Remote
Projects
Status: 2.10.0 (Launched)
Development

No branches or pull requests