-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EDU-3271: Adding RTO and RPO page under Temporal Cloud #3247
base: main
Are you sure you want to change the base?
Conversation
Temporal Cells are deployed in three Availability Zones (AZs) in the same region. | ||
Our data provider is deployed with the same topology in three AZs in the same region.\ | ||
**All writes to storage are synchronously replicated across AZs**, including our writes to ElasticSearch (ES). | ||
ES is eventually consistent, but this does not impact our RPO (there is no data loss).\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not impact our RPO (there is no data loss).
Might change it to "losing an AZ will not result in data loss or unavailability of Temporal service"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the backslashes here?
2. The eight-hour RPO/RTO Temporal Cloud reports for _regional_ failures for single-region namespaces | ||
3. The RPO/RTO Temporal Cloud guarantees for _availability zone_ failure. | ||
|
||
Which objective is relevant to your organization is driven by whether you map data center loss to a _regional_ loss or a _zonal_ loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readers might not know the difference between regional and zonal.
|
||
## Scenario: Multi Region Namespace, Regional Failure | ||
|
||
Temporal Cloud offers a "Multi Region Namespace" option in private preview. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Mentioning the release stage here makes it very hard to find and update when MRN goes out of private preview.
- I think you should consider adding a link to cloud/multi-region
- Recovery Time Objective | ||
--- | ||
|
||
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) for Temporal Cloud can be considered within three scenarios: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording feels like it can be simplified for readability. Consider throwing it into a scanner to identify opportunities for clarity.
|
||
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) for Temporal Cloud can be considered within three scenarios: | ||
|
||
1. The near-zero RPO/20 minutes or less RTO for Temporal Cloud with Multi-Region Namespaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You jump right in here with how RTO/RPO are used but not what they are. I think if you expand the introduction it would be a great place to set the scene for visitors to better understand the acronyms beyond what they stand for.
|
||
## Scenario: Single Region Namespace, Regional Failure | ||
|
||
Temporal Cloud Namespace data is backed up by our data provider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many readers might want to know the policy for this. How often? Is it automatic? Is it triggered by data change?
**Recovery Point Objective (RPO) - 8 hours** | ||
|
||
- Our data provider “snapshot” duration which is _4 hours_ | ||
- The time window of _4 hours_ allocated to detection of corruption point before we mitigate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does mitigation involve here?
|
||
## Scenario: Availability Zone Failure | ||
|
||
Temporal Cells are deployed in three Availability Zones (AZs) in the same region. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a great point to explain why this is done
This applies for both single region Namespaces and multi region Namespaces.\ | ||
This leads to the following objectives for availability zone failure: | ||
|
||
### Recovery Point Objective (RPO) \- 0\. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the following header could be misinterpreted. If this means no data loss and instant recovery, consider clarifying. Kapa says "the RTO is stated to be zero, meaning there should be no downtime in such scenarios."
What does this PR do?
Notes to reviewers
https://temporalio.atlassian.net/browse/EDU-3271