Skip to content

Commit

Permalink
EDU-3861: Landing page prototype phase 2
Browse files Browse the repository at this point in the history
- Adds inline definitions
- Adds "under the hood"
- Simplifies and focuses content
- Adds explanatory images
  • Loading branch information
fairlydurable committed Feb 6, 2025
1 parent cf6c2c1 commit 11f280d
Show file tree
Hide file tree
Showing 6 changed files with 148 additions and 75 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ Trigger testing can:

- **Assess recovery time**:
Manual testing helps you measure actual recovery time.
You can check if it meets your expected Recovery Time Objective (RTO) of 20 minutes or less, as stated in the [High availability Namespace SLA](/cloud/high-availability#sla).
You can check if it meets your expected Recovery Time Objective (RTO) of 20 minutes or less, as stated in the [High availability Namespace SLA](/cloud/sla).

- **Identify potential issues**:
Failover testing uncovers problems not visible during normal operation.
Expand Down
151 changes: 78 additions & 73 deletions docs/production-deployment/cloud/high-availability/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -21,116 +21,121 @@ keywords:
---

import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead';
import ExpandableDefinition from '@site/src/components/definitions/ExpandableDefinition';

Temporal Cloud's replicated Namespaces provide disaster-tolerant deployment for workloads where availability is critical to your operations.
When you enable high availability, Temporal Cloud automatically synchronizes your data between a primary and a fallback Namespace, keeping them in sync.
Should an incident occur, Temporal will [failover](/glossary#failover) your Namespace.
This allows your Workflow Executions and Schedules to seamlessly shift from the active availability zone to the synchronized replica in the fallback availability zone.
Temporal Cloud offers "Reliability-as-a-Service" to support mission-critical deployment when applications must be highly available.
Data loss and disruptions to Workflows can severely impact business.
Replication, the critical component of highly available Namespaces, protects applications against outages and downtime.
High availability creates a fallback "replica"<ExpandableDefinition label="Definition" definition="A replica is a synchronized copy of your active Namespace used to provide high availability" /> that can take over Namespace duties during service incidents.
This keeps your Workflows running and your data available, even when your normal Namespace is unavailable.

Advantages of using Temporal Cloud’s High Availability features:
## Replication and failovers

Temporal Cloud’s replicated Namespaces provide disaster-tolerant availability for critical workloads.
When you enable replication, Temporal Cloud syncs your data and Workflows between an active and a replica Namespace.
If an incident occurs, Temporal automatically **fails over**<ExpandableDefinition label="Definition" definition="A failover shifts Workflow Execution processing from an active Temporal Namespace to a standby Temporal Namespace replica during outages or other incidents. Temporal Cloud uses replication to duplicate data to the replica Namespace and prevent data loss during failover." /> your Namespace.

During a failover, your Workflow Executions and Schedules seamlessly transition from the active Namespace to a standby domain.
This standby domain is called a **replica**, as it replicates the Workflows and data of the active Namespace.
Once the incident resolves, the Namespaces reconcile and control returns back to the original.

<div style={{ display: 'flex', justifyContent: 'center', alignItems: 'center' }}>
<img
src="/img/cloud/high-availability/failover.png"
alt="Data and Workflows replicate to the Replica. Failover transfers control to the Replica that becomes the active Namespace. After resolution, control fails back to the original"
style={{ width: '50%', border: '1px solid #ddd', borderRadius: '4px', margin: '20px' }}
/>
</div>

A high availability Namespace creates a single logical Namespace that operates across two domains: one active and one standby.
Replicated Namespaces combine access for both domains to a unified Namespace endpoint.
As Workflows progress in the active Namespace, Temporal Cloud replicates History events to the standby domain, ensuring continuity and data integrity.

During an incident or outage in the active domain, Temporal Cloud seamlessly fails over to your replica.
Failovers allow existing Workflow Executions to continue running and new Workflow Executions to be started.
Once failover occurs, the replica becomes active.
After the issue is resolved, the active replica "fails back" and the original Namespace resumes being "active".
Temporal resumes replication from the original active Namespace to the replica.

<div style={{ display: 'flex', justifyContent: 'center', alignItems: 'center' }}>
<img
src="/img/cloud/high-availability/logical-namespace.png"
alt="Data and Workflows replicate to the Replica. Failover transfers control to the Replica that becomes the active Namespace. After resolution, control fails back to the original"
style={{ width: '70%', border: '1px solid #ddd', borderRadius: '4px', margin: '20px' }}
/>
</div>


<details>

<summary>
Under the hood
</summary>

An **isolation domain** is a physically isolated data center within a deployment region for a given cloud provider.
**Regions** consist of multiple isolation domains.
Isolation domains provide redundancy and fault tolerance.

A **replicated Namespace** consists of an **active Namespace** and a passive, fallback **replica**.
Depending on your setup, your replica may reside in the same region as your active Namespace (standard replication), or it may be located in an entirely different region (multi-region replication).

After a **failover**, the replica takes on the active role until the incident is resolved.
After, the replica **fails back** and the original Namespace resumes the active role.

</details>

**Temporal Cloud’s high availability features:**

- No manual deployment or configuration needed, just simple push-button operations.
- Existing Workflows resume seamlessly in the replica with minimal interruption and data loss.
- No changes needed for Worker and Workflow code during setup or failover.
- 99.99% contractual [SLA](#sla).
- 99.99% contractual [SLA](/cloud/sla).

## High availability options
## Types of high availability

Temporal currently offers the following high availability features, which you configure at a Namespace level:
Temporal currently offers the following high availability features.
Configure these from your Namespace:

- **Replication**:
Workflows are seamlessly replicated to a different isolation domain within the same region as the Namespace, such as "us-east-1".
Workflows are seamlessly replicated to a different isolation domain within the same region as the Namespace, such as "us-east-1".<ExpandableDefinition label="Info" definition="The us-east-1 region was Amazon's original and largest region. Many people consider it the 'default' region." />
Choose this option for applications architected for a single-region.
You will failover within the same region to a separate isolation domain.
Your Namespaces failover to an isolation domain within the same region.
- **Multi-region replication**:
Workflows are seamlessly replicated to a different region that you choose.
Choose this option when your business requires multi-regional availability and the higher-level of resilience that separated locations offers.
You will failover from one region to a separate region.

:::note

Please note that replication charges apply when enabling high availability features.
For pricing details, visit Temporal Cloud's [Pricing](/cloud/pricing) page.
Replication charges apply when you enable high availability.
For pricing details, visit the Temporal Cloud [Pricing](/cloud/pricing) page.

:::

## Replication and replicas

High Availability features in Temporal Cloud simplify deployment, ensuring operational continuity and data integrity even during unexpected events impacting Namespace operations.
It uses a process called replication.
Replication asynchronously replicates Workflow Executions from an active Namespace to its replica, which is physically located in another isolation domain within the same region or another region in the same continent.
In the event of incidents in the active Namespaces, your replica is ready to take over.
Temporal Cloud smoothly transitions control from the active to the replica via a "failover".

## Isolation domains and replicas

An isolation domains is a physically isolated data center within a deployment region for a given cloud provider.
Regions consist of multiple isolation domains, providing redundancy and fault tolerance.
In some cases, the fallback domain may be in the same region as the primary, or it may be in a different region altogether, depending on your deployment configuration.

High availability simplifies deployment, ensuring operational continuity and data integrity even during unexpected events.
Incidents that affect the data centers within a specific isolation domain may occur.
High availability allows processing to shift from the affected domain to an already-synchronized fallback domain.

This synchronized domain is called a "**replica**."
The process of duplicating all Workflow data ensures that your replica, which serves as the standby Namespace, is always available and ready to take on the active role.
When necessary, Temporal Cloud smoothly transitions control from the active to the standby using a process called "[failover](/glossary#failover)".

## High availability and business continuity {#high-availability-intro}

For many organizations, ensuring high availability is critical to maintaining business continuity.
Temporal Cloud's high availability Namespace feature includes a 99.99% contractual Service Level Agreement ([SLA](https://docs.temporal.io/cloud/sla)).
It provides 99.99% availability and 99.99% guarantee against service errors.

A high availability Namespace creates a single logical Namespace that operates across two physical isolation domains: one active and one standby.
Replicated Namespaces streamline access for both domains to a unified Namespace endpoint.
As Workflows progress in the active Namespace, history events are asynchronously replicated to the standby zone, ensuring continuity and data integrity.

In the event of an incident or outage in the active isolation domain, Temporal Cloud will seamlessly failover to your standby replica.
Failovers allow existing Workflow Executions to continue running and new Workflow Executions to be started.
Once failover occurs, the roles of the active and standby domains switch.
The standby zone becomes active, and the previous active zone becomes the standby.
After the issue is resolved, the domain "fails back" from the replica to the original.

## Should you choose high availability?

Should you be using high availability Namespaces? It depends on your availability requirements:

- High availability Namespaces offer a 99.99% contractual SLA for workloads with strict high availability needs.
- **High availability Namespaces** offer a 99.99% Service Level Agreement ([SLA](/cloud/sla)) for workloads with strict high availability needs.
They use two Namespaces in two isolation domains to support standby recovery.
In the event of an incident, Temporal Cloud automatically fails over the Namespace to the standby replica.
- Namespaces without high availability include a 99.9% contractual Service Level Agreement ([SLA](/cloud/sla)).
High availability Namespaces' 99.99% availability is enforced by Temporal Cloud's [service error rates SLA](https://docs.temporal.io/cloud/sla).
- **Namespaces without high availability** include a 99.9% contractual Service Level Agreement ([SLA](/cloud/sla)).
In this use, Temporal clients connect to a single Namespace in one deployment domain.
For many applications, this offers sufficient availability.

Temporal Cloud provides 99.99% service availability for all Namespaces, both single-region and high availability.

## SLA guarantees {#sla}

High availability Namespaces offer 99.99% availability, enforced by Temporal Cloud's [service error rates SLA](https://docs.temporal.io/cloud/sla).
Our system is designed to limit data loss after recovery when the incident triggering the failover is resolved.

Our recovery point objective ([RPO](https://en.wikipedia.org/wiki/Disaster_recovery#Recovery_Point_Objective)) is near-zero.
There may be a short period of time during an incident or forced failover when some data is unavailable in the replica.
Some Workflow History data won't arrive until networks issue are fixed, enabling the History to finish replicating and the divergent History branches to reconcile.

Temporal Cloud proactively responds to incidents by triggering failovers.
Our recovery time objective ([RTO](https://en.wikipedia.org/wiki/Disaster_recovery#Recovery_Time_Objective)) is 20 minutes or less per incident.
- Our recovery point objective (RPO)<ExpandableDefinition label="Definition" definition="RPO is the maximum acceptable duration during which transactional data might be lost from the service." /> is near-zero.
There may be a short period of time during an incident or forced failover when some data is unavailable in the replica.
Some Workflow History data won't arrive until networks issue are fixed, enabling the History to finish replicating and the divergent History branches to reconcile.
- Temporal Cloud proactively responds to incidents by triggering failovers.
Our recovery time objective (RTO)<ExpandableDefinition label="Definition" definition="RTO sets a maximum time and service level objective, after which, service must be restored following a disruption." /> is 20 minutes or less per incident.

:::info

During a disaster scenario in which the data on the hard drives in the active Namespace cannot be recovered, the duration of data loss may be as high as the [replication lag](/cloud/high-availability/best-practices#metrics) at the time of disaster.

:::

## Regional availability {#regional-availability}

Multi-region Namespaces are one of the high availability options you can choose.
They are available in all existing [Temporal Cloud regions](/cloud/service-availability#regions).

:::tip

Namespace pairing is currently limited to regions within the same continent.
South America is excluded as only one region is available.

:::
12 changes: 11 additions & 1 deletion docs/production-deployment/cloud/high-availability/work-file.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,18 @@
<div style={{backgroundColor: '#ffff00',padding: '0px 15px',borderRadius: '5px',border: '1px solid #cccccc',display: 'inline-block'}}>**STOPPED HERE. Considering whether this should be its own page**</div>

AFFECTED COVERAGE:
<ExpandableDefinition label="Definition" definition="RTO sets a maximum time and service level objective, after which, service must be restored following a disruption." />

## Regional availability {#regional-availability}

Multi-region Namespaces are one of the high availability options you can choose.
They are available in all existing [Temporal Cloud regions](/cloud/service-availability#regions).

:::tip

Namespace pairing is currently limited to regions within the same continent.
South America is excluded as only one region is available.

:::


:::warning
Expand Down
58 changes: 58 additions & 0 deletions src/components/definitions/ExpandableDefinition.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
import React, { useState } from 'react';

const ExpandableDefinition = ({ label = "Definition", definition }) => {
const [isOpen, setIsOpen] = useState(false);

const toggleDefinition = () => setIsOpen(!isOpen);

return (
<span>
<button
onClick={toggleDefinition}
style={{
background: 'none',
border: 'none',
color: '#007acc',
cursor: 'pointer',
fontSize: '0.9rem',
margin: '-5px',
verticalAlign: 'super',
transition: 'transform 0.3s ease-in-out',
transform: isOpen ? 'rotate(45deg)' : 'none',
// textDecoration: 'underline',
}}
>
{/* Circled plus */}
</button>
{isOpen && (
<span
style={{
display: 'block',
marginTop: '5px',
fontStyle: 'italic',
textAlign: 'center',
marginLeft: 'auto',
marginRight: 'auto',
maxWidth: '80%',
// color: '#4F4F4F',
fontFamily: 'Georgia, serif',
padding: '10px',
border: '2px solid #B0B0B0',
borderRadius: '8px',
// backgroundColor: '#f9f9f9',
transition: 'max-height 0.3s ease-out',
maxHeight: isOpen ? '500px' : '0',
overflow: 'hidden',
}}
>
<span style={{ fontWeight: 'bold', /*color: '#333'*/ }}>
{label}: {/* Dynamically change the label */}
</span>
{definition}
</span>
)}
</span>
);
};

export default ExpandableDefinition;
Binary file added static/img/cloud/high-availability/failover.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 11f280d

Please sign in to comment.