From e8eca50988bd28ab3dca0885ea0f033ecb5708c0 Mon Sep 17 00:00:00 2001 From: mich-elle-luna Date: Fri, 23 May 2025 13:53:20 -0700 Subject: [PATCH 01/22] DOC-832: Improve readability and clarity of Active-Active failover documentation - Restructured content with clear overview and prerequisites - Added comprehensive code examples with proper labeling - Improved section organization and flow - Enhanced explanations of failure detection strategies - Added troubleshooting section for common issues - Included practical implementation guidance - Better formatting with tips, warnings, and notes --- .../develop/app-failover-active-active.md | 402 +++++++++++++++--- 1 file changed, 347 insertions(+), 55 deletions(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 17fddbf6c2..08c7100f64 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -6,92 +6,384 @@ categories: - operate - rs - rc -description: How to failover your application to connect to a remote replica. +description: Learn how to implement application failover and failback with Active-Active databases to maintain high availability. linkTitle: App failover weight: 99 --- -Active-Active Redis deployments don't have a built-in failover or failback mechanism for application connections. -An application deployed with an Active-Active database connects to a replica of the database that is geographically nearby. -If that replica is not available, the application can failover to a remote replica, and failback again if necessary. -In this article we explain how this process works. -Active-Active connection failover can improve data availability, but can negatively impact data consistency. -Active-Active replication, like Redis replication, is asynchronous. -An application that fails over to another replica can miss write operations. -If the failed replica saved the write operations in persistent storage, -then the write operations are processed when the failed replica recovers. +Active-Active databases provide high availability by maintaining synchronized replicas across multiple geographic locations. However, when a local replica becomes unavailable, your application needs a strategy to failover to a remote replica and failback when the local replica recovers. -## Detecting Failure +This guide explains how to implement robust failover and failback mechanisms for applications using Active-Active Redis databases. -Your application can detect two types of failure: +## Overview -1. **Local failures** - The local replica is down or otherwise unavailable -1. **Replication failures** - The local replica is available but fails to replicate to or from remote replicas +Active-Active databases don't include built-in application failover mechanisms. Instead, your application must: -### Local Failures +1. **Monitor** local and remote replicas for availability +2. **Detect** failures quickly and accurately +3. **Failover** to a healthy remote replica when needed +4. **Failback** to the local replica when it recovers -Local failure is detected when the application is unable to connect to the database endpoint for any reason. Reasons for a local failure can include: multiple node failures, configuration errors, connection refused, connection timed out, unexpected protocol level errors. +{{< warning >}} +**Data consistency considerations**: Active-Active replication is asynchronous. Applications that failover to another replica may miss recent write operations, which can impact data consistency. +{{< /warning >}} -### Replication Failures +## Prerequisites -Replication failures are more difficult to detect reliably without causing false positives. Replication failures can include: network split, replication configuration issues, remote replica failures. +Before implementing failover logic, ensure you understand: -The most reliable method for health-checking replication is by using the Redis publish/subscribe (pub/sub) mechanism. +- [Active-Active database concepts]({{< relref "/operate/rs/databases/active-active" >}}) +- Your application's data consistency requirements +- Network topology between replicas +- Redis [pub/sub mechanism]({{< relref "/develop/interact/pubsub" >}}) + +## Failure detection strategies + +Your application should monitor for two types of failures: + +### Local replica failures + +**What it is**: The local replica is completely unavailable to your application. + +**Common causes**: +- Multiple node failures +- Network connectivity issues +- Configuration errors +- Database endpoint unavailable + +**Detection method**: Monitor connection attempts to the database endpoint. If connections consistently fail (timeout, refused, protocol errors), consider the local replica failed. + +### Replication failures + +**What it is**: The local replica is available but can't communicate with remote replicas. + +**Common causes**: +- Network partitions between data centers +- Replication configuration issues +- Remote replica failures +- Firewall or security group changes + +**Detection method**: Use Redis pub/sub to monitor replication health across all replicas. + +## Implementing pub/sub health monitoring + +The most reliable way to detect replication failures is using Redis pub/sub: + +### How it works + +1. **Subscribe** to a dedicated health-check channel on each replica +2. **Publish** periodic heartbeat messages with unique identifiers +3. **Monitor** that your own messages are received within a time window +4. **Detect failure** when messages aren't received from specific replicas + +### Implementation steps + +1. **Connect to all replicas**: + ```python + # Example implementation - adapt for your environment + replicas = { + 'local': redis.Redis(host='local-replica.example.com'), + 'remote1': redis.Redis(host='remote1-replica.example.com'), + 'remote2': redis.Redis(host='remote2-replica.example.com') + } + ``` + +2. **Subscribe to health channels**: + ```python + # Example implementation - adapt for your environment + for name, client in replicas.items(): + client.subscribe(f'health-check-{name}') + ``` + +3. **Publish heartbeat messages**: + ```python + # Example implementation - adapt for your environment + import time + import uuid + + def send_heartbeat(): + message = { + 'timestamp': time.time(), + 'id': str(uuid.uuid4()), + 'sender': 'app-instance-1' + } + + for name, client in replicas.items(): + client.publish(f'health-check-{name}', json.dumps(message)) + ``` + +4. **Monitor message delivery**: + ```python + # Example implementation - adapt for your environment + def check_replication_health(): + # Check if messages sent in last 30 seconds were received + cutoff_time = time.time() - 30 + + for replica_name in replicas: + if not received_own_message_since(replica_name, cutoff_time): + mark_replica_unhealthy(replica_name) + ``` + +{{< tip >}} +**Why pub/sub works**: Pub/sub messages are delivered as replicated effects, making them a reliable indicator of active replication links. Unlike dataset changes, pub/sub doesn't make assumptions about your data structure. +{{< /tip >}} + +## Handling sharded databases + +If your Active-Active database uses sharding, you need to monitor each shard individually: + +### Symmetric sharding (recommended) + +With symmetric sharding, all replicas have the same number of shards and hash slots. + +**Monitoring approach**: +1. Use the Cluster API to get the sharding configuration +2. Create one pub/sub channel per shard +3. Ensure each channel name maps to a different shard + +```python +# Example implementation - adapt for your environment +def get_channels_per_shard(redis_client): + """Generate channel names that map to different shards""" + cluster_info = redis_client.cluster_nodes() + channels = [] + + for shard_id in range(len(cluster_info)): + # Generate a channel name that hashes to this shard + channel = f"health-shard-{shard_id}" + channels.append(channel) + + return channels +``` + +### Asymmetric sharding (not recommended) + +Asymmetric configurations require monitoring every hash slot intersection, which is complex and error-prone. + +## Implementing failover + +When you detect a local replica failure: + +### 1. Stop writing to the failed replica + +Immediately redirect all database operations to a healthy remote replica. + +```python +# Example implementation - adapt for your environment +def failover_to_replica(target_replica_name): + """Switch application connections to target replica""" + global active_redis_client + + # Update active connection + active_redis_client = replicas[target_replica_name] + + # Log the failover event + logger.warning(f"Failed over to replica: {target_replica_name}") + + # Update application configuration + update_app_config('active_replica', target_replica_name) +``` + +### 2. Handle data consistency + +**Important considerations**: +- The remote replica may not have your most recent writes +- Recent writes might be lost permanently or temporarily unavailable +- Avoid reading data you just wrote before the failover + +**Best practices**: +- Design your application to handle eventual consistency +- Use timestamps or version numbers to detect stale data +- Implement retry logic for critical operations + +### 3. Update monitoring + +Continue monitoring all replicas, including the failed one, to detect when it recovers. + +## Implementing failback + +Monitor the failed replica to determine when it's safe to failback: + +### Failback criteria + +A replica is ready for failback when it's: +1. **Available**: Accepting connections and responding to commands +2. **Synchronized**: Caught up with changes from other replicas +3. **Not stale**: Actively participating in replication + +### Failback process + +1. **Verify replica health**: + ```python + # Example implementation - adapt for your environment + def is_replica_ready_for_failback(replica_name): + client = replicas[replica_name] + + try: + # Test basic connectivity + client.ping() + + # Test pub/sub replication + if not test_pubsub_replication(client): + return False + + # Verify not in stale mode + if is_replica_stale(client): + return False + + return True + except Exception: + return False + ``` + +2. **Gradual failback** (recommended): + ```python + # Example implementation - adapt for your environment + def gradual_failback(primary_replica): + # Start with read operations + redirect_reads_to(primary_replica) + + # Monitor for issues + time.sleep(30) + + # If stable, redirect writes + if is_replica_stable(primary_replica): + redirect_writes_to(primary_replica) + ``` + +{{< warning >}} +**Avoid dataset-based monitoring**: Don't rely solely on reading/writing test keys to determine replica health. Replicas can appear healthy while still in stale mode or missing recent updates. +{{< /warning >}} + +## Configuration best practices + +### Application-side failover only + +- **Do**: Implement all failover logic in your application +- **Don't**: Modify Active-Active database configuration during failover + +### When to remove a replica + +Only remove a replica from the Active-Active configuration when: +- Memory consumption becomes critically high +- Garbage collection cannot keep up with the replication backlog +- The replica cannot be recovered + +{{< warning >}} +**Data loss risk**: Removing a replica from the configuration permanently loses any unconverged writes. The replica must rejoin as a new member, losing its data. +{{< /warning >}} + +## Example implementation + +Here's a simplified example of a failover-capable Redis client: {{< note >}} -Note that this document does not suggest that Redis pub/sub is reliable in the common sense. Messages can get lost in certain conditions, but that is acceptable in this case because typically the application determines that replication is down only after not being able to deliver a number of messages over a period of time. +**Example code**: The following is an illustrative example to demonstrate concepts. Adapt this code for your specific environment, error handling requirements, and production needs. {{< /note >}} -When you use the pub/sub data type to detect failures, the application: +```python +import redis +import json +import time +import threading +from typing import Dict, Optional + +class FailoverRedisClient: + def __init__(self, replica_configs: Dict[str, dict]): + self.replicas = {} + self.active_replica = None + self.replica_health = {} + + # Initialize connections + for name, config in replica_configs.items(): + self.replicas[name] = redis.Redis(**config) + self.replica_health[name] = True + + # Set initial active replica (prefer 'local') + self.active_replica = 'local' if 'local' in self.replicas else list(self.replicas.keys())[0] + + # Start health monitoring + self.start_health_monitoring() + + def execute_command(self, command: str, *args, **kwargs): + """Execute Redis command with automatic failover""" + max_retries = len(self.replicas) + + for attempt in range(max_retries): + try: + client = self.replicas[self.active_replica] + return getattr(client, command)(*args, **kwargs) + except Exception as e: + self.handle_connection_error(e) + if attempt < max_retries - 1: + self.failover_to_next_healthy_replica() + else: + raise + + def failover_to_next_healthy_replica(self): + """Switch to the next healthy replica""" + for name, is_healthy in self.replica_health.items(): + if name != self.active_replica and is_healthy: + self.active_replica = name + print(f"Failed over to replica: {name}") + return -1. Connects to all replicas and subscribes to a dedicated channel for each replica. -1. Connects to all replicas and periodically publishes a uniquely identifiable message. -1. Monitors received messages and ensures that it is able to receive its own messages within a predetermined window of time. + raise Exception("No healthy replicas available") -You can also use known dataset changes to monitor the reliability of the replication stream, -but pub/sub is preferred method because: + def start_health_monitoring(self): + """Start background health monitoring""" + def monitor(): + while True: + self.check_replica_health() + time.sleep(10) -1. It does not involve dataset changes. -1. It does not make any assumptions about the dataset. -1. Pub/sub messages are delivered as replicated effects and are a more reliable indicator of a live replication link. In certain cases, dataset keys may appear to be modified even if the replication link fails. This happens because keys may receive updates through full-state replication (re-sync) or through online replication of effects. + thread = threading.Thread(target=monitor, daemon=True) + thread.start() -## Impact of sharding on failure detection + def check_replica_health(self): + """Check health of all replicas using pub/sub""" + # Implementation details for pub/sub health checking + # (See previous examples for complete implementation) + pass +``` -If your sharding configuration is symmetric, make sure to use at least one key (PUB/SUB channels or real dataset key) per shard. Shards are replicated individually and are vulnerable to failure. Symmetric sharding configurations have the same number of shards and hash slots for all replicas. -We do not recommend an asymmetric sharding configuration, which requires at least one key per hash slot that intersects with a pair of shards. +## Next steps -To make sure that there is at least one key per shard, the application should: +- [Configure Active-Active databases]({{< relref "/operate/rs/databases/active-active/create" >}}) +- [Monitor Active-Active replication]({{< relref "/operate/rs/databases/active-active/monitor" >}}) +- [Develop applications with Active-Active databases]({{< relref "/operate/rs/databases/active-active/develop" >}}) -1. Use the Cluster API to retrieve the database sharding configuration. -1. Compute a number of key names, such that there is one key per shard. -1. Use those key names as channel names for the pub/sub mechanism. +## Troubleshooting common issues -### Failing over +### False positive failure detection -When the application needs to failover to another replica, it should simply re-establish its connections with the endpoint on the remote replica. Because Active/Active and Redis replication are asynchronous, the remote endpoint may not have all of the locally performed and acknowledged writes. +**Problem**: Application detects failures when replicas are actually healthy. -It's best if your application doesn't read its own recent writes. Those writes can be either: +**Solutions**: +- Increase heartbeat timeout windows +- Use multiple consecutive failures before triggering failover +- Monitor network latency between replicas -1. Lost forever, if the local replica has an event such as a double failure or loss of persistent files. -1. Temporarily unavailable, but will be available at a later time if the local replica's failure is temporary. +### Split-brain scenarios - +**Problem**: Network partition causes multiple replicas to appear as "primary" to different application instances. -## Failback decision +**Solutions**: +- Implement consensus mechanisms in your application +- Use external coordination services (like Consul or etcd) +- Design for eventual consistency -Your application can use the same checks described above to continue monitoring the state of the failed replica after failover. +### Slow failback -To monitor the state of a replica during the failback process, you must make sure the replica is available, re-synced with the remote replicas, and not in stale mode. The PUB/SUB mechanism is an effective way to monitor this. +**Problem**: Replica appears healthy but failback causes performance issues. -Dataset-based mechanisms are potentially less reliable for several reasons: -1. In order to determine that a local replica is not stale, it is not enough to simply read keys from it. You must also attempt to write to it. -1. As stated above, remote writes for some keys appear in the local replica before the replication link is back up and while the replica is still in stale mode. -1. A replica that was never written to never becomes stale, so on startup it is immediately ready but serves stale data for a longer period of time. +**Solutions**: +- Implement gradual failback (reads first, then writes) +- Monitor replica performance metrics during failback +- Use canary deployments for failback testing -## Replica Configuration Changes +## Related topics -All failover and failback operations should be done strictly on the application side, and should not involve changes to the Active-Active configuration. -The only valid case for re-configuring the Active-Active deployment and removing a replica is when memory consumption becomes too high because garbage collection cannot be performed. -Once a replica is removed, it can only be re-joined as a new replica and it loses any writes that were not converged. +- [Redis pub/sub]({{< relref "/develop/interact/pubsub" >}}) +- [Redis Cluster API]({{< relref "/operate/rs/clusters/cluster-api" >}}) +- [High availability best practices]({{< relref "/operate/rs/databases/durability-ha" >}}) From aa52a6079e784af1b2df2f30aff8f11ccef3c2a7 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 09:56:34 -0700 Subject: [PATCH 02/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 08c7100f64..5fd3a6447d 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -25,7 +25,7 @@ Active-Active databases don't include built-in application failover mechanisms. 4. **Failback** to the local replica when it recovers {{< warning >}} -**Data consistency considerations**: Active-Active replication is asynchronous. Applications that failover to another replica may miss recent write operations, which can impact data consistency. +**Data consistency considerations**: Active-Active replication is asynchronous. Applications that failover to another replica may miss recent write operations, which can impact data consistency. If the failed replica saved the write operations in persistent storage, then the write operations are processed when the failed replica recovers. {{< /warning >}} ## Prerequisites From 6c34951feb045d29fb4b5bb0ece733b8868bf773 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 09:56:41 -0700 Subject: [PATCH 03/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 5fd3a6447d..4578cd136c 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -34,7 +34,7 @@ Before implementing failover logic, ensure you understand: - [Active-Active database concepts]({{< relref "/operate/rs/databases/active-active" >}}) - Your application's data consistency requirements -- Network topology between replicas +- [Network topology]({{}}) between replicas - Redis [pub/sub mechanism]({{< relref "/develop/interact/pubsub" >}}) ## Failure detection strategies From 12445163328cd477b22f594d10f6d4eb38f5c838 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 09:57:18 -0700 Subject: [PATCH 04/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 4578cd136c..b94f8ccf55 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -39,7 +39,7 @@ Before implementing failover logic, ensure you understand: ## Failure detection strategies -Your application should monitor for two types of failures: +Your application should monitor local replica failures and replication failures. ### Local replica failures From 34d67c8299886834865cc768b17519ea3cc96136 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 09:57:29 -0700 Subject: [PATCH 05/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index b94f8ccf55..ff0293ff07 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -19,10 +19,10 @@ This guide explains how to implement robust failover and failback mechanisms for Active-Active databases don't include built-in application failover mechanisms. Instead, your application must: -1. **Monitor** local and remote replicas for availability -2. **Detect** failures quickly and accurately -3. **Failover** to a healthy remote replica when needed -4. **Failback** to the local replica when it recovers +1. Monitor local and remote replicas for availability. +2. Detect failures quickly and accurately. +3. Failover to a healthy remote replica when needed. +4. Failback to the local replica when it recovers. {{< warning >}} **Data consistency considerations**: Active-Active replication is asynchronous. Applications that failover to another replica may miss recent write operations, which can impact data consistency. If the failed replica saved the write operations in persistent storage, then the write operations are processed when the failed replica recovers. From 3cfcc64a661a9262392124beb691bbd7938f5713 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 09:57:38 -0700 Subject: [PATCH 06/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index ff0293ff07..0844f677e8 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -50,8 +50,9 @@ Your application should monitor local replica failures and replication failures. - Network connectivity issues - Configuration errors - Database endpoint unavailable +- Unexpected protocol level errors -**Detection method**: Monitor connection attempts to the database endpoint. If connections consistently fail (timeout, refused, protocol errors), consider the local replica failed. +**Detection method**: Monitor connection attempts to the database endpoint. If connections consistently fail, consider the local replica failed. ### Replication failures From 5caf1d761771fe0abd405dcb2772c5e33ebaf8ae Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 09:58:03 -0700 Subject: [PATCH 07/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 0844f677e8..d4771b9a77 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -386,5 +386,4 @@ class FailoverRedisClient: ## Related topics - [Redis pub/sub]({{< relref "/develop/interact/pubsub" >}}) -- [Redis Cluster API]({{< relref "/operate/rs/clusters/cluster-api" >}}) -- [High availability best practices]({{< relref "/operate/rs/databases/durability-ha" >}}) +- [OSS Cluster API]({{< relref "/operate/rs/clusters/optimize/oss-cluster-api/" >}}) From 1cdaa146f08ffef2220d225b4505a4ea283f9138 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 09:58:12 -0700 Subject: [PATCH 08/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index d4771b9a77..a5a19e4888 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -201,9 +201,9 @@ def failover_to_replica(target_replica_name): Continue monitoring all replicas, including the failed one, to detect when it recovers. -## Implementing failback +## Implement failback -Monitor the failed replica to determine when it's safe to failback: +Monitor the failed replica to determine when it's safe to failback. ### Failback criteria From 894f110622d8f2d3888b9a7025c37b3fd377da77 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 10:01:05 -0700 Subject: [PATCH 09/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../develop/app-failover-active-active.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index a5a19e4888..6b7fcaadfe 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -66,16 +66,16 @@ Your application should monitor local replica failures and replication failures. **Detection method**: Use Redis pub/sub to monitor replication health across all replicas. -## Implementing pub/sub health monitoring +## Set up pub/sub health monitoring -The most reliable way to detect replication failures is using Redis pub/sub: +The most reliable way to detect replication failures is using Redis pub/sub. ### How it works -1. **Subscribe** to a dedicated health-check channel on each replica -2. **Publish** periodic heartbeat messages with unique identifiers -3. **Monitor** that your own messages are received within a time window -4. **Detect failure** when messages aren't received from specific replicas +1. Subscribe to a dedicated health-check channel on each replica. +2. Publish periodic heartbeat messages with unique identifiers. +3. Monitor that your own messages are received within a time window. +4. Detect failure when messages aren't received from specific replicas. ### Implementation steps From b37824fa2f9c689623d927e3e6e8f509407fc6bd Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 10:01:17 -0700 Subject: [PATCH 10/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 6b7fcaadfe..da2d37e161 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -208,9 +208,9 @@ Monitor the failed replica to determine when it's safe to failback. ### Failback criteria A replica is ready for failback when it's: -1. **Available**: Accepting connections and responding to commands -2. **Synchronized**: Caught up with changes from other replicas -3. **Not stale**: Actively participating in replication +1. **Available**: Accepting connections and responding to commands. +2. **Synchronized**: Caught up with changes from other replicas. +3. **Not stale**: You can read and write to the replica. ### Failback process From e9deddab418ce58a2e00251fd36d56c6f2377f73 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 10:01:35 -0700 Subject: [PATCH 11/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index da2d37e161..e0e75f6850 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -79,7 +79,7 @@ The most reliable way to detect replication failures is using Redis pub/sub. ### Implementation steps -1. **Connect to all replicas**: +1. Connect to all replicas: ```python # Example implementation - adapt for your environment replicas = { From 3a1318ac1472726a387ffbc5ecff1541bee4dc63 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 10:01:42 -0700 Subject: [PATCH 12/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index e0e75f6850..46654877fc 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -214,7 +214,7 @@ A replica is ready for failback when it's: ### Failback process -1. **Verify replica health**: +1. Verify replica health: ```python # Example implementation - adapt for your environment def is_replica_ready_for_failback(replica_name): From aff22b63239ce358a923c05bfa29e3ba9deb6175 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 10:01:50 -0700 Subject: [PATCH 13/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 46654877fc..73c9ef7df1 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -237,7 +237,7 @@ A replica is ready for failback when it's: return False ``` -2. **Gradual failback** (recommended): +2. Gradual failback: ```python # Example implementation - adapt for your environment def gradual_failback(primary_replica): From e49179a7622fd3a809f56e98b3aa569a508aba3d Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 10:02:11 -0700 Subject: [PATCH 14/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 73c9ef7df1..239bbef53e 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -89,7 +89,7 @@ The most reliable way to detect replication failures is using Redis pub/sub. } ``` -2. **Subscribe to health channels**: +2. Subscribe to health channels: ```python # Example implementation - adapt for your environment for name, client in replicas.items(): From 193ee6fbd08b08e6c563e6f3e066c3ffcd42fe49 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 10:02:27 -0700 Subject: [PATCH 15/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 239bbef53e..1a2d61ec6f 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -96,7 +96,7 @@ The most reliable way to detect replication failures is using Redis pub/sub. client.subscribe(f'health-check-{name}') ``` -3. **Publish heartbeat messages**: +3. Publish heartbeat messages: ```python # Example implementation - adapt for your environment import time From 76c3b9d7eab20427d08d3b2384167cbd12a91a97 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 10:02:48 -0700 Subject: [PATCH 16/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 1a2d61ec6f..322263d9db 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -113,7 +113,7 @@ The most reliable way to detect replication failures is using Redis pub/sub. client.publish(f'health-check-{name}', json.dumps(message)) ``` -4. **Monitor message delivery**: +4. Monitor message delivery: ```python # Example implementation - adapt for your environment def check_replication_health(): From 654707062cd2ab6daf1b18b8f0438e04e37fc73d Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 10:04:34 -0700 Subject: [PATCH 17/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 322263d9db..f1024b7e06 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -129,7 +129,7 @@ The most reliable way to detect replication failures is using Redis pub/sub. **Why pub/sub works**: Pub/sub messages are delivered as replicated effects, making them a reliable indicator of active replication links. Unlike dataset changes, pub/sub doesn't make assumptions about your data structure. {{< /tip >}} -## Handling sharded databases +## Handle sharded databases If your Active-Active database uses sharding, you need to monitor each shard individually: From 7c16984f909856f0af55600ab44ce3a06bc02812 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 10:05:04 -0700 Subject: [PATCH 18/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index f1024b7e06..655151f388 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -165,9 +165,7 @@ Asymmetric configurations require monitoring every hash slot intersection, which When you detect a local replica failure: -### 1. Stop writing to the failed replica - -Immediately redirect all database operations to a healthy remote replica. +1. Stop writing to the failed replica and immediately redirect all database operations to a healthy remote replica. ```python # Example implementation - adapt for your environment From c9baab32b4c175d963f7afec9ebecbaab5d866dd Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 10:05:23 -0700 Subject: [PATCH 19/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 655151f388..bd66f754af 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -183,7 +183,7 @@ def failover_to_replica(target_replica_name): update_app_config('active_replica', target_replica_name) ``` -### 2. Handle data consistency +2. Handle data consistency **Important considerations**: - The remote replica may not have your most recent writes From 804add53152410f50d2d1c3301f8d20d44048796 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 10:05:44 -0700 Subject: [PATCH 20/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index bd66f754af..9becb8ed94 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -195,9 +195,7 @@ def failover_to_replica(target_replica_name): - Use timestamps or version numbers to detect stale data - Implement retry logic for critical operations -### 3. Update monitoring - -Continue monitoring all replicas, including the failed one, to detect when it recovers. +3. Continue monitoring all replicas, including the failed one, to detect when it recovers. ## Implement failback From 138afcb20ae5725f1464427d4ff135180e0e9a72 Mon Sep 17 00:00:00 2001 From: mich-elle-luna <153109578+mich-elle-luna@users.noreply.github.com> Date: Wed, 28 May 2025 10:06:03 -0700 Subject: [PATCH 21/22] Update content/operate/rs/databases/active-active/develop/app-failover-active-active.md Co-authored-by: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> --- .../active-active/develop/app-failover-active-active.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 9becb8ed94..270c8cf8c5 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -161,7 +161,7 @@ def get_channels_per_shard(redis_client): Asymmetric configurations require monitoring every hash slot intersection, which is complex and error-prone. -## Implementing failover +## Implement failover When you detect a local replica failure: From 9879716930c7e01512a11e6f6075b9caccdc7b8f Mon Sep 17 00:00:00 2001 From: mich-elle-luna Date: Tue, 3 Jun 2025 16:25:14 -0700 Subject: [PATCH 22/22] Address PR feedback for Active-Active app failover documentation - Make sharding monitoring requirements less prescriptive, offer database-level and per-shard approaches - Convert asymmetric sharding section to a note for cleaner structure - Move dataset monitoring warning to Failback criteria section for better context - Fix Next steps section with appropriate links and remove broken monitoring link - Remove inappropriate generic troubleshooting content to keep focus on Redis Enterprise specifics --- .../develop/app-failover-active-active.md | 88 +++++++++---------- 1 file changed, 40 insertions(+), 48 deletions(-) diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 270c8cf8c5..f853ba1efb 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -70,6 +70,10 @@ Your application should monitor local replica failures and replication failures. The most reliable way to detect replication failures is using Redis pub/sub. +{{< tip >}} +**Why pub/sub works**: Pub/sub messages are delivered as replicated effects and are a more reliable indicator of a live replication link. In certain cases, dataset keys may appear to be modified even if the replication link fails. This happens because keys may receive updates through full-state replication (re-sync) or through online replication of effects. Unlike dataset changes, pub/sub doesn't make assumptions about your data structure. +{{< /tip >}} + ### How it works 1. Subscribe to a dedicated health-check channel on each replica. @@ -125,18 +129,38 @@ The most reliable way to detect replication failures is using Redis pub/sub. mark_replica_unhealthy(replica_name) ``` -{{< tip >}} -**Why pub/sub works**: Pub/sub messages are delivered as replicated effects, making them a reliable indicator of active replication links. Unlike dataset changes, pub/sub doesn't make assumptions about your data structure. -{{< /tip >}} - ## Handle sharded databases -If your Active-Active database uses sharding, you need to monitor each shard individually: +If your Active-Active database uses sharding, you have several monitoring approaches: + +### Database-level monitoring (simpler approach) + +For many use cases, you can monitor the entire database using a single pub/sub channel per replica. This approach: + +- **Works well when**: All shards typically fail together (node failures, network partitions) +- **Simpler to implement**: Uses the same monitoring logic as non-sharded databases +- **May miss**: Individual shard failures that don't affect the entire database + +```python +# Example implementation - adapt for your environment +# Use the same approach as non-sharded databases +for name, client in replicas.items(): + client.subscribe(f'health-check-{name}') +``` + +### Per-shard monitoring (comprehensive approach) + +Monitor each shard individually when you need to detect partial database failures: -### Symmetric sharding (recommended) +#### Symmetric sharding (recommended) With symmetric sharding, all replicas have the same number of shards and hash slots. +**When to use per-shard monitoring**: +- You need to detect individual shard failures +- Your application can handle partial database availability +- You want maximum visibility into database health + **Monitoring approach**: 1. Use the Cluster API to get the sharding configuration 2. Create one pub/sub channel per shard @@ -157,9 +181,9 @@ def get_channels_per_shard(redis_client): return channels ``` -### Asymmetric sharding (not recommended) - -Asymmetric configurations require monitoring every hash slot intersection, which is complex and error-prone. +{{< note >}} +**Asymmetric sharding**: Asymmetric configurations require monitoring every hash slot intersection, which is complex and error-prone. For asymmetric sharding, database-level monitoring is often more practical than per-shard monitoring. +{{< /note >}} ## Implement failover @@ -208,6 +232,10 @@ A replica is ready for failback when it's: 2. **Synchronized**: Caught up with changes from other replicas. 3. **Not stale**: You can read and write to the replica. +{{< warning >}} +**Avoid dataset-based monitoring**: Don't rely solely on reading/writing test keys to determine replica health. Replicas can appear healthy while still in stale mode or missing recent updates. +{{< /warning >}} + ### Failback process 1. Verify replica health: @@ -248,10 +276,6 @@ A replica is ready for failback when it's: redirect_writes_to(primary_replica) ``` -{{< warning >}} -**Avoid dataset-based monitoring**: Don't rely solely on reading/writing test keys to determine replica health. Replicas can appear healthy while still in stale mode or missing recent updates. -{{< /warning >}} - ## Configuration best practices ### Application-side failover only @@ -344,42 +368,10 @@ class FailoverRedisClient: pass ``` -## Next steps - -- [Configure Active-Active databases]({{< relref "/operate/rs/databases/active-active/create" >}}) -- [Monitor Active-Active replication]({{< relref "/operate/rs/databases/active-active/monitor" >}}) -- [Develop applications with Active-Active databases]({{< relref "/operate/rs/databases/active-active/develop" >}}) - -## Troubleshooting common issues - -### False positive failure detection - -**Problem**: Application detects failures when replicas are actually healthy. - -**Solutions**: -- Increase heartbeat timeout windows -- Use multiple consecutive failures before triggering failover -- Monitor network latency between replicas - -### Split-brain scenarios - -**Problem**: Network partition causes multiple replicas to appear as "primary" to different application instances. - -**Solutions**: -- Implement consensus mechanisms in your application -- Use external coordination services (like Consul or etcd) -- Design for eventual consistency - -### Slow failback - -**Problem**: Replica appears healthy but failback causes performance issues. - -**Solutions**: -- Implement gradual failback (reads first, then writes) -- Monitor replica performance metrics during failback -- Use canary deployments for failback testing - ## Related topics +- [Manage Active-Active databases]({{< relref "/operate/rs/databases/active-active/manage" >}}) +- [Active-Active database synchronization]({{< relref "/operate/rs/databases/active-active/syncer" >}}) +- [Monitor Redis Enterprise Software]({{< relref "/operate/rs/monitoring" >}}) - [Redis pub/sub]({{< relref "/develop/interact/pubsub" >}}) - [OSS Cluster API]({{< relref "/operate/rs/clusters/optimize/oss-cluster-api/" >}})