ENG-8212: Redis Oplock implementation #5932

masenf · 2025-10-27T05:49:38Z

When taking a lock from redis, hold it for 80% of the lock expiration timeout
While the lock is held, other events processed against the instance will use
the cached in-memory copy of the state.
When the timeout expires or another process signals intention to access a
locked state, flush the modifed states to redis and release the lock.

Set REFLEX_OPLOCK_ENABLED=1 to use this feature

Increases overall test coverage by ~1% without REFLEX_REDIS_URL

When an update is emitted for a token, but the websocket for that token is on another instance of the app, post it to the lost+found channel where other instances are listening for updates to send to their clients.

Set the groundwork for being able to broadcast updates to all connected states.

For more efficient and fair lock queueing, each StateManagerRedis uses a single task to monitor the keyspace for lock release/expire and then wakes up the next caller that was waiting in the queue (no fairness between separate processes though). Now lockers will wait for an `asyncio.Event` which is set by the redis pubsub waiter. If any locker waits longer than the lock_expiration, it will just try to get the lock in case there was some mixup with the pub/sub, the locker won't be blocked forever.

* When taking a lock from redis, hold it for 80% of the lock expiration timeout * While the lock is held, other events processed against the instance will use the cached in-memory copy of the state. * When the timeout expires or another process signals intention to access a locked state, flush the modifed states to redis and release the lock. Set REFLEX_OPLOCK_ENABLED=1 to use this feature

codspeed-hq · 2025-10-27T05:54:08Z

CodSpeed Performance Report

Merging #5932 will not alter performance

_{Comparing masenf/redis_oplock (38ea405) with main (62264f1)}

Summary

✅ 8 untouched

…senf/redis_oplock

Always check redis for contended leases before granting a lease. It's a bit slower, but much more reliable and avoids racy lock_expiration timeouts when contention occurs before the lease is created or when the pubsub hasn't caught up to reality. Always start _lock_update_task in __post_init__ to avoid race where the lease is granted, then contended, but the pubsub task hasn't started to catch the contention.

greptile-apps

Greptile Overview

Greptile Summary

This PR implements an opportunistic locking (oplock) mechanism for the Redis state manager to improve performance when there's no lock contention. The key change is that when REFLEX_OPLOCK_ENABLED=1 is set, the state manager holds Redis locks for 80% of the expiration timeout and caches states in memory, allowing subsequent operations to avoid Redis round-trips entirely.

Key Changes:

Holds locks for 80% of lock_expiration time to enable fast in-memory state access during uncontended periods
Uses Redis pubsub keyspace notifications to detect lock contention and flush cached states when other processes need access
Adds comprehensive test coverage including mock Redis implementation and integration tests
Gracefully handles cancellation and ensures state consistency by shielding flush operations from cancellation
Maintains backward compatibility - feature is opt-in via environment variable

Implementation Quality:

Well-structured with extensive test coverage (8 new test cases covering basic ops, contention, cancellation, and edge cases)
Proper error handling and cleanup logic with asyncio.shield to prevent data loss
CI updated to test both with and without oplock enabled
Good separation of concerns with _try_modify_state handling retry logic

Confidence Score: 4/5

This PR is generally safe to merge with some considerations for production deployment
Score of 4 reflects solid implementation with comprehensive testing, but this is a complex concurrency feature that introduces new failure modes. The oplock mechanism is well-tested and properly handles contention, cancellation, and edge cases. However, it's a significant architectural change to the state management system that could have unexpected interactions in production under high load or network issues. The feature is appropriately gated behind an environment variable and thoroughly tested in CI. One minor style suggestion was provided for code clarity.
Pay close attention to reflex/istate/manager/redis.py - this is the core implementation with complex async logic and lock management

Important Files Changed

File Analysis

Filename	Score	Overview
reflex/istate/manager/redis.py	4/5	Implements opportunistic locking (oplock) for Redis state manager. Holds locks for 80% of expiration time to enable fast in-memory access when uncontended. Adds pubsub-based lock contention detection and local state caching.
reflex/environment.py	5/5	Adds two environment variables: `REFLEX_REDIS_STATE_MANAGER_DEBUG` for debug logging and `REFLEX_OPLOCK_ENABLED` to enable opportunistic locking feature.
tests/units/mock_redis.py	5/5	New mock Redis implementation supporting pubsub, keyspace notifications, and set operations. Also provides `real_redis` helper for integration testing.
tests/units/istate/manager/test_redis.py	5/5	Comprehensive test coverage for Redis state manager oplock feature. Tests basic operations, lock contention, lease management, cancellation handling, and substate fetching.
.github/workflows/unit_tests.yml	5/5	Adds Redis service container and additional test run with `REFLEX_OPLOCK_ENABLED=true` to verify oplock functionality in CI.

Sequence Diagram

sequenceDiagram
    participant Client1 as Client 1 (Process A)
    participant SM1 as StateManager 1
    participant Redis
    participant PubSub as Redis PubSub
    participant SM2 as StateManager 2
    participant Client2 as Client 2 (Process B)

    Note over Client1,Client2: Scenario: Oplock enabled, uncontended access

    Client1->>SM1: modify_state(token)
    SM1->>Redis: SET token_lock (NX, PX=lock_expiration)
    Redis-->>SM1: OK (lock acquired)
    SM1->>Redis: GET state data
    Redis-->>SM1: state
    SM1->>SM1: Cache state in _cached_states
    SM1->>SM1: Create lease_breaker task (sleep 80% of lock_expiration)
    SM1-->>Client1: yield cached_state
    Client1->>Client1: Modify state in memory
    Client1->>SM1: Exit context (fast, no Redis write)
    
    Note over SM1: Lock held, state cached for subsequent calls

    Client1->>SM1: modify_state(token) [2nd call]
    SM1->>SM1: Check _cached_states
    SM1-->>Client1: yield cached_state (no Redis lock!)
    Client1->>Client1: Modify state in memory
    Client1->>SM1: Exit context (fast, no Redis write)

    Note over Client2,SM2: Contention scenario begins

    Client2->>SM2: modify_state(token)
    SM2->>Redis: SADD token_lock_waiters (signal contention)
    Redis->>PubSub: keyspace event: sadd
    PubSub->>SM1: Lock contention detected
    SM1->>SM1: Cancel lease_breaker task
    SM1->>Redis: SET state data (flush cached state)
    SM1->>Redis: DEL token_lock (release)
    Redis->>PubSub: keyspace event: del
    PubSub->>SM2: Lock released notification
    SM2->>Redis: SET token_lock (acquire)
    Redis-->>SM2: OK
    SM2->>Redis: GET state data
    Redis-->>SM2: state (with Client1's changes)
    SM2->>SM2: Cache state, create new lease_breaker
    SM2-->>Client2: yield state
    Client2->>Client2: Modify state
    Client2->>SM2: Exit context

    Note over SM1,SM2: Lease timeout scenario (no contention)

    SM1->>SM1: lease_breaker wakes after 80% timeout
    SM1->>Redis: SET state data (flush)
    SM1->>Redis: DEL token_lock (release)
    SM1->>SM1: Clear _cached_states[token]

_{10 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

reflex/istate/manager/redis.py

No point in continually spamming "no running event loop" to the console.

linear · 2025-11-03T22:33:28Z

ENG-8212 Oplock implementation

masenf added 8 commits October 23, 2025 17:51

Token manager tracks instance_id in token_to_socket

e3b9787

RedisTokenManager: keep local dicts globally updated via pub/sub

76630a6

Implement lost+found for StateUpdate without websocket

e71eece

When an update is emitted for a token, but the websocket for that token is on another instance of the app, post it to the lost+found channel where other instances are listening for updates to send to their clients.

Implement enumerate_tokens for TokenManager

096f6ac

Set the groundwork for being able to broadcast updates to all connected states.

Consolidate on _get_token_owner

223747d

fix test_connection_banner.py: expect SocketRecord JSON

07a9093

add test_background_task.py::test_fast_yielding

b107aee

This was referenced Oct 27, 2025

fix: do not create many redis subscriptions per token - fast yielding background #4419

Closed

WIP - Shared redis subscribe #4432

Closed

masenf added 5 commits October 27, 2025 12:50

Merge remote-tracking branch 'origin/main' into masenf/redis_lost+found

d8f040e

Implement real redis-backed test cases for lost+found

83c8fcf

add some polling for the emit mocks since L+F doesn't happen immediately

36496dd

Merge remote-tracking branch 'origin/masenf/redis_lost+found' into ma…

e1ef249

…senf/redis_oplock

Fix up unit tests for OPLOCK_ENABLED mode

e20cb81

Base automatically changed from masenf/redis_lost+found to main October 29, 2025 23:03

masenf added 6 commits October 29, 2025 16:04

Merge remote-tracking branch 'origin/main' into masenf/redis_oplock

615b11f

support py3.10

fa8b0f4

Add real+mock test cases for StateManagerRedis

da1ddb5

update test_state to use mock_redis when real redis is not available

06c20f7

Merge remote-tracking branch 'origin/main' into masenf/redis_oplock

d771e3e

masenf marked this pull request as ready for review November 3, 2025 21:49

greptile-apps bot reviewed Nov 3, 2025

View reviewed changes

reflex/istate/manager/redis.py Show resolved Hide resolved

masenf added 4 commits November 3, 2025 14:04

safe await cancelled task

838926f

explicitly disable oplock for basic test_redis cases

b058569

py3.10 support: asyncio.TimeoutError != TimeoutError

a0c08c7

break out of forever tasks when event loop goes away

871c885

No point in continually spamming "no running event loop" to the console.

masenf changed the title ~~Redis Oplock implementation~~ ENG-8212: Redis Oplock implementation Nov 3, 2025

masenf added 4 commits November 3, 2025 16:18

generalize "forever" tasks to centralize exception handling/retry

6a4db93

remove unused arg

474d856

less racy way test_ensure_task_limit_window_passed

d487a95

rename REFLEX_STATE_MANAGER_REDIS_DEBUG to match the class name

38ea405

adhami3310 approved these changes Nov 4, 2025

View reviewed changes

masenf merged commit 781bd06 into main Nov 4, 2025
47 checks passed

masenf deleted the masenf/redis_oplock branch November 4, 2025 01:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENG-8212: Redis Oplock implementation #5932

ENG-8212: Redis Oplock implementation #5932

masenf commented Oct 27, 2025 •

edited

Loading

Uh oh!

codspeed-hq bot commented Oct 27, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

linear bot commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ENG-8212: Redis Oplock implementation #5932

ENG-8212: Redis Oplock implementation #5932

Conversation

masenf commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #5932 will not alter performance

Summary

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

linear bot commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

masenf commented Oct 27, 2025 •

edited

Loading

codspeed-hq bot commented Oct 27, 2025 •

edited

Loading