Skip to content

Conversation

@daveyeu
Copy link

@daveyeu daveyeu commented Oct 17, 2014

The most important change here is that Redis failures are now expected, and the entire GC/destroy client process has been updated to handle them. Broadly, failures are tolerated by allowing the GC to be re-runnable for a failed client ID.

When GC starts up, we fan out and call destroyClient(clientId) for each candidate expired client ID. Each one of these calls can fail individually, and processing simply stops for that client ID. Since the removal of the client ID from the /clients sorted set is the last step, GC will simply pickup failed client IDs and re-run them in the future. As such, the lock around GC has also been removed, since failures are expected and multiple deletion runs for a single client ID are tolerable.

Other notable changes:

  • The score is now assigned to a client ID in the initial ZADD call in createClient. Previously, the score was assigned to zero, and updated in the callback with ping(). This is susceptible to a race, however, where GC can run and pluck the client ID before its score is updated.
  • Publishing no longer destroys clients. Previously, publish would check the expiry of every client ID, and call destroyClient if it detected an expired client. This could lead to races, so it's now the sole responsibility of GC to cleanup old clients.

Publishing was stepping on GC's toes by calling `destroyClient` whenever
it detected a client ID was past its prime. I *think* there may have
been some contention between the two (publisher and gc), so this moves
the cleanup responsibility wholly to the latter.

Well, not completely. The `disconnect` function inside of Faye itself
calls `destroyClient` as well, but let's leave that be.
A lot of changes here. Primarily, the destroyClient function now
tolerates Redis failures all the way downstream, and if any error is
detected, the destroy process stops. Instead of trying to recover, we
rely on the hope that GC will pick up a client and re-run the calls in
the future, since the removal from the /clients sorted set is the final
Redis call.

The GC lock has also been removed. After grabbing a list of expired
client IDs, we iterate over them with destroyClient, which returns
immediately. Since we're simply launching destroyClient into flight, and
don't care if it succeeds or not, we allow GC to re-run in the future
without any restrictions.

Because destroyClient doesn't block, we have an additional, expected
test failure -- "Redis engine destroyClient when the client has
subscriptions stops the client receiving messages:". This will fail b/c
it's written to expect the destroy to occur immediately. If the test is
rewritten with the expectation in a callback, it succeeds.
Previously, it would set the initial score to zero, and then allow
ping() to update the score to now in the callback. This seems like it
would allow a window of opportunity for GC to sneak in and kill the
newly created client, however, so now it simply sets the score in the
initial call.
@jpignata
Copy link

lgtm

daveyeu added a commit that referenced this pull request Oct 17, 2014
Rewrite GC to handle Redis failures
@daveyeu daveyeu merged commit 2d43d61 into master Oct 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants