-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed handling of interval tree clocks #85
Conversation
These rules have to be followed to make ITCs work properly: - Never store a remote clock in the local registry. Remote clocks have no identity (or the wrong identity). There have been many places where remote clocks have been stored in the local registry. This should be always wrong, because remote clocks either have no identity (if peeked) or the wrong identity of another node. - Use Clock.join() to receive causal information from remote (peeked) clocks. Clock.join() has not been used anywhere in the Tracker, which means that causal information from other nodes has never updated after a broadcast. - Always call Clock.peek() before sending clocks to other replicas. This avoids a couple bugs when the local node stores the wrong clock identity. - Make sure that every node gets a properly forked clock when joining the cluster. Removed handle_replica calls for :add_meta and :remove_meta because clocks are stored with the metadata itself and they cannot determine the causal ordering of add and remove events. Using :update_meta for all replica events. Changed broadcast_event and added get_registry_snapshot wrapper to make sure that only peeked clocks are send over to other nodes. The TrackerState clock is now only used to hold the clock identity.
This change should address the issues I outlined here: |
- Add tests for handling replication events: track, untrack and update_meta - Add tests for syncing registries
Added Tracker tests to validate handling of replication events and syncing registries |
We have been using this change in production for a couple of weeks now. Without any problems. It really fixes the occasional issues we saw with the clock handling |
@bitwalker have you had time to look at this change? |
@tschmittni I'm trying to get all Swarm tests passing, and some previously failing tests seem to be fixed on your ITC branch, which is awesome :-) However, your most recent commit, in which you added |
Thank you! It looks like the tests are not working in "clustered" mode. But you can run them with: Let me know if you want me to change anything with the tests. Otherwise, we can also just drop them if you prefer other type of tests. |
Thanks, that helps. The correct branch is |
The latest branch should be the one in this pull request: https://github.com/tschmittni/swarm/tree/refactored-itc-usage |
I'll see what I can do to get them running with |
Fixed the tests, see my #94 PR |
This is a follow up of bitwalker#85 I forgot to change the handle_topology_change method which is as far as I can tell also the root cause for bitwalker#106
These rules have to be followed to make ITCs work properly:
Never store a remote clock in the local registry. Remote clocks
have no identity (or the wrong identity). There have been many
places where remote clocks have been stored in the local registry.
This should be always wrong, because remote clocks either have
no identity (if peeked) or the wrong identity of another node.
Use Clock.join() to receive causal information from remote (peeked)
clocks. Clock.join() has not been used anywhere in the Tracker,
which means that causal information from other nodes has never
updated after a broadcast.
Always call Clock.peek() before sending clocks to other replicas.
This avoids a couple bugs when the local node stores the wrong
clock identity.
Make sure that every node gets a properly forked clock when
joining the cluster.
Removed handle_replica calls for :add_meta and :remove_meta because
clocks are stored with the metadata itself and they cannot determine
the causal ordering of add and remove events. Using :update_meta for
all replica events.
Changed broadcast_event and added get_registry_snapshot wrapper
to make sure that only peeked clocks are send over to other nodes.
The TrackerState clock is now only used to hold the clock identity.