Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs_watch() can miss values on stalled read #812

Closed
garlick opened this issue Sep 20, 2016 · 3 comments
Closed

kvs_watch() can miss values on stalled read #812

garlick opened this issue Sep 20, 2016 · 3 comments

Comments

@garlick
Copy link
Member

garlick commented Sep 20, 2016

If a kvs watch request is stalled traversing the namespace to read the new value, the key could have changed between the stall and its eventual restart. The watch request will return the more recent value and the "watcher" (client calling kvs_watch) will have missed the original value.

@garlick
Copy link
Member Author

garlick commented Dec 30, 2016

This problem stems from the design of the get handler, which watch leverages.

When the get handler must make an RPC to load some data into its cache in order to continue a walk through the namespace, it simply stores the original get request on wait queue of the cache entry, and when the load RPC returns, the wait queue is "run", restarting the queued request. It is restarted from the beginning, walking from the current root. If the root hash has since changed, the walk starts from there, and ultimately the newest value is returned to the caller. This is fine for get which is effectively a "get the latest value of key" request.

watch is handled just like a get except the watch request is also stored in a list of active watchers that is "run" on each commit. As long as the watch request can complete its walk without stalling, this results in a watch response for every commit that changes the watched key. However, if the walk stalls, and another commit occurs in between, the watch restarts from the beginning at the new root hash and can return a newer value.

The fix should could be either:

  1. restart the stalled watch from the hash reference that caused the stall
  2. restart from the beginning but, like getat, use a specific root hash reference as a starting point

Currently a message handler called from the reactor or called from the "waitlist" abstraction has no way of knowing how it is being called. If there were a way to "annotate" the request message with this stored walk context and retrieve it later without breaking the "constness" of the message argument, this would be easy to implement.

Another possibility is to rework the coproc support to work in conjunction with a genericized wait queue abstraction. Any reactor handler could call sleep_on (&waitqueue), run (&waitqueue) would mark those stalled contexts as restartable, and a reactor prepare/check watcher would sort out running those fairly. The caveat is only "reentrant" reactor handlers could work this way, since a given handler might be stalled multiple times on different request messages. This is quite different and more flexible than the current coproc support which is turned off by default, only implements a "reschedule point" at flux_recv(), and doesn't allow handler reentry. (Although possibly I'm not remembering some constraints that led us there)

@grondo
Copy link
Contributor

grondo commented May 10, 2018

Is this still an issue? In any event, moving off to 0.10.0

@grondo grondo modified the milestones: release 0.9.0, release 0.10.0 May 10, 2018
@chu11
Copy link
Member

chu11 commented May 10, 2018

I think this is fixed. All the way back in #1066, I refactored walks/lookups to no longer start at the beginning. It starts at whereever it last was in the walk.

So this was effectively fixed via the suggested:

restart the stalled watch from the hash reference that caused the stall

up above. I'm going to close this.

Edit: For documentation purposes, the key patch is 30d8a12

@chu11 chu11 closed this as completed May 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants