Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve kvs watch efficiency #810

Closed
wants to merge 8 commits into from
Closed

Conversation

garlick
Copy link
Member

@garlick garlick commented Sep 15, 2016

This PR is a speculative attempt to redesign the kvs watch machinery to make it more efficient and scalable as discussed in #803. So far:

  • the kvs.setroot event sends out a list of keys modified by the commit
  • new function kvs_getat() which is like kvs_get() but with additional treeobj argument (think of it like a snapshot-relative kvs_get() as suggested in kvsdir_* API should provide atomicity in namespace traversal #64)
  • command line flux kvs getat treeobj key and sharness tests

What I was thinking of doing roughly, is removing the watch machinery from the kvs module entirely, and instead have the client kvs_watch() implementation:

  • subscribe to kvs.setroot events
  • get current value of key (call user callback)
  • as each event comes in, if watched key is in the list, call kvs_getat() on the root treeobj from the event (call user callback)

I haven't quite figured out how to synchronize the first key value with the event-based updates. We might need to expose the root sequence number in the dirent so we can correlate a dirent obtained from kvs_get_treeobj (".") with one obtained from the setroot event.

There might be a bit more latency getting updates this way, since the setroot event only triggers the sequence to obtain the next item; there has to be an RTT with the local kvs module before the new value can be passed to the user callback. I think we can hide all that so it can remain asynchronous (reactor driven) though, e.g. setroot event handler sends rpc and registers a completion, then completion calls user callback.

Work in progress, but thought I'd post for early feedback.

@garlick garlick added the review label Sep 15, 2016
@codecov-io
Copy link

codecov-io commented Sep 15, 2016

Current coverage is 75.00% (diff: 88.78%)

Merging #810 into master will increase coverage by 0.06%

@@             master       #810   diff @@
==========================================
  Files           145        145          
  Lines         24911      24992    +81   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          18666      18744    +78   
- Misses         6245       6248     +3   
  Partials          0          0          
Diff Coverage File Path
•••••••• 82% src/modules/kvs/kvs.c
••••••••• 90% src/cmd/flux-kvs.c
••••••••• 92% src/modules/kvs/libkvs.c
•••••••••• 100% src/modules/kvs/proto.c
•••••••••• 100% src/modules/kvs/json_dirent.c

Powered by Codecov. Last update b532e12...a1620c3

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 75.372% when pulling d1913e6 on garlick:kvs_watch into b532e12 on flux-framework:master.

@grondo
Copy link
Contributor

grondo commented Sep 15, 2016

This seems very nice!

My very uninformed first observation is that moving kvs_watch into the handle implementation will require all handles using kvs_watch to receive and parse every kvs.setroot event, searching for one key (or a glob) in the event's keys array (if I've understood correctly). This seems like a win for handles that watch many keys, but handles (modules, api users) that only watch a single key will see much more activity (?) (i.e. wakeup for every kvs.setroot)

The kvs_getat() interface is really cool! I can't even conceive of all the things this could be used for...

@garlick
Copy link
Member Author

garlick commented Sep 15, 2016

Thanks @grondo - that's the sort of early feedback I was looking for. I think you're probably spot-on.

I wish there was a way to map watch notifications onto events, and then one would simply subscribe to the keys of interest.

I suppose we could try the easy thing and generate a globally distributed event per key instead of packing the list into the setroot event. An event with no payload only takes up a few bytes...

@grondo
Copy link
Contributor

grondo commented Sep 15, 2016

Would deleted keys be included in either the kvs.setroot keys list, or the global per-key events? I'm thinking of a purge of 1000 job entries under a single commit. If a basedir is unlinked from the kvs, I'm assuming that an event would only be generated for basedir., and not every path under basedir., and therefore a watcher for basedir.a.b.c key would have to interpolate that the key of interest is deleted and invoke callbacks? I assume this would require annotation in the event that key was deleted (so the watcher would know to match basedir of keys not literal key path itself) (I could be missing something obvious though)

Hm, does kvs_watch callback fire in the current code if a key is deleted via unlink of a parent directory?

@garlick
Copy link
Member Author

garlick commented Sep 15, 2016

An unlink would be listed once in the keys list (or have one event), and you are right, there would be no list entry/event for a key contained within an unlinked directory . I hadn't thought about that. I had only thought about the opposite, where a change to a key implies that its parent directories also change.

I've been pondering the volume of events in this proposed change off and on this morning and haven't really come up with a good idea to limit them to an acceptable level while keeping the design simple. The way change notifications are distributed now is pretty focused.

To answer your final question, yes kvs_watch callbacks should fire if its parent is unlinked now. The way watched keys are monitored now is the kvs module (on every rank) executes a full get after every commit for every watched key and compares the result with the previous one.

Problem: a kvs.watch notification is not sent for every
update of a watched key if two or more commits operating
on the key are merged by the kvs for performance.  In
that case, a watcher is only notified once for the single,
merged commit.

This may be a problem for some use cases, for example
a finite state machine may use the kvs to communicate
state by commiting each change to a watched key.  Watchers
cannot follow the complete state if intermediate steps are lost.

Solution: check commits before merging them to see if they
change the same keys to different values, and abort the merge
if so.  This check does not apply to directories modified
as a side effect of a commit; only the fully qualified keys
supplied in the commit messages are checked for duplicates.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.07%) to 75.316% when pulling a1620c3 on garlick:kvs_watch into b532e12 on flux-framework:master.

@garlick
Copy link
Member Author

garlick commented Sep 16, 2016

Maybe the best approach for this PR is to carry through with the original plan only "filter" the setroot events, forwarding them point to point to watchers. It will still shift the burden of looking up changed keys on to the client, but the number of client wakeups will be on the same order as before - or doubled I guess because each changed key requires a kvs_getat() to pick up the new value. The overhead in the kvs module will be much reduced especially for keys that don't change often.

The one rpc, multiple responses pattern still bugs me. I thought I might try an idea here like the stream idea from #271, where client opens a "channel" to service in order to receive messages on. The kvs module could use this channel to forward filtered setroot events point to point. Possibly this idea could later be generalized to allow routing of any type of message point to point through the TBON (like a tunnel).

@grondo
Copy link
Contributor

grondo commented Sep 16, 2016

Who would filter setroot events? The kvs module on each rank? So client wouldn't have to subscribe to all kvs.setroot events, but it would still get one message per kvs.setroot event and have to filter keys? Is this actually the same proposal as earlier, just no global event?

@garlick
Copy link
Member Author

garlick commented Sep 16, 2016

Exactly.

The kvs watch-mt test starts N threads watching the same
key, then modifies the key monotonically M times.  Now that
watch is guaranteed to get a callback on every key change, this
test can be tightened up to verify monotonicity of the watched key.

Add a run of the test on rank 1 to ensure a "slave" KVS behaves
the same as the "master".
Include an array of (de-duplicated) keys that were modified
by the commit in the kvs.setroot event.  This prepares the
way for more efficient kvs_watch() handling as described in flux-framework#803.

Update test for kvs.setroot codec.
Allow an optional root treeobj to accompany a kvs.get
request.  If present, key will be looked up relative to
this directory.  It is an EINVAL error if the treeobj
is invalid or not a directory reference.

Update test for kvs.get codec.
Add a new function kvs_getat() which is like kvs_get()
but has a new "treeobj" argument that specifies the specific
root (or other directory) "snapshot" that the key will be looked
up in.
Exercise flux-kvs getat via t1000-kvs-basic.t sharness test.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) to 75.308% when pulling 261ee47 on garlick:kvs_watch into b532e12 on flux-framework:master.

@garlick
Copy link
Member Author

garlick commented Sep 21, 2016

Closing this PR and will resubmit a PR for kvs_getat().

I've opened a couple of issues #812 and #813 on some kvs_watch() correctness problems and will try to solve those in another PR.

The performance issue is maybe a lower priority and could be returned to when we switch to the RFC 11 "Treeobj" metadata format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants