KVS refactor of lookup() / walk() functions #1066

chu11 · 2017-05-17T00:51:28Z

In this PR, we convert the KVS walk function from a recursive implementation into a non-recursive one and simplified the logic as well. The predominant logic changes are:

A) In preparation for maintaining "state" so that RPCs aren't fully replayed, the entire path of the kvs get request is pre-parsed and stored in a list.

B) The walk() function has been made non-recursive and "depth" is maintained in a stack. This is also in preparation for maintaining "state" for removal of RPC replay.

C) A variety of code logic changes to make the walk() function simpler.

The predominant logic changes that may look strange in this PR are:

A) The pre-parsing of the path makes things more inefficient for the time being b/c RPCs are still completely replayed (i.e. it is fully parsed on every RPC-then continuation).

B) There are some paths (such as through get_request_cb()) in which a load() will be called twice on the same root reference. This will be resolved when lookup() is refactored. This PR is only sticking to walk().

I still need to run through valgrind to make sure I didn't mem-leak anything. But wanted to throw the PR up here for a first skim.

coveralls · 2017-05-17T01:09:59Z

Coverage increased (+0.04%) to 78.076% when pulling 4a767bd on chu11:kvsrefactor-walk into 61bae10 on flux-framework:master.

codecov-io · 2017-05-17T01:10:19Z

Codecov Report

Merging #1066 into master will increase coverage by 0.09%.
The diff coverage is 91.91%.

@@            Coverage Diff             @@
##           master    #1066      +/-   ##
==========================================
+ Coverage   77.81%   77.91%   +0.09%     
==========================================
  Files         148      150       +2     
  Lines       25764    25965     +201     
==========================================
+ Hits        20049    20231     +182     
- Misses       5715     5734      +19

Impacted Files	Coverage Δ
src/modules/kvs/cache.c	`93.65% <100%> (+0.15%)`	⬆️
src/modules/kvs/kvs.c	`79.76% <90.27%> (-0.65%)`	⬇️
src/modules/kvs/json_util.c	`90.9% <90.9%> (ø)`
src/modules/kvs/lookup.c	`92.33% <92.33%> (ø)`
src/common/libutil/dirwalk.c	`92.59% <0%> (-0.75%)`	⬇️
src/common/libflux/rpc.c	`91.27% <0%> (-0.73%)`	⬇️
src/broker/content-cache.c	`72.54% <0%> (-0.44%)`	⬇️
src/modules/kvs/libkvs.c	`75.38% <0%> (-0.12%)`	⬇️
src/broker/broker.c	`72.58% <0%> (-0.1%)`	⬇️
src/common/libflux/message.c	`81.68% <0%> (ø)`	⬆️
... and 4 more

garlick · 2017-05-17T16:45:46Z

Great job @chu11 - this was complicated, messy, and critical code and now it is definitely heading in the right direction.

As we discussed privately, I think clarity (and my ability to review these changes) would improve significantly if walk() could be broken out to a separate C module with its public and private bits clearly separated and its purpose and function documented. Ideally for the history, the individual commits to move the function to a new file, and the changes to the function would be kept separate, although if that proves too difficult, I'd be willing to sacrifice the diff(s) between old and new walk functions in the history, since it's pretty hard to follow anyway.

Note the work on "futures" in #1053. Possibly walk or some future more reusable incarnation could return a flux_future_t that encapsulates its state and the ability to call it in both synchronous and asynchronous modes. That's just something to keep in the back of your mind, not concrete feedback for this PR.

chu11 · 2017-05-17T17:31:06Z

@garlick It was a little hard to splice out b/c of the dependencies (such as the load() function). In my mind I was hoping to do the file split later, but perhaps it might be better to do that now as a checkpoint.

Thinking about it, I think I see a way to slice things out now that might be pretty good. As you mentioned privately, load() can probably be migrated into the cache API. Finally de-coupling that from the main kvs file.

garlick · 2017-05-17T17:43:34Z

If it can be done in the near term it would probably be good to do it in this PR. If it's a really big project though, then maybe we should consider putting this in as is.

chu11 · 2017-05-17T20:52:20Z

My idea to splice things out didn't work out. A lot of it centers on the fact that the load() function performs cache operations, rpcs, and waits. So it's sort of has a dependency on everything.

I sort of thought this was a good "checkpoint" for a PR, but perhaps its not. My general idea was:

refactor walk() w/ state
refactor lookup() w/ state
refactor get_request_cb() to utilize state
refactor get_request_cb() to call load() instead of lookup() and walk(). The latter functions now call cache_lookup(), and on failure pass back a reference for get_request_cb() to call load with.
splice out lookup() and walk() into new file, as they only call cache API and no longer deal with RPCs.

The order was predominantly based on the fact that w/ "state", it would be easy to pass back information to get_request_cb(). However, 4 & 5 could be done before 1 & 2 & 3, just figured it would be uglier. Lemme see if that'll work out easier/harder than I thought.

chu11 · 2017-05-18T00:32:47Z

It ended up doing step 4 above wasn't as difficult as I thought it'd be. Perhaps the code isn't the prettiest, but it's not bad. It'll allow us to splice off the walk() function into a separate API in this PR.

I'll issue a new PR for the load() refactoring.

garlick · 2017-05-18T19:02:05Z

@chu11 sorry for not providing more review, wasn't sure if I was waiting for more or not.
Superficially it looks good to me but I don't feel confident that I'm understanding/remembering every corner case. I do think we have decent test coverage for this so the fact that tests are passing is a good indication.

Were you able to verify that memory isn't leaking?

Another thing to check just to be sure we're not causing problems for others would be the soak test. We could get a baseline that could serve all the KVS rework activity and then run it for successive PR's. (Some past results are posted to various closed issues - e.g. #614)

I'm running an errand and when I return around 2 I'll have one more spin through the code.

chu11 · 2017-05-18T19:20:43Z

@garlick Yeah, I ran a bunch of valgrind runs using the src/test/valgrind.sh. Also wrote an alternate cheapo workload consisting of a collection of simple kvs operations. The leak summary at the very end didn't change, which I think is the most important thing to verify didn't change.

I'm presently re-building this tree to be rebased under the patches from PR #1069. With PR #1069, I can splice lookup() and walk() into a new file. So hopefully this PR will be easier to understand, as all the changes will be more isolated.

garlick · 2017-05-18T19:49:16Z

That sounds perfect, thanks.

…

On May 18, 2017 12:20 PM, "Al Chu" ***@***.***> wrote: @garlick <https://github.com/garlick> Yeah, I ran a bunch of valgrind runs using the src/test/valgrind.sh. Also wrote an alternate cheapo workload consisting of a collection of simple kvs operations. The leak summary at the very end didn't change, which I think is the most important thing to verify didn't change. I'm presently re-building this tree to be rebased under the patches from PR #1069 <#1069>. With PR #1069 <#1069>, I can splice lookup() and walk() into a new file. So hopefully this PR will be easier to understand, as all the changes will be more isolated. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1066 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKX23olmmgaPtCWmYdMeCLJ-dlSK3Gbks5r7JoLgaJpZM4NdP5d> .

chu11 · 2017-05-18T23:22:18Z

Pushed an update, based on the branch from #1069 . With that branch, I can now split out the lookup() and walk() functions into its own file. The nice thing is now lookup() is the only public facing function. Everything behind that is now private (walk() as well as my collection of helper functions and helper structs).

The bad thing is that with lookup() being the "public" function of this file, it looks sorta bad given I haven't completed the lookup-refactor. Which sort of makes this branch look sad and incomplete. Although I was hoping to "checkpoint" with refactoring just walk(), perhaps I'll have to do the rest in this PR. Which will ultimately end up being a really large patch series.

coveralls · 2017-05-18T23:37:14Z

Coverage increased (+0.02%) to 78.053% when pulling e30c8ac on chu11:kvsrefactor-walk into 583c8ec on flux-framework:master.

garlick · 2017-05-19T00:41:33Z

I think it'd be OK to continue on with the refactoring in this PR. After all it's a pretty focused topic.

Wow, it's really nice having that split off. I wonder if we can do anything about the appalling number of in/out parameters in this function? That might bear some pondering. Maybe we could turn it into more of a class - for example, maybe

struct lookup *lookup_create (struct cache *cache);
int lookup_set_epoch (struct lookup *l, int epoch); // call from heartbeat_cb
...

seems like root, rootdir, and root_dirent should be packaged up in a struct?

maybe the result (alp, load_ref, stall, ep) could also be packaged up?

(I'm probably not seeing all the ramifications of what I'm suggesting, but it would be nice if that parameter list was a little easier to look at :-)

Great work.

garlick · 2017-05-19T00:43:29Z

I meant to add: unit test? cache.c has one so maybe it would not be too far fetched for lookup.c since the cache is its main dependency? If you're going to be doing more refactoring you may find it is worth adding one sooner rather than later?

chu11 · 2017-05-19T00:52:11Z

@garlick Yeah, the appalling number of parameters is something I'm working on. In a prior play-branch I had a struct lookup that I passed around, but now that it's split off into it's own file, I was thinking it'll be an API in the vein of what you're suggesting.

Unit tests were something I was going to do after the refactor of lookup(). I don't even know what the final API is even going to look like, so I decided it would be better for later.

If we're going to go full steam ahead here w/ a mega PR here, should we just drop PR #1069? I think its an ok "mini-checkpoint", but it's only 4 commits. Maybe doesn't matter at this point.

garlick · 2017-05-19T01:08:50Z

I'm OK with the mega PR approach, especially if we end up taking a few wrong turns and can squash that down in the history. If this PR is a superset of that one, and you agree with the approach, then yeah I guess that PR could be abandoned.

chu11 · 2017-05-20T13:42:05Z

Finished up something that looks near complete yesterday evening. Still need to valgrind and soak it and add unit tests. The key completion is that now get_request_cb and watch_request_cb no longer replay rpcs fully, they maintain state and pickup things in the middle. lookup() is now API-ized.

I think I can splice out a few cleanup and "setup convenience functions" PRs out of this to shorten the number of commits.

coveralls · 2017-05-20T13:57:52Z

Coverage increased (+0.01%) to 78.053% when pulling a1f1db4 on chu11:kvsrefactor-walk into dc46d77 on flux-framework:master.

Add extra tests for invalid cache entries. Test that checks for invalid entries work and expire fails for non valid entries.

Add convenience function cache_lookup_and_get_json(), which returns a json value of a cache entry if it is found and the cache entry is deemed valid.

Create new json_util files and place the copydir() and compare_json() functions in there. Rename functions to json_object_copydir() and json_compare() respectively.

The load() function may load data from the KVS cache or it may issue a rpc to retrieve data from the broker content cache if the data is not yet locally cached. The load() function's dependency on both cache and rpc APIs makes it difficult to splice out functions that call load() into separate APIs. This patch removes the call of load() from the lookup() and walk() functions. In lookup() and walk(), data is instead looked up only in the local KVS cache. If it is not stored there locally, it is the responsibility of the original caller to then call load() to load the missing data. A in/out parameter will pass the missing reference all the way back to the original caller.

Splice out lookup() and walk() into new file lookup.[ch].

Refactor walk function logic in preparation for further refactoring. - Move parsing of terminal path component from outside of loop into loop, adjusting loop logic accordingly. - Increment depth counter on recursive step instead of begining of function.

coveralls · 2017-05-23T04:19:17Z

Coverage increased (+0.1%) to 78.171% when pulling 3d4b6c2 on chu11:kvsrefactor-walk into 32268ce on flux-framework:master.

Remove recursion from walk() function and implement code logic non-recursively. Use stack to manage "recursive" calls through links.

Refactor walk() logic to be simpler by passing in initial root dirent and not root directory. The adjusted logic is effectively changed from: load & set dir = root dir while (path components left) dirent = directory entry of path component if dirent is a link recursively resolve dirent if not last path component load & set dir = dir of dirent next path component to: set dirent = directory entry of root while (path components left) load & set dir = dir of dirent dirent = directory entry of component if dirent is a link recursively resolve dirent next path component This makes the logic simpler and removes some duplicate code.

Remove call of load() before calling lookup(), as it is no longer necessary with refactored walk() using root_dirent. Adjust parameters that need to be passed into lookup() based on this adjustment. Adjust logic of callers, as wait_create_msg_handler() can be called later.

Place lookup() parameters in a struct for passing variables in and out. This is beginning refactoring for a more thorough lookup() api.

With the lookup_handle now existing, the user no longer is required to pass root_dirent to the lookup() call. It can be calculated and maintained internally.

Cleanup code now that lookup handle exists: - Move levels & other data structures into lookup handle. - Only pass lookup handle as parameter when appropriate. - Cleanup error/stall paths.

garlick · 2017-05-23T15:34:07Z

In walk(), zlist_remove is called within a walk of the list. I was going to call this out as unsafe but it turns out that with commit zeromq/czmq@cf1f485 (in 2013, in v2.0.0!) czmq addressed this and it is now safe.

garlick · 2017-05-23T16:36:11Z

The refactoring and isolation of lookup() is great. The extra work avoided by not starting from the root at each stall/restart is great. The new tests are great. We've verified with valgrind and the soak test that these changes don't introduce any obvious regressions. Still a lot of KVS refactoring needed, but this step forward makes it all seem more tractable.

I think we should go ahead and get this merged. What do you think @chu11? Ready?

chu11 · 2017-05-23T17:21:08Z

@garlick Close, trying to eek out a few more performance tweaks (removed an uncommon branch check), used a switch statement on lookup "states" which I think is cleaner and might be faster too.

coveralls · 2017-05-23T17:42:02Z

Coverage increased (+0.1%) to 78.18% when pulling ea8f232 on chu11:kvsrefactor-walk into 32268ce on flux-framework:master.

Add a state variable to lookup handle so that prior actions completed by a lookup are not repeated.

Make lookup_t struct private and create set of API functions to get value results and add convenience get/set functions. Also add convenience function for testing.

With lookup now API-ized, do not exit on unexpected errors, instead return errors that are appropriate. Do not exit when walking a path and reaching a DIRVAL key type or an unexpected type. Instead return EPERM. When reading a linkval with READLINK, if the user sets READDIR, return ENOTDIR.

Do not recreate lookup data structure when rpcs are replayed due to stalls. State maintained in lookup data structure now allows rpcs to continue from the last point they ended instead of replaying from the beginning.

chu11 · 2017-05-23T22:54:19Z

@garlick From the point of your last comment, converted the if-state-branch logic in lookup() to be a switch, as I think it makes the code easier to understand. Also added more stall unit tests as I realized I didn't do the stall unit tests correctly (I was re-starting the lookups from the beginning again, which is not the right idea). A few minor cleanups. But other than that, I think the branch is good to go.

garlick · 2017-05-23T22:56:43Z

OK I think it's good to go once travis passes.

coveralls · 2017-05-23T23:07:50Z

Coverage increased (+0.1%) to 78.174% when pulling 186547d on chu11:kvsrefactor-walk into 32268ce on flux-framework:master.

coveralls · 2017-05-23T23:21:17Z

Coverage increased (+0.1%) to 78.163% when pulling 186547d on chu11:kvsrefactor-walk into 32268ce on flux-framework:master.

chu11 mentioned this pull request May 18, 2017

Refactor how data is loaded in KVS walk/lookup #1069

Closed

chu11 requested a review from garlick May 18, 2017 18:10

chu11 force-pushed the kvsrefactor-walk branch from 4a767bd to e30c8ac Compare May 18, 2017 23:17

chu11 changed the title ~~KVS refactor of walk() function~~ KVS refactor of lookup() / walk() functions May 19, 2017

chu11 force-pushed the kvsrefactor-walk branch from e30c8ac to a1f1db4 Compare May 20, 2017 13:38

chu11 mentioned this pull request May 20, 2017

KVS Cleanup #1073

Merged

chu11 added 6 commits May 20, 2017 16:52

modules/kvs/test: Add extra cache tests

ebaf792

Add extra tests for invalid cache entries. Test that checks for invalid entries work and expire fails for non valid entries.

modules/kvs: Add cache_lookup_and_get_json()

4454cc9

Add convenience function cache_lookup_and_get_json(), which returns a json value of a cache entry if it is found and the cache entry is deemed valid.

modules/kvs: Add json_util convenience files

2e1e858

Create new json_util files and place the copydir() and compare_json() functions in there. Rename functions to json_object_copydir() and json_compare() respectively.

modules/kvs: Splice out functions into new file

d655309

Splice out lookup() and walk() into new file lookup.[ch].

chu11 force-pushed the kvsrefactor-walk branch from 5114db5 to 3d4b6c2 Compare May 23, 2017 04:01

chu11 added 6 commits May 23, 2017 08:28

modules/kvs: Refactor walk() to be non-recursive

5515bd6

Remove recursion from walk() function and implement code logic non-recursively. Use stack to manage "recursive" calls through links.

modules/kvs: Refactor lookup() call

b5dcdf5

Place lookup() parameters in a struct for passing variables in and out. This is beginning refactoring for a more thorough lookup() api.

modules/kvs: Remove root_dirent in lookup() call

c93c42a

With the lookup_handle now existing, the user no longer is required to pass root_dirent to the lookup() call. It can be calculated and maintained internally.

modules/kvs: Cleanup lookup and walk code.

00628fc

Cleanup code now that lookup handle exists: - Move levels & other data structures into lookup handle. - Only pass lookup handle as parameter when appropriate. - Cleanup error/stall paths.

chu11 force-pushed the kvsrefactor-walk branch from 3d4b6c2 to ea8f232 Compare May 23, 2017 17:18

chu11 added 4 commits May 23, 2017 13:50

modules/kvs: Add state info to lookup handle

54ccf6d

Add a state variable to lookup handle so that prior actions completed by a lookup are not repeated.

modules/kvs: Make lookup_t private, add lookup API

d63c29f

Make lookup_t struct private and create set of API functions to get value results and add convenience get/set functions. Also add convenience function for testing.

modules/kvs: Do not fully replay get/watch rpcs

30d8a12

Do not recreate lookup data structure when rpcs are replayed due to stalls. State maintained in lookup data structure now allows rpcs to continue from the last point they ended instead of replaying from the beginning.

chu11 force-pushed the kvsrefactor-walk branch from ea8f232 to f319983 Compare May 23, 2017 22:47

modules/kvs/test: Add lookup unit tests

186547d

chu11 force-pushed the kvsrefactor-walk branch from f319983 to 186547d Compare May 23, 2017 22:51

garlick merged commit e6560ce into flux-framework:master May 24, 2017

garlick mentioned this pull request May 25, 2017

Valgrind fixes #1076

Merged

garlick mentioned this pull request Jul 3, 2017

Track Latest Change to flux-core flux-framework/flux-sched#251

Closed

grondo mentioned this pull request Aug 23, 2017

0.8.0 Release #1160

Closed

chu11 mentioned this pull request May 10, 2018

kvs_watch() can miss values on stalled read #812

Closed

chu11 deleted the kvsrefactor-walk branch June 5, 2021 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KVS refactor of lookup() / walk() functions #1066

KVS refactor of lookup() / walk() functions #1066

chu11 commented May 17, 2017

coveralls commented May 17, 2017

codecov-io commented May 17, 2017 •

edited

Loading

garlick commented May 17, 2017

chu11 commented May 17, 2017

garlick commented May 17, 2017

chu11 commented May 17, 2017

chu11 commented May 18, 2017

garlick commented May 18, 2017

chu11 commented May 18, 2017

garlick commented May 18, 2017 via email

chu11 commented May 18, 2017

coveralls commented May 18, 2017

garlick commented May 19, 2017

garlick commented May 19, 2017

chu11 commented May 19, 2017 •

edited

Loading

garlick commented May 19, 2017

chu11 commented May 20, 2017 •

edited

Loading

coveralls commented May 20, 2017

coveralls commented May 23, 2017

garlick commented May 23, 2017 •

edited

Loading

garlick commented May 23, 2017

chu11 commented May 23, 2017

coveralls commented May 23, 2017

chu11 commented May 23, 2017

garlick commented May 23, 2017

coveralls commented May 23, 2017

coveralls commented May 23, 2017

KVS refactor of lookup() / walk() functions #1066

KVS refactor of lookup() / walk() functions #1066

Conversation

chu11 commented May 17, 2017

coveralls commented May 17, 2017

codecov-io commented May 17, 2017 • edited Loading

Codecov Report

garlick commented May 17, 2017

chu11 commented May 17, 2017

garlick commented May 17, 2017

chu11 commented May 17, 2017

chu11 commented May 18, 2017

garlick commented May 18, 2017

chu11 commented May 18, 2017

garlick commented May 18, 2017 via email

chu11 commented May 18, 2017

coveralls commented May 18, 2017

garlick commented May 19, 2017

garlick commented May 19, 2017

chu11 commented May 19, 2017 • edited Loading

garlick commented May 19, 2017

chu11 commented May 20, 2017 • edited Loading

coveralls commented May 20, 2017

coveralls commented May 23, 2017

garlick commented May 23, 2017 • edited Loading

garlick commented May 23, 2017

chu11 commented May 23, 2017

coveralls commented May 23, 2017

chu11 commented May 23, 2017

garlick commented May 23, 2017

coveralls commented May 23, 2017

coveralls commented May 23, 2017

codecov-io commented May 17, 2017 •

edited

Loading

chu11 commented May 19, 2017 •

edited

Loading

chu11 commented May 20, 2017 •

edited

Loading

garlick commented May 23, 2017 •

edited

Loading