Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KVS refactor of lookup() / walk() functions #1066

Merged
merged 20 commits into from
May 24, 2017

Conversation

chu11
Copy link
Member

@chu11 chu11 commented May 17, 2017

In this PR, we convert the KVS walk function from a recursive implementation into a non-recursive one and simplified the logic as well. The predominant logic changes are:

A) In preparation for maintaining "state" so that RPCs aren't fully replayed, the entire path of the kvs get request is pre-parsed and stored in a list.

B) The walk() function has been made non-recursive and "depth" is maintained in a stack. This is also in preparation for maintaining "state" for removal of RPC replay.

C) A variety of code logic changes to make the walk() function simpler.

The predominant logic changes that may look strange in this PR are:

A) The pre-parsing of the path makes things more inefficient for the time being b/c RPCs are still completely replayed (i.e. it is fully parsed on every RPC-then continuation).

B) There are some paths (such as through get_request_cb()) in which a load() will be called twice on the same root reference. This will be resolved when lookup() is refactored. This PR is only sticking to walk().

I still need to run through valgrind to make sure I didn't mem-leak anything. But wanted to throw the PR up here for a first skim.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 78.076% when pulling 4a767bd on chu11:kvsrefactor-walk into 61bae10 on flux-framework:master.

@codecov-io
Copy link

codecov-io commented May 17, 2017

Codecov Report

Merging #1066 into master will increase coverage by 0.09%.
The diff coverage is 91.91%.

@@            Coverage Diff             @@
##           master    #1066      +/-   ##
==========================================
+ Coverage   77.81%   77.91%   +0.09%     
==========================================
  Files         148      150       +2     
  Lines       25764    25965     +201     
==========================================
+ Hits        20049    20231     +182     
- Misses       5715     5734      +19
Impacted Files Coverage Δ
src/modules/kvs/cache.c 93.65% <100%> (+0.15%) ⬆️
src/modules/kvs/kvs.c 79.76% <90.27%> (-0.65%) ⬇️
src/modules/kvs/json_util.c 90.9% <90.9%> (ø)
src/modules/kvs/lookup.c 92.33% <92.33%> (ø)
src/common/libutil/dirwalk.c 92.59% <0%> (-0.75%) ⬇️
src/common/libflux/rpc.c 91.27% <0%> (-0.73%) ⬇️
src/broker/content-cache.c 72.54% <0%> (-0.44%) ⬇️
src/modules/kvs/libkvs.c 75.38% <0%> (-0.12%) ⬇️
src/broker/broker.c 72.58% <0%> (-0.1%) ⬇️
src/common/libflux/message.c 81.68% <0%> (ø) ⬆️
... and 4 more

@garlick
Copy link
Member

garlick commented May 17, 2017

Great job @chu11 - this was complicated, messy, and critical code and now it is definitely heading in the right direction.

As we discussed privately, I think clarity (and my ability to review these changes) would improve significantly if walk() could be broken out to a separate C module with its public and private bits clearly separated and its purpose and function documented. Ideally for the history, the individual commits to move the function to a new file, and the changes to the function would be kept separate, although if that proves too difficult, I'd be willing to sacrifice the diff(s) between old and new walk functions in the history, since it's pretty hard to follow anyway.

Note the work on "futures" in #1053. Possibly walk or some future more reusable incarnation could return a flux_future_t that encapsulates its state and the ability to call it in both synchronous and asynchronous modes. That's just something to keep in the back of your mind, not concrete feedback for this PR.

@chu11
Copy link
Member Author

chu11 commented May 17, 2017

@garlick It was a little hard to splice out b/c of the dependencies (such as the load() function). In my mind I was hoping to do the file split later, but perhaps it might be better to do that now as a checkpoint.

Thinking about it, I think I see a way to slice things out now that might be pretty good. As you mentioned privately, load() can probably be migrated into the cache API. Finally de-coupling that from the main kvs file.

@garlick
Copy link
Member

garlick commented May 17, 2017

If it can be done in the near term it would probably be good to do it in this PR. If it's a really big project though, then maybe we should consider putting this in as is.

@chu11
Copy link
Member Author

chu11 commented May 17, 2017

My idea to splice things out didn't work out. A lot of it centers on the fact that the load() function performs cache operations, rpcs, and waits. So it's sort of has a dependency on everything.

I sort of thought this was a good "checkpoint" for a PR, but perhaps its not. My general idea was:

  1. refactor walk() w/ state
  2. refactor lookup() w/ state
  3. refactor get_request_cb() to utilize state
  4. refactor get_request_cb() to call load() instead of lookup() and walk(). The latter functions now call cache_lookup(), and on failure pass back a reference for get_request_cb() to call load with.
  5. splice out lookup() and walk() into new file, as they only call cache API and no longer deal with RPCs.

The order was predominantly based on the fact that w/ "state", it would be easy to pass back information to get_request_cb(). However, 4 & 5 could be done before 1 & 2 & 3, just figured it would be uglier. Lemme see if that'll work out easier/harder than I thought.

@chu11
Copy link
Member Author

chu11 commented May 18, 2017

It ended up doing step 4 above wasn't as difficult as I thought it'd be. Perhaps the code isn't the prettiest, but it's not bad. It'll allow us to splice off the walk() function into a separate API in this PR.

I'll issue a new PR for the load() refactoring.

@garlick
Copy link
Member

garlick commented May 18, 2017

@chu11 sorry for not providing more review, wasn't sure if I was waiting for more or not.
Superficially it looks good to me but I don't feel confident that I'm understanding/remembering every corner case. I do think we have decent test coverage for this so the fact that tests are passing is a good indication.

Were you able to verify that memory isn't leaking?

Another thing to check just to be sure we're not causing problems for others would be the soak test. We could get a baseline that could serve all the KVS rework activity and then run it for successive PR's. (Some past results are posted to various closed issues - e.g. #614)

I'm running an errand and when I return around 2 I'll have one more spin through the code.

@chu11
Copy link
Member Author

chu11 commented May 18, 2017

@garlick Yeah, I ran a bunch of valgrind runs using the src/test/valgrind.sh. Also wrote an alternate cheapo workload consisting of a collection of simple kvs operations. The leak summary at the very end didn't change, which I think is the most important thing to verify didn't change.

I'm presently re-building this tree to be rebased under the patches from PR #1069. With PR #1069, I can splice lookup() and walk() into a new file. So hopefully this PR will be easier to understand, as all the changes will be more isolated.

@garlick
Copy link
Member

garlick commented May 18, 2017 via email

@chu11 chu11 force-pushed the kvsrefactor-walk branch from 4a767bd to e30c8ac Compare May 18, 2017 23:17
@chu11
Copy link
Member Author

chu11 commented May 18, 2017

Pushed an update, based on the branch from #1069 . With that branch, I can now split out the lookup() and walk() functions into its own file. The nice thing is now lookup() is the only public facing function. Everything behind that is now private (walk() as well as my collection of helper functions and helper structs).

The bad thing is that with lookup() being the "public" function of this file, it looks sorta bad given I haven't completed the lookup-refactor. Which sort of makes this branch look sad and incomplete. Although I was hoping to "checkpoint" with refactoring just walk(), perhaps I'll have to do the rest in this PR. Which will ultimately end up being a really large patch series.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 78.053% when pulling e30c8ac on chu11:kvsrefactor-walk into 583c8ec on flux-framework:master.

@garlick
Copy link
Member

garlick commented May 19, 2017

I think it'd be OK to continue on with the refactoring in this PR. After all it's a pretty focused topic.

Wow, it's really nice having that split off. I wonder if we can do anything about the appalling number of in/out parameters in this function? That might bear some pondering. Maybe we could turn it into more of a class - for example, maybe

struct lookup *lookup_create (struct cache *cache);
int lookup_set_epoch (struct lookup *l, int epoch); // call from heartbeat_cb
...

seems like root, rootdir, and root_dirent should be packaged up in a struct?

maybe the result (alp, load_ref, stall, ep) could also be packaged up?

(I'm probably not seeing all the ramifications of what I'm suggesting, but it would be nice if that parameter list was a little easier to look at :-)

Great work.

@garlick
Copy link
Member

garlick commented May 19, 2017

I meant to add: unit test? cache.c has one so maybe it would not be too far fetched for lookup.c since the cache is its main dependency? If you're going to be doing more refactoring you may find it is worth adding one sooner rather than later?

@chu11
Copy link
Member Author

chu11 commented May 19, 2017

@garlick Yeah, the appalling number of parameters is something I'm working on. In a prior play-branch I had a struct lookup that I passed around, but now that it's split off into it's own file, I was thinking it'll be an API in the vein of what you're suggesting.

Unit tests were something I was going to do after the refactor of lookup(). I don't even know what the final API is even going to look like, so I decided it would be better for later.

If we're going to go full steam ahead here w/ a mega PR here, should we just drop PR #1069? I think its an ok "mini-checkpoint", but it's only 4 commits. Maybe doesn't matter at this point.

@garlick
Copy link
Member

garlick commented May 19, 2017

I'm OK with the mega PR approach, especially if we end up taking a few wrong turns and can squash that down in the history. If this PR is a superset of that one, and you agree with the approach, then yeah I guess that PR could be abandoned.

@chu11 chu11 changed the title KVS refactor of walk() function KVS refactor of lookup() / walk() functions May 19, 2017
@chu11 chu11 force-pushed the kvsrefactor-walk branch from e30c8ac to a1f1db4 Compare May 20, 2017 13:38
@chu11
Copy link
Member Author

chu11 commented May 20, 2017

Finished up something that looks near complete yesterday evening. Still need to valgrind and soak it and add unit tests. The key completion is that now get_request_cb and watch_request_cb no longer replay rpcs fully, they maintain state and pickup things in the middle. lookup() is now API-ized.

I think I can splice out a few cleanup and "setup convenience functions" PRs out of this to shorten the number of commits.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 78.053% when pulling a1f1db4 on chu11:kvsrefactor-walk into dc46d77 on flux-framework:master.

@chu11 chu11 mentioned this pull request May 20, 2017
chu11 added 6 commits May 20, 2017 16:52
Add extra tests for invalid cache entries.  Test that checks
for invalid entries work and expire fails for non valid entries.
Add convenience function cache_lookup_and_get_json(), which
returns a json value of a cache entry if it is found and the
cache entry is deemed valid.
Create new json_util files and place the copydir() and
compare_json() functions in there.

Rename functions to json_object_copydir() and json_compare()
respectively.
The load() function may load data from the KVS cache or it may
issue a rpc to retrieve data from the broker content cache if the
data is not yet locally cached.

The load() function's dependency on both cache and rpc APIs makes
it difficult to splice out functions that call load() into separate
APIs.

This patch removes the call of load() from the lookup() and
walk() functions.  In lookup() and walk(), data is instead looked
up only in the local KVS cache.  If it is not stored there locally,
it is the responsibility of the original caller to then call load()
to load the missing data.  A in/out parameter will pass the missing
reference all the way back to the original caller.
Splice out lookup() and walk() into new file lookup.[ch].
Refactor walk function logic in preparation for further refactoring.

- Move parsing of terminal path component from outside of loop
  into loop, adjusting loop logic accordingly.

- Increment depth counter on recursive step instead of begining of
  function.
@chu11 chu11 force-pushed the kvsrefactor-walk branch from 5114db5 to 3d4b6c2 Compare May 23, 2017 04:01
@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 78.171% when pulling 3d4b6c2 on chu11:kvsrefactor-walk into 32268ce on flux-framework:master.

chu11 added 6 commits May 23, 2017 08:28
Remove recursion from walk() function and implement code logic
non-recursively.  Use stack to manage "recursive" calls through
links.
Refactor walk() logic to be simpler by passing in initial root
dirent and not root directory.  The adjusted logic is effectively
changed from:

load & set dir = root dir
while (path components left)
   dirent = directory entry of path component
   if dirent is a link
      recursively resolve dirent
   if not last path component
      load & set dir = dir of dirent
   next path component

to:

set dirent = directory entry of root
while (path components left)
   load & set dir = dir of dirent
   dirent = directory entry of component
   if dirent is a link
      recursively resolve dirent
   next path component

This makes the logic simpler and removes some duplicate code.
Remove call of load() before calling lookup(), as it is no
longer necessary with refactored walk() using root_dirent.

Adjust parameters that need to be passed into lookup() based on
this adjustment.

Adjust logic of callers, as wait_create_msg_handler() can be called
later.
Place lookup() parameters in a struct for passing variables in
and out.  This is beginning refactoring for a more thorough
lookup() api.
With the lookup_handle now existing, the user no longer is required
to pass root_dirent to the lookup() call.  It can be calculated
and maintained internally.
Cleanup code now that lookup handle exists:

- Move levels & other data structures into lookup handle.
- Only pass lookup handle as parameter when appropriate.
- Cleanup error/stall paths.
@garlick
Copy link
Member

garlick commented May 23, 2017

In walk(), zlist_remove is called within a walk of the list. I was going to call this out as unsafe but it turns out that with commit zeromq/czmq@cf1f485 (in 2013, in v2.0.0!) czmq addressed this and it is now safe.

@garlick
Copy link
Member

garlick commented May 23, 2017

The refactoring and isolation of lookup() is great. The extra work avoided by not starting from the root at each stall/restart is great. The new tests are great. We've verified with valgrind and the soak test that these changes don't introduce any obvious regressions. Still a lot of KVS refactoring needed, but this step forward makes it all seem more tractable.

I think we should go ahead and get this merged. What do you think @chu11? Ready?

@chu11 chu11 force-pushed the kvsrefactor-walk branch from 3d4b6c2 to ea8f232 Compare May 23, 2017 17:18
@chu11
Copy link
Member Author

chu11 commented May 23, 2017

@garlick Close, trying to eek out a few more performance tweaks (removed an uncommon branch check), used a switch statement on lookup "states" which I think is cleaner and might be faster too.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 78.18% when pulling ea8f232 on chu11:kvsrefactor-walk into 32268ce on flux-framework:master.

chu11 added 4 commits May 23, 2017 13:50
Add a state variable to lookup handle so that prior actions
completed by a lookup are not repeated.
Make lookup_t struct private and create set of API functions to
get value results and add convenience get/set functions.  Also
add convenience function for testing.
With lookup now API-ized, do not exit on unexpected errors, instead
return errors that are appropriate.

Do not exit when walking a path and reaching a DIRVAL key type or an
unexpected type.  Instead return EPERM.

When reading a linkval with READLINK, if the user sets READDIR, return
ENOTDIR.
Do not recreate lookup data structure when rpcs are replayed
due to stalls.  State maintained in lookup data structure now
allows rpcs to continue from the last point they ended instead
of replaying from the beginning.
@chu11 chu11 force-pushed the kvsrefactor-walk branch from ea8f232 to f319983 Compare May 23, 2017 22:47
@chu11 chu11 force-pushed the kvsrefactor-walk branch from f319983 to 186547d Compare May 23, 2017 22:51
@chu11
Copy link
Member Author

chu11 commented May 23, 2017

@garlick From the point of your last comment, converted the if-state-branch logic in lookup() to be a switch, as I think it makes the code easier to understand. Also added more stall unit tests as I realized I didn't do the stall unit tests correctly (I was re-starting the lookups from the beginning again, which is not the right idea). A few minor cleanups. But other than that, I think the branch is good to go.

@garlick
Copy link
Member

garlick commented May 23, 2017

OK I think it's good to go once travis passes.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 78.174% when pulling 186547d on chu11:kvsrefactor-walk into 32268ce on flux-framework:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 78.163% when pulling 186547d on chu11:kvsrefactor-walk into 32268ce on flux-framework:master.

@garlick garlick merged commit e6560ce into flux-framework:master May 24, 2017
@garlick garlick mentioned this pull request May 25, 2017
@grondo grondo mentioned this pull request Aug 23, 2017
@chu11 chu11 deleted the kvsrefactor-walk branch June 5, 2021 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants