Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3: simplify watcher synchronization #6525

Merged
merged 2 commits into from
Oct 4, 2016

Conversation

heyitsanthony
Copy link
Contributor

Was more complicated than it needed to be.

Also fixes clobbering id's on resume and losing watcher channels when watchers are disconnected and canceled.

/cc @hongchaodeng

@gyuho
Copy link
Contributor

gyuho commented Sep 27, 2016

Also fixes clobbering id's on

Do we need to backport this then?

@heyitsanthony
Copy link
Contributor Author

@gyuho probably, it fixes the new test case

@@ -284,6 +284,9 @@ func (w *watcher) Watch(ctx context.Context, key string, opts ...OpOption) Watch
if ok {
select {
case ret := <-retc:
if ret == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this happen? Can we add some comment? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this happens if retc is closed, I changed it to ret, ok := <-retc so it's clearer that's what it's checking

@@ -118,8 +118,8 @@ type watchGrpcStream struct {

// mu protects the streams map
mu sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that this mu also protects closeErr?

@@ -118,8 +118,8 @@ type watchGrpcStream struct {

// mu protects the streams map
mu sync.RWMutex
// streams holds all active watchers
streams map[int64]*watcherStream
// substreams holds all active watchers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

holds all active gRPC streams for watchers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's in watcher.streams, this tracks the watch id's that are in a single grpc stream (hence the rename from streams to substreams, using "streams" twice is super confusing). Will clarify the comments a bit.

w.mu.Lock()
w.streams[ws.id] = ws
w.mu.Unlock()
w.substreams[ws.id] = ws
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we drop the lock here?

}
w.mu.Unlock()
return empty
delete(w.substreams, ws.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we drop the lock here?

w.closeStream(ws)
}

w.owner.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably add a delStream func on owner?

w.mu.RLock()
defer w.mu.RUnlock()
ws, ok := w.streams[pbresp.WatchId]
ws, ok := w.substreams[pbresp.WatchId]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we drop the lock here?

@xiang90
Copy link
Contributor

xiang90 commented Sep 27, 2016

@heyitsanthony OK. After read the whole thing, I understand that we want to serialize the closing event in the single run routine. Can we update the description of the mutex then?

@heyitsanthony
Copy link
Contributor Author

there's still some intrinsic raceiness between the resume path and stream cancelation, going to rework this a little bit more so the substream goroutines don't have to reason about resume

@heyitsanthony
Copy link
Contributor Author

Tore up the watcher code from the roots. The old code's resume path was totally broken and could livelock in some cases. Now, all watch registrations go through a resuming queue on watchGrpcStream. On reconnect, all watcherStream goroutines are stopped so the resuming revision is stable, then restarted and requeued on resuming. Overall the design is much more convincing than the old mess.

PTAL /cc @xiang90 @gyuho

@heyitsanthony heyitsanthony removed the WIP label Sep 30, 2016
Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

TestWatchWithRequireLeader fails?

Defer to @xiang90

Thanks!

donec chan struct{}
// closing is set to true when stream should be scheduled to shutdown.
closing bool
// id is the registered watch id for on the grpc stream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/for on/on/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, will fix

@@ -314,12 +320,7 @@ func (w *watcher) Close() (err error) {
}

func (w *watchGrpcStream) Close() (err error) {
w.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q. Why do we remove this now? Was this causing any problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sharing was making it unnecessarily difficult to reason about the watcherStream teardown path so now watcherStream will post itself to watchGrpcStream.closingc and watchGrpcStream.run handles the final removal of watcherStream resources.

@heyitsanthony heyitsanthony force-pushed the watcher-disconn branch 3 times, most recently from 6a3b548 to a64d996 Compare September 30, 2016 23:50
w.resumec = make(chan struct{})
w.joinSubstreams()
for _, ws := range w.substreams {
ws.id = -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to update the donec of wc as what we do for the resuming streams at line 654?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but ws is appended into resuming and the following loop over resuming picks it up, so it's unnecessary here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah. right.

// streams are marked as nil in the queue since the head must wait for its inflight registration.
func (w *watchGrpcStream) nextResume() *watcherStream {
for len(w.resuming) != 0 {
if w.resuming[0] != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to handle the nil case outside the next resume? it seems like we can always remove the abandoned resuming at the caller side, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, but the caller side would have to reimplement this loop every time. I'd rather have it done in one place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@@ -283,7 +288,10 @@ func (w *watcher) Watch(ctx context.Context, key string, opts ...OpOption) Watch
// receive channel
if ok {
select {
case ret := <-retc:
case ret, ok := <-wr.retc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some comments for this? it is not clear to me why do we need to call watch recursively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i failed to figure out where we might close retc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't any more; removed the close case

@xiang90
Copy link
Contributor

xiang90 commented Oct 3, 2016

LGTM. Fix CI?

Anthony Romano added 2 commits October 3, 2016 16:56
Was more complicated than it needed to be and didn't really work in the
first place. Restructured watcher registation to use a queue.
@heyitsanthony heyitsanthony deleted the watcher-disconn branch October 4, 2016 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants