Why are we calling SyncChanges() for TTL updates? #4460

flyinprogrammer · 2018-07-27T17:22:59Z

AgentCheckPass

consul/agent/agent_endpoint.go

Line 431 in 8cbeb29

s.syncChanges()

AgentCheckWarn

consul/agent/agent_endpoint.go

Line 449 in 8cbeb29

s.syncChanges()

AgentCheckFail

consul/agent/agent_endpoint.go

Line 467 in 8cbeb29

s.syncChanges()

All call a blocking call of s.syncChanges():

consul/agent/agent_endpoint.go

Lines 342 to 346 in 8cbeb29

    
           func (s *HTTPServer) syncChanges() { 
        
           	if err := s.agent.State.SyncChanges(); err != nil { 
        
           		s.agent.logger.Printf("[ERR] agent: failed to sync changes: %v", err) 
        
           	} 
        
           }

consul/agent/local/state.go

Lines 1062 to 1063 in 8cbeb29

    
           l.Lock() 
        
           defer l.Unlock()

What I'm failing to understand is why do we do this?

All three methods call s.agent.updateTTLCheck(checkID, api.HealthCritical, note):

consul/agent/agent_endpoint.go

Lines 428 to 430 in 8cbeb29

    
           if err := s.agent.updateTTLCheck(checkID, api.HealthPassing, note); err != nil { 
        
           	return nil, err 
        
           }

consul/agent/agent.go

Lines 2498 to 2524 in 8cbeb29

    
           // updateTTLCheck is used to update the status of a TTL check via the Agent API. 
        
           func (a *Agent) updateTTLCheck(checkID types.CheckID, status, output string) error { 
        
           	a.checkLock.Lock() 
        
           	defer a.checkLock.Unlock() 
        
           	// Grab the TTL check. 
        
           	check, ok := a.checkTTLs[checkID] 
        
           	if !ok { 
        
           		return fmt.Errorf("CheckID %q does not have associated TTL", checkID) 
        
           	} 
        
           	// Set the status through CheckTTL to reset the TTL. 
        
           	check.SetStatus(status, output) 
        
           	// We don't write any files in dev mode so bail here. 
        
           	if a.config.DataDir == "" { 
        
           		return nil 
        
           	} 
        
           	// Persist the state so the TTL check can come up in a good state after 
        
           	// an agent restart, especially with long TTL values. 
        
           	if err := a.persistCheckState(check, status, output); err != nil { 
        
           		return fmt.Errorf("failed persisting state for check %q: %s", checkID, err) 
        
           	} 
        
           	return nil 
        
           }

Which calls: check.SetStatus(status, output):

consul/agent/agent.go

Line 2510 in 8cbeb29

check.SetStatus(status, output)

Which calls c.Notify.UpdateCheck(c.CheckID, status, output):

consul/agent/checks/check.go

Line 280 in 8cbeb29

c.Notify.UpdateCheck(c.CheckID, status, output)

Which goes to this code block:

consul/agent/local/state.go

Lines 469 to 544 in 7fa6bb0

    
           func (l *State) UpdateCheck(id types.CheckID, status, output string) { 
        
           	l.Lock() 
        
           	defer l.Unlock() 
        
           	c := l.checks[id] 
        
           	if c == nil || c.Deleted { 
        
           		return 
        
           	} 
        
           	if l.discardCheckOutput.Load().(bool) { 
        
           		output = "" 
        
           	} 
        
           	// Update the critical time tracking (this doesn't cause a server updates 
        
           	// so we can always keep this up to date). 
        
           	if status == api.HealthCritical { 
        
           		if !c.Critical() { 
        
           			c.CriticalTime = time.Now() 
        
           		} 
        
           	} else { 
        
           		c.CriticalTime = time.Time{} 
        
           	} 
        
           	// Do nothing if update is idempotent 
        
           	if c.Check.Status == status && c.Check.Output == output { 
        
           		return 
        
           	} 
        
           	// Defer a sync if the output has changed. This is an optimization around 
        
           	// frequent updates of output. Instead, we update the output internally, 
        
           	// and periodically do a write-back to the servers. If there is a status 
        
           	// change we do the write immediately. 
        
           	if l.config.CheckUpdateInterval > 0 && c.Check.Status == status { 
        
           		c.Check.Output = output 
        
           		if c.DeferCheck == nil { 
        
           			d := l.config.CheckUpdateInterval 
        
           			intv := time.Duration(uint64(d)/2) + lib.RandomStagger(d) 
        
           			c.DeferCheck = time.AfterFunc(intv, func() { 
        
           				l.Lock() 
        
           				defer l.Unlock() 
        
           				c := l.checks[id] 
        
           				if c == nil { 
        
           					return 
        
           				} 
        
           				c.DeferCheck = nil 
        
           				if c.Deleted { 
        
           					return 
        
           				} 
        
           				c.InSync = false 
        
           				l.TriggerSyncChanges() 
        
           			}) 
        
           		} 
        
           		return 
        
           	} 
        
           	// If this is a check for an aliased service, then notify the waiters. 
        
           	if aliases, ok := l.checkAliases[c.Check.ServiceID]; ok && len(aliases) > 0 { 
        
           		for _, notifyCh := range aliases { 
        
           			// Do not block. All notify channels should be buffered to at 
        
           			// least 1 in which case not-blocking does not result in loss 
        
           			// of data because a failed send means a notification is 
        
           			// already queued. This must be called with the lock held. 
        
           			select { 
        
           			case notifyCh <- struct{}{}: 
        
           			default: 
        
           			} 
        
           		} 
        
           	} 
        
           	// Update status and mark out of sync 
        
           	c.Check.Status = status 
        
           	c.Check.Output = output 
        
           	c.InSync = false 
        
           	l.TriggerSyncChanges() 
        
           }

Which either calls l.TriggerSyncChanges() or not, depending on whether or not the check update is important:

consul/agent/local/state.go

Line 519 in 7fa6bb0

l.TriggerSyncChanges()

Which is setup when we start the agent:

consul/agent/agent.go

Lines 346 to 351 in 0e02277

    
           // link the state with the consul server/client and the state syncer 
        
           // via callbacks. After several attempts this was easier than using 
        
           // channels since the event notification needs to be non-blocking 
        
           // and that should be hidden in the state syncer implementation. 
        
           a.State.Delegate = a.delegate 
        
           a.State.TriggerSyncChanges = a.sync.SyncChanges.Trigger

Which is used to notify the syncer to do a partial sync

consul/agent/ae/ae.go

Line 275 in 0e02277

case <-s.SyncChanges.Notif():

Which calls the same SyncChanges() method we're going to force every API call to call:

consul/agent/ae/ae.go

Line 197 in 0e02277

err := s.State.SyncChanges()

Seems to me if the syncer is working, we wouldn't ever have to call that extra sync on each api call; however, I'm still learning go, so if someone has cycles to explain to me why my thinking is wrong I would be extra appreciative!

The text was updated successfully, but these errors were encountered:

freddygv · 2019-08-09T17:31:06Z

Hi @flyinprogrammer, the reason for calling s.syncChanges is that it directly triggers a sync against the catalog.

TriggerSyncChanges notifies retrySyncFullEventFn, but for that to be handled, the StateSyncer FSM needs to be in the partialSyncState. Getting to this state can be delayed by up to 30 seconds due to transient errors in the full sync state.

Because of this, calling s.syncChanges in the HTTP handlers can allow us to push user updates faster than if we only let the StateSyncer handle them. However, if the StateSyncer does get to the request first, then syncChanges will only do a small amount of work to find out that nothing needs to be synced.

So you're right, we don't need to call syncChanges to update a given TTL status, but it can speed up the sync for little extra cost.

flyinprogrammer · 2019-08-09T17:40:43Z

thanks for being awesome @freddygv - this was helpful follow up!

249043822 · 2019-09-24T09:31:49Z

In the AgentCheckPass() endpoint, it would call if err := s.agent.vetCheckUpdate(token, checkID); err != nil { return nil, err }
and
s.syncChanges()
sequentially.

sometimes, maybe due to network io delay, s.syncChanges() may be hanging for a long time, as inside s.syncChanges() , it got a Write Lock, then the following TTLs would be blocked because they could not get RLock while runing the line if err := s.agent.vetCheckUpdate(token, checkID); er, then the ttl timeout, our service becomes critical.
After a while, when the first s.syncChanges() break out, it return to normal.

we have met such problem many times in online environment, any suggestions? @freddygv

func (s *HTTPServer) AgentCheckPass(resp http.ResponseWriter, req *http.Request) (interface{}, error) {
	checkID := types.CheckID(strings.TrimPrefix(req.URL.Path, "/v1/agent/check/pass/"))
	note := req.URL.Query().Get("note")

	// Get the provided token, if any, and vet against any ACL policies.
	var token string
	s.parseToken(req, &token)
	if err := s.agent.vetCheckUpdate(token, checkID); err != nil {
		return nil, err
	}

	if err := s.agent.updateTTLCheck(checkID, api.HealthPassing, note); err != nil {
		return nil, err
	}
	s.syncChanges()
	return nil, nil
}

freddygv · 2019-10-10T22:08:57Z

Hi @249043822. Given current options, I can't think of a better solution than increasing the TTL interval for that check.

There is no configuration option to adjust the 10 second timeout for the RPC to sync the check status.

Additionally, syncChanges needs to acquire a lock because check states can be updated concurrently by multiple goroutines.

One way to mitigate this could be to refactor SyncChanges() so that we release the state lock after constructing a list of services and checks that were deleted or aren't in sync. That way the catalog update isn't done while holding the lock. This would require some more thought to ensure it's safe to make that change.

Edit: I created #6616 to track that potential change.

pearkes added the needs-investigation The issue described is detailed and complex. label Aug 1, 2018

pierresouchay mentioned this issue Jun 25, 2019

Support for maximum size for Output of checks #5233

Merged

freddygv self-assigned this Jul 11, 2019

freddygv closed this as completed Aug 9, 2019

freddygv mentioned this issue Oct 10, 2019

Release state lock before catalog operations in State.SyncChanges #6616

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are we calling SyncChanges() for TTL updates? #4460

Why are we calling SyncChanges() for TTL updates? #4460

flyinprogrammer commented Jul 27, 2018

freddygv commented Aug 9, 2019

flyinprogrammer commented Aug 9, 2019

249043822 commented Sep 24, 2019 •

edited

Loading

freddygv commented Oct 10, 2019 •

edited

Loading

Why are we calling SyncChanges() for TTL updates? #4460

Why are we calling SyncChanges() for TTL updates? #4460

Comments

flyinprogrammer commented Jul 27, 2018

freddygv commented Aug 9, 2019

flyinprogrammer commented Aug 9, 2019

249043822 commented Sep 24, 2019 • edited Loading

freddygv commented Oct 10, 2019 • edited Loading

249043822 commented Sep 24, 2019 •

edited

Loading

freddygv commented Oct 10, 2019 •

edited

Loading