Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are we calling SyncChanges() for TTL updates? #4460

Closed
flyinprogrammer opened this issue Jul 27, 2018 · 4 comments
Closed

Why are we calling SyncChanges() for TTL updates? #4460

flyinprogrammer opened this issue Jul 27, 2018 · 4 comments
Assignees
Labels
needs-investigation The issue described is detailed and complex.

Comments

@flyinprogrammer
Copy link

AgentCheckPass

s.syncChanges()

AgentCheckWarn

s.syncChanges()

AgentCheckFail

s.syncChanges()

All call a blocking call of s.syncChanges():

func (s *HTTPServer) syncChanges() {
if err := s.agent.State.SyncChanges(); err != nil {
s.agent.logger.Printf("[ERR] agent: failed to sync changes: %v", err)
}
}

consul/agent/local/state.go

Lines 1062 to 1063 in 8cbeb29

l.Lock()
defer l.Unlock()

What I'm failing to understand is why do we do this?

All three methods call s.agent.updateTTLCheck(checkID, api.HealthCritical, note):

if err := s.agent.updateTTLCheck(checkID, api.HealthPassing, note); err != nil {
return nil, err
}

consul/agent/agent.go

Lines 2498 to 2524 in 8cbeb29

// updateTTLCheck is used to update the status of a TTL check via the Agent API.
func (a *Agent) updateTTLCheck(checkID types.CheckID, status, output string) error {
a.checkLock.Lock()
defer a.checkLock.Unlock()
// Grab the TTL check.
check, ok := a.checkTTLs[checkID]
if !ok {
return fmt.Errorf("CheckID %q does not have associated TTL", checkID)
}
// Set the status through CheckTTL to reset the TTL.
check.SetStatus(status, output)
// We don't write any files in dev mode so bail here.
if a.config.DataDir == "" {
return nil
}
// Persist the state so the TTL check can come up in a good state after
// an agent restart, especially with long TTL values.
if err := a.persistCheckState(check, status, output); err != nil {
return fmt.Errorf("failed persisting state for check %q: %s", checkID, err)
}
return nil
}

Which calls: check.SetStatus(status, output):

check.SetStatus(status, output)

Which calls c.Notify.UpdateCheck(c.CheckID, status, output):

c.Notify.UpdateCheck(c.CheckID, status, output)

Which goes to this code block:

consul/agent/local/state.go

Lines 469 to 544 in 7fa6bb0

func (l *State) UpdateCheck(id types.CheckID, status, output string) {
l.Lock()
defer l.Unlock()
c := l.checks[id]
if c == nil || c.Deleted {
return
}
if l.discardCheckOutput.Load().(bool) {
output = ""
}
// Update the critical time tracking (this doesn't cause a server updates
// so we can always keep this up to date).
if status == api.HealthCritical {
if !c.Critical() {
c.CriticalTime = time.Now()
}
} else {
c.CriticalTime = time.Time{}
}
// Do nothing if update is idempotent
if c.Check.Status == status && c.Check.Output == output {
return
}
// Defer a sync if the output has changed. This is an optimization around
// frequent updates of output. Instead, we update the output internally,
// and periodically do a write-back to the servers. If there is a status
// change we do the write immediately.
if l.config.CheckUpdateInterval > 0 && c.Check.Status == status {
c.Check.Output = output
if c.DeferCheck == nil {
d := l.config.CheckUpdateInterval
intv := time.Duration(uint64(d)/2) + lib.RandomStagger(d)
c.DeferCheck = time.AfterFunc(intv, func() {
l.Lock()
defer l.Unlock()
c := l.checks[id]
if c == nil {
return
}
c.DeferCheck = nil
if c.Deleted {
return
}
c.InSync = false
l.TriggerSyncChanges()
})
}
return
}
// If this is a check for an aliased service, then notify the waiters.
if aliases, ok := l.checkAliases[c.Check.ServiceID]; ok && len(aliases) > 0 {
for _, notifyCh := range aliases {
// Do not block. All notify channels should be buffered to at
// least 1 in which case not-blocking does not result in loss
// of data because a failed send means a notification is
// already queued. This must be called with the lock held.
select {
case notifyCh <- struct{}{}:
default:
}
}
}
// Update status and mark out of sync
c.Check.Status = status
c.Check.Output = output
c.InSync = false
l.TriggerSyncChanges()
}

Which either calls l.TriggerSyncChanges() or not, depending on whether or not the check update is important:

l.TriggerSyncChanges()

Which is setup when we start the agent:

consul/agent/agent.go

Lines 346 to 351 in 0e02277

// link the state with the consul server/client and the state syncer
// via callbacks. After several attempts this was easier than using
// channels since the event notification needs to be non-blocking
// and that should be hidden in the state syncer implementation.
a.State.Delegate = a.delegate
a.State.TriggerSyncChanges = a.sync.SyncChanges.Trigger

Which is used to notify the syncer to do a partial sync

case <-s.SyncChanges.Notif():

Which calls the same SyncChanges() method we're going to force every API call to call:

err := s.State.SyncChanges()

Seems to me if the syncer is working, we wouldn't ever have to call that extra sync on each api call; however, I'm still learning go, so if someone has cycles to explain to me why my thinking is wrong I would be extra appreciative!

@pearkes pearkes added the needs-investigation The issue described is detailed and complex. label Aug 1, 2018
@freddygv freddygv self-assigned this Jul 11, 2019
@freddygv
Copy link
Contributor

freddygv commented Aug 9, 2019

Hi @flyinprogrammer, the reason for calling s.syncChanges is that it directly triggers a sync against the catalog.

TriggerSyncChanges notifies retrySyncFullEventFn, but for that to be handled, the StateSyncer FSM needs to be in the partialSyncState. Getting to this state can be delayed by up to 30 seconds due to transient errors in the full sync state.

Because of this, calling s.syncChanges in the HTTP handlers can allow us to push user updates faster than if we only let the StateSyncer handle them. However, if the StateSyncer does get to the request first, then syncChanges will only do a small amount of work to find out that nothing needs to be synced.

So you're right, we don't need to call syncChanges to update a given TTL status, but it can speed up the sync for little extra cost.

@freddygv freddygv closed this as completed Aug 9, 2019
@flyinprogrammer
Copy link
Author

thanks for being awesome @freddygv - this was helpful follow up!

@249043822
Copy link

249043822 commented Sep 24, 2019

In the AgentCheckPass() endpoint, it would call if err := s.agent.vetCheckUpdate(token, checkID); err != nil { return nil, err }
and
s.syncChanges()
sequentially.

sometimes, maybe due to network io delay, s.syncChanges() may be hanging for a long time, as inside s.syncChanges() , it got a Write Lock, then the following TTLs would be blocked because they could not get RLock while runing the line if err := s.agent.vetCheckUpdate(token, checkID); er, then the ttl timeout, our service becomes critical.
After a while, when the first s.syncChanges() break out, it return to normal.

we have met such problem many times in online environment, any suggestions? @freddygv

func (s *HTTPServer) AgentCheckPass(resp http.ResponseWriter, req *http.Request) (interface{}, error) {
	checkID := types.CheckID(strings.TrimPrefix(req.URL.Path, "/v1/agent/check/pass/"))
	note := req.URL.Query().Get("note")

	// Get the provided token, if any, and vet against any ACL policies.
	var token string
	s.parseToken(req, &token)
	if err := s.agent.vetCheckUpdate(token, checkID); err != nil {
		return nil, err
	}

	if err := s.agent.updateTTLCheck(checkID, api.HealthPassing, note); err != nil {
		return nil, err
	}
	s.syncChanges()
	return nil, nil
}

@freddygv
Copy link
Contributor

freddygv commented Oct 10, 2019

Hi @249043822. Given current options, I can't think of a better solution than increasing the TTL interval for that check.

There is no configuration option to adjust the 10 second timeout for the RPC to sync the check status.

Additionally, syncChanges needs to acquire a lock because check states can be updated concurrently by multiple goroutines.

One way to mitigate this could be to refactor SyncChanges() so that we release the state lock after constructing a list of services and checks that were deleted or aren't in sync. That way the catalog update isn't done while holding the lock. This would require some more thought to ensure it's safe to make that change.

Edit: I created #6616 to track that potential change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation The issue described is detailed and complex.
Projects
None yet
Development

No branches or pull requests

4 participants