Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unstable test testSuite5.TestSelectGlobalVar #25272

Closed
Tracked by #25899
zhouqiang-cl opened this issue Jun 9, 2021 · 5 comments · Fixed by #26714
Closed
Tracked by #25899

unstable test testSuite5.TestSelectGlobalVar #25272

zhouqiang-cl opened this issue Jun 9, 2021 · 5 comments · Fixed by #26714
Assignees
Labels
component/test severity/major type/bug The issue is confirmed as a bug.

Comments

@zhouqiang-cl
Copy link
Contributor

Bug Report

----------------------------------------------------------------------
[2021-06-08T10:36:43.102Z] FAIL: set_test.go:1056: testSuite5.TestSelectGlobalVar
[2021-06-08T10:36:43.102Z] 
[2021-06-08T10:36:43.102Z] set_test.go:1065:
[2021-06-08T10:36:43.102Z]     tk.MustQuery("select @@max_connections;").Check(testkit.Rows("100"))
[2021-06-08T10:36:43.102Z] /home/jenkins/agent/workspace/tidb-unit-test-nightly/go/src/github.com/pingcap/tidb/util/testkit/testkit.go:63:
[2021-06-08T10:36:43.102Z]     res.c.Assert(resBuff.String(), check.Equals, needBuff.String(), res.comment)
[2021-06-08T10:36:43.102Z] ... obtained string = "[151]\n"
[2021-06-08T10:36:43.102Z] ... expected string = "[100]\n"
[2021-06-08T10:36:43.102Z] ... sql:select @@max_connections;, args:[]

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

in ci https://ci.pingcap.net/blue/rest/organizations/jenkins/pipelines/tidb-unit-test-nightly/runs/261/nodes/104/steps/344/log/?start=0

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

master 7538818

@morgo
Copy link
Contributor

morgo commented Jul 28, 2021

I believe I might have figured out the cause of this! It is in this function:

tidb/domain/sysvar_cache.go

Lines 112 to 146 in c70662b

// RebuildSysVarCache rebuilds the sysvar cache both globally and for session vars.
// It needs to be called when sysvars are added or removed.
func (svc *SysVarCache) RebuildSysVarCache(ctx sessionctx.Context) error {
newSessionCache := make(map[string]string)
newGlobalCache := make(map[string]string)
tableContents, err := svc.fetchTableValues(ctx)
if err != nil {
return err
}
for _, sv := range variable.GetSysVars() {
sVal := sv.Value
if _, ok := tableContents[sv.Name]; ok {
sVal = tableContents[sv.Name]
}
// session cache stores non-skippable variables, which essentially means session scope.
// for historical purposes there are some globals, but these should eventually be removed.
if !sv.SkipInit() {
newSessionCache[sv.Name] = sVal
}
if sv.HasGlobalScope() {
newGlobalCache[sv.Name] = sVal
}
// Propagate any changes to the server scoped variables
checkEnableServerGlobalVar(sv.Name, sVal)
}
logutil.BgLogger().Debug("rebuilding sysvar cache")
svc.Lock()
defer svc.Unlock()
svc.session = newSessionCache
svc.global = newGlobalCache
return nil
}

The Lock is only held when replacing the cache, not when reading the table values (svc.fetchTableValues(ctx)), so it's possible that the following can occur:

  • Updated Request 1 starts, reads table tables.
  • An update occurs, triggering an update to the table and update request 2 to start.
  • Update request 2 updates the cache
  • Update request 1 updates the cache

Thus the update request 2 is completely lost. We know that there are concurrent calls to RebuildSysVarCache (and there could be concurrent sysvar updates).

The fix is to hold the lock when reading the table values. This ensures updates are ordered and no requests are lost.

@morgo
Copy link
Contributor

morgo commented Jul 28, 2021

The only other possibility is that GetGlobalVar is failing, and the default is being substituted:

tidb/session/session.go

Lines 1118 to 1129 in c70662b

sysVar, err := domain.GetDomain(s).GetSysVarCache().GetGlobalVar(s, name)
if err != nil {
// The sysvar exists, but there is no cache entry yet.
// This might be because the sysvar was only recently registered.
// In which case it is safe to return the default, but we can also
// update the cache for the future.
logutil.BgLogger().Info("sysvar not in cache yet. sysvar cache may be stale", zap.String("name", name))
sysVar, err = s.getTableValue(context.TODO(), mysql.GlobalVariablesTable, name)
if err != nil {
return sv.Value, nil
}
}

But I don't think this is the case, since the line above (select @@global.max_connections;) is using exactly the same function call.

@tiancaiamao
Copy link
Contributor

How about let LoadSysVarCacheLoop the only place to call RebuildSysVarCache?

For example, you can

func (do *Domain) NotifyUpdateSysVarCache(ctx sessionctx.Context) {
	// ...
	// update locally
	do.sysVarCache.Ch <- struct{}
}

func (do *Domain) LoadSysVarCacheLoop(ctx sessionctx.Context) error {
        // ...
	go func() {
                // ...
		for {
			ok := true
			select {
			case <-do.exit:
				return
			case _, ok = <-watchCh:
                        case _, ok = <-sysVarCache.Ch:   // For notification from NotifyUpdateSysVarCache
			case <-time.After(duration):
			}

		       // ...
		}
	}()
	return nil
}

@morgo
Copy link
Contributor

morgo commented Jul 29, 2021

@tiancaiamao This is also used by

func (svc *SysVarCache) rebuildCacheIfNeeded(ctx sessionctx.Context) (err error) {
svc.RLock()
cacheNeedsRebuild := len(svc.session) == 0 || len(svc.global) == 0
svc.RUnlock()
if cacheNeedsRebuild {
logutil.BgLogger().Warn("sysvar cache is empty, triggering rebuild")
if err = svc.RebuildSysVarCache(ctx); err != nil {
logutil.BgLogger().Error("rebuilding sysvar cache failed", zap.Error(err))
}
}
return err
}
which is part of the SysVarCache and can't call domain.NotifyUpdateSysVarCache.

I could add the channel to the SysVarCache instead of a mutex, but my hope is we figure this out better for #24700 so that the Loops backoff their regular execution when they are manually triggered.

@ti-srebot
Copy link
Contributor

ti-srebot commented Aug 4, 2021

Please edit this comment or add a new comment to complete the following information

Bug

Concurrent execution of SET GLOBAL could lead to incorrect sysvar cache contents. It is unlikely to impact user environments much, because the cache regenerates every 30 seconds anyway.

1. Root Cause Analysis (RCA) (optional)

A mutex is required to prevent concurrent sysvar cache generation.

2. Symptom (optional)

3. All Trigger Conditions (optional)

4. Workaround (optional)

5. Affected versions

[v5.1.0,v5.3.0]

6. Fixed versions

master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/test severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants