Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bindinfo: sync concurrent ops on mysql.bind_info from multiple tidb instances #21629

Merged
merged 2 commits into from
Dec 17, 2020

Conversation

eurekaka
Copy link
Contributor

@eurekaka eurekaka commented Dec 10, 2020

What problem does this PR solve?

Issue Number: close #21516

Problem Summary:

Inconsistency between binding cache and mysql.bind_info if there are concurrent CREATE BINDING / DROP BINDING on multiple tidb instances.

What is changed and how it works?

What's Changed:

  • Physically remove old bindings in CREATE BINDING no matter if oldRecord from binding cache is empty.
  • Combine removeBindRecord and appendBindRecord into a single setBindRecord.
  • Insert a builtin row into mysql.bind_info to simulate table lock.
  • Lock mysql.bind_info before manipulating records in that table by updating the specific builtin row. Since we have no gap lock in TiDB, and the LOCK TABLE mysql.bind_info WRITE is not generally available now, we use UPDATE on a specific row to simulate LOCK TABLE.
  • Get the real update_time after acquiring the "simulated table lock", instead of the StartTS of the transaction.
  • Prevent interruption of Update() during the interval of transaction commit and binding cache update by h.bindInfo.Lock().

How it Works:

The core idea of this PR is to make the operations on mysql.bind_info atomic, i.e, the steps should not be interrupted by another transaction from another tidb instance.

Related changes

  • Need to cherry-pick to the release branch

Check List

Tests

  • Integration test: UTF contains a test for this PR.
  • Manual test: I started a tidb cluster with 2 tidb instances, and run the below test program for more than 2 hours, it didn't fail.
package main

import (
	"context"
	"database/sql"
	"fmt"
	_ "github.com/go-sql-driver/mysql"
	"strconv"
	"sync"
	"time"
)

var (
	dsns = []string{
		"root:@tcp(127.0.0.1:4000)/test?charset=utf8&parseTime=True",
		"root:@tcp(127.0.0.1:4001)/test?charset=utf8&parseTime=True",
	}
)

func main() {
	for {
		ctx, cancel := context.WithCancel(context.Background())
		wg := sync.WaitGroup{}
		for i := range dsns {
			wg.Add(1)
			go func(name string) {
				defer wg.Done()
				run(ctx, name, i)
			}(dsns[i])
		}

		time.Sleep(3 * time.Second)
		cancel()
		wg.Wait()
		for _, item := range dsns {
			check(item)
		}
		fmt.Println("success\n")
	}
}

func check(dsn string) {
	db, err := sql.Open("mysql", dsn)
	if err != nil {
		panic(err)
	}
	defer db.Close()

	c, err := db.Conn(context.Background())
	if err != nil {
		panic(err)
	}
	defer c.Close()

	if _, err = c.ExecContext(context.Background(), "admin flush bindings"); err != nil {
		panic(err)
	}

	rows, err := c.QueryContext(context.Background(), "select count(*) from mysql.bind_info where status != 'deleted' and status != 'builtin'")
	if err != nil {
		panic(err)
	}
	count := 0
	rows.Next()
	if err := rows.Scan(&count); err != nil {
		panic(err)
	}
	if err := rows.Err(); err != nil {
		panic(err)
	}
	rows.Close()

	rows, err = c.QueryContext(context.Background(), "show global bindings")
	if err != nil {
		panic(err)
	}
	n := 0
	for rows.Next() {
		n++
	}
	if err := rows.Err(); err != nil {
		panic(err)
	}

	if count != n {
		panic(fmt.Sprintf("unexpected count, %d, %d, %s\n", count, n, dsn))
	}
}

func run(ctx context.Context, dsn string, index int) {
	db, err := sql.Open("mysql", dsn)
	if err != nil {
		panic(err)
	}
	defer db.Close()

	wg := sync.WaitGroup{}
	for i := 0; i < 1; i++ {
		wg.Add(1)
		go func(i int) {
			defer wg.Done()

			c, err := db.Conn(context.Background())
			if err != nil {
				panic(err)
			}
			defer c.Close()

			name := "t" + strconv.Itoa(index*1000+i)
			if _, err := c.ExecContext(context.Background(), fmt.Sprintf("create table if not exists %s(a int, b int, index idx(a))", name)); err != nil {
				panic(err)
			}

			for {

				ss := []string{
					fmt.Sprintf("create global binding for select * from %s using select * from %s;", name, name),
					fmt.Sprintf("drop global binding for select * from %s", name),
				}
				for _, s := range ss {
					select {
					case <-ctx.Done():
						return
					default:
						if _, err := c.ExecContext(context.Background(), s); err != nil {
							panic(err)
						}
					}
				}
			}

		}(i)
	}
	wg.Wait()
}

Side effects

  • DBA must NOT delete the builtin row in mysql.bind_info, otherwise the locking mechanism would be broken.

Release note

  • Synchronize concurrent operations on mysql.bind_info from multiple tidb instances

@eurekaka eurekaka requested a review from a team as a code owner December 10, 2020 04:01
@eurekaka eurekaka requested review from hanfei1991 and removed request for a team and hanfei1991 December 10, 2020 04:01
@ichn-hu ichn-hu mentioned this pull request Dec 10, 2020
@github-actions github-actions bot added the sig/planner SIG: Planner label Dec 10, 2020
session/bootstrap.go Outdated Show resolved Hide resolved
session/bootstrap.go Outdated Show resolved Hide resolved
@zz-jason
Copy link
Member

zz-jason commented Dec 10, 2020

Breaking backward compatibility: the mysql.bind_info is redesigned, the upgrade utility brutally drops the previous table and creates a new table, i.e, the previous bindings would be lost if without backup.

@eurekaka This side effect is unacceptable. It brings a huge impact on the stability of TiDB applications which rely on SQL binding. Could you transport the old bindings to the new table?

@eurekaka
Copy link
Contributor Author

@zz-jason

An alternative is that we can back up this table internally, i.e, save the old records before dropping mysql.bind_info, and insert them into the new table in this upgrade utility, but there may be potential errors if the previous original_sql is longer than VARCHAR(512), so I don't implement this logic, at least for now.

How to handle the possible insert error?

@eurekaka eurekaka force-pushed the sync_bind_info branch 3 times, most recently from 491192b to 28a2a07 Compare December 14, 2020 10:09
@eurekaka
Copy link
Contributor Author

I re-implemented the locking mechanism by introducing a builtin row into mysql.bind_info(thanks for suggestion from @qw4990), PTAL.

Copy link
Contributor

@qw4990 qw4990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Dec 15, 2020
Copy link
Contributor

@Reminiscent Reminiscent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot
Copy link
Contributor

@Reminiscent, Thanks for your review. The bot only counts LGTMs from Reviewers and higher roles, but you're still welcome to leave your comments. See the corresponding SIG page for more information. Related SIG: planner(slack).

@Reminiscent
Copy link
Contributor

@eurekaka Please help merge. Thanks!

Copy link
Member

@winoros winoros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-srebot ti-srebot removed the status/LGT1 Indicates that a PR has LGTM 1. label Dec 17, 2020
@ti-srebot ti-srebot added the status/LGT2 Indicates that a PR has LGTM 2. label Dec 17, 2020
@winoros
Copy link
Member

winoros commented Dec 17, 2020

/merge

@ti-srebot ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 17, 2020
@ti-srebot
Copy link
Contributor

/run-all-tests

@ti-srebot
Copy link
Contributor

cherry pick to release-4.0 in PR #21868

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic/sql-plan-management sig/planner SIG: Planner status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SPM: multiple tidb instances sync bindings failed
6 participants