Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: infinite loop in tmState.ChangeTabletType #9838

Closed
mvh-stripe opened this issue Mar 7, 2022 · 2 comments
Closed

Bug Report: infinite loop in tmState.ChangeTabletType #9838

mvh-stripe opened this issue Mar 7, 2022 · 2 comments

Comments

@mvh-stripe
Copy link
Contributor

Overview of the Issue

		// Update the tablet record first.
		_, err := topotools.ChangeType(ctx, ts.tm.TopoServer, ts.tm.tabletAlias, tabletType, PrimaryTermStartTime)
		if err != nil {
			log.Errorf("error in changing type in tablet record in topo for tablet %s :- %v", topoproto.TabletAliasString(ts.tm.tabletAlias), err)
			log.Errorf("will keep trying to read from the topo server")
			// In case of a topo error, we aren't sure if the data has been written or not.
			// We must read the data again and verify whether the previous write succeeded or not.
			// The only way to guarantee safety is to keep retrying read until we succeed
			for {
				ti, errInReading := ts.tm.TopoServer.GetTablet(ctx, ts.tm.tabletAlias)
				if errInReading != nil {
					time.Sleep(100 * time.Millisecond)
					continue
				}
                                 ....
                         }

8b3f61f
^ A recent change introduced the possibility of an infinite loop when the context gets canceled. In the for loop, we never check if the context has an error, and so it keeps trying to call topoServer.GetTablet every 100ms and never exit. This is bad because this goroutine also holds the mutex for the entire duration. So retries of the ChangeTabletType rpc also get stuck on the mutex.

Reproduction Steps

N/A

Binary Version

N/A

Operating System and Environment details

N/A

Log Fragments

No response

@mvh-stripe mvh-stripe added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Mar 7, 2022
@deepthi deepthi added Component: Cluster management and removed Needs Triage This issue needs to be correctly labelled and triaged labels Mar 7, 2022
@deepthi
Copy link
Member

deepthi commented Mar 7, 2022

/assign @vitessio/cluster-management

@GuptaManan100
Copy link
Member

Fixed by #9842

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants