Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking CheckMySQL and Query Engine history #11885

Closed
GuptaManan100 opened this issue Dec 6, 2022 · 8 comments · Fixed by #11895
Closed

Tracking CheckMySQL and Query Engine history #11885

GuptaManan100 opened this issue Dec 6, 2022 · 8 comments · Fixed by #11895
Assignees

Comments

@GuptaManan100
Copy link
Member

There have been multiple failures in CheckMySQL code that cause it to get stuck. This issue documents the history of all the changes that have been made to this codepath to better understand the intent and the evolution of the code over time.

@GuptaManan100
Copy link
Member Author

GuptaManan100 commented Dec 6, 2022

The story is going to start at the beginning of release-6.0 (as it stands now).

In this release the state-manager didn't exist and the tabletserver used to handle the state changes directly. To do this it had a field called state. This was an enum which took 5 values. These are listed below with their purpose -

  1. StateNotConnected - state where tabletserver is not connected to an underlying mysql instance.
  2. StateNotServing - state where tabletserver is connected to an underlying mysql instance, but is not serving queries.
  3. StateServing - state where queries are allowed.
  4. StateTransitioning - transient state indicating that the tabletserver is tranisitioning to a new state. No requests are allowed in this state
  5. StateShuttingDown - indicates that the tabletserver is shutting down. In this state, we wait for outstanding requests and transactions to conclude.

SetServingType takes in a boolean and we set the state to StateTransitioning if we want to change the state to Serving and we set the state to StateShuttingDown when we want to go to Nonserving.
If the state is ShuttingDown or Transitioning we reject the SetServingType request.
Based on the current state and the desired serving state, we have 4 actions that are taken -

  1. actionNone - This does nothing.
  2. actionFullStart - Opens the query engine and schema engine, initializes the tx engine and heartbeat writer, and calls the action for actionServeNewType. In case of an error close everything.
  3. actionServeNewType - This does some operations which are specific to being a primary or replica like setting read-write, whether heartbeat reader or writer should be open etc. Also opens the transaction throttler for the primary. Changes the state to serving in the end. In case of an error closes everything.
  4. actionGracefulStop - This calls tsv.waitForShutdown which closes the transaction engine, kills the stream queries, and then waits for requests to be empty. (Does not close the query engine). Sets the state to NotServing in the end.

Requests handling.
Since the state is kept on the tabletserver, it is directly accessible in the StartRequest. This function doesn't allow any new requests to go through if the state is NotServing. (Some exceptions for Shutdown state and allowonShutdown which comes from the outside)
Does a little more validation for the shard, keyspace and the tablet type and then adds 1 to the requests wait-group.
Requests allowed while in Shutdown state -

  1. ExecuteBatch if it is in a transaction
  2. Commit, CommitPrepared, Rollback, RollbackPrepared
  3. ConcludeTransaciton, CreateTransaction, SetRollback, StartCommit (This looks like 2pc)
  4. Execute if it is in a transaction
  5. Prepare
  6. ReadTransaction (I don't know what this is)

It is worth noting here that the requests from Execute and ExecuteBatch that are permitted only go to the tx engine to be executed. That engine is closed gracefully in waitForShutdown giving the transactions the configured time before killing them. No new transactions are started since Begin is not permitted.

CheckMySQL
Starts by transitioning to the Shutdown state. It then calls waitForShutdown followed by closing the query engine, schema engine etc and finally transitions to StateNotConnected.

Key Takeaways

  • We transition to StateShuttingDown when we want to go to StateNotConnected (from CheckMySQL) or StateNotServing (from SetServingType, intern called from DemotePrimary etc). In this state we do not allow any new reads or writes to happen outside of a transaction. We don't even start a new transaction.
  • waitForShutdown just allows the existing transaction to continue running until the graceful time-period expires and then kills them. It waits on the requests, but it is safe since no new reads are allowed. Once the transactions are killed there are no new requests allowed, so this wait will not block indefinitely (hopefully). (the intention is clear here).
  • We don't allow any requests in the StateTransition phase. So going from Primary to Replica or vice versa momentarily blocks all the requests.
  • query engine only closes on errors in SetServingType (see ☝️ for more details) or when we go the state of NotConnected. So even when we go to a not serving state, we don't close the query engine. It is important to note that we stop the requests we don't want to allow much higher up the stack in startRequest.

UPDATE
From what I can tell, once we are in the StateNotConnected, we go back to StateServing only via a RPC call or the health check. This is in runHealthCheckLocked which calls SetServingType if we should be serving. Honestly I don't fully understand this code to say for sure it brings the tablet to a serving state after MySQL is up. Maybe it does 🤷‍♂️. I personally think it probably would, since the replication would be healthy at some point after MysQL is up and then this would run.

@harshit-gangal
Copy link
Member

This is great.

Regarding

ReadTransaction (I don't know what this is)

This is again for 2PC

@GuptaManan100
Copy link
Member Author

GuptaManan100 commented Dec 6, 2022

A major change that happened pertaining to this code was in #6396. Let's take a look at release-7.0 which includes these changes next.

This PR intends to accomplish 2 things -

  1. Introduce a state_manager component to abstract out the state management capabilities from the tabletserver directly.
  2. Remember the desired state and continuously retry if the first attempt fails instead of closing everything and waiting for user intervention.

The newly created state-manager changed the way it manages the state. Instead of keeping just the current state of the tablet, it stores two of them, the correct state state and the desired state wantState.

The states themselves are restricted. The states StateShuttingDown and StateTransitioning are removed. It is good to know however, that the state-space itself increases, since we have 2 variables, so we have 9 possibilities of the states.

tabletserver's SetServingType is now a passthrough to state manager's (from here on sm) SetServingType with a little code to change boolean to an enum for state.
SetServingType itself changes considerably. For the tablet types RESTORE and BACKUP we explicitly override the requested state to StateNotConnected. It then calls mustTransition which stores the requested state as the desired state and checks if a transition is required or not.
We then reach the execTransition which does the bulk of the work. In the end in case of an error, there is code to retryTransition. There is enough protection with the mutexes to ensure there is only one of mustTransition, execTransition and retryTransition running at any given time and there is atmost 1 running go-routine for recheckState.

Instead of the previous architecture of having different actions which themselves also checked for the tablet type inside, we now just have 4 separate functions, one for each combination for the serving state requested and the tablet type -

  1. servePrimary - Calls sm.connect, starts accepting read-write, and then sets the actual state in the sm.
  2. serveNonPrimary - Shuts down the messager, tracker and then calls sm.connect, starts accepting read-only and then sets the actual state.
  3. unservePrimary - Calls unserveCommon, sm.connect and then sets the actual state.
  4. unserveNonPrimary - very similar to unservePrimary. Different only in schema-engine, watcher and replication tracker. None of these differences are important for us.

There is also one case for StateNotConnected. In this case we call closeAll. Keep reading on for more details on this function.

Let's look at some of the common functions called in more detail -

  1. sm.connect checks if MySQL is reachable, then starts the schema engine, vstreamer, query engine and transaction throttler. This can't be stressed enough, we call this function in all the 4 funcions above, so we always open the query-engine when we transition to Serving or Not serving state. (If MySQL is reachable, otherwise we error out sooner).
  2. setState is a small function which does what the name suggests. It sets the current state of the state manager. The only interesting thing here that is noteworthy for our discussion is the part where it has code to transition to Primary type while still allowing some queries that were targeted at the old type to succeed. This is done by the alsoAllow field which is used in startRequest. Essentially the idea is that if a read was targeted at a replica, it shouldn't fail due to a transition to primary, but the other way around is not supported (You don't want to run a write on a replica!).
  3. unserveCommon - This closes the transaction engine, messager, and tracker. It stops query service by killing the streaming queries and then waits for the requests to be empty. The comment explicitly says, StopServing kills all streaming queries. Other queries are handled by the tsv.requests Waitgroup.. Even here, we don't close the query engine, and the intention is to block the queries in the StartRequest like before I believe.

Requests Handling
The StartRequest has also been considerably reworked. This is required since the states have changed and we no longer have a StateShutdown, which was used extensively in the previous architecture to allow the running transactions to complete.
The first check is to say if the current state is not serving or if the replication is not healthy, we don't allow any queries. This makes sense to me, since we have only 3 states. We don't allow queries when the current state is NotServing or NotConnected. Next we assume that we are shutting down if the wantState is not Serving. shuttingDown := sm.wantState != StateServing. This also makes sense. If we are trying to reach a different state than the ServingState, we are in the process of shutting down.
From here on, it remains the same as before. We use this shuttingDown variable in tandem with the allowOnShutdown. We also use the target and alsoAllow to restrict some more requests.
Apart from the ones listed in the previous message some more requests are allowed in shutdown -

  1. Release (There is code to error out if transactionID and reservedID are both nil)
  2. ReserveExecute, ReserveStreamExecute, and StreamExecute - Only in transactions

CheckMySQL
Checks if MySQL is reachable. If it's not, then calls closeAll. Finally it retries the transition. (That exits early if no retry required). In closeAll, we call unserveCommon, and then close the transaction throttler, query engine, watcher, vstreamer, replication tracker, schema engine and finally set the state to NotConnected.

Going back to serving from NotConnected
There were a lot of changes to #6155, #6311, and #6461. The last notably removed the health check code. I think this was taken over by the replication manager which would try to fix replication when the MySQL is back. When it does, it calls setPrimaryRepairReplication which eventually calls ChangeTabletType which eventually calls SetServingType.

🚨🚨🚨 BUG ALERT
I think there is a bug here! We don't change the wantState in CheckMySQL in the beginning! Currently that only happens in mustTransition, but we don't call that! So what happens is that CheckMySQL is started but the wantState is still StateServing. We execute unserveCommon and then wait for the requests to be empty. But normal read queries aren't blocked! They succeed since the wantState and the currentState are both StateServing!
This will also cause the retryTransition in the end of CheckMySQL to have a weird effect! Since the wantState is not changed, the retry is actually trying to start the query service again instead of closing it!

@GuptaManan100
Copy link
Member Author

GuptaManan100 commented Dec 6, 2022

#7011 makes a few more changes to how query killing works. Transitioning in transaction engine and fixes a race in the Begin code path. I am not going into too much detail for this since it's not super relevant to our discussion.

@GuptaManan100
Copy link
Member Author

GuptaManan100 commented Dec 6, 2022

cc - @harshit-gangal @deepthi All in a days work ☝️. Now we know what the intention was and what part is the actual bug. We want to keep the query engine open. But we need to block the queries in the StartRequest function. We don't need a new state, but the issue is that we aren't setting the correct want state in CheckMySQL.
The current state diagram for CheckMySQL looks like this -

flowchart LR
    state1(WantState=Serving\nState=Serving)-->state2(Kill Queries\nClose Transaction Engine)
    state2-->state3(Wait For Requests to be empty)
    state3-->state4(WantState=Serving\nState=NotConnected)
    state4-->state5(retryTransition)
Loading

This is obviously wrong because, during the wait for requests, we aren't preventing new requests from coming in.
The correct way would look something like this -

flowchart LR
    state1(WantState=Serving\nState=Serving)-->state1b(WantState=NotConnected\nState=Serving)
    state1b-->state2(Kill Queries\nClose Transaction Engine)
    state2-->state3(Wait For Requests to be empty)
    state3-->state4(WantState=NotConnected\nState=NotConnected)
    state4-->state4b{WantState=Serving\nState=NotConnected}
    state4b-->state5(retryTransition)
Loading

I am not sure about the state marked in a diamond. The intent for the retry in the original PR is still unclear to me. Do we want to retry getting into the NotConnected state or do we actually want to spawn a go-routine that tries to take us back to the serving state?
Since closeAll doesn't return any errors and it is guaranteed to end up calling sm.setState(topodatapb.TabletType_UNKNOWN, StateNotConnected) for sure before we retry, I kinda believe that we want the latter because otherwise, the retry does nothing ever.

@harshit-gangal
Copy link
Member

@GuptaManan100 This is really good.

If I have understood correctly, then the change is from checkMySQL we would call SetServingType with StateNotConnected. Once we ensured that the state is now StateNotConnection, we will call the SetServingType again with StateServing and exit the checkMySQL.

@harshit-gangal
Copy link
Member

Another thing I would like to point out is that in the state_manager closeAll method there is a code for timebomb

func (sm *stateManager) closeAll() {
	defer close(sm.setTimeBomb())
        ....
}        

That ensures that close should happen within this certain period of time otherwise it will crash vttablet which will eventually lead to the restart of vttablet pod / VTOrc kick-in

Currently, it is disabled as the default time is set to 0 which means unlimited wait.

@GuptaManan100
Copy link
Member Author

GuptaManan100 commented Dec 7, 2022

#11900, #11901 and #11902 are the back ports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants