Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

packets: discard connection on zero sequence ID #1471

Closed
wants to merge 1 commit into from

Conversation

owbone
Copy link
Contributor

@owbone owbone commented Aug 8, 2023

Description

MySQL 8 introduced some error packets that can be sent to otherwise idle connections, notably ER_CLIENT_INTERACTION_TIMEOUT and ER_DISCONNECTING_REMAINING_CLIENTS. In rare circumstances, or in extremely busy applications, these packets can be sent by the server at the same instant that the client sends a command. This results in the client receiving a packet with a zero sequence ID when trying to read the result of a command.

Currently, when the client receives a sequence ID that it doesn't expect then it returns a ErrPktSync error and leaves the connection open, since it normally indicates that the user is trying to run a command at the wrong time (e.g. before having finished reading the results of a previous query) rather than the connection having gone bad.

However, in the case of these errors, the connection has gone bad, since we expect that the server has closed the connection and there is unread data left in the buffer. Since a connection in this state is not closed then it gets returned to the pool, and when it gets picked up from the pool then many commands will fail with ErrBusyBuffer (because the buffer still contains the unread error packet), and it will return to the pool again. This can turn a single error packet into thousands and thousands of errors, as the connections are reused again and again until they eventually timeout.

To add some perspective, GitHub recently encountered 146 connections which failed with ErrPktSync due to a replica forcibly disconnecting clients (ER_DISCONNECTING_REMAINING_CLIENTS). These 146 connections resulted in around 1.3 million request failures with ErrBusyBuffer before they timed out within 30 seconds.

Ideally, the client should be able to read these error packets and handle them properly, but doing so is a larger project. This change instead makes a relatively small and uninvasive fix to add special handling for zero sequence IDs. With this, when we receive a packet with a zero sequence ID then we always assume that the server has sent a terminal error packet, close the connection, and return ErrInvalidConn.

Closes #1394.

MySQL 8 introduced some error packets that can be sent to otherwise idle
connections, notably `ER_CLIENT_INTERACTION_TIMEOUT` and
`ER_DISCONNECTING_REMAINING_CLIENTS`. In rare circumstances, or in
extremely busy applications, these packets can be sent by the server at
the same instant that the client sends a command. This results in the
client receiving a packet with a zero sequence ID when trying to read
the result of a command.

Currently, when the client receives a sequence ID that it doesn't expect
then it returns a `ErrPktSync` error and leaves the connection open,
since it normally indicates that the user is trying to run a command at
the wrong time (e.g. before having finished reading the results of a
previous query) rather than the connection having gone bad.

However, in the case of these errors, the connection has gone bad, since
we expect that the server has closed the connection and there is unread
data left in the buffer. Since a connection in this state is not closed
then it gets returned to the pool, and when it gets picked up from the
pool then many commands will fail with `ErrBusyBuffer` (because the
buffer still contains the unread error packet), and it will return to
the pool again. This can turn a single error packet into thousands and
thousands of errors, as the connections are reused again and again
until they eventually timeout.

To add some perspective, GitHub recently encountered 146 connections
which failed with `ErrPktSync` due to a replica forcibly disconnecting
clients (`ER_DISCONNECTING_REMAINING_CLIENTS`). These 146 connections
resulted in around 1.3 million request failures with `ErrBusyBuffer`
before they timed out within 30 seconds.

Ideally, the client should be able to read these error packets and
handle them properly, but doing so is a larger project. This change
instead makes a relatively small and uninvasive fix to add special
handling for zero sequence IDs. With this, when we receive a packet with
a zero sequence ID then we always assume that the server has sent a
terminal error packet, close the connection, and return
`ErrInvalidConn`.
@methane
Copy link
Member

methane commented Aug 8, 2023

With this, when we receive a packet with a zero sequence ID then we always assume that the server has sent a terminal error packet, close the connection, and return ErrInvalidConn.

I am not sure that this is safe assumption.

@owbone
Copy link
Contributor Author

owbone commented Aug 8, 2023

I am not sure that this is safe assumption.

You're right. The assumption breaks if the sequence ID wraps around.

I have two other options in mind:

  1. Should we just always close the connection on ErrPktSync or ErrPktSyncMul? It seems that the connection is always stuck in a broken state on those two errors since we always read the packet header without reading the payload. Unless I'm missing something, it doesn't seem possible to recover the connection from those errors right now. This PR was intended to be a less invasive version of that. (Implemented in 3f0ea34.)

  2. If not, we can just return false from IsValid() if the buffer is not empty, which will prevent the connection with "busy buffer" from being returned to the pool. (Implemented in afbf75a.)

@methane
Copy link
Member

methane commented Aug 9, 2023

The assumption breaks if the sequence ID wraps around.

Not only it. MySQL protocol is used by not only MySQL.
There are no guarantee that other servers doesn't have a bug that sends sequence id 0 for error packets.

  1. Should we just always close the connection on ErrPktSync or ErrPktSyncMul?

I think we should not reuse such connection even if we read the packet payload.

@owbone
Copy link
Contributor Author

owbone commented Aug 9, 2023

I think we should not reuse such connection even if we read the packet payload.

Ok, I'll open a separate PR for the first option then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Connections stuck with "busy buffer" after ErrPktSync
2 participants