Server timeouts broken since #302 #657

vmg · 2017-09-06T15:19:25Z

Hey everyone!

After more than 2 years being blocked, PR #302 was merged just two weeks ago. It purportedly fixes issues were queries were being wrongly "retried" (i.e. being ran twice by the database/sql client even though the original query went through). To accomplish this, it reduces the number of places where the driver returns a driver.ErrBadConn error: it only returns this value when we're 100% sure that the database has not received the original query, as returning ErrBadConn forces the database/sql client to retry the query altogether with a different connection, hence leading to redundant/repeated queries.

In practice, however, this breaks the driver in any situation where the MySQL server can kill connections on the server side. Let's analyse why and try to find a workaround for this issue.

Reproduction

We're gonna make the driver crash with a pretty simple reproduction recipe.

First, we need a MySQL server configured to timeout idle connections. We're gonna use MySQL 5.7 and simply set global wait_timeout=3;. This will make mysqld kill any connections that haven't been active (i.e. sending queries) for 3 seconds. This whole example also works with any maintained MySQL release, including 5.6, and all MariaDB releases. The 3 second timeout is a bit aggressive but it's something similar to what we run in our main MySQL cluster here at GitHub.

With this MySQL server in place, we can reproduce the issue with a few lines of Go code, using the master branch of go-sql-driver/mysql:

db, err := sql.Open("mysql", dsn)
assert.NoError(t, err)
defer db.Close()

assert.NoError(t, db.Ping())

// Wait for 5 seconds. This should be enough to timeout the conn, since `wait_timeout` is 3s
time.Sleep(5 * time.Second)

// Simply attempt to begin a transaction
tx, err := db.Begin()
assert.NoError(t, err)

The result of pinging the DB, sleeping for 5 seconds, then attempting to begin a transaction, is a crash:

Error:          Received unexpected error: invalid connection

It is clear that this is not the right behaviour. The mysqld instance is still up throughout the test, so even if the initial connection got killed by the server, calling db.Begin() should create a new connection instead of returning an error. This was the behaviour before #302 was merged.

So, what's the underlying issue here, and why is database/sql not retrying the connection after the changes in #302?

Deep Dive

The changes in #302 are sensible and appear correct when reviewed. The "most important" modification happens in packets.go, in the code to handle writes:

mysql/packets.go

Lines 121 to 147 in 26471af

    
           n, err := mc.netConn.Write(data[:4+size]) 
        
           if err == nil && n == 4+size { 
        
           	mc.sequence++ 
        
           	if size != maxPacketSize { 
        
           		return nil 
        
           	} 
        
           	pktLen -= size 
        
           	data = data[size:] 
        
           	continue 
        
           } 
        
           // Handle error 
        
           if err == nil { // n != len(data) 
        
           	mc.cleanup() 
        
           	errLog.Print(ErrMalformPkt) 
        
           } else { 
        
           	if cerr := mc.canceled.Value(); cerr != nil { 
        
           		return cerr 
        
           	} 
        
           	if n == 0 && pktLen == len(data)-4 { 
        
           		// only for the first loop iteration when nothing was written yet 
        
           		return errBadConnNoWrite 
        
           	} 
        
           	mc.cleanup() 
        
           	errLog.Print(err) 
        
           } 
        
           return ErrInvalidConn

The PR adds line 140, where if n == 0 && pktLen == len(data)-4 is checked. The rationale is as follows:

Any interaction (i.e. query, exec) from the driver to the MySQL server happens in two phases: First, we write the packet with the query/exec command to the socket connected to the server. If this write succeeds, it means the server must write back an "answer" packet, so we immediately read the response from the socket.
If this write doesn't succeed, it means the connection has gone bad (maybe the server killed it?), but we know for sure the server has not received our query, so we can return errBadConnNoWrite, which will get translated down the driver into driver.ErrBadConn and force database/sql to retry the query with another connection.
However: If the write succeeds but the follow-up read from the socket fails (i.e. we never receive the "answer packet" from MySQL), we cannot return ErrBadConn, because the server has received our query and may have executed it even though we received no answer. We cannot let database/sql retry this query.

This is perfectly sensible, but the previous reproduction recipe shows it doesn't work in practice. In our code, the following sequence of actions happens:

We connect to the MySQL server. sql.Open
We ping the MySQL server by sending a "ping" packet
We receive a "pong" packet response from MySQL -- db.Ping returns successfully
We sleep for 5 seconds
After just 3 seconds, the MySQL server is going to kill our driver's only open connection
We wake up from sleep and attempt to begin a transaction, by execing a START TRANSACTION.
We write an exec packet to the server with START TRANSACTION. The write succeeds
We then read the response packet from MySQL: This read fails, because MySQL killed our connection -- but now the driver cannot return ErrBadConn to retry this operation, because our initial write succeeded! The server may have executed START TRANSACTION! So the driver's only option is to hard crash.

When written like this, the problem becomes really clear. #302 made an assumption about the TCP protocol that doesn't hold in practice: that after MySQL server has killed our connection, trying to write a packet from the driver to the server will return an error. That's not how TCP works: let us prove it by looking at some packet captures. 😄

This are steps 1 through 8 seen through Wireshark:

The beginning is straightforward. We open the connection with a TCP handshake, we send a ping packet and we get a reply back (you can actually see two pings + two replies in that capture). We then go to sleep.

...And just 3 seconds later, in frame 14, the MySQL server enacts the timeout we've configured. Our TCP connection gets closed on the server side and we receive a [FIN, ACK] pair... To which our Golang client simply responds with an ACK! There's no FIN response from the driver because we haven't closed our side of the connection, as we're sleeping.

Seven seconds later, at T+10 (frame 18), we wake up from our sleep and attempt to exec a START TRANSACTION. And (here's what throws us off), the Linux kernel accepts the write just fine. It returns no error to Golang, because this is how TCP works. TCP is a bidirectional protocol, and our side of the connection is still open, so the write goes through.

Of course, the MySQL server immediately replies -- with a RST. "Dude, this connection is over". But it's too late for our driver: the write of the exec packet already succeeded, and with this new logic, we assume that the query may have been executed by the server (even though the MySQL server was not even listening on our socket anymore).

And meanwhile, the kernel has received the RST packet from the MySQL server. The next time we attempt to read from our connection, we will definitely crash. But with this new logic, a retry won't be attempted! This is bad news, because our original exec was not even close to making it to the server -- let alone being executed.

What are the potential fixes?

There is, in theory, a "right way" to handle this general case of the server closing a connection on us: performing a zero-size read from the socket right before we attempt to write our query packet should let us know about that FIN packet we've received: "MySQL has closed this connection, buddy. Your side is open, so you can still write, and the kernel will swallow it... But it's pointless because MySQL isn't reading anymore and -- even if it were --, the FIN means it wouldn't be able to reply".

Here's the issue: the "zero read" semantics from Golang are a bit fuzzy right now. This commit from bradfitz special-cases this situation so the read never makes it to the kernel; before that, the read would always return io.EOF.

This closed issue in golang/go has a very similar analysis to this one (at the TCP level -- not really related to MySQL), with several Golang core devs chiming in.

Ian Lance Taylor comments:

I don't see what Go can do here. Try using "go test -c" to build your test into an executable and then run "strace -f EXECUTABLE". You will see that the final write system call succeeds and the kernel reports it as succeeding. Go is just returning that success to the caller. The same thing would happen for a C program.

At the TCP level, you have closed one side of the TCP connection but not the other. The kernel will accept writes until both sides are closed. This is not as simple a matter as you seem to be expecting. I found a decent description of the issue at http://stackoverflow.com/questions/11436013/writing-to-a-closed-local-tcp-socket-not-failing .

The "final write system call" that Ian talks about here is the same as our "trying to write START TRANSACTION after our sleep, and indeed, it succeeds at the kernel level.

Brad Fitzpatrick wraps up the issue with a comment on zero-size reads from Golang:

FWIW, the 0-byte-Read-always-returns-EOF was changed in Go 1.7 with 5bcdd63. But as of Go 1.7 it now always returns nil, which is at least a bit more Go-like, but still doesn't tell you whether FIN was ever seen arriving. There aren't great APIs anywhere (especially not portably) to know that without reading data.

Another still open issue in golang/go explains why is there --still-- no way to see if the socket is good for reading (and hence it makes sense writing to it).

In Conclusion

I hope this in-depth analysis wasn't too boring, and I also hope it serves as definitive proof that the concerns that some people (@xaprb, @julienschmidt) had on the original PR are indeed very real: we tried rolling out the master branch of this driver in one of our Golang services and it's unusable with any of our MySQL clusters here at GitHub.

As for a conclusive fix: I honestly have no idea. I would start by reverting the PR; I think the risk of duplicated queries is worrisome, but the driver as it is right now is unusable for large MySQL clusters that run tight on connections and do pruning.

Doing a Ping before attempting new queries on connections that could be "stale" seems like a feasible workaround. But how do we mark such connections as stale? What's the timeframe? Should we let the users configure it? What's the overhead going to be like?

I would love to hear from the maintainers on approaches to fix this. I have time and obviously interest on using the master branch of this driver in production and I'll gladly write & PR any fix for this issue that you deem acceptable.

Cheers!

The text was updated successfully, but these errors were encountered:

methane · 2017-09-06T15:53:13Z

First, we need a MySQL server configured to timeout idle connections. We're gonna use MySQL 5.7 and simply set global wait_timeout=3;.

Fast of all, you should use DB.SetConnMaxLifetime() instead of wait_timeout.
Closing connection from client is always better than closing from server, because
client may send query just when server start closing the connection. In such case,
client can't know sent query is received or not.

Here's the issue: the "zero read" semantics from Golang are a bit fuzzy right now. This commit from bradfitz special-cases this situation so the read never makes it to the kernel; before that, the read would always return io.EOF.

Yes. I reached same conclusion. And that's why I strongly recommend to use SetConnMaxLifetime
with short lifetime always.

If MySQL cluster's wait_timeout can't be longer than 3sec, I think you should use
1sec or less timeout. In real world application processing thousands of requests per second,
1sec is long enough if servers are on same LAN.

vmg · 2017-09-06T16:11:16Z

Fast of all, you should use DB.SetConnMaxLifetime() instead of wait_timeout.

Thank you for the suggestion! That seems like a good workaround. I'm afraid it's not up to me how our MySQL clusters handle idle connections. We have many different clusters with different capacities and many different languages/clients connecting to them, so connections are gonna get killed -- either "statically" through wait_timeout or dynamically by monitoring processes. I would definitely appreciate it if the Golang driver could handle killed connections gracefully by retrying them. 😅

I think using an aggressive timeout on SetConnMaxLifetime is gonna work around the issue for now, but I'd love to discuss a proper long term fix so the driver always retries connections. I think we can all agree that the current behaviour as of #302 is not "correct" even if we can work around it by having the driver timeout connections before the server does.

methane · 2017-09-06T16:34:27Z

Would you file an issue on golang/go ?
I think proper solution is adding new Error which means BadConn, but isn't safe to retry.

Adding new method to TCP (and Unix socket, maybe) to check connection is killed from peer
is also nice. But all discussion should be in golang's forum or issue tracker.

arnehormann · 2017-09-08T07:17:49Z

@vmg thank you for the reports, it's amongst the best ones I've seen so far.
I still disagree that #302 is not correct, but it definitely changes the driver's behaviour.

I'm in a hurry right now, but from memory:

In the end, https://golang.org/src/database/sql/sql.go?s=34900:34969#L630 is called.
What if the connection pool contains maxBadConnRetries connections which were all invisibly closed by the server and your query is retried on all of those? The query will fail and you'd get the same error.

Robust client code has to include retry logic to prevent this failure mode. This is very, very annoying, but I still see an improvement in #302 - we reduce the number of spec violations and increase the visibility of a potential problem, the one I mentioned above.

That was from memory and I might be wrong in the sql - driver - mysql interactions. If I am, I am very eager to rollback the change. If not, we should have a discussion how to best change documentation or (I really don't want to) even introduce a DSN parameter for the old behaviour.

vmg · 2017-09-08T08:09:08Z

thank you for the reports, it's amongst the best ones I've seen so far.

You're welcome! I'm glad it's of some use!

What if the connection pool contains maxBadConnRetries connections which were all invisibly closed by the server and your query is retried on all of those? The query will fail and you'd get the same error.

No, it will not. The database/sql will force a new connection if after maxBadConnRetries, they all return ErrBaddConn. The logic is just 3 lines above the link you posted:

https://github.com/golang/go/blob/ab40107708042ded6bdc1fb841c7cf2c2ab002ab/src/database/sql/sql.go#L1286-L1288

This new connection will never fail. But, of course, to reach that logic the driver must be returning ErrBadConn -- which is not the case as of #302.

I think you may indeed be wrong about the way the driver interacts with the standard library. To me, it seems like database/sql is designed to be resilient around the case of broken connections -- but for the retry logic to work, the driver needs to play along and return ErrBadConn.

Again, I'm aware that duplicate queries are a real issue, but the fix in #302 breaks the semantics in database/sql and causes queries that were previously succeeding (i.e. being retried) in all circumstances to be reliably broken in all circumstances.

It's a big regression from my point of view. Please reconsider! Cheers!

arnehormann · 2017-09-08T08:19:28Z

I'll clear the egg from my face, dig deeper into it again, take some time to digest it and get back to you. Probably tomorrow. Thanks again! Am 08.09.2017 10:09 schrieb "Vicent Martí" <notifications@github.com>: thank you for the reports, it's amongst the best ones I've seen so far. You're welcome! I'm glad it's of some use! What if the connection pool contains maxBadConnRetries connections which were all invisibly closed by the server and your query is retried on all of those? The query will fail and you'd get the same error. No, it will not. The database/sql will force a new connection if after maxBadConnRetries, they all return ErrBaddConn. The logic is just 3 lines above the link you posted: https://github.com/golang/go/blob/ab40107708042ded6bdc1fb841c7cf 2c2ab002ab/src/database/sql/sql.go#L1286-L1288 This new connection will never fail. But, of course, to reach that logic the driver must be returning ErrBadConn -- which is not the case as of #302 <#302>. I think you may indeed be wrong about the way the driver interacts with the standard library. To me, it seems like database/sql is designed to be resilient around the case of broken connections -- but for the retry logic to work, the driver needs to play along and return ErrBadConn. Again, I'm aware that duplicate queries are a real issue, but the fix in #302 <#302> breaks the semantics in database/sql and causes queries that were previously succeeding (i.e. being retried) *in all circumstances* to be reliably broken *in all circumstances*. It's a big regression from my point of view. Please reconsider! Cheers! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA9u1ZGtQyhSR8o2ETj1_cx8WqXTOtxZks5sgPYpgaJpZM4POimv> .

arnehormann · 2017-09-09T15:18:00Z

Sorry, I can't get to it today. I'll get to it Monday evening!

methane · 2017-09-10T00:00:43Z

I think you may indeed be wrong about the way the driver interacts with the standard library. To me, it seems like database/sql is designed to be resilient around the case of broken connections -- but for the retry logic to work, the driver needs to play along and return ErrBadConn.

No. We understand database/sql design. Before #302, we violate the design and #302 fixes it.
See #654 and golang/go#11978 (comment)

Again, I'm aware that duplicate queries are a real issue, but the fix in #302 breaks the semantics in database/sql and causes queries that were previously succeeding (i.e. being retried) in all circumstances to be reliably broken in all circumstances.

The retry was dangerous.

methane · 2017-09-10T00:04:42Z

Checking connection closed by peer before sending query may increase "successful retry" rate.
But there are still timings we can't retry safely: server closed the connection while we sending query.

So best approach is shorten connection lifetime enough. Close connection from client side, before
it closed by server side.

julienschmidt · 2017-09-10T07:12:34Z

One can't have automatic 'magic' retries without potentially executing queries twice.

Imagine a situation where the server wants to notify that an auto-committed query was successful.
We have a problem if the server assumes the query done but the driver never received the notification that it was successful. The client is not sending back some kind of ACK. So it might happen that the server could successfully write the notification to the socket but it got eventually lost on the path or in the OS buffer.

Now one could argue, that a good database design can prevent this. Unfortunately not every user is a database expert and the database/sql doc is clear that drivers should be rather "safe than sorry".

We could probably do what we always do - add a DSN flag 😆

arnehormann · 2017-09-11T18:53:55Z

#302 fixes real bugs and improves the spec conformity. Still, as diagnosed above, it comes with its own set of problems.

If we look at the situations where the driver returns ErrBadConn, we traded false positives for false negatives and gained spec conformity (see the first coments in #302). Sadly, the false negatives were pretty rare (judging from report frequency); the false positives not so much.

I don't see a sensible way to fix both false positives and false negatives. Still, as methane mentioned, there is a workaround for the problems introduced by #302. That and the spec conformity makes me want to keep the current state - with the PR merged.

I do see that this causes problems for existing clients. I dislike the introduction of a new DSN parameter as it fragments the user base and complicates triaging bug reports in the future, but we have to deal with it anyway until everyone uses the current version (never, probably). That's why I'd reluctantly agree to it - in combination with additional documentation.

Also, I'm sorry for all the confusion - esp. concerning my unfounded response earlier.

arnehormann · 2017-09-12T08:07:36Z

I just had an idea I prefer to a DSN flag. DSN is per connection, but the required changes are per package as they might require different code. So they should be changed globally. Let's introduce a new exported error variable, only for #302 changes (placeholder: ErrXYZ). Then clients can get the old behavior by mysql.ErrXYZ = driver.ErrBadConn... This can preserve legacy behavior and still nudges developers in the right direction. And it suggests reduced accountability. And it will probably not be used by new clients. Thoughts?

vmg · 2017-09-12T08:33:28Z

That sounds like a very good idea! Allowing users to customise the default BadConn error seems like the perfect trade-off.

julienschmidt · 2017-09-12T11:46:35Z

I prefer a DSN flag. Global state should be avoided when possible. Imagine for example two 3rd party packages that depend on a different value of the global state to work properly.

And I don't see any changes in #302 that require a global state. Simple if-else checking of the value of an attribute of mysqlConn should also work.

mrsinham · 2017-10-16T15:21:57Z

just to add a little humble opinion on the subject : The #302 broke the behaviour on our systems too.

After deploying this, the daemons that are using one shared *sql.DB object began to fail completely very quickly with invalid connections.

We are currently doing a rollback. IMHO the sql.DB object is a cluster of connections that handles the complexity of broken connection and should only return invalid connection after x retries to reconnect.

In addition can I suggest to tag commit before you merge heavy changes like that ? Thank you very much ;)

danielstjules · 2017-11-22T01:12:32Z

Just upgraded a service to take advantage of rejectReadOnly for easy aurora failover, and was sad to see a breaking change requiring client code to handle reconnects. The logic behind the change makes sense, but in our case, we only perform reads and writes that are idempotent, so the concerns about exactly-once vs at-least-once delivery don't affect us.

Any update on this? Would a PR on a new DSN flag be accepted? :)

methane · 2017-11-22T04:09:15Z

Why don't you shorten MaxConnLifetime?
Why don't you retry by yourself?

I don't think adding complexity into this lib only for special use cases
(e.g. all queries are idempotant) is good idea.
Unless all other solution (or workaround) can't solve this, I'm -1 on adding option.

danielstjules · 2017-11-22T05:00:08Z

@methane

Why don't you shorten MaxConnLifetime?

Unless we're reconnecting per query, no value passed to SetConnMaxLifetime will guarantee a good connection.

Why don't you retry by yourself?

Because this was a major breaking change. database/sql previously handled this behavior with alwaysNewConn and maxBadConnRetries. Wrapping database/sql with retry logic for Begin, BeginTx, Exec, ExecContext, Prepare, PrepareContext, Query, QueryContext, in our code, and more importantly for all dependencies that used to assume retry behavior, is a lot of work.

I don't think adding complexity into this lib only for special use cases (e.g. all queries are idempotant) is good idea.

I don't think having connections to a "read only" DB is a special use case. Primary-replica topologies are very common. All queries targeting replicas are read-only and idempotent.

methane · 2017-11-22T05:16:05Z

Unless we're reconnecting per query, no value passed to SetConnMaxLifetime will guarantee a good connection.

I don't know your specific case. But most people who see this error log has wrong setting:
server (or network) timeout is longer than client timeout. It's broken setting.

Because this was a major breaking change.

If it is only reason, you can use old version.

I don't think having connections to a "read only" DB is a special use case. Primary-replica topologies are very common. All queries targeting replicas are read-only and idempotent.

If it's not so special, it should be fixed by database/sql design.

danielstjules · 2017-11-22T06:20:51Z

I don't know your specific case. But most people who see this error log has wrong setting:
server (or network) timeout is longer than client timeout. It's broken setting.

I'm trying to setup graceful automatic failover of the primary DB, and recovery from reader/replica failures without seeing a bunch of connection.go:372: invalid connection when each connection in the pool subsequently fails after #302. It's not a common occurrence, but it's an important enough failure case.

If it's not so special, it should be fixed by database/sql design.

That's fair. :) Appreciate the reply!

methane · 2017-11-29T06:46:01Z

but it's an important enough failure case.

Seeing some logs is so important?
I don't think some logs are shown is not important than breaking some data silently.

methane · 2018-10-24T03:03:56Z

I close this issue because it's hard to follow discussion. I don't want to continue disucssion here.

My conclusion is:

Connection closed from server / router / middleware is always dangerous. User must avoid it by suing DB.SetConnMaxLifetime().
On the other hand, when it happened, it can be handled more safely.

I'll create new issues for them.

This commit contains a potential fix to the issue reported in go-sql-driver#657. As a summary: when a MySQL server kills a connection on the server-side (either because it is actively pruning connections, or because the connection has hit the server-side timeout), the Go MySQL client does not immediately become aware of the connection being dead. Because of the way TCP works, the client cannot know that the connection has received a RST packet from the server (i.e. the server-side has closed) until it actually reads from it. This causes an unfortunate bug wherein a MySQL idle connection is pulled from the connection pool, a query packet is written to it without error, and then the query fails with an "unexpected EOF" error when trying to read the response packet. Since the initial write to the socket does not fail with an error, it is generally not safe to return `driver.ErrBadConn` when the read fails, because in theory the write could have arrived to the server and could have been committed. Returning `ErrBadConn` could lead to duplicate inserts on the database and data corruption because of the way the Go SQL package performs retries. In order to significantly reduce the circumstances where this "unexpected EOF" error is returned for stale connections, this commit performs a liveness check before writing a new query. When do we check? ----------------- This check is not performed for all writes. Go 1.10 introduced a new `sql/driver` interface called `driver.SessionResetter`, which calls the `ResetSession` method on any connections _when they are returned to the connection pool_. Since performing the liveness check during `ResetSession` is not particularly useful (the connection can spend a long time in the pool before it's checked out again, and become stale), we simply mark the connection with a `reset` flag instead. This `reset` flag is then checked from `mysqlConn.writePacket` to perform the liveness checks. This ensures that the liveness check will only be performed for the first query on a connection that has been checked out of the connection pool. These are pretty much the semantics we want: a fresh connection from the pool is more likely to be stale, and it has not performed any previous writes that could cause data corruption. If a connection is being consistently used by the client (i.e. through an open transaction), we do NOT perform liveness checks. If MySQL Server kills such active connection, we want to bubble up the error to the user because any silent retrying can and will lead to data corruption. Since the `ResetSession` interface is only available in Go 1.10+, the liveness checks will only be performed starting with that Go version. How do we check? ---------------- To perform the actual liveness test on the connection, we use the new `syscall.Conn` interface which is available for all `net.Conn`s since Go 1.9. The `SyscallConn` method returns a `RawConn` that lets us read directly from the connection's file descriptor using syscalls, and skipping the default read pipeline of the Go runtime. When reading directly from the file descriptor using `syscall.Read`, we pass in a 1-length buffer, as passing a 0-length buffer will always result in a 0-length read, and the 1-length buffer will never be filled because we're not expecting any reads from MySQL before we have written any request packets in a fresh connection. All sockets created in the Go runtime are set to non-blocking (O_NONBLOCK). Consequently, we can detect a socket that has been closed on the server-side because the `read` syscall will return a 0-length read _and_ no error. We assume that any other errors returned from the `read` also mean the connection is in a bad state, except for `EAGAIN`/`EWOULDBLOCK`, which is the expected return for a healthy non-blocking socket in this circumstance. Because of the dependency on `syscall.Conn`, liveness checks can only be performed in Go 1.9+. This restriction however overlaps with the fact that we only mark connections as having been reset in Go 1.10+, as explained in the previous section.

This commit contains a potential fix to the issue reported in #657. As a summary: when a MySQL server kills a connection on the server-side (either because it is actively pruning connections, or because the connection has hit the server-side timeout), the Go MySQL client does not immediately become aware of the connection being dead. Because of the way TCP works, the client cannot know that the connection has received a RST packet from the server (i.e. the server-side has closed) until it actually reads from it. This causes an unfortunate bug wherein a MySQL idle connection is pulled from the connection pool, a query packet is written to it without error, and then the query fails with an "unexpected EOF" error when trying to read the response packet. Since the initial write to the socket does not fail with an error, it is generally not safe to return `driver.ErrBadConn` when the read fails, because in theory the write could have arrived to the server and could have been committed. Returning `ErrBadConn` could lead to duplicate inserts on the database and data corruption because of the way the Go SQL package performs retries. In order to significantly reduce the circumstances where this "unexpected EOF" error is returned for stale connections, this commit performs a liveness check before writing a new query. When do we check? ----------------- This check is not performed for all writes. Go 1.10 introduced a new `sql/driver` interface called `driver.SessionResetter`, which calls the `ResetSession` method on any connections _when they are returned to the connection pool_. Since performing the liveness check during `ResetSession` is not particularly useful (the connection can spend a long time in the pool before it's checked out again, and become stale), we simply mark the connection with a `reset` flag instead. This `reset` flag is then checked from `mysqlConn.writePacket` to perform the liveness checks. This ensures that the liveness check will only be performed for the first query on a connection that has been checked out of the connection pool. These are pretty much the semantics we want: a fresh connection from the pool is more likely to be stale, and it has not performed any previous writes that could cause data corruption. If a connection is being consistently used by the client (i.e. through an open transaction), we do NOT perform liveness checks. If MySQL Server kills such active connection, we want to bubble up the error to the user because any silent retrying can and will lead to data corruption. Since the `ResetSession` interface is only available in Go 1.10+, the liveness checks will only be performed starting with that Go version. How do we check? ---------------- To perform the actual liveness test on the connection, we use the new `syscall.Conn` interface which is available for all `net.Conn`s since Go 1.9. The `SyscallConn` method returns a `RawConn` that lets us read directly from the connection's file descriptor using syscalls, and skipping the default read pipeline of the Go runtime. When reading directly from the file descriptor using `syscall.Read`, we pass in a 1-length buffer, as passing a 0-length buffer will always result in a 0-length read, and the 1-length buffer will never be filled because we're not expecting any reads from MySQL before we have written any request packets in a fresh connection. All sockets created in the Go runtime are set to non-blocking (O_NONBLOCK). Consequently, we can detect a socket that has been closed on the server-side because the `read` syscall will return a 0-length read _and_ no error. We assume that any other errors returned from the `read` also mean the connection is in a bad state, except for `EAGAIN`/`EWOULDBLOCK`, which is the expected return for a healthy non-blocking socket in this circumstance. Because of the dependency on `syscall.Conn`, liveness checks can only be performed in Go 1.9+. This restriction however overlaps with the fact that we only mark connections as having been reset in Go 1.10+, as explained in the previous section.

vmg mentioned this issue Sep 6, 2017

fix driver.ErrBadConn behavior, stop repeated queries #302

Merged

methane mentioned this issue Sep 21, 2017

mysql does not reconnect if connection was closed #670

Closed

rdallman mentioned this issue Oct 24, 2017

fixes mysql lock failure mattes/migrate#299

Closed

julienschmidt added the bug label Oct 27, 2017

julienschmidt added this to the v1.4 milestone Oct 27, 2017

julienschmidt added help wanted bug and removed bug labels Oct 27, 2017

julienschmidt mentioned this issue Oct 27, 2017

Please tag a new release #694

Closed

cbonte mentioned this issue Sep 19, 2018

MySQL and HAProxy closed connection issue grafana/grafana#13286

Closed

methane added working as intended database/sql issue and removed bug good first issue help wanted labels Oct 24, 2018

methane changed the title ~~Server timeouts broken since #302~~ Server timeouts fixed since #302, then people notice their bad configuration. Oct 24, 2018

methane closed this as completed Oct 24, 2018

This was referenced Oct 24, 2018

Check connection liveness before sending query #882

Closed

Convert UnexpectedEOF to ErrBadConn when safe APIs #883

Closed

evanelias mentioned this issue Oct 24, 2018

Question about --alter-wrapper and wait_timeout skeema/skeema#43

Closed

or-else mentioned this issue Nov 14, 2018

wsarecv: An established connection was aborted by the software in your host machine. tinode/chat#183

Closed

3 tasks

vmg changed the title ~~Server timeouts fixed since #302, then people notice their bad configuration.~~ Server timeouts broken since #302 Mar 6, 2019

vmg mentioned this issue Mar 12, 2019

packets: Check connection liveness before writing query #934

Merged

5 tasks

drakkan mentioned this issue Sep 12, 2019

Occasional Malfunction of Connection drakkan/sftpgo#47

Closed

jackc mentioned this issue May 23, 2020

Is pgx vulnerable to any of the go-mysql-driver issues? jackc/pgx#755

Closed

Caesarovich mentioned this issue Jul 22, 2023

No automatic reconnection to the database semaphoreui/semaphore#1170

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server timeouts broken since #302 #657

Server timeouts broken since #302 #657

vmg commented Sep 6, 2017

methane commented Sep 6, 2017

vmg commented Sep 6, 2017

methane commented Sep 6, 2017

arnehormann commented Sep 8, 2017

vmg commented Sep 8, 2017

arnehormann commented Sep 8, 2017 via email

arnehormann commented Sep 9, 2017 via email

methane commented Sep 10, 2017

methane commented Sep 10, 2017

julienschmidt commented Sep 10, 2017

arnehormann commented Sep 11, 2017

arnehormann commented Sep 12, 2017 via email

vmg commented Sep 12, 2017

julienschmidt commented Sep 12, 2017 •

edited

Loading

mrsinham commented Oct 16, 2017

danielstjules commented Nov 22, 2017

methane commented Nov 22, 2017

danielstjules commented Nov 22, 2017 •

edited

Loading

methane commented Nov 22, 2017

danielstjules commented Nov 22, 2017

methane commented Nov 29, 2017

methane commented Oct 24, 2018

Server timeouts broken since #302 #657

Server timeouts broken since #302 #657

Comments

vmg commented Sep 6, 2017

Reproduction

Deep Dive

What are the potential fixes?

In Conclusion

methane commented Sep 6, 2017

vmg commented Sep 6, 2017

methane commented Sep 6, 2017

arnehormann commented Sep 8, 2017

vmg commented Sep 8, 2017

arnehormann commented Sep 8, 2017 via email

arnehormann commented Sep 9, 2017 via email

methane commented Sep 10, 2017

methane commented Sep 10, 2017

julienschmidt commented Sep 10, 2017

arnehormann commented Sep 11, 2017

arnehormann commented Sep 12, 2017 via email

vmg commented Sep 12, 2017

julienschmidt commented Sep 12, 2017 • edited Loading

mrsinham commented Oct 16, 2017

danielstjules commented Nov 22, 2017

methane commented Nov 22, 2017

danielstjules commented Nov 22, 2017 • edited Loading

methane commented Nov 22, 2017

danielstjules commented Nov 22, 2017

methane commented Nov 29, 2017

methane commented Oct 24, 2018

julienschmidt commented Sep 12, 2017 •

edited

Loading

danielstjules commented Nov 22, 2017 •

edited

Loading