-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[10.x] Detect MySQL read-only mode error as a lost connection #48517
[10.x] Detect MySQL read-only mode error as a lost connection #48517
Conversation
a88ac20
to
326dc6f
Compare
So... does this fix the issue? Have you tested it in a real scenario? |
I ran a test scenario, not against a master failover. Create a job that sets the database to read-only. The worker dies with the PR change. Without it, the queue worker will keep running but will never be able to advance. |
There is some risk here that this will result in thrashing connections to people's databases if the writer is in read-only mode or we are connected to a reader, and re-connecting doesn't change that. |
Any idea what might be a better solution? We could create a new trait That said, from a quick scan at the current errors checked within 'running with the --read-only option so it cannot execute this statement' and 'Reason: Server is in script upgrade mode. Only administrator can connect at this time.' |
Just from the top of my head, what about on the first error encountered we reconnect and if we get the same read-only error, we fall back to current behavior of throwing the exception and failing? |
@peterlupu I think that's a good approach, but I feel like it wouldn't make sense to apply it to just this one exception. Anything that is considered recoverable should be tried N times before finally bubbling the exception up to the client code. Would love to get some feedback on which cases those are. |
Reading the linked issue, it sounds to me this is a Worker problem and not a connection problem. A Queue Worker is an infinite loop process and as such it makes itself resilient by having an eager try/catch that prevents it from dying. But some situations are good reason to cause the infinite loop to die. i.e. suppose your Queue Worker is writing some logs to local storage. If your disk reaches 100%, it's pointless to keep trying to work anything as everything will always be in a failing state. This particular issue is better fixed in a way that the infinite loop of the Worker is able to detect that there was a read-only connection. Then the worker can decide whether it wants to try reconnecting and/or simply die and let the outside orchestration (ECS, Kubernetes, Supervisord, etc) spin up a fresh worker. |
@cosmastech are you still working on this? |
@driesvints to be honest, from the feedback I got, I didn't know if there was a particular change that needs to be made. I would rather have it opened back up for review from @taylorotwell. He seemed concerned in his comment that I had not tested it at all (which I had), but some other concerns were brought up. |
This shows zero files changed? |
To resolve #48486. The case described in the issue is using
database
for the queue connection and AWS Aurora failing over during the queue worker run.If database is in read-only mode, then queue worker cannot update tables as needed and should kill the worker.
I admit that I do not know what consequences there might be for including this in the
DetectsLostConnections
trait, as it is employed by several other classes... seems to me if the PDO is throwing an exception and that exception is reporting a read-only error, it's probably safe to call it a lost connection, but I'll admit I'm looking at this through a pretty myopic lens.