-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Pool connection doesn't reconnect well. #821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not really sure what you are asking. I also don't know anything about how Amazon's services work, so all that information about Amazon-specific terms is over my head. an you give me a more detailed description about what is not working, exactly? Also, are you getting any error from the pool? |
Oh, the connection count graph increased at certain point is not done automatically. The connection count didn't increase but increased after I restarted node process. Amazon EC2 (just virtual machine computer) connects to RDS which is DB server. Two RDS works, one is primary (master) and the other is slave. Each other changes when fail-over happens. This RDS fail-over works well with other program written in python, reconnects very smoothly when RDS reboots or some problem occurs. Sorry that I couldn't get any info from pool. I didn't log about node-mysql that much. I used cluster to run many nodejs process and each process has 10 connections max. Maybe it's dns resolve problem, which previous pool connection trying to connect to same IP again and again. I'm not sure but just guessing because RDS works different, needs DNS info refreshed. Curious thing is that it worked a bit too. I'm not sure it worked because it reconnected or some idle connection was established but some data was sent correctly. Any idea? |
I don't really have any. So for the host -> DNS, that is just using the internal node.js core mechanisms, so I don't have any comment there about how that is working. Also, if the connection gets lost in this pool, a new connection will only be reestablished the next time to try to make a query, so if you are not making queries, the connections won't come back automatically. Also, what version of this module are you using. Can you try the newest 2.3.0 if you are not using it and see if anything changes (there are various improvements/fixes that may be impacting you)? |
Queries are inserted frequently like 10~50 per second. Maybe I had to wait more time to see if things were going well, but I couldn't wait long because it is real production service. I'm using 2.2.0 |
OK, 2.2.0 should be the newest I really wanted you to try, but you are using it.
I don't have access to any Amazon services to try and reproduce the issue to determine what is happening. It would be really helpful if you could figure out what the issue is so I would know how to make a patch for it, otherwise I'm just in the dark about what the issue you are experiencing is. |
/cc @sidorares |
What information do you need? pool = mysql.createPool() Sorry that I'm new to nodejs, not sure how to examine it. |
It's not very clear to me as well what problem are you solving. Number of connections in AWS console is expected to drop after restart - pool opens new connections only if there is no idle connection available and you requesting new one. Are you actually getting errors on mysql client side because it's using old IP after failover? If not, could you add some connection identifying SQL and log results ( server id / thread id )? |
I will test more and write it soon. Thanks all. |
I analyzed deep into it today. When Amazon tries Multi-AZ fail-over, two DB instance changes their position by redirecting endpoint url. I couldn't get any error when doing fail-over with node-mysql. It works as it didn't disconnected at all. Pool connection works perfectly when I tried just restarting DB instance not using Multi-AZ fail over. It says One more thing, I connected HeidiSQL to Master DB from my client PC and it disconnected immediately when I fail-over the DB instance. So HeidiSQL detects disconnection at right moment and tried to reconnect to endpoint url again. I think node-mysql doesn't detect disconnection well or Amazon doesn't disconnect backup DB connection from which is from their server farm. Since other python based mysql projects works well with Amazon MySQL Multi-AZ, it maybe a specific issue with node-mysql only. I'm using latest node-mysql version. I used today's latest souce. |
Yea, it does sound like there must be some disconnection indicator if other libs are seeing the disconnect. Just as a curiosity, would it be possible to try the |
I tried mysql2 too but no luck. When error occurs, I tried to reconnect something like this code. pool.query(querystring, ..., ..., It worked as expected but had some timing issue that createPool() was called constantly by other query's fail. I want to know for clean and robust code to reset pool connection and reinitialization. Any idea for workaround or clean code? |
Can you see what the behavior is if you set your |
When I set connectTimeout: 10
It is stated as I am worried about it that some potential problem lies in it. Like round robin is working as expected but timeout didn't happen to previous connection and something stuck. Well DB records are inserted successfully though. Hope it's just round robin problem or false documented. Anyway, Amazon RDS is the core problem that didn't disconnect client when failover occurs. I submitted this issue to Amazon RDS forum. |
Oops, I was thinking more like 10s, not 10ms :) How about "connectTimeout: 10000"? I think setting it to 10ms will cause weird behavior for sure, though, because that is way too quick. |
Unfortunately, if for example, the AZ's datacentre goes completely offline for some reason, Amazon 'fixing' this problem this way will not help, you are merely treating the symptoms and failover will still not happen. It is perfectly feasible in many failure conditions that the connection could just die without ever sending you a TCP RST, and that all users are susceptible to this problem, however the behaviour of RDS has drawn attention to the problem. I haven't had time to look into this just yet, but I can confirm it happening here on our RDS cluster. The fix must be applied here (or upstream if that is where the problem actually is). I'll check back with more information once I've had a chance to look at it. |
Yes, that is definitely an issue. Without a |
Yeah, that should hopefully do it (assuming obviously you can dial in the timeout and not prematurely close horribly loaded servers ;)) I'm hoping to allocate some time next week to look at this issue properly and will let you know what tcpdump sees and what's going on inside the code as it happens. Happy to test anything you need, but my time is a bit limited at the moment, so I can't be sure of when it will be done. |
If this means "is the timeout configurable"... yes, yes, it will definitely be :)
Awesome! |
Necessity is the mother of pull requests ;) |
Hi again, finally had a chance to briefly look at the problem today in more detail. It's very much to do with the failure to receive that final RST, so this is a very critical bug. When we simulate the AWS failover amazon abruptly kill the server instance and the connection dies. A simple setInterval query selecting
An easy way to reproduce under a linux box would be just to block traffic from port 3306:
|
@mseddon thanks for the verification of the missing RST to be the source :) I'll plan to make a new release tonight with changes that should solve the issue. |
Excellent, thanks! I'll have a play with that over the weekend and let you know how it works out. |
@mseddon sorry, I did not end up having much time yesterday, but I got most of it done. Will be finishing it up today :) |
@dougwilson Any updates on this issue? |
I have been iterating though different takes in a server farm. They weren't quite working, so I'm just trying to get something that will actually work :) |
Ok, no problem! Thanks :) |
If you like, please try out the patch with |
Printing the result of doing SELECT CURRENT_TIMESTAMP every 2 seconds and then triggering a failover still doesn't cause any reconnect after ten seconds, or indeed five minutes.
Same problem as before when the RST packet is not received.
@miningpoolhub does this patch work for you? |
@mseddon can you paste me the code you are using to test? It is important to note that the patch only affects |
ah, that may be it, one moment while I try again using that. Cheers. |
Hi- switching the test to use pool.getConnection seems to work better, I'll have a play but this might fix it after all, sorry about that. Thanks :) |
@mseddon no problem :) I know there is still work to be done--but from now, it's actually down deep into Node.js internals. When you send data over TCP, you get ACKs for each packet; we should be able to timeout if when sending off the sequence doesn't get ACKs back, but that is not straight-forward in Node.js since the details have been extracted away from us. |
So, one observation (though probably not surprising)- if I set the connectionLimit to 1 to simulate a full connection pool, and failover within a query, reconnects still don't happen. I would imagine that this is because we're hung waiting for a free connection, and your timeout check occurs once we have a free connection? |
Correct. The query is hung because of the severed connection. You can always do |
Hmm, does timeout start when the conn.query() command is issued, or when the request is finally submitted to mysql? It's not pretty but it could work for my case. |
The timeout is actually based on sequence packet activity, so it would start when the query packet is sent to MySQL server, rather than at the time you called |
Ok, thanks, that should be good enough! |
Cool :) I'm still going to be working on adding more protections, though. It may be interesting to know that every command now accepts a timeout option, so you could ping with a timeout |
closes mysqljs#821 closes mysqljs#854
I use amazon cloud server, with rds multi-AZ.
I'm using pooled connection, but it doesn't reconnect as expected.
I restarted RDS machine, and it started fail-over to other multi-AZ.
But at that point, connection was lost and not recovered well.
I'm just using pool.query() which don't have connection release leak. It must work as expected isn't it?
The text was updated successfully, but these errors were encountered: