-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network timeout leads to full driver crash when using connection pooling: SMux Provider Physical connection is not usable #762
Comments
Could be related to bug #678 |
Hi @jeff1985 thank you for providing all the necessary details to help us reproduce. We will investigate and get back to you on this. Concerning Issue #678, it is about connection resiliency, which is a feature only available in Windows as of today. |
Hi @yitam, any update on that? Were you able to reproduce the bug? |
@jeff1985 You are correct that this issue is related to #678. Basically, connection pooling on Linux is not recommended at this point. My understanding is the unixODBC driver manager does not have a facility to validate connections before handing them out from the pool. So when connections are broken due to network issues, they remain broken in the pool. (Not good.) Additionally, connection resiliency in the MS ODBC driver has not been implemented in the Linux version of the driver, which mostly alleviates these problems by keeping minor network issues from breaking connections in the first place. Connection resiliency on Linux is planned for the next release of the ODBC driver, which should be releasing around the July time frame. |
@David-Engel You stated that connection pooling is not recommended at this point. Is that documented anywhere? I've seen that the warning "Connection pooling on Linux doesn't work properly when using MSODBC17" has been removed with the 5.2.0 release which made me think that it is possible to use now. I'm also in contact with MS Azure support regarding this issue and they did clearly recommend to use connection pooling in our Linux based configuration! I see you assume that the issue will be fixed by the ODBC driver upgrade. Could you please take a look on it before the driver update as otherwise we will loose time? |
@jeff1985 I've been looking into the details this morning (internal ping from Azure support). I've been looking at your repro and this feels similar to previous issues encountered where the problem was with the network on the system. Following this post, I'm wondering if there are similar settings in Ubuntu to alleviate the issue: |
@jeff1985 Can you try turning off various *-offload features on your network adapter? Use |
@David-Engel Running ethtool in a docker container gives me an error:
So this is absolutely out of my control! Also the error message as such is not the main problem here. As I described in the bug the issue is that the driver does not correctly recover from this error (so connection is kept in the pool) as long we have continued requests to it. |
@jeff1985 I understand what you are saying. In looking under the error, though, the driver is not treating the problem it is encountering as one that needs to be corrected by declaring the connection dead. I initially thought the same thing - the driver should be treating it that way and flagging the connection as dead. But it seems like the driver thinks the underlying network is in a state where this problem would be present for any connection, new or existing, and it is waiting for the network stack to clear up. So I'm not sure if it is behaving incorrectly here. That's what I'm trying to figure out. |
So this might be the important point here. The networking layer wont clear up the connection while its still in use. And the driver is waiting for the connection to be cleared up, while still passing query requests to it. This probably causes new packets to be sent and prevents the connection from beeing cleared up. So the driver is waiting for network, network is waiting for driver. Deadlock! |
Is there any way for you to test the network connectivity of the server hosting your repro? For example, can you set up a shell script that runs curl every second to ping an external service that you can log to verify that network connectivity out of the server is still usable and active during the period the driver is also experiencing issues? Or even just logging a constant ping locally and reviewing that log after. |
@David-Engel |
@jeff1985 Do you have a DockerFile you can share to replicate your Ubuntu docker container? I want to see if using Docker makes a difference in being able to repro this. Are you using Docker on Windows or Docker on Linux? |
@David-Engel For docker container feel free to use https://github.com/Microsoft/mssql-docker/blob/master/oss-drivers/msphpsql/Dockerfile as base. I'm using ubuntu 16.04 as baseimage and there is nothing specific to the setup (only activated connection pooling and installed unixodbc 2.3.6 as specified in the bug description). I also have seen that a new connection driver has been released (libmsodbcsql-17.2.so.0.1). I was hoping that it might have solved the problem and have retested the bug with the new driver. The bug is still present, but some new error message appeared:
additionally I was still seeing the same error as with the older driver version:
During this errors the "normal" internet connectivity from the same docker container is not affected.
Can you please take a closer look on this issue as soon as possible? |
@jeff1985 Thanks. After I posted that, I built my own Docker image based on Ubuntu 16.04 + PHP 7.1 and unixODBC 2.3.6 and your table/data/php script. Running ab with varying packet loss rates (using tc qdisc netem), I ran it many times in Azure VMs in the West US 2 and West Europe regions. I did manage to get the "SMux Provider: Physical connection is not usable [xFFFFFFFF]." error intermittently (in about 1 in 10 of my runs) but it only ever appeared for a short time and connections recovered on their own with no intervention required. I also never got the error with an early 17.2 build (but that could have been intermittent coincidence). I passed my repro setup on to the dev teams for reference. Do you have the ODBC trace file available for your latest run with 17.2? The ODBC team would like to see what 17.2 is doing differently with the error now when it gets into the persistently errored state. Thanks, |
@jeff1985 The ODBC team is also requesting a network trace (wireshark or equivalent) if possible. |
@jeff1985 One other thing: Try disabling MARS and running your repro again. The ODBC team thinks there may be something in MARS that could be contributing to the error, even if you aren't using MARS. Regards, |
@David-Engel I would like to describe my test setup in more detail. I hope it helps you to reproduce the error. I'm running an ubuntu VM and a docker container inside it with apache, php and mssql drivers inside it. This is running locally on my laptop. To reproduce the problems at first I was trying to find a slow mobile connection, but then it turned out that there is a more simple way: if you turn off the internet connection for about 30 seconds while running ab, it breaks the driver in about 1 of 5 attempts for me. I tried running with MARS disabled by adding the following setting to the connection string:
This time I had a different error message, but still the same behavior: errors continue as long as ab is running, need a "pause" without requests to recover
|
@jeff1985 Yes, ab continued to run in my tests and the connection returned to normal during the ab test. The team spent more time today reproducing the issue by disconnecting the network and it appears they are seeing some similar errors that they may be able to debug. They are trying to reduce the issue down to the bare minimum to try to see through the noise. They will keep looking into it. Regards, |
Hi @David-Engel I've found two ways to reproduce the error on Azure infrastructure. VM scenarioThis repro is rather complicated and more out of scope here. Create:
When setup is ready, connect first client via ssh and run
open a linux terminal on any outside location (i.e. another Linux VM in Azure) to run load test
open second client via ssh and run
In my case the connections have been established to port 11017 and not 1433. I'm not sure if that will be always
Even after you stop the tcpkill command, the whole pool will remain in the broken state and will not recover until the traffic stops. Webapp scenarioCreate:
connect first client via webssh and run
open a linux terminal on any outside location (i.e. another Linux VM in Azure) to run load test
open azure portal and navigate to your database Switch back to the logfile. The change in DTU will create a second database in parallel and after 2-3 mins will switch the route to the new database with changed DTU count. In this moment some of TCP connections might fail:
So some of the connections failed and will not recover as long as you have traffic flowing to your webapp. In my test it was running for about 30 mins. The error percentage even increased over time. Then I stopped the load test. Note on configWhen testing this repro make sure, that you're actually running the driver with connection pooling turned on. As specified here https://docs.microsoft.com/de-de/sql/connect/php/connection-pooling-microsoft-drivers-for-php-for-sql-server?view=sql-server-2017
If you forget to specify CPTimeout, Pooling will not be active. |
@jeff1985 It looks like we found the problem in the unixODBC driver manager. We were able to initially reproduce your scenario of constant errors using the "disable the internet" method and then again against an Azure SQL DB instance after changing the DTU like you mentioned. We compiled a private build of unixODBC with fixes and verified that it resolves those two scenarios. Connections recovered better and never got into a persistently errored state after the changes. We submitted a pull request for the changes to unixODBC here: lurcher/unixODBC#16 You may not want to wait for the next unixODBC release for testing so there are a couple options:
Regards, |
@jeff1985 may I suggest then on the after the step you posted in your first post here:
You do the following to get the changes mentioned:
And then continue like you did before. |
You are my hero, @David-Engel ! I tested your patch. It solves the errors in my local repro, as well as the both reported scenarios in Azure infrastructure! Well done! Is this patch safe to use in production environment or should I wait until official release of unixODBC 2.3.7? |
@jeff1985 Great! Glad it solved the issue. I can't take all the credit. @yitam helped with the repro and testing the fix and @v-chojas debugged and coded the actual fix. As far as waiting for unixODBC 2.3.7, I don't have any insight into that release schedule so I don't know if you want to wait that long. Regards, |
Hi @jeff1985 Since you verified that the patch fixes your issue in testing and we will support it in production, I'm going to close this issue. If you find it is not resolved, feel free to re-open it. Regards, |
Hello @David-Engel I would like to give you feedback after going into production with the fix. We've been testing the fix under non-production for weeks and had no issues. Also we've been running it for 1 week in a production environment with no issues for about 7 days. However after executing a Azure SQL database failover procedure to a different region (by failover group) a similar connection issue occured with the fix in place. Again some of the requests were failing with the message "The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable." This message seems to be connected to connection resiliency (based on my google search). After more testing I see that this is however only happening during a database failover procedure. The connection issues seem to be successfuly mitigated by the patch under normal conditions. So I'm able to use it in production! My thanks to your team: @yitam and @v-chojas |
Driver version
ODBC v17 (libmsodbcsql-17.1.so.0.1)
php_sqlsrv 5.2.0
unixODBC 2.3.6
unixODBC 2.3.6
unixODBC is compiled with the following commands as recommended in bug #756
SQL Server version
Azure SQL Database
Client operating system
Ubuntu 16.04
PHP version
php 7.1.16
on Apache prefork with 10 processes
Table schema
see https://gist.github.com/jeff1985/ad9316fdb3313bf1a9e0b0926254df83#file-msodbcsqlbug-db-structure-sql
Problem description
We have identified a problem within php_sqlsrv, which causes downtime of our php application when using connection pooling with the latest stable drivers in case of network errors.
We have seen that problem multiple times occuring while using a php application hosted on Azure Linux Webapps with php running in a docker container.
After timeout/connection errors the database driver comes into a failure state
where all following database requests will keep throwing errors as long the traffic keeps flowing.
I suspect that the connection is not released from the pool despite it is in faulty state.
How to reproduce
I'm able to reproduce the problem in a development environment by using an unstable mobile internet connection.
In my repro script I've selected one row from the database and logged the result into a file.
Additionaly I created a moderate server traffic by running
ab -n 100000 -c 10 http://xxx.local/web-frontends/public/test_incomplete_data.php
.My mobile internet connection has an expected error rate of about 5%.
I can see that the first requests are served correctly, but then after some initial connection errors occur, the whole server becomes unusable. All following requests are throwing an exception with the error message:
SQLSTATE[08S02]: [Microsoft][ODBC Driver 17 for SQL Server]SMux Provider: Physical connection is not usable [xFFFFFFFF]
During this time I have no problems accessing websites or querying the same database using the same internet connection from another client system.
Even after I stop incoming requests the server remains in this faulty condition for approximately 2-5 minutes and does not respond correctly even to single requests.
Only after having a pause it then recovers and responds to requests again.
In production environments you will never be having such scenario because users will be trying
to reload the failed website. This means: the server will stay in this faulty condition for hours and the only way to recover will be a server reboot.
The resulting error rate in my measurements is 98%! Also 98% of the requests have a response time
of less than 100ms, which clearly indicates that no new connection attempt has been done.
Expected behavior and actual behavior
The expected behaviour in case of a sporadic network error/timeout is:
The actual behaviour is:
Repro code
All relevant logs and a repro script are available here:
https://gist.github.com/ad9316fdb3313bf1a9e0b0926254df83
Note: I removed db credentials from logs and code.
The text was updated successfully, but these errors were encountered: