Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network timeout leads to full driver crash when using connection pooling: SMux Provider Physical connection is not usable #762

Closed
jeff1985 opened this issue Apr 27, 2018 · 26 comments

Comments

@jeff1985
Copy link

jeff1985 commented Apr 27, 2018

Driver version

ODBC v17 (libmsodbcsql-17.1.so.0.1)
php_sqlsrv 5.2.0
unixODBC 2.3.6

unixODBC 2.3.6

unixODBC is compiled with the following commands as recommended in bug #756

# clean existing
rm -f /usr/lib/x86_64-linux-gnu/libodbcinst.so*
rm -f /usr/lib/x86_64-linux-gnu/libodbc.so*

# try to upgrade unixodbc-dev
export CPPFLAGS="-DSIZEOF_LONG_INT=8"
apt-get -y install gcc-5 g++-5
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 60 --slave /usr/bin/g++ g++ /usr/bin/g++-5
cd ~
echo "Configuring the unixODBC 2.3.6 Driver Manager"
wget ftp://ftp.unixodbc.org/pub/unixODBC/unixODBC-2.3.6.tar.gz
tar -xzf unixODBC-2.3.6.tar.gz
cd unixODBC-2.3.6/
./configure --prefix=/usr --libdir=/usr/local/lib --sysconfdir=/etc --disable-gui --disable-drivers --enable-iconv --with-iconv-char-enc=UTF8 --with-iconv-ucode-enc=UTF16LE 1> odbc_con.log 2> make_err.log

echo "Building and Installing the unixODBC 2.3.6 Driver Manager"
make
make install
echo "####### build ready ###### "

SQL Server version

Azure SQL Database

Client operating system

Ubuntu 16.04

PHP version

php 7.1.16
on Apache prefork with 10 processes

Table schema

see https://gist.github.com/jeff1985/ad9316fdb3313bf1a9e0b0926254df83#file-msodbcsqlbug-db-structure-sql

Problem description

We have identified a problem within php_sqlsrv, which causes downtime of our php application when using connection pooling with the latest stable drivers in case of network errors.

We have seen that problem multiple times occuring while using a php application hosted on Azure Linux Webapps with php running in a docker container.

After timeout/connection errors the database driver comes into a failure state
where all following database requests will keep throwing errors as long the traffic keeps flowing.
I suspect that the connection is not released from the pool despite it is in faulty state.

How to reproduce

I'm able to reproduce the problem in a development environment by using an unstable mobile internet connection.

In my repro script I've selected one row from the database and logged the result into a file.
Additionaly I created a moderate server traffic by running
ab -n 100000 -c 10 http://xxx.local/web-frontends/public/test_incomplete_data.php.
My mobile internet connection has an expected error rate of about 5%.

I can see that the first requests are served correctly, but then after some initial connection errors occur, the whole server becomes unusable. All following requests are throwing an exception with the error message:
SQLSTATE[08S02]: [Microsoft][ODBC Driver 17 for SQL Server]SMux Provider: Physical connection is not usable [xFFFFFFFF]

During this time I have no problems accessing websites or querying the same database using the same internet connection from another client system.

Even after I stop incoming requests the server remains in this faulty condition for approximately 2-5 minutes and does not respond correctly even to single requests.

Only after having a pause it then recovers and responds to requests again.
In production environments you will never be having such scenario because users will be trying
to reload the failed website. This means: the server will stay in this faulty condition for hours and the only way to recover will be a server reboot.

The resulting error rate in my measurements is 98%! Also 98% of the requests have a response time
of less than 100ms, which clearly indicates that no new connection attempt has been done.

Expected behavior and actual behavior

The expected behaviour in case of a sporadic network error/timeout is:

  1. server fails to connect to a database
  2. a PDOException is raised on php side on each connection timeout
  3. after each timeout, the connection is removed from the connection pool
  4. on following requests, a new connection is opened, to allow an immediate reconnect to the database

The actual behaviour is:

  1. server fails to connect to a database
  2. a PDOException is raised on php side on each connection timeout
  3. all following requests are failing even after the internet connectivity is fully recovered

Repro code

All relevant logs and a repro script are available here:
https://gist.github.com/ad9316fdb3313bf1a9e0b0926254df83

Note: I removed db credentials from logs and code.

@jeff1985
Copy link
Author

Could be related to bug #678

@yitam
Copy link
Contributor

yitam commented Apr 27, 2018

Hi @jeff1985 thank you for providing all the necessary details to help us reproduce. We will investigate and get back to you on this.

Concerning Issue #678, it is about connection resiliency, which is a feature only available in Windows as of today.

@jeff1985
Copy link
Author

jeff1985 commented May 7, 2018

Hi @yitam,

any update on that? Were you able to reproduce the bug?

@David-Engel
Copy link
Contributor

@jeff1985 You are correct that this issue is related to #678. Basically, connection pooling on Linux is not recommended at this point. My understanding is the unixODBC driver manager does not have a facility to validate connections before handing them out from the pool. So when connections are broken due to network issues, they remain broken in the pool. (Not good.) Additionally, connection resiliency in the MS ODBC driver has not been implemented in the Linux version of the driver, which mostly alleviates these problems by keeping minor network issues from breaking connections in the first place. Connection resiliency on Linux is planned for the next release of the ODBC driver, which should be releasing around the July time frame.
Regards,
David

@jeff1985
Copy link
Author

jeff1985 commented May 9, 2018

@David-Engel
thanks for looking into the issue!

You stated that connection pooling is not recommended at this point. Is that documented anywhere? I've seen that the warning "Connection pooling on Linux doesn't work properly when using MSODBC17" has been removed with the 5.2.0 release which made me think that it is possible to use now. I'm also in contact with MS Azure support regarding this issue and they did clearly recommend to use connection pooling in our Linux based configuration!

I see you assume that the issue will be fixed by the ODBC driver upgrade.
Based on available documentation idle connection resiliency can recover the connection only in certain states i.e. only tries a certain amount of reconnects (which is configurable). So while idle connection resiliency might help to mitigate the problem, you will still have the possibility that the connection remains in the error state and causes the bug which I have reported. This is why I don't think the issue will be fully mitigated by just enabling the connection resiliency.

Could you please take a look on it before the driver update as otherwise we will loose time?

@David-Engel
Copy link
Contributor

@jeff1985 I've been looking into the details this morning (internal ping from Azure support). I've been looking at your repro and this feels similar to previous issues encountered where the problem was with the network on the system. Following this post, I'm wondering if there are similar settings in Ubuntu to alleviate the issue:
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/95fe26d2-2e6c-4c6b-b02d-10794e92dc7a/sqlstate-08s02-smux-provider-physical-connection-is-not-usable?forum=sqldataaccess
(still investigating)
David

@David-Engel
Copy link
Contributor

@jeff1985 Can you try turning off various *-offload features on your network adapter? Use ethtool -k eth0 to view their current status and ethtool -K eth0 [feature] off to turn ones off. See if that helps the networking recover from errors properly in your repro.

@jeff1985
Copy link
Author

jeff1985 commented May 9, 2018

@David-Engel
Our system is running on Azure Webapp for Linux in a docker container where I have limited permissions (by design).

Running ethtool in a docker container gives me an error:

root@9f9ef49d603d:/# ethtool --features eth0 tx off
Cannot set device feature settings: Operation not permitted

So this is absolutely out of my control!

Also the error message as such is not the main problem here. As I described in the bug the issue is that the driver does not correctly recover from this error (so connection is kept in the pool) as long we have continued requests to it.

@David-Engel
Copy link
Contributor

@jeff1985 I understand what you are saying. In looking under the error, though, the driver is not treating the problem it is encountering as one that needs to be corrected by declaring the connection dead. I initially thought the same thing - the driver should be treating it that way and flagging the connection as dead. But it seems like the driver thinks the underlying network is in a state where this problem would be present for any connection, new or existing, and it is waiting for the network stack to clear up. So I'm not sure if it is behaving incorrectly here. That's what I'm trying to figure out.

@jeff1985
Copy link
Author

jeff1985 commented May 9, 2018

But it seems like the driver thinks the underlying network is in a state where this problem would be present for any connection, new or existing, and it is waiting for the network stack to clear up.

So this might be the important point here. The networking layer wont clear up the connection while its still in use. And the driver is waiting for the connection to be cleared up, while still passing query requests to it. This probably causes new packets to be sent and prevents the connection from beeing cleared up. So the driver is waiting for network, network is waiting for driver. Deadlock!

@David-Engel
Copy link
Contributor

David-Engel commented May 9, 2018

@jeff1985

During this time I have no problems accessing websites or querying the same database using the same internet connection from another client system.

Is there any way for you to test the network connectivity of the server hosting your repro? For example, can you set up a shell script that runs curl every second to ping an external service that you can log to verify that network connectivity out of the server is still usable and active during the period the driver is also experiencing issues? Or even just logging a constant ping locally and reviewing that log after.

@jeff1985
Copy link
Author

jeff1985 commented May 9, 2018

@David-Engel
During my tests I did verify the network connectity from inside the docker container. It was working. Also we did experience the same problems in production where we also were able to run curl requests and also have captured tcppings. So the network is fine!

@David-Engel
Copy link
Contributor

@jeff1985 Do you have a DockerFile you can share to replicate your Ubuntu docker container? I want to see if using Docker makes a difference in being able to repro this. Are you using Docker on Windows or Docker on Linux?

@jeff1985
Copy link
Author

@David-Engel
Unfortunately I'm not able to share the full source of my docker setup in public. I'm running the docker container inside a linux vm. We also have seen this errors happening in production in a docker container running inside a Azure Webapp on Linux.

For docker container feel free to use https://github.com/Microsoft/mssql-docker/blob/master/oss-drivers/msphpsql/Dockerfile as base. I'm using ubuntu 16.04 as baseimage and there is nothing specific to the setup (only activated connection pooling and installed unixodbc 2.3.6 as specified in the bug description).

I also have seen that a new connection driver has been released (libmsodbcsql-17.2.so.0.1). I was hoping that it might have solved the problem and have retested the bug with the new driver. The bug is still present, but some new error message appeared:

2018-07-11 17:30:30.764131 Exception: SQLSTATE[IMC06]: [Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection.

additionally I was still seeing the same error as with the older driver version:

2018-07-11 18:11:51.200333 Exception: SQLSTATE[08S02]: [Microsoft][ODBC Driver 17 for SQL Server]SMux Provider: Physical connection is not usable [xFFFFFFFF].

During this errors the "normal" internet connectivity from the same docker container is not affected.

root@21aaf4bc2b8c:/# while true; do curl -s -o /dev/null -w "%{http_code}\n" http://www.google.de; done;
200
200
200
200
200
200
200
200
200
200
200
200

Can you please take a closer look on this issue as soon as possible?

@David-Engel
Copy link
Contributor

David-Engel commented Jul 12, 2018

@jeff1985 Thanks. After I posted that, I built my own Docker image based on Ubuntu 16.04 + PHP 7.1 and unixODBC 2.3.6 and your table/data/php script. Running ab with varying packet loss rates (using tc qdisc netem), I ran it many times in Azure VMs in the West US 2 and West Europe regions. I did manage to get the "SMux Provider: Physical connection is not usable [xFFFFFFFF]." error intermittently (in about 1 in 10 of my runs) but it only ever appeared for a short time and connections recovered on their own with no intervention required. I also never got the error with an early 17.2 build (but that could have been intermittent coincidence). I passed my repro setup on to the dev teams for reference.

Do you have the ODBC trace file available for your latest run with 17.2? The ODBC team would like to see what 17.2 is doing differently with the error now when it gets into the persistently errored state.

Thanks,
David

@David-Engel
Copy link
Contributor

@jeff1985 The ODBC team is also requesting a network trace (wireshark or equivalent) if possible.

@David-Engel
Copy link
Contributor

@jeff1985 One other thing: Try disabling MARS and running your repro again. The ODBC team thinks there may be something in MARS that could be contributing to the error, even if you aren't using MARS.
How to disable MARS using PHP: https://docs.microsoft.com/en-us/sql/connect/php/how-to-disable-multiple-active-resultsets-mars?view=sql-server-2017

Regards,
David

@jeff1985
Copy link
Author

jeff1985 commented Jul 13, 2018

@David-Engel
Did you continue running ab after the "SMux" error appeared?

I would like to describe my test setup in more detail. I hope it helps you to reproduce the error. I'm running an ubuntu VM and a docker container inside it with apache, php and mssql drivers inside it. This is running locally on my laptop. To reproduce the problems at first I was trying to find a slow mobile connection, but then it turned out that there is a more simple way: if you turn off the internet connection for about 30 seconds while running ab, it breaks the driver in about 1 of 5 attempts for me.

I tried running with MARS disabled by adding the following setting to the connection string:

MultipleActiveResultSets=false;

This time I had a different error message, but still the same behavior: errors continue as long as ab is running, need a "pause" without requests to recover

2018-07-12 00:07:01.495155 Exception: SQLSTATE[08S01]: [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure
2018-07-12 00:07:01.498572 Exception: SQLSTATE[08S01]: [Microsoft][ODBC Driver 17 for SQL Server]Communication link failure

@David-Engel
Copy link
Contributor

David-Engel commented Jul 13, 2018

@jeff1985 Yes, ab continued to run in my tests and the connection returned to normal during the ab test. The team spent more time today reproducing the issue by disconnecting the network and it appears they are seeing some similar errors that they may be able to debug. They are trying to reduce the issue down to the bare minimum to try to see through the noise. They will keep looking into it.

Regards,
David

@jeff1985
Copy link
Author

Hi @David-Engel

I've found two ways to reproduce the error on Azure infrastructure.

VM scenario

This repro is rather complicated and more out of scope here.
I will describe only main points. Feel free to adopt on your side ;)

Create:

  • MSSQL database in West europe region
  • Azure VM with ubuntu 16.04
  • deploy docker container to the vm by installing docker

When setup is ready,

connect first client via ssh and run

#verify db driver version and pooling activated/deactivated
less /etc/odbcinst.ini

#follow logs
cd /var/log/apache2 && tail sqldata.log -f

open a linux terminal on any outside location (i.e. another Linux VM in Azure) to run load test

ab -n 100000 -c 10 https://xxx/mssql-bug-test/1.0/test_smux.php

open second client via ssh and run

# switch into docker container
docker ps
docker exec -it CONTAINERID /bin/bash
apt update && apt install dsniff --yes
tcpkill -i eth0 -9 port 11017
# hit Control+C after 30seconds
    
# ab should continue running

In my case the connections have been established to port 11017 and not 1433. I'm not sure if that will be always
the case, so please check connections on your host with netstat or kill connections by ip.
Then look back into the logfile. You'll see all of requests failing with message:

2018-07-17 12:35:57.467360 Exception: SQLSTATE[IMC06]: [Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection.

Even after you stop the tcpkill command, the whole pool will remain in the broken state and will not recover until the traffic stops.

Webapp scenario

Create:

  • mssql database in West europe region
  • webapp on linux in West europe region with a single instance (to be able to connect a single container via webssh)
  • deploy docker container to webapp

connect first client via webssh and run

#verify db driver version and pooling activated/deactivated
less /etc/odbcinst.ini

#follow logs
cd /var/log/apache2 && tail sqldata.log -f

open a linux terminal on any outside location (i.e. another Linux VM in Azure) to run load test

ab -n 100000 -c 10 https://xxx.azurewebsites.net/mssql-bug-test/1.0/test_smux.php 

open azure portal and navigate to your database
* click on "Configure"
* change "DTU" to any different value
* click on "apply"

Switch back to the logfile. The change in DTU will create a second database in parallel and after 2-3 mins will switch the route to the new database with changed DTU count. In this moment some of TCP connections might fail:

2018-07-17 13:14:41.397912 got result(1): id='306' totalLines='8273111' state='finished'
2018-07-17 13:14:41.401046 got result(1): id='306' totalLines='8273111' state='finished'
2018-07-17 13:14:41.583549 Exception: SQLSTATE[IMC06]: [Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver a
s unrecoverable. No attempt was made to restore the connection.
2018-07-17 13:14:41.761349 Exception: SQLSTATE[IMC06]: [Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver a
s unrecoverable. No attempt was made to restore the connection.
2018-07-17 13:14:41.753175 got result(1): id='306' totalLines='8273111' state='finished'

So some of the connections failed and will not recover as long as you have traffic flowing to your webapp. In my test it was running for about 30 mins. The error percentage even increased over time. Then I stopped the load test.

Note on config

When testing this repro make sure, that you're actually running the driver with connection pooling turned on. As specified here https://docs.microsoft.com/de-de/sql/connect/php/connection-pooling-microsoft-drivers-for-php-for-sql-server?view=sql-server-2017
this requires the following config.

[ODBC]
Pooling=Yes

[ODBC Driver XXX for SQL Server]
CPTimeout=<int value>    

If you forget to specify CPTimeout, Pooling will not be active.

@David-Engel
Copy link
Contributor

@jeff1985 It looks like we found the problem in the unixODBC driver manager. We were able to initially reproduce your scenario of constant errors using the "disable the internet" method and then again against an Azure SQL DB instance after changing the DTU like you mentioned. We compiled a private build of unixODBC with fixes and verified that it resolves those two scenarios. Connections recovered better and never got into a persistently errored state after the changes. We submitted a pull request for the changes to unixODBC here: lurcher/unixODBC#16

You may not want to wait for the next unixODBC release for testing so there are a couple options:

  1. You can get the source from the PR above and build it yourself. That's what we ran and would be equivalent to 2.3.7 prerelease + our changes.
  2. We can give you a tar.gz copy of the unixODBC-2.3.6 source that just has the changes we submitted in our PR.

Regards,
David

@jubalh
Copy link

jubalh commented Jul 18, 2018

@jeff1985 may I suggest then on the after the step you posted in your first post here:

cd unixODBC-2.3.6/

You do the following to get the changes mentioned:

wget https://patch-diff.githubusercontent.com/raw/lurcher/unixODBC/pull/16.patch
patch < 16.patch

And then continue like you did before.

@jeff1985
Copy link
Author

You are my hero, @David-Engel !

I tested your patch. It solves the errors in my local repro, as well as the both reported scenarios in Azure infrastructure! Well done!

Is this patch safe to use in production environment or should I wait until official release of unixODBC 2.3.7?

@David-Engel
Copy link
Contributor

@jeff1985 Great! Glad it solved the issue. I can't take all the credit. @yitam helped with the repro and testing the fix and @v-chojas debugged and coded the actual fix.

As far as waiting for unixODBC 2.3.7, I don't have any insight into that release schedule so I don't know if you want to wait that long.
We currently run 2.3.6 in our test suites for the ODBC driver. However the changes in our patch should be safe for production given that the previous code is basically broken for connection pooling with a Unicode driver and the scope of our changes is very narrow (you can review the changes yourself in the PR). We have no problem supporting your configuration with the patch we submitted.

Regards,
David

@David-Engel
Copy link
Contributor

Hi @jeff1985

Since you verified that the patch fixes your issue in testing and we will support it in production, I'm going to close this issue. If you find it is not resolved, feel free to re-open it.

Regards,
David

@jeff1985
Copy link
Author

Hello @David-Engel

I would like to give you feedback after going into production with the fix.

We've been testing the fix under non-production for weeks and had no issues. Also we've been running it for 1 week in a production environment with no issues for about 7 days. However after executing a Azure SQL database failover procedure to a different region (by failover group) a similar connection issue occured with the fix in place. Again some of the requests were failing with the message "The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable." This message seems to be connected to connection resiliency (based on my google search).
I also had similar behavior in test environment with a load test executed with a low load profile and executing INSERT querys. I suggest that you add a master slave failover scenario to your test suite in one of the upcoming releases!

After more testing I see that this is however only happening during a database failover procedure. The connection issues seem to be successfuly mitigated by the patch under normal conditions. So I'm able to use it in production! My thanks to your team: @yitam and @v-chojas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants