Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random slow Kerberos negotiation #2204

Closed
aloene opened this issue Nov 2, 2023 · 9 comments
Closed

Random slow Kerberos negotiation #2204

aloene opened this issue Nov 2, 2023 · 9 comments

Comments

@aloene
Copy link

aloene commented Nov 2, 2023

My ASP.NET 6.0 API uses Microsoft.Data.SqlClient 5.1.1 library to access an SQL Server 2019 DB (default instance) on a remote server (AlwaysOn).
This API is hosted on OpenShift (ubi8/dotnet-60-runtime image, rootless).
The API authenticates using Kerberos (with a keytab + kerberos config file) against SQL Server and... it works well.
Here is the connection string: Data Source=listenerhost.my.domain.priv;Initial Catalog=MY.DB.NAME;integrated security=True;Connection Timeout=60;MultipleActiveResultSets=True;TrustServerCertificate=True;Command Timeout=900

BUT, sometimes (quite regularly), I face a latency (about 400-500ms) on SQL Server queries. I have profiled the API and the culprit is the authentication step here (400-500ms) :
Microsoft.Data.SqlClient!System.Net.Security.NegotiateStreamPal.GssInitSecurityContext() so it seems it relates to this call : https://github.com/dotnet/SqlClient/blob/6456d71414347238d4bd519dbfdf1d936ca4cea6/src/Microsoft.Data.SqlClient/netcore/src/Common/src/System/Net/Security/NegotiateStreamPal.Unix.cs#L50C63-L50C63
I initially thought it may be linked to the Kerberos ticket renewal but it happens (at least when I checked) without any renewal involved (checked with klist).
I have also checked SPNs and they seems fine, SQL service account registered all SPNs : each node FQDN, Listener FQDN, with and without default port 1433.

I am not sure about the root cause... Could it be any network/DNS issues maybe ?
Or could it be linked to a bug or Kerberos bad configuration ? (See my config below)
Would you have an idea about it and the way I could troubleshoot and find the real cause ?

/etc/krb5.conf

includedir /etc/krb5.conf.d/

[logging]
    default = FILE:/var/log/krb5libs.log
    kdc = FILE:/var/log/krb5kdc.log
    admin_server = FILE:/var/log/kadmind.log

[libdefaults]
    dns_lookup_realm = false
    ticket_lifetime = 24h
    renew_lifetime = 7d
    forwardable = true
    rdns = false
    pkinit_anchors = FILE:/etc/pki/tls/certs/ca-bundle.crt
    spake_preauth_groups = edwards25519
    default_ccache_name = FILE:/tmp/krb5.ccache
    default_realm = MY.DOMAIN.PRIV
    udp_preference_limit = 0

[realms]
MY.DOMAIN.PRIV = {
 }

[domain_realm]
 my.domain.priv = MY.DOMAIN.PRIV
 .my.domain.priv = MY.DOMAIN.PRIV

/etc/krb5.conf.d/xxx content

[libdefaults]
permitted_enctypes = aes256-cts-hmac-sha1-96 aes256-cts-hmac-sha384-192 camellia256-cts-cmac aes128-cts-hmac-sha1-96 aes128-cts-hmac-sha256-128 camellia128-cts-cmac

Thank you.

@aloene
Copy link
Author

aloene commented Nov 2, 2023

It seems this latency occurs when creating new connection. (instead of reusing one from the pool)
I have added Kerberos trace and I cannot see anything when it happens... The initial service ticket was gathered within 50ms, too.

@aloene
Copy link
Author

aloene commented Nov 2, 2023

Also, it may be unrelated, but I also (sometimes) experience slowness when creating transactions (before sending actual queries).
For example, Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.SqlConnection.BeginDbTransaction() can sometimes take neatly 1 second.
Could it be related to network issues (or resource under-allocation maybe, although I cannot see any CPU usage...) ?

> System.Private.CoreLib!System.Threading.ManualResetEventSlim.Wait(int32,value class System.Threading.CancellationToken)
> Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.SNI.SNIMarsHandle.Receive(class Microsoft.Data.SqlClient.SNI.SNIPacket&,int32)
> Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.SNI.TdsParserStateObjectManaged.ReadSyncOverAsync(int32,unsigned int32&)
> Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
> Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
> Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
> Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte(unsigned int8&)
> Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.TdsParser.TryRun(value class Microsoft.Data.SqlClient.RunBehavior,class Microsoft.Data.SqlClient.SqlCommand,class Microsoft.Data.SqlClient.SqlDataReader,class Microsoft.Data.SqlClient.BulkCopySimpleResultSet,class Microsoft.Data.SqlClient.TdsParserStateObject,bool&)
> Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.TdsParser.Run(value class Microsoft.Data.SqlClient.RunBehavior,class Microsoft.Data.SqlClient.SqlCommand,class Microsoft.Data.SqlClient.SqlDataReader,class Microsoft.Data.SqlClient.BulkCopySimpleResultSet,class Microsoft.Data.SqlClient.TdsParserStateObject)
> Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(unsigned int8[],value class TransactionManagerRequestType,class System.String,value class TransactionManagerIsolationLevel,int32,class Microsoft.Data.SqlClient.SqlInternalTransaction,class Microsoft.Data.SqlClient.TdsParserStateObject,bool)
> Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransaction2005(value class TransactionRequest,class System.String,value class System.Data.IsolationLevel,class Microsoft.Data.SqlClient.SqlInternalTransaction,bool)
> Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(value class System.Data.IsolationLevel,class System.String,bool)
> Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.SqlConnection.BeginTransaction(value class System.Data.IsolationLevel,class System.String)
> Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.SqlConnection.BeginDbTransaction(value class System.Data.IsolationLevel)

@Havunen
Copy link

Havunen commented Nov 3, 2023

I was experiencing similar issue. The slowness was caused by huge memory IO traffic.
Have you tested 5.2 preview nuget DLL from the CI servers? Possibly related #2120

Unfortunately there is no official nuget yet published where the fix is included

Edit: after reading through the issue again, it is probably a different issue but the symptoms sound very similar. You could try profiling memory allocations to check if that is the issue here.

Also in the stacktrace you posted there is method "sync over async" that does not sound optimal.

@DavoudEshtehari DavoudEshtehari added the 🆕 Triage Needed For new issues, not triaged yet. label Nov 3, 2023
@DavoudEshtehari
Copy link
Contributor

We will look into this.

@aloene
Copy link
Author

aloene commented Nov 3, 2023

Also, it may be unrelated, but I also (sometimes) experience slowness when creating transactions (before sending actual queries). For example, Microsoft.Data.SqlClient!Microsoft.Data.SqlClient.SqlConnection.BeginDbTransaction() can sometimes take neatly 1 second.

I could fix/workaround this different issue by disabling MARS. Luckily, it appears this feature is not required for my application. #422 is related to this one.

@aloene
Copy link
Author

aloene commented Nov 3, 2023

I was experiencing similar issue. The slowness was caused by huge memory IO traffic. Have you tested 5.2 preview nuget DLL from the CI servers? Possibly related #2120

Unfortunately there is no official nuget yet published where the fix is included

Edit: after reading through the issue again, it is probably a different issue but the symptoms sound very similar. You could try profiling memory allocations to check if that is the issue here.

Thanks for these ideas. I will try. Also, I was thinking I could try to migrate to .NET 7 as I have noticed Kerberos authentication implementation was different in latest M.D.SqlClient library starting from .NET 7 (Built-in class)... But this could be huge work... without guaranteed success.
I have upgraded to 5.1.2 with same results.

Also in the stacktrace you posted there is method "sync over async" that does not sound optimal.

Yes, this API contains much . NET 4.x migrated (synchronous) source code... That's why async calls are not always here... However, it should work better.

@JRahnama JRahnama removed the 🆕 Triage Needed For new issues, not triaged yet. label Nov 7, 2023
@JRahnama
Copy link
Contributor

JRahnama commented Nov 7, 2023

@aloene is it possible for you to gather network traces to find out where the delay is happening?

@aloene
Copy link
Author

aloene commented Nov 7, 2023

I have no network traces but I have profiled and then traced SQL connections creations yesterday and I found the issue mainly came from cumulated DNS queries durations (about 30-50ms each)
I have edited my krb5.conf file to avoid some of these DNS queries and it seems better. I had not posted yet because I would like to be sure the issue has really gone.
Also, I have improved the individual DNS query duration by disabling host search through all domains search list for my FQDNs. There was an OpenShift default DNS settings (ndots:5) that led to trying FQDNs lP address resolutions across the DNS domains search list for nothing.
Today, I compared timings between Kerberos and SQL authentication with these new settings : Kerberos is still a bit slower but I believe it makes sense.
I will try to reproduce the issue and will confirm here if it is OK.

@aloene
Copy link
Author

aloene commented Nov 23, 2023

I am closing the issue as it was very likely caused by DNS latencies as explained above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

4 participants