-
Notifications
You must be signed in to change notification settings - Fork 773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only 100 streaming calls get responses under very specific conditions #2404
Comments
Ping for visibility - are there any maintainers here with thoughts on this? |
Has the client been configured to create multiple connections if required? 100 streams is a common limit per-connection. It is what Kestrel uses. |
Yes, as per my list of experiments, I've experimented with EnableMultipleHttp2Connections and MaxConnectionsPerServer, with no visible impact. If there's another setting you'd like me to try, I'm all ears. (I also don't know why this would be affected my making an initial unary call.) |
Ok. I don't know Firestorm or have an account for it. If you have an account I can use in a simple way, such as just running a console app with the given parameters, then I could test out on my machine. Email me details if that's an option. Alternatively, you can get more information on your machine. For getting more information about the client, I suggest you setup client logging: The most likely place for the hang to happen is inside SocketsHttpHandler. The handler uses EventSource instead of Microsoft.Logging.Extensions for its logging, so also create a listener for it at the start of the program. What I use inside gRPC tests for seeing handler logging: https://github.com/grpc/grpc-dotnet/blob/master/test/Shared/HttpEventSourceListener.cs The detailed handler logging will let you know whether request frames are successfully being sent to the server, or if they're queuing somewhere in the client. |
Back to testing now. Just in case it was relevant, I tried Thanks for the logging details... I've now got a 41,000 line log file to look through :) |
Ooh, looking at the logs again, I was wrong about this - or at least, somewhat wrong. Only ~100 manage to send their request to start with... the other 200 start getting to send their requests when the listen operations are being cancelled. So the log for .NET 8 looks like this:
The log for .NET 6 looks like this:
So I'm pretty sure it is the send that's blocking. |
Okay, I've tweaked the test app a little:
This makes it easier to isolate the logs for the one failed listener. Here's an example of the logging I get - the send call never returns
Whereas if I isolate a single successful listener, I believe the log for sending the request looks like this:
|
So the first part of the "good" log that's not in the "bad" log is RequestLeftQueue. I assume that means it's still in the queue, but I don't know enough about the queue to know what the next step should be... |
I see what it is. The client and server are using HTTP/3. The specific conditions are the same conditions that .NET supports HTTP/3. I'm sure the problem is still the connection's stream limit, but I don't think there is a way to allow the .NET client to create more connections when the limit is reached for HTTP/3. Couple of issues to improve:
I think an API for this will be added for .NET 9. You can work around this issue by creating a delegating handler that sets |
Thanks for filing the additional issues. I'll try the version policy to validate your diagnosis - it sounds right to me though. (Would love to know why the initial unary RPC makes a difference, but...) We (Google Cloud client libraries) have a recurrent problem that when there are aspects of the HTTP message handler to be tweaked, we've no clean way of doing that - basically either we ask the customer to create the handler themselves, or we'd have to reproduce all the logic in Grpc.Net to work out which handler to reproduce. This is another example of where it would be nice to have more flexibility. I'll ponder that aspect more and create a new issue with a proposal. I'm expecting a bunch of back and forth on it though :) |
HTTP/3 is negotiated by a response header. The initial unary request is HTTP/2, then the follow up listeners calls are on an upgraded HTTP/3 connection. I'm guessing that without the initial unary request then the 300 listeners are created on HTTP/2 before a response is received. |
Hooray - I can confirm that with the following code, all is well: FirestoreClient CreateClient()
{
string[] scopes =
{
"https://www.googleapis.com/auth/cloud-platform",
"https://www.googleapis.com/auth/datastore",
};
var cred = GoogleCredential.GetApplicationDefault().CreateScoped(scopes);
if (cred.UnderlyingCredential is ServiceAccountCredential serviceCredential)
{
cred = GoogleCredential.FromServiceAccountCredential(serviceCredential.WithUseJwtAccessWithScopes(true));
}
var channelOptions = new GrpcChannelOptions
{
Credentials = cred.ToChannelCredentials(),
HttpHandler = new ExactVersionHandler(new SocketsHttpHandler { EnableMultipleHttp2Connections = true })
};
var channel = GrpcChannel.ForAddress("https://firestore.googleapis.com", channelOptions);
return new FirestoreClient(channel);
}
internal class ExactVersionHandler : DelegatingHandler
{
internal ExactVersionHandler(HttpMessageHandler handler) : base(handler)
{
}
protected override Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
{
request.VersionPolicy = HttpVersionPolicy.RequestVersionExact;
return base.SendAsync(request, cancellationToken);
}
protected override HttpResponseMessage Send(HttpRequestMessage request, CancellationToken cancellationToken)
{
request.VersionPolicy = HttpVersionPolicy.RequestVersionExact;
return base.Send(request, cancellationToken);
}
} (That's clearly not production code, of course... but it proves your hypothesis.) Thanks so much! I'll leave this open for you to handle as you wish - now I need to work out what to do in client libraries :) |
Overloading I think it might be worth adding |
Right. I only did so here as it was quicker to do that than check whether it would be used ;)
That would be wonderful. I think it might be worth doing a general audit of "what's worth including in channel options?" - high on my priority list would be keep-alive settings, if you'd be open to that. |
Any more thoughts on updating GrpcChannelOptions? We've seen a few more issue like this where we've suggested the workaround above. I'll put a bit of a hack in Google.Api.Gax.Grpc as an option if necessary, but GrpcChannelOptions would be much better... |
I think it is a good idea. And defaulting to HTTP/2 only seems like a good idea too. Making HTTP/3 an explicit choice would reduce confusion. Just need to find the time to do it. |
PR: #2514 |
Original GitHub issue: googleapis/google-cloud-dotnet#12318
Diagnostic test solution: https://github.com/googleapis/google-cloud-dotnet/tree/main/issues/Issue12318
The most useful directory in there (in terms of looking at the code) is probably RawGrpcStubs.
Currently, I can reproduce this when talking to Firestore. I haven't managed to reproduce it with a local gRPC server - you can see attempts in the repro directory above, but they didn't come to anything.
The repro goes:
I would expect every listener to receive a few responses - so we'd report at the end "We have 300 documents, and 300 listeners received a response."
What we actually see is:
I can give more version details if that would be useful - let me know exactly what's required.
Some experiments:
It's not clear to me what the next step is in terms of diagnosing this - any suggestions would be welcome.
Looking at logging, it looks like all 300 listeners manage to send their request, but then 200 of the 300 don't see any responses. In an earlier test I thought I'd seen only 100 listeners make a request, the rest being blocked - but I suspect that was a diagnostics failure.
I don't remember seeing a difference between Windows 10 and Windows 11 before when it comes to running .NET (as opposed to .NET Framework) code - and it's rare to spot differences from .NET 6 to later versions like this.
The text was updated successfully, but these errors were encountered: