-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge "latency" when using AddGoogleDiagnosticsForAspNetCore on CloudRun #9868
Comments
I'll try to take a look later today. |
Thanks @amanda-tarafa In the meantime, I've tried to use BufferOptions.TimedBuffer(TimeSpan.FromSeconds(2)) Now, the requests are staying in the less than 100Ms range (way less, I see 65ms from my browser but logs are saying 3ms ❤️ ) However, I do not have any traces anymore ... So, let's say that I'm waiting your investigations with a great deal of impatience 😄 |
Since you are pointing at Google.Cloud.Diagnostics.AspNetCore documentation. Can I ask you whether you are using Google.Cloud.Diagnostics.AspNetCore or Google.Cloud.Diagnostics.AspNetCore3 and which version of .NET you are targetting? |
TargetFramework net6.0 |
I'll try to get back to you tomorrow. Sorry about the wait. |
Hello, did you find something @amanda-tarafa ? If I can help, do not hesitate to ask ! |
@christopheblin: Sorry, Amanda is on vacation this week; I'm sure she'll get back to you next week. |
Sorry again for the wait. Connection was awful where I was at.
|
Hi @amanda-tarafa, Thanks for the inputs. I made some tests (took me a while) with what you said in mind. using SizedBuffer, with various buffer sizes, I can see logs with a trace id, but the traces are NOT available I can confirm this by going to the trace dashboard and filtering on the service : no trace found Also, when we hit the buffer length, the penalty is around 70ms (which I find a bit high considering that this is normally a HTTP call from a GCP service to another GCP sevice ...) if I revert to NoBuffer, I see traces (but requests are slow, obviously) I'll try to start fresh from an empty and small project next week, I will report back here => in the meantime, if you have an idea (both for missing traces and for the time it should take), do not hesitate 😄 Side note : switching to "cpu always on" mode is not really an option for us If we take a 1req/sec, 24h a day, 365d a year, it is at least 4x more costly Considering our case where we have around 30 cloud runs (3 envs dev/staging/prod, 10 services), and that even production does not have this kind of workload, it would be 1200€ (40€x30services) against about 100€ for now (which is why we choose cloud run in the first place ...) |
Can you show your Google.Cloud.Diagnostics configuration code? Log and trace buffers are independent and you need to configure them separately. For double checking that everything is working, can you configure logging to use SizedBuffer but Trace to use no buffer? If that shows the traces then start by configuring SizedBuffer for tracing with a very small size, and increase that until you are comfortable with the balance between latency and freshnes of the info in the GCP console. The 70ms penalty, on the Google.Cloud.Diagnostics library, at this point, we basically just make a gRPC call to the service so I think there's not much there we can do.
Yes, I totally understand. It's just, using the timed buffer requires this, and there's no way around that am afraid. Would you consider using https://www.nuget.org/packages/Google.Cloud.Logging.Console if we added capailities for log entries to be correlated (it doesn't happen by default on CloudRun)? You wouldn't be able to augment traces or create new ones (unless you use Google.Cloud.Diagnostics for tracing or your own custom code), but if you are only interested in log entries being correlated to the request trace, then that might be a better option for you. Let me know and we can look into it. |
@amanda-tarafa thanks for the feedback next week, I'll try to setup a separate GCP project with only these cloud runs (without using my CI/CD and all other stuff)
I'll share the complete source code I'm not sure to understand your last sentence the trace views are super comfortable to use, but at least the following "grouping" works "good enough" right now so if it is possible to achieve this without using tracing (and to use the "default" stdout log collector that does not come with any grpc call penalty), then please show me 👍 note : if I understand correctly, using this lower level, we will lose the ability to track "distributed" logs (like service A emits pubsub message -> pubsub distribute message -> service B adds logs to same trace in new span). however, because it is not working right now with traces anyway, it will be ok ! |
It's not yet possible to achieve it, we'd have to make some changes to Google.Cloud.Logging.Console, but we would consider doing that. And maybe some support for the PubSub case (can't commit to that, but I could look into it). You would get the waterfall view though, if it ends up working. On PubSub not working currently, did you see my reply in #9856? If that's not working, just let me know (on that issue) and we can follow up there. |
I've created #9948 to track the work of possibly adding a correlation mechanism to Google.Cloud.Logging.Console. It's not the highest priority but I'll try to get to it next week. |
Yes, for propagating a tracing context when you publish a message, you'll have to wait for #9476. But for continuing a tracing context from a Pub/Sub message that you receive, you should be able to do the same as described in #9856, just extracting the context from the Pub/Sub message instead of from the ActiveMq message. |
Hello @amanda-tarafa Here is a minimal web api project demonstrating the issue Basically, there is a single /system route and we can control buffers for logs and traces with 2 env vars (and we test with different options). I deployed with various options for log and trace buffers
Full details with screenshots I know that using only 2 requests to benchmark is not enough, but it would require more time to have a test program (for example, 3 req/sec and verifying traces and logs thanks to response header x-cloud-trace-context) So, as we said earlier in the conversation, I will be happy to test a solution based only on "console" logs if I can still have the correlation between logs of the same request and the latency is back to <5ms However, for me, it looks like there is a weird problem somewhere with traces (maybe not in this lib ...) because, when we use TimedBuffer 10ms for traces, it appears that SOME traces are "forgotten in the middle" (whereas I could understand that all traces are lost or the last traces are lost, but not "in the middle") I hope that I provided enough details for more investigations (othewise, maybe we could organize a video call for ex, because writing everything down takes a lot of time ...) |
This is great information and repro, thanks a lot for that. I'll play around with it next week. Your observations that are unexpected for me:
This is exactly the behaviour that we expect from using a TimedBuffer in CloudRun, the timer threads are throtled when CPU is not on, and which threads are throtled is not in our control. It's posible that the first ones get through because they are closer to app initialization, and the last ones to app shutdown, etc. But we really don't expect TimedBuffer to work in CloudRun and similar other environments (unless always on), whit the symptons beeing lost traces and eventually thread starvation. I'll look at the other unexpected aspects next week and report back with my findings. Again, thanks for all your info an repro, it will definetely makes things easier. |
Great 👍 I forgot to mention : to deploy, I use the deploy.ps1 with Also, what is more strange is that, after this investigation, I look back to our settings in our "dev" cloud runs (where minInstance=0 and not always on). The latency (for a similar request) is around 10ms But we always have logs (at least I looked at 30+ requests and they all have the correlatation working in the logs explorer and it seems we have no logs missing) So it seems we have throtled traces but loags are not, this is strange to me ... Anyway, it seems consistent with the repro stuff (and we have more than 2 requests here 😄 ) Hope you will be able to find something next week, thanks for the awesome support ! |
OK, I've been playing around with your repro, and I can't reproduce, but I think there's a bug in the Logs View UI that you can access directly from a CloudRun Service page.
Can you please try again the SizedBuffer combinations and check LogsExplorer and Trace List View directly? This query in logs explorer should show the entries for the repro app, and from there you can jump to the Trace List View.
|
Thanks for the investigation First, I was doing all my screenshots ONLY from the logs explorer (I never use the logs in the cloud run which does not show traces nor correlated logs, so it is basically useless to me 😄 ) In my documetn, you can see that some logs have a trace (the icon is shown) but the view trace details is grayed out (or when I click on it it displays the error) I still have this behavior (some traces are msissing) even today (so more than 5 minutes after the shutdown 😄 ) And as you can see, this is for the case of sized buffers ! (i.e see the ResponseBody) And I also have the bug for NoBuffer (for traceMode) What can I do on my side ? Anyway, if you make some progress on #9948, I think we would be more confident with the approach (i.e the fact that we would have the same behavior locally and in GCP with stdout and without any penalty with grpc calls) |
A quick note as I'm out sick. Can you follow https://cloud.google.com/dotnet/docs/reference/Google.Cloud.Diagnostics.Common/latest#troubleshooting-tracing (and you can do the same for logging). Let's see if there's some exception beeing thrown with traces that we don't know about and that's causing your traces to not show up. |
@christopheblin I'll close this one tentatively given that I cannot reproduce and you prefer to wait for #9948 . We can always reopen if we need to. |
First, let me tell that what I am going to show here is probably a misconfiguration or a misunderstanding on my side
I have a very simple dotnet6 web api that just returns an in memory data. Locally, I have about 60ms (let's say less than 100ms)
I deploy this API in 2 cloud runs
In the first one, I use AddGoogleDiagnosticsForAspNetCore => latency goes up to 880ms (lets's say more than 500ms)
In the second one, I replaced with builder.Services.AddLogging(configure => configure.AddConsole()); => latency goes down to 86ms (agin let's say less than 100ms)
Here is how I configured :
I think the problem lies inside the
BufferOptions.NoBuffer
which I have used because of the doc referenced in the comments https://cloud.google.com/dotnet/docs/reference/Google.Cloud.Diagnostics.AspNetCore/latestHere is how it looks like in a trace (obviously, this is only available when we use AddGoogleDiagnosticsForAspNetCore
What value is correct to use for bufferOptions under Cloud Run ?
Is there somewhere else we can find information about that (i.e about the fact that this is going to use some CPU "after" the HTTP request) ?
Is there a way to manually control the buffer ? (i.e something like adding a middleware that will flush only once for each request)
Thanks for any pointers
The text was updated successfully, but these errors were encountered: