Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge "latency" when using AddGoogleDiagnosticsForAspNetCore on CloudRun #9868

Closed
christopheblin opened this issue Feb 10, 2023 · 22 comments
Closed
Assignees
Labels
api: cloudtrace Issues related to the Cloud Trace API. type: question Request for information or clarification. Not an issue.

Comments

@christopheblin
Copy link

First, let me tell that what I am going to show here is probably a misconfiguration or a misunderstanding on my side

I have a very simple dotnet6 web api that just returns an in memory data. Locally, I have about 60ms (let's say less than 100ms)

I deploy this API in 2 cloud runs

In the first one, I use AddGoogleDiagnosticsForAspNetCore => latency goes up to 880ms (lets's say more than 500ms)

In the second one, I replaced with builder.Services.AddLogging(configure => configure.AddConsole()); => latency goes down to 86ms (agin let's say less than 100ms)

Here is how I configured :

    public static void AddGcpLoggingAndTracing(this WebApplicationBuilder builder)
    {
        //clear the default providers so that only Google logging is used
        builder.Host.ConfigureLogging(logging =>
        {
            logging.ClearProviders();
        });

        //!\ needed in Cloud run env, https://cloud.google.com/dotnet/docs/reference/Google.Cloud.Diagnostics.AspNetCore/latest
        var bufferOptions = BufferOptions.NoBuffer();
        //!\ GoogleDiagnositcs does not respect appsettings.json, https://stackoverflow.com/a/70834072/1545567
        var logLevel = LogLevel.Debug;
        //!\ trace all requests
        var traceFallbackPredicate = TraceDecisionPredicate.Create((req) => true);
        var qpsSampleRate = 100;

        builder.Services.AddGoogleDiagnosticsForAspNetCore(
            new AspNetCoreTraceOptions()
            {
                TraceFallbackPredicate = traceFallbackPredicate,
                ServiceOptions = new TraceServiceOptions
                {
                    Options = TraceOptions.Create(
                        qpsSampleRate: qpsSampleRate,
                        bufferOptions: bufferOptions
                    )
                }
            },
            new LoggingServiceOptions()
            {
                Options = LoggingOptions.Create(
                    logLevel: logLevel,
                    bufferOptions: bufferOptions
                ),
            }
        );
    }

I think the problem lies inside the BufferOptions.NoBuffer which I have used because of the doc referenced in the comments https://cloud.google.com/dotnet/docs/reference/Google.Cloud.Diagnostics.AspNetCore/latest
image

Here is how it looks like in a trace (obviously, this is only available when we use AddGoogleDiagnosticsForAspNetCore

image

What value is correct to use for bufferOptions under Cloud Run ?

Is there somewhere else we can find information about that (i.e about the fact that this is going to use some CPU "after" the HTTP request) ?

Is there a way to manually control the buffer ? (i.e something like adding a middleware that will flush only once for each request)

Thanks for any pointers

@amanda-tarafa amanda-tarafa added type: question Request for information or clarification. Not an issue. api: cloudtrace Issues related to the Cloud Trace API. labels Feb 10, 2023
@amanda-tarafa
Copy link
Contributor

I'll try to take a look later today.

@christopheblin
Copy link
Author

Thanks @amanda-tarafa

In the meantime, I've tried to use BufferOptions.TimedBuffer(TimeSpan.FromSeconds(2))

Now, the requests are staying in the less than 100Ms range (way less, I see 65ms from my browser but logs are saying 3ms ❤️ )

However, I do not have any traces anymore ...

image

So, let's say that I'm waiting your investigations with a great deal of impatience 😄

@amanda-tarafa amanda-tarafa self-assigned this Feb 10, 2023
@amanda-tarafa
Copy link
Contributor

Since you are pointing at Google.Cloud.Diagnostics.AspNetCore documentation. Can I ask you whether you are using Google.Cloud.Diagnostics.AspNetCore or Google.Cloud.Diagnostics.AspNetCore3 and which version of .NET you are targetting?

@christopheblin
Copy link
Author

TargetFramework net6.0
PackageReference Include="Google.Cloud.Diagnostics.AspNetCore3" Version="5.0.0"

@amanda-tarafa
Copy link
Contributor

I'll try to get back to you tomorrow. Sorry about the wait.

@christopheblin
Copy link
Author

Hello,

did you find something @amanda-tarafa ?

If I can help, do not hesitate to ask !

@jskeet
Copy link
Collaborator

jskeet commented Feb 17, 2023

@christopheblin: Sorry, Amanda is on vacation this week; I'm sure she'll get back to you next week.

@amanda-tarafa
Copy link
Contributor

Sorry again for the wait. Connection was awful where I was at.

  • Indeed you cannot use the timed buffer on CloudRun unless you configure the CPU to be always-allocated. You can find details on the Background Activity section of the CloudRun documentation.

    • The timed buffer requires always-allocated CPU because it pushes entries on a thread spawned by a timer, separate from any request threads. The advantage is that pushing diagnostic entries (logs, traces, etc.) does not impact requeste latancy. But it does require CPU allocation independent of service load, else, you not only risk loosing diagnostic entries, but the system degrades overtime as the spawned threads remain alive but throtled. This is what you are seeing in this comment.
  • If you use no buffer, then every single diagnostics entry will be pushed to the server as soon as it is recorded. I'm not surprised you see increase latency here as compared with almost anything else.

  • Using the sized buffer seems to be the right approach in your case (unless you configure CPU to be always-allocated), diagnostic entries will be stored in a buffer (individual buffers for log and trace entries) and once the buffer reaches the configured size, within a request, entries are pushed. So, there'll be one request every now and then with somewhat increased latency from pushing (once) the buffer. You can use BufferOptions.SizedBuffer() in code as the one shown in Buffers and buffer configuration, where you can also read more about buffers.

@christopheblin
Copy link
Author

Hi @amanda-tarafa,

Thanks for the inputs.

I made some tests (took me a while) with what you said in mind.

using SizedBuffer, with various buffer sizes, I can see logs with a trace id, but the traces are NOT available

image

I can confirm this by going to the trace dashboard and filtering on the service : no trace found

Also, when we hit the buffer length, the penalty is around 70ms (which I find a bit high considering that this is normally a HTTP call from a GCP service to another GCP sevice ...)

if I revert to NoBuffer, I see traces (but requests are slow, obviously)

I'll try to start fresh from an empty and small project next week, I will report back here

=> in the meantime, if you have an idea (both for missing traces and for the time it should take), do not hesitate 😄

Side note : switching to "cpu always on" mode is not really an option for us

If we take a 1req/sec, 24h a day, 365d a year, it is at least 4x more costly
image

Considering our case where we have around 30 cloud runs (3 envs dev/staging/prod, 10 services), and that even production does not have this kind of workload, it would be 1200€ (40€x30services) against about 100€ for now (which is why we choose cloud run in the first place ...)

@amanda-tarafa
Copy link
Contributor

using SizedBuffer, with various buffer sizes, I can see logs with a trace id, but the traces are NOT available
I can confirm this by going to the trace dashboard and filtering on the service : no trace found

Can you show your Google.Cloud.Diagnostics configuration code? Log and trace buffers are independent and you need to configure them separately. For double checking that everything is working, can you configure logging to use SizedBuffer but Trace to use no buffer? If that shows the traces then start by configuring SizedBuffer for tracing with a very small size, and increase that until you are comfortable with the balance between latency and freshnes of the info in the GCP console.

The 70ms penalty, on the Google.Cloud.Diagnostics library, at this point, we basically just make a gRPC call to the service so I think there's not much there we can do.
I would also play around with the buffer size, it's possible that smaller payloads yield better times. And also with your CloudRun configuration (am not very familiar with that but maybe you can play around with bandwidth). One other thing to consider is whether the GCP project where your CloudRun is is the same as the one you are sending the logs to. If it's not you might want to consider using projects in the same region, etc.

Side note : switching to "cpu always on" mode is not really an option for us

Yes, I totally understand. It's just, using the timed buffer requires this, and there's no way around that am afraid.

Would you consider using https://www.nuget.org/packages/Google.Cloud.Logging.Console if we added capailities for log entries to be correlated (it doesn't happen by default on CloudRun)? You wouldn't be able to augment traces or create new ones (unless you use Google.Cloud.Diagnostics for tracing or your own custom code), but if you are only interested in log entries being correlated to the request trace, then that might be a better option for you. Let me know and we can look into it.

@christopheblin
Copy link
Author

@amanda-tarafa thanks for the feedback

next week, I'll try to setup a separate GCP project with only these cloud runs (without using my CI/CD and all other stuff)

  • no-buffer = logging : no buffer, tracing : no buffer => should display traces
  • buffer-logs = logging : sized buffer, tracing : no buffer => should display traces
  • buffer = logging : sized buffer, tracing : sized buffer => should NOT display traces and incurs only a small cost on some requests (that's my current setup)
  • optimized-buffers = logging : "big" sized buffer, tracing : small sized buffer => shoud display traces (and we will see the latency)

I'll share the complete source code

I'm not sure to understand your last sentence

the trace views are super comfortable to use, but at least the following "grouping" works "good enough" right now

image

so if it is possible to achieve this without using tracing (and to use the "default" stdout log collector that does not come with any grpc call penalty), then please show me 👍

note : if I understand correctly, using this lower level, we will lose the ability to track "distributed" logs (like service A emits pubsub message -> pubsub distribute message -> service B adds logs to same trace in new span). however, because it is not working right now with traces anyway, it will be ok !

@amanda-tarafa
Copy link
Contributor

It's not yet possible to achieve it, we'd have to make some changes to Google.Cloud.Logging.Console, but we would consider doing that. And maybe some support for the PubSub case (can't commit to that, but I could look into it). You would get the waterfall view though, if it ends up working.

On PubSub not working currently, did you see my reply in #9856? If that's not working, just let me know (on that issue) and we can follow up there.

@amanda-tarafa
Copy link
Contributor

I've created #9948 to track the work of possibly adding a correlation mechanism to Google.Cloud.Logging.Console. It's not the highest priority but I'll try to get to it next week.

@christopheblin
Copy link
Author

For #9856 : yes I saw your answer, sorry for not following up (this is now done). For pub/sub scenario : we are using cloud run => pub/sub push => http context (so I was more waiting for #9476 with regards to PubSub)

@amanda-tarafa
Copy link
Contributor

For pub/sub scenario : we are using cloud run => pub/sub push => http context (so I was more waiting for #9476 with regards to PubSub)

Yes, for propagating a tracing context when you publish a message, you'll have to wait for #9476.

But for continuing a tracing context from a Pub/Sub message that you receive, you should be able to do the same as described in #9856, just extracting the context from the Pub/Sub message instead of from the ActiveMq message.

@christopheblin
Copy link
Author

Hello @amanda-tarafa

Here is a minimal web api project demonstrating the issue
gcp-log-trace.zip

Basically, there is a single /system route and we can control buffers for logs and traces with 2 env vars (and we test with different options).
Everyhting is mostly at the top of Program.cs (except controller)

I deployed with various options for log and trace buffers

  • with NoBuffer for both logs & traces => 500ms latency, logs and traces ok
  • with Timed(10ms) for both logs & traces => <20ms latency, logs ok and traces KO (not all traces KO, very weird here, see details)
  • with Timed(10ms) for logs and NoBuffer for traces => 300ms latency, logs ok and traces KO (with the latency, it clearly shows that NoBuffer is used but then traces are still not working)
  • with Sized(64ko) for logs and Sized(4ko) for traces => < 5ms latency, logs KO and traces KO

Full details with screenshots
GCP Logs Traces.pdf

I know that using only 2 requests to benchmark is not enough, but it would require more time to have a test program (for example, 3 req/sec and verifying traces and logs thanks to response header x-cloud-trace-context)

So, as we said earlier in the conversation, I will be happy to test a solution based only on "console" logs if I can still have the correlation between logs of the same request and the latency is back to <5ms

However, for me, it looks like there is a weird problem somewhere with traces (maybe not in this lib ...) because, when we use TimedBuffer 10ms for traces, it appears that SOME traces are "forgotten in the middle" (whereas I could understand that all traces are lost or the last traces are lost, but not "in the middle")

I hope that I provided enough details for more investigations (othewise, maybe we could organize a video call for ex, because writing everything down takes a lot of time ...)

@amanda-tarafa
Copy link
Contributor

This is great information and repro, thanks a lot for that. I'll play around with it next week. Your observations that are unexpected for me:

  • with Timed(10ms) for logs and NoBuffer for traces => 300ms latency, logs ok and traces KO (with the latency, it clearly shows that NoBuffer is used but then traces are still not working) (I would have expected to see traces here)
  • with Sized(64ko) for logs and Sized(4ko) for traces => < 5ms latency, logs KO and traces KO (I would have expected to see both logs and traces here)

because, when we use TimedBuffer 10ms for traces, it appears that SOME traces are "forgotten in the middle" (whereas I could understand that all traces are lost or the last traces are lost, but not "in the middle")

This is exactly the behaviour that we expect from using a TimedBuffer in CloudRun, the timer threads are throtled when CPU is not on, and which threads are throtled is not in our control. It's posible that the first ones get through because they are closer to app initialization, and the last ones to app shutdown, etc. But we really don't expect TimedBuffer to work in CloudRun and similar other environments (unless always on), whit the symptons beeing lost traces and eventually thread starvation.

I'll look at the other unexpected aspects next week and report back with my findings. Again, thanks for all your info an repro, it will definetely makes things easier.

@christopheblin
Copy link
Author

Great 👍

I forgot to mention : to deploy, I use the deploy.ps1 with gcloud builds submit . --config=cloudbuild.yaml + gcloud run services replace .\cloudrun.yaml
If you make changes to the code, change vXX in cloudbuild.yaml & cloudrun.yaml and then redeploy (otherwise, CloudRun would stay on previous version 😄)

Also, what is more strange is that, after this investigation, I look back to our settings in our "dev" cloud runs (where minInstance=0 and not always on).
We are using var bufferOptions = BufferOptions.TimedBuffer(TimeSpan.FromSeconds(2)); for both logs and traces

The latency (for a similar request) is around 10ms
sometimes there are traces (view trace details is NOT grayed out and we can see the trace)
sometimes we have traces that are not working (view trace details is NOT grayed out but there is an error when clikcing it) [1]
sometimes we have no trace (view trace details is grayed out)

But we always have logs (at least I looked at 30+ requests and they all have the correlatation working in the logs explorer and it seems we have no logs missing)

So it seems we have throtled traces but loags are not, this is strange to me ...

Anyway, it seems consistent with the repro stuff (and we have more than 2 requests here 😄 )

Hope you will be able to find something next week, thanks for the awesome support !

[1]
image

@amanda-tarafa
Copy link
Contributor

OK, I've been playing around with your repro, and I can't reproduce, but I think there's a bug in the Logs View UI that you can access directly from a CloudRun Service page.
I've attached this document with screenshots and with what I've tried, here is the short version:

  • TimedBuffer does not work in CloudRun (unless CPU always on). We understand why very well, and although some of the symptons of that may be surprising, they are not unexpected in a scenarion where threads are throtlled, and the thread pool is affected. Basically, anything can fail at some point.

  • Sized buffers work as expected. I've tried combinations of Sized1 and Sized2 for both loggin and trace and every time it worked well. Since the Size2 buffer is on the bigger side, whatever you have configured to use Size2 won't be available immediately. Make several requests instead of 2 (5 or 6 seems to be fine, 10 to be on the safe side), or on application shut down, and wait a few minutes for the UI to be refreshed, the information is not shown immediately anywhere, it even happened to me that I saw first newer entries than older, but eventually, 5 mins roughly, the UI catches up.

    • There's a bug in the Logs View UI that you can access directly from a CloudRun service page, it never refreshed to show the information that was buffered and wasn't received "immediately". So, once the buffer has been flushed, you only get to see the flushed information in the general purpose Logs Explorer or the general purpose Trace List View. I'll try to find the right team to look into that.

Can you please try again the SizedBuffer combinations and check LogsExplorer and Trace List View directly? This query in logs explorer should show the entries for the repro app, and from there you can jump to the Trace List View.

resource.type = "cloud_run_revision"
resource.labels.service_name = "test-api-dev"
resource.labels.location = "europe-west1"
 severity>=DEFAULT

@christopheblin
Copy link
Author

Hi @amanda-tarafa

Thanks for the investigation

First, I was doing all my screenshots ONLY from the logs explorer (I never use the logs in the cloud run which does not show traces nor correlated logs, so it is basically useless to me 😄 )

In my documetn, you can see that some logs have a trace (the icon is shown) but the view trace details is grayed out (or when I click on it it displays the error)

I still have this behavior (some traces are msissing) even today (so more than 5 minutes after the shutdown 😄 )

image

And as you can see, this is for the case of sized buffers ! (i.e see the ResponseBody)

image

And I also have the bug for NoBuffer (for traceMode)

image

What can I do on my side ?
Is there any way to activate some logging of the grpc calls made by the library in stdout in order to be sure the bug is on the trace component side and not in this lib ?
Should I fill a support case (but I think support team are not coders, so it will be difficult for me to convey such a tricky bug...) ? Do you have access to someone in the trace component team that may help ?

Anyway, if you make some progress on #9948, I think we would be more confident with the approach (i.e the fact that we would have the same behavior locally and in GCP with stdout and without any penalty with grpc calls)
We would lose the traces, but anyway they are not working properly today for us ...

@amanda-tarafa
Copy link
Contributor

A quick note as I'm out sick. Can you follow https://cloud.google.com/dotnet/docs/reference/Google.Cloud.Diagnostics.Common/latest#troubleshooting-tracing (and you can do the same for logging). Let's see if there's some exception beeing thrown with traces that we don't know about and that's causing your traces to not show up.

@amanda-tarafa
Copy link
Contributor

@christopheblin I'll close this one tentatively given that I cannot reproduce and you prefer to wait for #9948 .

We can always reopen if we need to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: cloudtrace Issues related to the Cloud Trace API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

3 participants