Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation / Default for Logging flushlevel can be considered harmful #2906

Closed
wleese opened this issue May 21, 2024 · 7 comments · Fixed by #2922
Closed

Documentation / Default for Logging flushlevel can be considered harmful #2906

wleese opened this issue May 21, 2024 · 7 comments · Fixed by #2922
Assignees
Labels
priority: p2 type: bug Something isn't working

Comments

@wleese
Copy link

wleese commented May 21, 2024

The documentation found at https://googlecloudplatform.github.io/spring-cloud-gcp/5.3.0/reference/html/index.html#log-via-api says:

flushLevel WARN If a log entry with this level is encountered, trigger a flush of locally buffered log to Cloud Logging. This can also be set via the STACKDRIVER_LOG_FLUSH_LEVEL environmental variable.

I believe the default that Spring Cloud GCP uses is actually ERROR, when the user has not explicitly set a flushLevel.

However, looking at googleapis/google-cloud-java#4254 the default the library uses is null (OFF), due to running into severe performance issues that can occur when a log statement at the flushing log level is performed. It effectively blocks the thread that is performing the logging statement.

Unless I'm misunderstanding the situation, I believe it would be correct to adhere to the library's default of not using log severity level flushing, to avoid performance issues - and of course correct the documentation, whatever is decided.

@breun
Copy link
Contributor

breun commented May 21, 2024

We have run into performance and availability issues in production due to this default flushing setting, so we are now indeed setting this to OFF to make sure that no flushing -- which is a blocking operation -- happens.

@lqiu96
Copy link
Contributor

lqiu96 commented May 23, 2024

Thanks for reporting this. I'm not too familiar with logging and just trying to understand the issue more.

It effectively blocks the thread that is performing the logging statement.

Would you be able to elaborate on this a bit more? Is this due to a build up of buffered logs until an ERROR log has been is encountered (potentially blocking the thread for X seconds as it's flushing all the logs)?

Also, if possible, would you have a simple repro with spring that I can try?

@lqiu96 lqiu96 added type: bug Something isn't working priority: p2 labels May 23, 2024
@wleese
Copy link
Author

wleese commented May 24, 2024

I believe the issue I linked in the original post explains the issue better than I can, but I'll try to do so again :)
Also note that I believe the mere fact that the library Spring Cloud GCP uses has a different default value for flushLevel, than Spring Cloud GCP applies, is a red flag itself (without knowing why Spring Cloud GCP deviates).

With the current setup, if an application uses Spring Cloud GCP and has a method that does e.g. logger.warn("hi!"); (although I believe flushLevel is ERROR, which conflicts with the documentation of Spring Cloud GCP), then a stacktrace of that thread would look something like this:

"DefaultDispatcher-worker-8" - Thread t@165
   java.lang.Thread.State: TIMED_WAITING
        at java.base@21.0.3/jdk.internal.misc.Unsafe.park(Native Method)
        - parking to wait for <246711db> (a com.google.common.util.concurrent.CollectionFuture$ListFuture)
        at java.base@21.0.3/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:269)
        at app//com.google.common.util.concurrent.OverflowAvoidingLockSupport.parkNanos(OverflowAvoidingLockSupport.java:38)
        at app//com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:458)
        at app//com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:118)
        at app//com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:75)
        at app//com.google.cloud.logging.LoggingImpl.flush(LoggingImpl.java:910)
        at app//com.google.cloud.logging.LoggingImpl.write(LoggingImpl.java:893)
        at app//com.google.cloud.logging.logback.LoggingAppender.append(LoggingAppender.java:403)
        at app//com.google.cloud.logging.logback.LoggingAppender.append(LoggingAppender.java:111)
        at app//ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:85)
        at app//ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
        at app//ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:272)
        at app//ch.qos.logback.classic.Logger.callAppenders(Logger.java:259)
        at app//ch.qos.logback.classic.Logger.log(Logger.java:817)
        at app//org.slf4j.spi.DefaultLoggingEventBuilder.log(DefaultLoggingEventBuilder.java:163)
        at app//org.slf4j.spi.DefaultLoggingEventBuilder.log(DefaultLoggingEventBuilder.java:137)
        at app//com.netflix.graphql.dgs.exceptions.DefaultDataFetcherExceptionHandler.logException(DefaultDataFetcherExceptionHandler.kt:68)
        at app//com.netflix.graphql.dgs.exceptions.DefaultDataFetcherExceptionHandler.doHandleException(DefaultDataFetcherExceptionHandler.kt:58)
        at app//com.netflix.graphql.dgs.exceptions.DefaultDataFetcherExceptionHandler.handleException(DefaultDataFetcherExceptionHandler.kt:38)
        at app//a.b.c(Handler.kt:26)
        at app//graphql.execution.ExecutionStrategy.asyncHandleException(ExecutionStrategy.java:406)
        at app//graphql.execution.ExecutionStrategy.handleFetchingException(ExecutionStrategy.java:394)
        at app//graphql.execution.ExecutionStrategy.lambda$fetchField$5(ExecutionStrategy.java:312)
        at app//graphql.execution.ExecutionStrategy$$Lambda/0x00007ab8fd10edc0.apply(Unknown Source)
        at java.base@21.0.3/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
        at java.base@21.0.3/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
        at java.base@21.0.3/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base@21.0.3/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2194)
        at app//org.dataloader.DataLoaderHelper.lambda$dispatchQueueBatch$3(DataLoaderHelper.java:282)
        at app//org.dataloader.DataLoaderHelper$$Lambda/0x00007ab8fd12e240.apply(Unknown Source)
        at java.base@21.0.3/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:990)
        at java.base@21.0.3/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:974)
        at java.base@21.0.3/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base@21.0.3/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2194)
        at app//kotlinx.coroutines.future.CompletableFutureCoroutine.onCancelled(Future.kt:66)
        at app//kotlinx.coroutines.AbstractCoroutine.onCompletionInternal(AbstractCoroutine.kt:91)
        at app//kotlinx.coroutines.JobSupport.finalizeFinishingState(JobSupport.kt:237)
        at app//kotlinx.coroutines.JobSupport.tryMakeCompletingSlowPath(JobSupport.kt:910)
        at app//kotlinx.coroutines.JobSupport.tryMakeCompleting(JobSupport.kt:867)
        at app//kotlinx.coroutines.JobSupport.makeCompletingOnce$kotlinx_coroutines_core(JobSupport.kt:832)
        at app//kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:100)
        at app//kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
        at app//kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
        at app//kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
        at app//kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:793)
        at app//kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:697)
        at app//kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:684)

The interesting part of this stacktrace is 1 the fact that it's in TIMED_WAITING and 2 how it got there:

        at app//com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:75)
        at app//com.google.cloud.logging.LoggingImpl.flush(LoggingImpl.java:910)
        at app//com.google.cloud.logging.LoggingImpl.write(LoggingImpl.java:893)
        at app//com.google.cloud.logging.logback.LoggingAppender.append(LoggingAppender.java:403)
        at app//com.google.cloud.logging.logback.LoggingAppender.append(LoggingAppender.java:111)

The logger.warn("hi!") goes to com.google.cloud.logging.LoggingImpl.write for an asynchronous write, but there it determines flushLevel == warn so it performs a flush via com.google.cloud.logging.LoggingImpl.flush. That flush, leads to com.google.common.util.concurrent.ForwardingFuture.get which will block our thread.

What's wrong with blocking the thread you may wonder? Well that really depends! But when threads are limited (very common) and new threads are required due to the workload (e.g. handling REST calls), if the pressure is high enough, you will run out of threads / heap. And actually, even if there is little pressure on the application, if the API (logging.googleapis.com) has an issue and we're unfortunate enough to perform a logger.warn, the thread might be blocked for 50 seconds (gRPC timeout), so that even an application with little load, can go down.

Asynchronous logging is a requirement for high performance applications. The fact that there is a commonly accessed code path that blocks the thread that is logging (from the users code), is a trap that can easily bring down an application (our experience). Instead, the "flush" should have happened asynchronously (when the Appender has already been set up for asynchronous logging), but the library developers said this is not possible at the moment, so instead they decided to turn off the flushing by default. I believe that decision was the right one, and that Spring Cloud GCP should not deviate from it by having a different flushLevel, effectively enabling a dangerous feature that was turned off at some point.

@breun
Copy link
Contributor

breun commented May 24, 2024

@wleese Nice explanation!

The idea of the flushLevel option seems to be that you can indicate that you want log messages that are ‘important enough’ (by default warn and higher apparently, although the docs say error) without any delay by flushing those immediately. The problem is that flushing is a blocking operation and that if a lot of these ‘important enough to flush immediately’ logs happen in a short amount of time, the whole application may come to a grinding halt because of this behavior.

The safest default for flushLevel would indeed be OFF, and maybe there should be a warning with this configuration that setting it to any other value to immediately flush for any log message at at least the configured log level can cause issues, because flushing is a blocking operation and thus blocks your application code from executing until the flush is done.

@lqiu96
Copy link
Contributor

lqiu96 commented May 24, 2024

Thanks for the added context and explanations. I do not know the history of the deviation between the logging/flushing levels. Having the flush operation be blocking for X seconds does seem quite problematic. I believe I saw the older example had someone wait for 2 hours and your comment above mentioned ~50 seconds.

CC: @meltsufin @burkedavison For additional opinions on this. Given that java-logging has changed the flush level, perhaps this is something we should consider for spring-cloud-gcp?

@burkedavison
Copy link
Member

OP's linked logging client disabled-by-default PR happened after this module was developed, which at least explains how we got here. There doesn't appear to be any intent to have a different defaults in Spring Cloud GCP or the java-logging-logback appender.

ch.qos.logbck.classic.Level.OFF exists (#nice), and could be used to tell the logback appender to disable flushing.

I'd like to hear thoughts about the backward compatibility impact for customers in making the change, but generally agree the default should be 'disabled'.

@meltsufin
Copy link
Member

I think it makes sense to match the change to the default in googleapis/google-cloud-java#4254.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 type: bug Something isn't working
Projects
None yet
5 participants