Logging unstable under loss of connection #745

gjferriercoats · 2021-11-26T09:24:39Z

Using the latest sample from spring-cloud-gcp-samples/spring-cloud-gcp-logging-sample I am able to make the implementation unstable (runaway memory, thread allocations) by disabling the network connection so that the attempt to log to GCP fails.

I was able to recreate the problem by adding the following -

Add @EnableScheduling to the SpringBoot application
Add a new class which has a fixed rate schedule, see below
package com.example;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;

@component
public class OnASchedule {

private static final Log LOGGER = LogFactory.getLog(OnASchedule.class);

private int i = 0;
@Scheduled(fixedRateString = "5000")
public void publishHeartbeat() {
    ++i;
    LOGGER.info("Message #" + i);
}

}

Start the application, witness local console and GCP logs being received. Terminate the network connection (this was done locally by simply turning off the wifi of the laptop, and in server setting by using iptables to drop packets for logging.googleapis.com, for 5 minutes. When the network connection is restored I expect the logs to flow to the console and GCP again, this is not the case. Instead exceptions are thrown repeatedly, massive memory and thread allocation. In my latest test ~7000 threads and ~75M objects of the types - com.google.common.util.concurrent.AbstractFuture$Listener, com.google.common.util.concurrent.AggregateFuture$Listener and com.google.api.core.ApiFutureToListenableFuture (all from VisualVM).

If necessary I can provide a PR that can be used to demonstrate.

The text was updated successfully, but these errors were encountered:

elefeint · 2021-11-29T14:01:24Z

Thank you for the report. I believe this is related to #614 -- the CANCELED grpc netty futures have something holding on to them preventing garbage collection. I am looking into which layer of our stack is holding on to the objects.

gjferriercoats · 2021-12-02T14:46:08Z

@elefeint have you got any idea about the timeline for a fix? This is causing production outages for us. If a fix is going to be sometime we'll need to look at removing the offending code.

elefeint · 2021-12-02T15:36:10Z

Did production issues start after a particular time / upgrade?
Or are the network issues new?

gjferriercoats · 2021-12-02T15:50:50Z

Not a particular upgrade, the problem appears to be caused by unstable network (which we're looking into separately).

elefeint · 2021-12-03T17:09:02Z

One thing I can recommend while we are looking at the issue is to switch from using direct API appender ("STACKDRIVER") to using structured console logging ("CONSOLE_JSON"). Because the latter does not send logs directly to Cloud Logging, it should not result in the same issue when network connection breaks.

gjferriercoats · 2021-12-06T09:09:41Z

If I've understood that correctly, it will just log to the console, log events will not be forwarded to GCP for non-GCP (i.e. on-premise) deployments?

elefeint · 2021-12-29T13:56:39Z

@gjferriercoats Sorry, I missed this. That's correct for on-prem / local -- the logs are only processed when running in a GCP environment.

elefeint · 2021-12-29T16:49:48Z

@gjferriercoats There are two things that in combination will help:

Upgrade to Spring Cloud GCP 2.0.7. We've picked up a client library fix for restoring batching, which will relieve heap pressure.
Switch from synchronous to asynchronous logging, which will reduce the number of threads spawned and prevent logging issues from affecting the rest of your application's thread pools (such as the timed scheduler).

	<appender name="ASYNC_STACKDRIVER" class="ch.qos.logback.classic.AsyncAppender">
		<appender-ref ref="STACKDRIVER" />
	</appender>

	<root level="INFO">
		<appender-ref ref="ASYNC_STACKDRIVER" />
	</root>

These won't fix logging (I suspect due to an issue similar to googleapis/java-logging#645), but they will keep the rest of application from being unstable. UPDATE: logging does recover after ~40 minutes.

meltsufin · 2022-08-04T19:52:57Z

Please re-open if it's still an issue.

elefeint added logging priority: p1 labels Nov 29, 2021

elefeint added priority: p2 and removed priority: p1 labels Dec 30, 2021

meltsufin closed this as completed Aug 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging unstable under loss of connection #745

Logging unstable under loss of connection #745

gjferriercoats commented Nov 26, 2021 •

edited

Loading

elefeint commented Nov 29, 2021

gjferriercoats commented Dec 2, 2021

elefeint commented Dec 2, 2021

gjferriercoats commented Dec 2, 2021

elefeint commented Dec 3, 2021

gjferriercoats commented Dec 6, 2021 •

edited

Loading

elefeint commented Dec 29, 2021

elefeint commented Dec 29, 2021 •

edited

Loading

meltsufin commented Aug 4, 2022

Logging unstable under loss of connection #745

Logging unstable under loss of connection #745

Comments

gjferriercoats commented Nov 26, 2021 • edited Loading

elefeint commented Nov 29, 2021

gjferriercoats commented Dec 2, 2021

elefeint commented Dec 2, 2021

gjferriercoats commented Dec 2, 2021

elefeint commented Dec 3, 2021

gjferriercoats commented Dec 6, 2021 • edited Loading

elefeint commented Dec 29, 2021

elefeint commented Dec 29, 2021 • edited Loading

meltsufin commented Aug 4, 2022

gjferriercoats commented Nov 26, 2021 •

edited

Loading

gjferriercoats commented Dec 6, 2021 •

edited

Loading

elefeint commented Dec 29, 2021 •

edited

Loading