Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging unstable under loss of connection #745

Closed
gjferriercoats opened this issue Nov 26, 2021 · 9 comments
Closed

Logging unstable under loss of connection #745

gjferriercoats opened this issue Nov 26, 2021 · 9 comments

Comments

@gjferriercoats
Copy link

gjferriercoats commented Nov 26, 2021

Using the latest sample from spring-cloud-gcp-samples/spring-cloud-gcp-logging-sample I am able to make the implementation unstable (runaway memory, thread allocations) by disabling the network connection so that the attempt to log to GCP fails.

I was able to recreate the problem by adding the following -

  • Add @EnableScheduling to the SpringBoot application
  • Add a new class which has a fixed rate schedule, see below
    package com.example;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;

@component
public class OnASchedule {

private static final Log LOGGER = LogFactory.getLog(OnASchedule.class);

private int i = 0;
@Scheduled(fixedRateString = "5000")
public void publishHeartbeat() {
    ++i;
    LOGGER.info("Message #" + i);
}

}

Start the application, witness local console and GCP logs being received. Terminate the network connection (this was done locally by simply turning off the wifi of the laptop, and in server setting by using iptables to drop packets for logging.googleapis.com, for 5 minutes. When the network connection is restored I expect the logs to flow to the console and GCP again, this is not the case. Instead exceptions are thrown repeatedly, massive memory and thread allocation. In my latest test ~7000 threads and ~75M objects of the types - com.google.common.util.concurrent.AbstractFuture$Listener, com.google.common.util.concurrent.AggregateFuture$Listener and com.google.api.core.ApiFutureToListenableFuture (all from VisualVM).

If necessary I can provide a PR that can be used to demonstrate.

@elefeint
Copy link
Contributor

Thank you for the report. I believe this is related to #614 -- the CANCELED grpc netty futures have something holding on to them preventing garbage collection. I am looking into which layer of our stack is holding on to the objects.

@gjferriercoats
Copy link
Author

@elefeint have you got any idea about the timeline for a fix? This is causing production outages for us. If a fix is going to be sometime we'll need to look at removing the offending code.

@elefeint
Copy link
Contributor

elefeint commented Dec 2, 2021

Did production issues start after a particular time / upgrade?
Or are the network issues new?

@gjferriercoats
Copy link
Author

Not a particular upgrade, the problem appears to be caused by unstable network (which we're looking into separately).

@elefeint
Copy link
Contributor

elefeint commented Dec 3, 2021

One thing I can recommend while we are looking at the issue is to switch from using direct API appender ("STACKDRIVER") to using structured console logging ("CONSOLE_JSON"). Because the latter does not send logs directly to Cloud Logging, it should not result in the same issue when network connection breaks.

@gjferriercoats
Copy link
Author

gjferriercoats commented Dec 6, 2021

If I've understood that correctly, it will just log to the console, log events will not be forwarded to GCP for non-GCP (i.e. on-premise) deployments?

@elefeint
Copy link
Contributor

@gjferriercoats Sorry, I missed this. That's correct for on-prem / local -- the logs are only processed when running in a GCP environment.

@elefeint
Copy link
Contributor

elefeint commented Dec 29, 2021

@gjferriercoats There are two things that in combination will help:

  1. Upgrade to Spring Cloud GCP 2.0.7. We've picked up a client library fix for restoring batching, which will relieve heap pressure.
  2. Switch from synchronous to asynchronous logging, which will reduce the number of threads spawned and prevent logging issues from affecting the rest of your application's thread pools (such as the timed scheduler).
	<appender name="ASYNC_STACKDRIVER" class="ch.qos.logback.classic.AsyncAppender">
		<appender-ref ref="STACKDRIVER" />
	</appender>

	<root level="INFO">
		<appender-ref ref="ASYNC_STACKDRIVER" />
	</root>

These won't fix logging (I suspect due to an issue similar to googleapis/java-logging#645), but they will keep the rest of application from being unstable. UPDATE: logging does recover after ~40 minutes.

@meltsufin
Copy link
Member

Please re-open if it's still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants