logging stops after a few hours with "retry total timeout exceeded" #596

gae123 · 2019-10-05T05:57:45Z

(This is in continuation of #586)

Because of #586 we yanked out the google winston log library and used directly "@google-cloud/logging": "5.3.1",

24 hours went by with no memory leak BUT I just realized that logging did stop about 15 hours after it started. Every attempt to write an entry since then produced the following (thousands of such messages in the log, the one below is the first).

These messages start appearing about 7 hours after the deployment but between the 7th and the 15ht hour I do see some of our log messages as well as the "ERROR: " messages.

The message is written using a console.log() in the error path of the promise. Here is the approximate code that produces this message:

import * as googleLogging from '@google-cloud/logging';
...
const metadata =
        {
            resource: {
                type: "gke_container",
            }
        }
const logger = new googleLogging.Logging().log(
                "name",
            {
                removeCircular: true
            }
        );
const entry = this.logger.entry(metadata, data);
let promise = this.logger.debug(entry);  // this worked fine for several hours
promise.catch(err => {
          // This is what we see after
           console.log(`ERROR: ${util.inspect(err)}`) ;
});

The text was updated successfully, but these errors were encountered:

gae123 · 2019-10-05T06:53:15Z

Some more input. Here is the memory footprint of this GKE cluster:

The cluster starts at around 10PM, and the first ERROR: message in the logs appears at 5:05AM, exactly when you see that very fist increase in the memory usage!!

gae123 · 2019-10-05T23:22:27Z

We restarted the deployment and it happened again after a few hours.

The cluster has three pods, what I see happening is that the issue starts in one pod, that pod keeps sending "ERROR:" while the other pods continue working fine, eventually all three of them start sending only "ERROR:"

This is really serious, it prevents other issues from being diagnosed.

Wondering, why do the console.log() messages continue working?

gae123 · 2019-10-07T20:03:08Z

As another data point, we removed completely google-logging from our code base and we now just use console.log() (for info, debug) console.warning for warning, and console.error() (for error)

The problem is not happening any more after 24 hours of observation.

Although this is not the desirable solution for a number of reasons:

no differentiation for debug messag es etc
warnings show as errors in the UI
the whole payload is printed as a string in message
etc

it is a workaround for now...

environment:

GKE: 1.14.3-gke.11
nodejs: FROM node:10.14-alpine

bcoe · 2019-10-07T20:47:35Z

@alexander-fenster, in our meeting today you mentioned that this potentially relates to some ongoing issues with grpc-js.

gae123 · 2019-10-09T22:13:20Z

Any update on this one? Thanks!!

bcoe · 2019-10-10T00:27:24Z

@gae123 could you please try:

https://www.npmjs.com/package/@google-cloud/logging/v/5.4.1

and let us know if this version is continuing to have problems?

gae123 · 2019-10-11T16:44:29Z

With 5.4.1, the system has now run for about 36 hours with no issue, we had never gone past 24 hours before so this is definitely helping!!

bcoe · 2019-10-11T23:00:32Z

@gae123 that's great news, I'm going to keep this open for the time being, because I expect we'll have one or two more patches for grpc-js in the near future. But, it's good to know that some of the more critical issues are being addressed.

bcoe · 2019-11-18T21:59:49Z

@gae123 please feel free to reopen this if you bump into this issue again, I think the issues with @grpc/grpc-js we're currently working on are unrelated to this bug 👍

gae123 mentioned this issue Oct 5, 2019

Attempt to write logs with no transports #586

Closed

gae123 changed the title ~~logging stops after about 15 hours with "retry total timeout exceeded"~~ logging stops after about 7 hours with "retry total timeout exceeded" Oct 5, 2019

yoshi-automation added the triage me I really want to be triaged. label Oct 5, 2019

gae123 changed the title ~~logging stops after about 7 hours with "retry total timeout exceeded"~~ logging stops after a few hours with "retry total timeout exceeded" Oct 5, 2019

bcoe added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed triage me I really want to be triaged. labels Oct 7, 2019

gae123 mentioned this issue Oct 8, 2019

DEADLINE_EXCEEDED makes application not receiving messages at all googleapis/nodejs-pubsub#770

Closed

yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Oct 16, 2019

yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Oct 27, 2019

yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Nov 6, 2019

yoshi-automation added the 🚨 This issue needs some love. label Nov 16, 2019

bcoe closed this as completed Nov 18, 2019

google-cloud-label-sync bot added the api: logging Issues related to the googleapis/nodejs-logging API. label Jan 31, 2020

JustinBeckwith assigned bcoe Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logging stops after a few hours with "retry total timeout exceeded" #596

logging stops after a few hours with "retry total timeout exceeded" #596

gae123 commented Oct 5, 2019 •

edited

Loading

gae123 commented Oct 5, 2019

gae123 commented Oct 5, 2019 •

edited

Loading

gae123 commented Oct 7, 2019 •

edited

Loading

bcoe commented Oct 7, 2019

gae123 commented Oct 9, 2019

bcoe commented Oct 10, 2019

gae123 commented Oct 11, 2019

bcoe commented Oct 11, 2019

bcoe commented Nov 18, 2019

logging stops after a few hours with "retry total timeout exceeded" #596

logging stops after a few hours with "retry total timeout exceeded" #596

Comments

gae123 commented Oct 5, 2019 • edited Loading

gae123 commented Oct 5, 2019

gae123 commented Oct 5, 2019 • edited Loading

gae123 commented Oct 7, 2019 • edited Loading

bcoe commented Oct 7, 2019

gae123 commented Oct 9, 2019

bcoe commented Oct 10, 2019

gae123 commented Oct 11, 2019

bcoe commented Oct 11, 2019

bcoe commented Nov 18, 2019

gae123 commented Oct 5, 2019 •

edited

Loading

gae123 commented Oct 5, 2019 •

edited

Loading

gae123 commented Oct 7, 2019 •

edited

Loading