Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logging stops after a few hours with "retry total timeout exceeded" #596

Closed
gae123 opened this issue Oct 5, 2019 · 9 comments
Closed

logging stops after a few hours with "retry total timeout exceeded" #596

gae123 opened this issue Oct 5, 2019 · 9 comments
Assignees
Labels
api: logging Issues related to the googleapis/nodejs-logging API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@gae123
Copy link

gae123 commented Oct 5, 2019

(This is in continuation of #586)

Because of #586 we yanked out the google winston log library and used directly "@google-cloud/logging": "5.3.1",

24 hours went by with no memory leak BUT I just realized that logging did stop about 15 hours after it started. Every attempt to write an entry since then produced the following (thousands of such messages in the log, the one below is the first).

These messages start appearing about 7 hours after the deployment but between the 7th and the 15ht hour I do see some of our log messages as well as the "ERROR: " messages.

image

The message is written using a console.log() in the error path of the promise. Here is the approximate code that produces this message:

import * as googleLogging from '@google-cloud/logging';
...
const metadata =
        {
            resource: {
                type: "gke_container",
            }
        }
const logger = new googleLogging.Logging().log(
                "name",
            {
                removeCircular: true
            }
        );
const entry = this.logger.entry(metadata, data);
let promise = this.logger.debug(entry);  // this worked fine for several hours
promise.catch(err => {
          // This is what we see after
           console.log(`ERROR: ${util.inspect(err)}`) ;
});
@gae123 gae123 changed the title logging stops after about 15 hours with "retry total timeout exceeded" logging stops after about 7 hours with "retry total timeout exceeded" Oct 5, 2019
@gae123
Copy link
Author

gae123 commented Oct 5, 2019

Some more input. Here is the memory footprint of this GKE cluster:

image

The cluster starts at around 10PM, and the first ERROR: message in the logs appears at 5:05AM, exactly when you see that very fist increase in the memory usage!!

@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Oct 5, 2019
@gae123
Copy link
Author

gae123 commented Oct 5, 2019

We restarted the deployment and it happened again after a few hours.

The cluster has three pods, what I see happening is that the issue starts in one pod, that pod keeps sending "ERROR:" while the other pods continue working fine, eventually all three of them start sending only "ERROR:"

This is really serious, it prevents other issues from being diagnosed.

Wondering, why do the console.log() messages continue working?

@gae123 gae123 changed the title logging stops after about 7 hours with "retry total timeout exceeded" logging stops after a few hours with "retry total timeout exceeded" Oct 5, 2019
@gae123
Copy link
Author

gae123 commented Oct 7, 2019

As another data point, we removed completely google-logging from our code base and we now just use console.log() (for info, debug) console.warning for warning, and console.error() (for error)

The problem is not happening any more after 24 hours of observation.

Although this is not the desirable solution for a number of reasons:

  • no differentiation for debug messag es etc
  • warnings show as errors in the UI
  • the whole payload is printed as a string in message
  • etc

it is a workaround for now...

environment:

  • GKE: 1.14.3-gke.11
  • nodejs: FROM node:10.14-alpine

@bcoe bcoe added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed triage me I really want to be triaged. labels Oct 7, 2019
@bcoe
Copy link
Contributor

bcoe commented Oct 7, 2019

@alexander-fenster, in our meeting today you mentioned that this potentially relates to some ongoing issues with grpc-js.

@gae123
Copy link
Author

gae123 commented Oct 9, 2019

Any update on this one? Thanks!!

@bcoe
Copy link
Contributor

bcoe commented Oct 10, 2019

@gae123 could you please try:

https://www.npmjs.com/package/@google-cloud/logging/v/5.4.1

and let us know if this version is continuing to have problems?

@gae123
Copy link
Author

gae123 commented Oct 11, 2019

With 5.4.1, the system has now run for about 36 hours with no issue, we had never gone past 24 hours before so this is definitely helping!!

@bcoe
Copy link
Contributor

bcoe commented Oct 11, 2019

@gae123 that's great news, I'm going to keep this open for the time being, because I expect we'll have one or two more patches for grpc-js in the near future. But, it's good to know that some of the more critical issues are being addressed.

@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Oct 16, 2019
@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Oct 27, 2019
@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Nov 6, 2019
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Nov 16, 2019
@bcoe
Copy link
Contributor

bcoe commented Nov 18, 2019

@gae123 please feel free to reopen this if you bump into this issue again, I think the issues with @grpc/grpc-js we're currently working on are unrelated to this bug 👍

@bcoe bcoe closed this as completed Nov 18, 2019
@google-cloud-label-sync google-cloud-label-sync bot added the api: logging Issues related to the googleapis/nodejs-logging API. label Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: logging Issues related to the googleapis/nodejs-logging API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

3 participants