Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cron job stops after random hours/days #805

Open
github-anis-snoussi opened this issue Nov 28, 2023 · 23 comments
Open

Cron job stops after random hours/days #805

github-anis-snoussi opened this issue Nov 28, 2023 · 23 comments
Labels
released on @beta type:bug Bug reports and bug fixes

Comments

@github-anis-snoussi
Copy link

Description

I have a node app running on a machine with 512mb of memory and 256vCPU on AWS, and I have a cron job (actually using Nestjs which in turn uses this package) that runs every second using node-cron.
After few days (doesn't always happen), the cron just stops working with no error or log.
In fact there was an identical issue reported before (#232) and was closed despite people still reporting the same behaviour.

Expected Behavior

cron job should not stop working, or at the very least throw an error ?

Actual Behavior

  • job stops randomly after few days, without any error / log
  • restarting the container fixes the issue

Possible Fix

No response

Steps to Reproduce

  • hard to reproduce, create a cron job that runs every second and leave it running for couple of days ?

Context

Your Environment

  • cron version: 2.4.1
  • NodeJS version: 18.18.2
  • Operating System and version: node:18.18.2-alpine3.17 docker image
  • TypeScript version (if applicable): 5.1.6
  • Link to your project (if applicable): -
@intcreator
Copy link
Collaborator

is the Node process still running when the cron job stops? is it only the cron part that fails?

@github-anis-snoussi
Copy link
Author

yes the Node process keeps running.
In fact I have other crons defined in the same process, and they keep on working.

@sheerlox
Copy link
Collaborator

sheerlox commented Nov 28, 2023

I have a NestJS codebase running in production for more than 8 months (restarted once in a while for deployments) that runs a lot of cronjobs for handling financial transactions, one of them running every second, and I've never encountered this issue... However, I do have more memory on my server, 2GB I believe, so maybe that's the issue. Have you ever encountered a situation when your container exceeds its maximum memory usage?

Could you try updating cron to 2.4.4 to see if the issue still happens? You can force NestJS to use it in this way:

"overrides": {
  "@nestjs/schedule": {
    "cron": "2.4.4"
  }
},

This kind of issue is typically difficult to pinpoint without detailed information from those who have experienced it. Unfortunately, reports of this nature are quite rare and often lack critical details. If you can provide any detail or hint that would help us reproduce the issue, we would appreciate it and do our best to find a solution.

@github-anis-snoussi
Copy link
Author

github-anis-snoussi commented Nov 29, 2023

I understand how difficult it is to reproduce these kind of bugs, I can't even reproduce it myself and it happened twice so far.
I just noticed that the previous issue (#232) was closed while some people are still reporting it, so I opened this issue just to keep track.

For details, the container with the cron jobs is running usually at 70% memory and <5% cpu, but there was no peak in usage when the bug occurred.

Thank you @sheerlox for the suggestion, but I think that blindly upgrading cron without identifying the root cause will give a false feeling that bug was fixed (exactly what happened in the previously closed Github issue).

Again, I'm not even sure how to debug this bug, so I'm available to give more details if possible.

@sheerlox
Copy link
Collaborator

I gave a good look to #232, and it actually contains a lot of information! From what I understand it looks like the issue is with line 191 from job.js. I've created a patch you can deploy easily to confirm that's indeed the issue and provide more information.

Please add this file to your project, then follow the set-up instructions from patch-package:

./patches/cron+2.4.1.patch

diff --git a/node_modules/cron/lib/job.js b/node_modules/cron/lib/job.js
index a3b12b5..9d89479 100644
--- a/node_modules/cron/lib/job.js
+++ b/node_modules/cron/lib/job.js
@@ -135,6 +135,8 @@ function CronJob(CronTime, spawn) {
 		// The callback wrapper checks if it needs to sleep another period or not
 		// and does the real callback logic when it's time.
 		function callbackWrapper() {
+			console.debug(`[CRON DEBUG] BEG CALLBACK: ${performance.now()}`);
+
 			const diff = startTime + timeout - Date.now();
 
 			if (diff > 0) {
@@ -188,7 +190,13 @@ function CronJob(CronTime, spawn) {
 
 			_setTimeout(timeout);
 		} else {
+			console.debug(`[CRON DEBUG] END CALLBACK: ${performance.now()}`);
 			this.stop();
+			console.debug(
+				`[CRON DEBUG] JOB ABORTED: ${
+					self.cronTime.source
+				} - ${luxon.DateTime.local().toISO()}`
+			);
 		}
 	};

Once the issue manifests itself again, you should be able to extract the interesting output with the following command:
cat {YOUR_LOGFILE} | grep "[CRON DEBUG]" | grep -B 15 "JOB ABORTED"

15 lines above should give us a pretty good picture of the previous runs of the callbackWrapper function.

If you can get us this debug information I'll make sure to free up some time to dig deeper into the issue and find a fix.

Thank you for re-opening an issue about this to let us know that still happens.

@github-anis-snoussi
Copy link
Author

Thanks for the update, will make sure to add this and get back to you if the issue manifests itself again 🙏.

@github-anis-snoussi
Copy link
Author

A little update :

  • The cron job is running on an ephemeral docker container (logs are not saved on disk, instead they are sent to a log aggregator) => adding the console.logs you mentioned will pollute extensively the current logs (+drive up the cost of the log aggregator 😅).
  • So instead of the patch you proposed, here's what I ended up changing :
diff --git a/node_modules/cron/lib/job.js b/node_modules/cron/lib/job.js
index a3b12b5..b34927c 100644
--- a/node_modules/cron/lib/job.js
+++ b/node_modules/cron/lib/job.js
@@ -188,7 +188,13 @@ function CronJob(CronTime, spawn) {
 
 			_setTimeout(timeout);
 		} else {
+			console.debug(`[CRON DEBUG] END CALLBACK: ${performance.now()}`);
 			this.stop();
+			console.debug(
+				`[CRON DEBUG] JOB ABORTED: ${
+				self.cronTime.source
+				} - ${luxon.DateTime.local().toISO()}`
+				);
 		}
 	};

will that be enough to debug the issue in case it happens ?

@sheerlox
Copy link
Collaborator

That makes sense, I didn't consider that when writing the patch!

Knowing the exact time the callback was started would be really useful, so instead of my first version can I suggest saving performance.now()'s result in a variable and logging it when the job gets killed in the else block?

@github-anis-snoussi
Copy link
Author

github-anis-snoussi commented Nov 30, 2023

Sounds good to me, will add it 👌.

Update patch :

diff --git a/node_modules/cron/lib/job.js b/node_modules/cron/lib/job.js
index a3b12b5..833640e 100644
--- a/node_modules/cron/lib/job.js
+++ b/node_modules/cron/lib/job.js
@@ -135,6 +135,8 @@ function CronJob(CronTime, spawn) {
 		// The callback wrapper checks if it needs to sleep another period or not
 		// and does the real callback logic when it's time.
 		function callbackWrapper() {
+			let callbackStartedAt = performance.now()
+
 			const diff = startTime + timeout - Date.now();
 
 			if (diff > 0) {
@@ -188,8 +190,14 @@ function CronJob(CronTime, spawn) {
 
 			_setTimeout(timeout);
 		} else {
+			console.debug(`[CRON DEBUG] END CALLBACK: ${performance.now()} WHICH STARTED: ${callbackStartedAt}`);
 			this.stop();
-		}
+			console.debug(
+				`[CRON DEBUG] JOB ABORTED: ${
+				self.cronTime.source
+				} - ${luxon.DateTime.local().toISO()}`
+				);
+						 }
 	};
 
 	CJ.prototype.start = start;

@jrichardsz

This comment was marked as off-topic.

@sheerlox

This comment was marked as off-topic.

@MiikeVi
Copy link

MiikeVi commented May 17, 2024

Just stopping by to say that it’s still happening in version 3.1.7, using node 14.19.3 (node:14-alpine). I had to switch to node-cron

@sheerlox
Copy link
Collaborator

sheerlox commented May 17, 2024

Just stopping by to say that it’s still happening in version 3.1.7, using node 14.19.3 (node:14-alpine). I had to switch to node-cron

As I previously said, that issue is hard to reproduce and we're still looking for help from someone encountering it.

I have critical production code running this library for more than a year and I've never been able to observe that issue, so we're welcoming anyone who would like to help us debug the issue with more information. Please find the patch to apply for the necessary debug information here: #805 (comment).

Regarding the transition to node-cron, it's known for having lots of bugs and being unmaintained (look at the issues list on their GitHub). If you really have to switch libraries, I'd recommend using croner.

P.S. moreover that issue also seems to exist in node-cron.

@ivan-kleshnin
Copy link

ivan-kleshnin commented Jul 9, 2024

There's an interesting blog post: https://lucumr.pocoo.org/2024/6/5/node-timeout/ that might be relevant to the issue.

There Armin describes how clearTimeout does not actually free memory because of timer reference enclosing over the scope and preserving cyclic links. He recommends to cast setTimeout return value to integer (which is an accepted type) to prevent the issue. At the time he wrote this, he was also describing another bug, with the workaround, but it is already fixed in Node v22.

I see that here:
https://github.com/kelektiv/node-cron/blob/main/src/job.ts#L223
you rely on non-privimitive setTimeout return

And here:
https://github.com/kelektiv/node-cron/blob/main/src/job.ts#L297
you just clearTimeout without pruning the value (an alternative to using a primitive type).

So the described memory leek may very well apply to the library.

@droderuan
Copy link

droderuan commented Aug 27, 2024

Hello guys. I would like to add my case that happened to me a few days ago. I have a payment gateway application implemented with NestJs, in production, that has a cron job (from nestjs) running to trigger webhooks. The cronjob runs every 10s, using the CronExpression from the library which solves in */10 * * * * *.

The application runs on GCP using Cloud Run. In my implementation, for every execution, the tries increase by one until it is processed. The API was running for at least 20 days without an interruption or reset.
image
The problem started between 2024-08-24 00:28:16 and 2024-08-24 00:31:28.
Look at the orange line, it was the last webhook processed by the cronjob, and the next one at the red line was executed the next day (when I saw the problem) on my computer. The cronjob was working precisely, processing every 10 seconds. When I started on my computer, the cronjob worked pretty well.

So move on to what I have in GCP...

I tried to see if the application was reaching the memory limit, but no, and the cloud run has a configuration to keep at least one instance. In this documentation about instances in cloud run, it has an idle status but is not deactivated.
image

Cloud run configuration has 512mb and 1cpu allocated. The CPU is not set to be always allocated, but this can not be the problem as the other webhooks were processed in time.

Your Environment
cron version: 3.1.7
NodeJS version: 20.12
Operating System and version: node:20.12 docker image
TypeScript version (if applicable): 5.1.3
Link to your project (if applicable): -

@lmcuong25
Copy link

lmcuong25 commented Oct 23, 2024

i have the same issue: Cron job stops after random time

Description:

I encountered an issue where the cron job stops running after a random amount of time. After monitoring the behavior for a while, I noticed that the job unexpectedly stops without any specific error message or pattern.

Steps to Reproduce:

Set up a cron job using "cron": "^3.1.7" .
Schedule the job to run at regular intervals.
Observe the cron job over time.

Expected Behavior:

The cron job should continue running consistently according to the schedule.

Actual Behavior:

The cron job stops after a random period of time. There is no specific pattern or error log that explains the stoppage.

What I Did to Resolve the Issue:

I forked the project and made modifications to the code. After testing the changes over the course of a few weeks, I no longer experienced the issue.

Changes Made:

        getTimeout2() {
		let sendAtToMillis = this.sendAt().toMillis();
		let localTimeToMillis = DateTime.local().toMillis();
		return {
			sendAtToMillis,
			localTimeToMillis,
			minus: sendAtToMillis - localTimeToMillis,
			timeout: Math.max(-1, sendAtToMillis - localTimeToMillis)
		};
	}
  • then use that function in file src/job.ts to get timeout
let resultTimeout:any = this.cronTime.getTimeout2();
let timeout = resultTimeout?.timeout;

Conclusion:

Since the changes I made have resolved the issue for me, I wanted to report this in case others are facing the same problem. I'm happy to provide further details or submit a pull request if needed.

@sheerlox
Copy link
Collaborator

Hi @lmcuong25, thanks a lot for the effort. However, I'm afraid I might be missing something because except for returning more data for logging purposes, I don't see any changes to the getTimeout function. Could you point out to what you think solves the issue in your patch?

@lmcuong25
Copy link

Hi @sheerlox ,
I'm not really sure why. The getTimeout2 function just returns additional data for debugging purposes (it logs more information to the console to monitor the situation when the cron job stops). I'm still monitoring it.

@droderuan
Copy link

droderuan commented Oct 26, 2024

Hi guys, I spent some time trying to understand what could cause the cron job to stop running. I used this tool https://github.com/wolfcw/libfaketime to try to replicate any situation where the cronjob could stop. I don't know if this tool is the best way to replicate any scenario, but I could stop the cron-job.

If the multiplier to speed up the process is too big, the timeout calculated by the this.cronTime.getTimeout() at job.ts returns -1, as provided by the code itself from time.ts.

	/**
	 * Get the number of milliseconds in the future at which to fire our callbacks.
	 */
	getTimeout() {
		return Math.max(-1, this.sendAt().toMillis() - DateTime.local().toMillis());
	}

Another curiosity is that, for example, I set the ratio in libfaketime to speed up 1000x (in seconds), sometimes could not start (timeout get -1 in the first loop), other times the code started and kept running, and in other cases, it ran by N loops, then got -1 as timeout and started the stop flow.

I know it is inconclusive, but I think the problem could be with the way the getTimeout() is working or with the environment, which is not related to the library and could explain why some people do not experience this bug.

In my case, maybe GCP misconfigured the time/timezone while orchestrating the containers(?)

I did not comprehend why a timeout of -1 should stop the cronjob from running. Maybe to not compromise the cronjob execution interval and lead to unexpected behavior to the user? But in my opinion, stop cronjob because this already leads to unexpected behavior. My suggestion would be to throw a specialized error instead of calling this.stop() inside job.ts, and leave the user to decide what to do if an error is thrown. Of course, I`m assuming that the problem is related to a timeout of -1.

As mentioned in other issues, like #232, the -1 in timeout could be the problem. In this comment #232 (comment), the person points out that the node is running very slowly, which could break the logic inside the library. If this is correct, the library should have minimal specs to run perfectly, and throwing an error would be a better choice in these situations.

This is the repository to the fork and the changes I made: https://github.com/droderuan/node-cron

@sheerlox
Copy link
Collaborator

Thank for so much to all of you for the work you've put into this. Thanks to your input, I think I might have a solution to the issue. I'll be working on it ASAP and keep you updated!

@lmcuong25
Copy link

I'm using Docker with the image: node:20.16.0-alpine

Here’s my log output:

CRON STOP DEBUG:  {"sendAtToMillis":1729964793000,"localTimeToMillis":1729964793084,"minus":-84,"timeout":-1,"cronTimeSource":"*/3 * * * * *"}
  • sendAtToMillis: 2024-10-26T17:46:33.000Z
  • localTimeToMillis: 2024-10-26T17:46:33.084Z

At the time of logging, CPU and RAM usage weren’t particularly high. However, it’s possible that a process was blocking the thread at that moment (since Node.js is single-threaded).
Screenshot 2024-10-29 at 09 59 13
Screenshot 2024-10-29 at 09 59 29

To address the issue, I modified the code to call localTimeToMillis first, as shown in this comparison: main...lublue:node-cron-track:main

Updated code:

getTimeout2() {
    let localTimeToMillis = DateTime.local().toMillis();
    let sendAtToMillis = this.sendAt().toMillis();
    return {
        sendAtToMillis,
        localTimeToMillis,
        minus: sendAtToMillis - localTimeToMillis,
        timeout: Math.max(-1, sendAtToMillis - localTimeToMillis)
    };
}

droderuan added a commit to droderuan/node-cron that referenced this issue Nov 2, 2024
@intcreator intcreator added the type:bug Bug reports and bug fixes label Dec 11, 2024
sheerlox added a commit to sheerlox/node-cron that referenced this issue Dec 27, 2024
@ncb000gt
Copy link
Member

🎉 This issue has been resolved in version 3.3.2-beta.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

@sheerlox
Copy link
Collaborator

Hey everyone, we just released a beta version with an idea of a fix so we don't let that issue stale.

Basically, it will execute the job even if with some delay. There's still a maximum delay of 1000ms (1 sec) to avoid potentially disruptive behaviors, in which case it will behave as before (stop the job).

In any case, a few logs have been added that should give us more info on the best way to go about this issue.

We would highly appreciate if some of you could use 3.3.2-beta.1 on your servers and report back with the logs!

All relevant logs start with [CRON DEBUG], so a simple grep on your log files should do the job.

This version should be production-safe, all tests are still passing, but feel free to take a look at the code before you commit.

Happy Holiday Season! 🎄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
released on @beta type:bug Bug reports and bug fixes
Projects
None yet
Development

No branches or pull requests

9 participants