-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Worker stopped processing jobs, and mostly delayed Jobs #2466
Comments
Do you know if the worker is still connected to Redis when this happens? for instance, does the workers appear list of workers that are online for the queue? |
That I need to check. That is a good idea. Is there a method which shows that? Or something I can look directly in Redis? |
You can use https://api.docs.bullmq.io/classes/v5.Queue.html#getWorkers or Taskforce.sh, I don't know if BullBoard can also show this information. |
Also important to read this if you haven't already: https://docs.bullmq.io/guide/going-to-production#maxretriesperrequest |
I have set that one as null
|
I don't think Bullboard shows it. I haven't see it. For Taskforce.sh, is there a trial version? For how long? We use Redis Cloud (we don't self-host our Redis Instances) |
So regarding checking if worker is connected. |
I also was reading the one link you gave. I need to change this |
Yes there is trial. You can find the pricing on the webpage: https://taskforce.sh/ |
If you use the suggested settings, the workers should automatically reconnect as soon they can, so you should not get this issue anymore. |
With taskforce.sh, I am a bit confused what the difference is between the plans, especially the connections between 1, 5 and 15 |
Also, we use redis cloud. Three instances, one is development, then qa test and then currently one production. So does that mean we need 3 connections? Also, can we used direct connection or taskforce.sh connector? I will need to make also sure we enable tls on the redis cloud instances |
You need a connection for every host that you want to debug/monitor, so in your example it would be 3 as you suspected. The direct connector can be used if you have access to your Redis host from the internet. If not, you can use the connector. For example the connector is always used for local connections. |
@manast I am noticing the same issue with a bunch of delayed jobs not getting picked up by the worker. Here you can see that job was created 5 mins ago with delay of 1-2 seconds and the job is just sitting there. Here you can see that the worker is connected. |
What are your worker settings? |
Another thing, do you know why these jobs have such high "started" numbers? |
Also, how are those delayed jobs produced? |
We just had a similar issue this morning. Here is how we can create the delayed Jobs.
|
my worker options are:
with redisClient being
|
When i add a job, in its processor function i try to get a redis lock on a custom key, if it fails to get a lock, i move the job to delayed like this
|
they have a high |
hi @wernermorgenstern, I can see that your are using groups but you are pointing to 5.4.2 bullmq version, not the actual bullmq-pro version. v7.3.0 that was released today is using bullmq 5.4.2. Just for curiosity, are you using bullmq along side with bullmq-pro? |
I am actually using pro. So do I need to remove the bullmq from the package for? I use the pro versions of the functions and constructors |
hey, if you are meaning that you have bullmq-pro and bullmq in your packages.json, yes you should only have pro version, as we used fixed versions of bullmq in pro version |
another question, which versión of Redis are you using? |
I will remove and try that |
Hi, we are using Redis Cloud Enterprise, which is on 7.2 |
hi @wernermorgenstern could you pls try upgrading to pro version 7.3.0, that's our last release https://github.com/taskforcesh/bullmq/blob/master/docs/gitbook/bullmq-pro/changelog.md#730-2024-03-16 that contains a fix that affects delayed jobs |
I will do that. And deploy it to our production environment next week. What we saw yesterday, that we had one job stuck in active state. In Taskforce.sh (Trial version) - I love it so far - I saw it had workers, and idle time was between 1s and 10s. So the worker was still processing other jobs. |
hi @matthewgonzalez, are you passing a ioredis instance? if yes, are you also setting |
Yes we are passing in an ioredis instance. Here is how that instance is configured: export const connection = new IORedis(REDIS_URL, {
// a warning is thrown on redis startup if these aren't added
enableReadyCheck: false,
maxRetriesPerRequest: null,
}).setMaxListeners(4 * queuesMQ.length + 10) |
We rolled back to |
I am reopening as it seems this is still not working, until we can reproduce it. |
For the record, this is the test code I am using and not able to reproduce the issue. I wonder if there is not something more to it, like disconnections or something like that. const { Queue, Worker } = require("bullmq");
const queueName = "test";
async function start() {
const queue = new Queue(queueName, {
connection: { host: "localhost", port: 6379 },
// a warning is thrown on redis startup if these aren't added
enableReadyCheck: false,
maxRetriesPerRequest: null,
});
const job = await queue.add("__default__", null, {
jobId: queueName + "-cron-worker-job",
repeat: {
every: 15000, // every 15 seconds
},
data: {
foo: "bar",
},
});
const processFn = async (job) => {
console.log(`Processing job ${job.id} with data ${job.data}`)
console.log(`-> ${job.id}`);
await new Promise((res) => setTimeout(res, 1000));
console.log(`\t<- ${job.id}`);
};
const worker = new Worker(queueName, processFn, {
connection: {
host: "localhost",
port: 6379, // a warning is thrown on redis startup if these aren't added
enableReadyCheck: false,
maxRetriesPerRequest: null,
},
});
worker.on("error", (err) => {
console.error(err);
});
worker.on("completed", (job) => {
console.log(`Job ${job.id} completed`);
});
worker.on("failed", (job, err) => {
console.error(`Job ${job.id} failed with ${err.message}`);
});
}
start(); |
I think we encountered this issue today, we are using bullmq We first thought that it might be a redis issue, This is the first time we encountered this issue. Aside from |
Is is really necessary to explicitly set
I connected to our Q Redis instance and checked the workers & worker count of the deployed queue locally |
Apparently the managed redis instances were restarted prior to the issue. |
Hi, @manast If I force close the redis connections with e.g. If however I kill/shutdown the entire redis server (Simulating a server restart and/or crash) Is that the expected behaviour ?
version: '3.8'
services:
redis:
container_name: 'redis_test_server'
image: redis:6.2.14-alpine
restart: always
ports:
- '6379:6379'
volumes:
- /tmp/redis-test-server:/data
(Setting I modified your code slightly: const { Queue, Worker } = require('bullmq');
const queueName = 'test';
async function start() {
const queue = new Queue(queueName, {
connection: { host: 'localhost', port: 6379 },
// a warning is thrown on redis startup if these aren't added
enableReadyCheck: false,
maxRetriesPerRequest: null,
enableOfflineQueue: false,
});
setInterval(() => {
queue.getWorkersCount().then((numberOfWorkers) => {
console.warn(`Number of workers: ${numberOfWorkers}`);
});
queue.getJobCounts().then((numberOfJobs) => {
console.warn(`Number of jobs: ${JSON.stringify(numberOfJobs)}`);
});
}, 10_000);
const job = await queue.add('__default__', null, {
jobId: queueName + '-cron-worker-job',
repeat: {
every: 3000, // every 3 seconds
},
data: {
foo: 'bar',
},
});
const processFn = async (job) => {
console.log(`Processing job ${job.id} with data ${job.data}`);
console.log(`-> ${job.id}`);
await new Promise((res) => setTimeout(res, 1000));
console.log(`\t<- ${job.id}`);
};
const worker = new Worker(queueName, processFn, {
connection: {
host: 'localhost',
port: 6379, // a warning is thrown on redis startup if these aren't added
enableReadyCheck: false,
maxRetriesPerRequest: null,
enableOfflineQueue: true,
},
});
worker.on('error', (err) => {
console.error(err);
});
worker.on('closed', () => {
console.warn('Worker closed');
});
worker.on('ready', () => {
console.warn('Worker is ready!');
});
worker.on('completed', (job) => {
console.log(`Job ${job.id} completed`);
});
worker.on('failed', (job, err) => {
console.error(`Job ${job.id} failed with ${err.message}`);
});
}
start(); |
I can reproduce the above with bullMQ starting with version |
@lukas-becker0 seems like I am able to reproduce it following your instructions. I will keep you updated... |
I hope this small fix finally resolves this issue for everybody. |
Hi @manast, I'm sorry but I'm still able to reproduce it with the fix and bullmq Assuming it is enough to add the line from the fix PR to Sometimes the worker can connect again but when I then restart redis for a second or third time it eventually results in the same issue as before. I also tried with a custom |
It is not enough adding the line with the disconnect, you must remove the other 2 as well. |
Sorry you are right, I somehow missed that 🙈 ... (due to the darkreader firefox extension ....) I just did run the tests again and it does indeed work now as expected. |
Really glad this got fixed, literally just ran into this issue today during POC testing and upgrading to 5.7.14 did the trick. Thanks @manast |
Is this already part of the latest pro version too or is that still coming?
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Matthew Stevens ***@***.***>
Sent: Wednesday, May 29, 2024 7:42:01 PM
To: taskforcesh/bullmq ***@***.***>
Cc: Werner Morgenstern ***@***.***>; Mention ***@***.***>
Subject: Re: [taskforcesh/bullmq] [Bug]: Worker stopped processing jobs, and mostly delayed Jobs (Issue #2466)
Really glad this got fixed, literally just ran into this issue today during POC testing. Thanks @manast<https://github.com/manast>
—
Reply to this email directly, view it on GitHub<#2466 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJPN7C4PKET7YZH7ICFCKXTZEZRUTAVCNFSM6AAAAABEM752DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZYGQZDGMBRHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@wernermorgenstern its coming very soon. |
hi @wernermorgenstern, it's available since v7.8.2 in pro version |
We will attempt the update next week and report back. |
Experiencing this in v5.8.1 |
Since it seems that the original authors cannot reproduce the issue anymore, I will close it now, so that other users do not get lured into this one if they are experiencing a similar issue, but not this exact one, as that will just confuse everybody. @tavindev you are welcome to open a new issue with the particular details for your use case. |
@tavindev |
Version
v5.4.2
Platform
NodeJS
What happened?
We have a service, where a worker runs, and processes jobs. After the processing is done, it will create another job, which is delayed (around 64 minutes).
Today, I noticed that the service and worker stopped processing jobs. There were no error messages in the logs. When I used BullBoard (I use it as a UI to see jobs), I saw the jobs were still in the delayed state, and like 24 hours overdue.
When I restarted the service, and the worker started, it immediately started processing those delayed jobs.
This is not the first it happened. Today I though first checked the delayed jobs.
In today's incident, the service has been running for 4 days.
We run in EKS on AWS (NodeJS service, using Typescript). I use BullMQ Pro. And we are using Groups and each Group has a concurrency set to 1.
How to reproduce.
I don't have any test code for this
Relevant log output
Code of Conduct
The text was updated successfully, but these errors were encountered: