[Bug]: Worker stopped processing jobs, and mostly delayed Jobs #2466

wernermorgenstern · 2024-03-08T15:28:33Z

Version

v5.4.2

Platform

NodeJS

What happened?

We have a service, where a worker runs, and processes jobs. After the processing is done, it will create another job, which is delayed (around 64 minutes).
Today, I noticed that the service and worker stopped processing jobs. There were no error messages in the logs. When I used BullBoard (I use it as a UI to see jobs), I saw the jobs were still in the delayed state, and like 24 hours overdue.

When I restarted the service, and the worker started, it immediately started processing those delayed jobs.
This is not the first it happened. Today I though first checked the delayed jobs.

In today's incident, the service has been running for 4 days.

We run in EKS on AWS (NodeJS service, using Typescript). I use BullMQ Pro. And we are using Groups and each Group has a concurrency set to 1.

How to reproduce.

I don't have any test code for this

Relevant log output

No Logs or error logs were produced

Code of Conduct

I agree to follow this project's Code of Conduct

manast · 2024-03-08T23:32:25Z

Do you know if the worker is still connected to Redis when this happens? for instance, does the workers appear list of workers that are online for the queue?

wernermorgenstern · 2024-03-08T23:35:56Z

That I need to check. That is a good idea. Is there a method which shows that? Or something I can look directly in Redis?

manast · 2024-03-08T23:37:44Z

You can use https://api.docs.bullmq.io/classes/v5.Queue.html#getWorkers or Taskforce.sh, I don't know if BullBoard can also show this information.

manast · 2024-03-08T23:38:55Z

Also important to read this if you haven't already: https://docs.bullmq.io/guide/going-to-production#maxretriesperrequest

wernermorgenstern · 2024-03-08T23:47:44Z

Also important to read this if you haven't already: https://docs.bullmq.io/guide/going-to-production#maxretriesperrequest

I have set that one as null

    const redis = new Redis(config.redisBullUrl as string, {
      keepAlive: 1,
      autoResubscribe: true,
      autoResendUnfulfilledCommands: true,
      connectionName: `${config.serviceName}-${config.agentId}-${redisConnection}-bullmq`,
      connectTimeout: helpers.period('10s'),
      enableOfflineQueue: true,
      enableReadyCheck: true,
      enableAutoPipelining: true,
      maxRetriesPerRequest: null,
      retryStrategy: function (times) {
        return Math.max(Math.min(Math.exp(times), 20000), 1000)
      },
      offlineQueue: true
    })

wernermorgenstern · 2024-03-08T23:51:21Z

getWorkers

I don't think Bullboard shows it. I haven't see it.

For Taskforce.sh, is there a trial version? For how long?
And what is the cost of it? I want to see if it might be better than Bullboard.

We use Redis Cloud (we don't self-host our Redis Instances)

wernermorgenstern · 2024-03-08T23:58:28Z

So regarding checking if worker is connected.
I actually have that already in the Code, for me to make API Calls to check that. So that is good.
So next time we have a similar issue, and I notice the worker is not connected, what would the next steps be in troubleshooting, and resolving this issue?

wernermorgenstern · 2024-03-08T23:59:39Z

I also was reading the one link you gave. I need to change this enableOfflineQueue to false, for the Redis Instance for Queue, and add a job. But leave it for true for Workers. So that is one good suggestion

manast · 2024-03-10T23:00:53Z

For Taskforce.sh, is there a trial version? For how long?
And what is the cost of it? I want to see if it might be better than Bullboard.

Yes there is trial. You can find the pricing on the webpage: https://taskforce.sh/

manast · 2024-03-10T23:02:18Z

So next time we have a similar issue, and I notice the worker is not connected, what would the next steps be in troubleshooting, and resolving this issue?

If you use the suggested settings, the workers should automatically reconnect as soon they can, so you should not get this issue anymore.

wernermorgenstern · 2024-03-11T02:09:47Z

With taskforce.sh, I am a bit confused what the difference is between the plans, especially the connections between 1, 5 and 15

wernermorgenstern · 2024-03-11T02:19:27Z

Also, we use redis cloud. Three instances, one is development, then qa test and then currently one production. So does that mean we need 3 connections?

Also, can we used direct connection or taskforce.sh connector?
what about the pro connector? Do we need that one?

I will need to make also sure we enable tls on the redis cloud instances

manast · 2024-03-11T10:44:35Z

You need a connection for every host that you want to debug/monitor, so in your example it would be 3 as you suspected. The direct connector can be used if you have access to your Redis host from the internet. If not, you can use the connector. For example the connector is always used for local connections.

mistersingh179 · 2024-03-14T23:01:29Z

@manast I am noticing the same issue with a bunch of delayed jobs not getting picked up by the worker.
I do notice that they stay in the delayed job forever till a new job comes in to active state and then the delayed jobs all get get picked up.

Here you can see that job was created 5 mins ago with delay of 1-2 seconds and the job is just sitting there.

Here you can see that the worker is connected.

manast · 2024-03-15T07:35:52Z

What are your worker settings?

manast · 2024-03-15T07:40:20Z

Another thing, do you know why these jobs have such high "started" numbers?

manast · 2024-03-15T08:03:16Z

Also, how are those delayed jobs produced?

wernermorgenstern · 2024-03-15T10:07:16Z

We just had a similar issue this morning.

Here is how we can create the delayed Jobs.

const createNextJob = async (
  jobRotateQueue: QueuePro,
  tid: string,
  devCreated: any,
  devConfig: any,
  delay: string
) => {
  const jobData = {
    tid,
    devCreated,
    devConfig
  }

  const jobOptions: JobsProOptions = {
    removeOnComplete: {
      count: 1000
    },
    removeOnFail: {
      count: 10
    },
    attempts: 6,
    delay: 0,
    group: {
      id: tid
    },
    delay: helpers.period(delay) // This converts e.g. '2m' to milliseconds
  }

  try {
    return await jobRotateQueue.add('jobName', jobData, jobOptions)
  } catch (err: any) {
    winstonLogger.error('Unable to add Job', {
        groupId: tid,
        error: { message: err.message, stack: err.stack, code: err.code }
      })
    }
  }
}

mistersingh179 · 2024-03-15T12:36:28Z

What are your worker settings?

my worker options are:

{
    connection: redisClient,
    concurrency: Number(10),
    limiter: {
      max: 100,
      duration: 1000,
    },
    autorun: false,
    metrics: {
      maxDataPoints: MetricsTime.TWO_WEEKS,
    },
  },

with redisClient being

const redisClient = new Redis(REDIS_URL, {
  maxRetriesPerRequest: null,
  enableReadyCheck: false,
  retryStrategy: function (times: number) {
    return Math.max(Math.min(Math.exp(times), 20000), 1000);
  },
});

// this prevents node warning which comes from many workers adding listeners to the same redis connection
redisClient.setMaxListeners(100);

mistersingh179 · 2024-03-15T12:38:32Z

Also, how are those delayed jobs produced?

When i add a job, in its processor function i try to get a redis lock on a custom key, if it fails to get a lock, i move the job to delayed like this

await job.moveToDelayed(Date.now() + randomInt(1000, 2000), token);
throw new DelayedError();

mistersingh179 · 2024-03-15T12:39:59Z

Another thing, do you know why these jobs have such high "started" numbers?

they have a high started number, because they have been tried many times and every time they were unable to get a lock they were moved to delayed.

roggervalf · 2024-03-16T02:03:05Z

hi @wernermorgenstern, I can see that your are using groups but you are pointing to 5.4.2 bullmq version, not the actual bullmq-pro version. v7.3.0 that was released today is using bullmq 5.4.2. Just for curiosity, are you using bullmq along side with bullmq-pro?

wernermorgenstern · 2024-03-16T02:28:31Z

I am actually using pro. So do I need to remove the bullmq from the package for? I use the pro versions of the functions and constructors

roggervalf · 2024-03-16T03:33:42Z

hey, if you are meaning that you have bullmq-pro and bullmq in your packages.json, yes you should only have pro version, as we used fixed versions of bullmq in pro version

manast · 2024-03-16T08:44:15Z

another question, which versión of Redis are you using?

wernermorgenstern · 2024-03-16T14:06:30Z

hey, if you are meaning that you have bullmq-pro and bullmq in your packages.json, yes you should only have pro version, as we used fixed versions of bullmq in pro version

I will remove and try that

wernermorgenstern · 2024-03-16T14:06:55Z

another question, which versión of Redis are you using?

Hi, we are using Redis Cloud Enterprise, which is on 7.2

roggervalf · 2024-03-16T15:49:51Z

hi @wernermorgenstern could you pls try upgrading to pro version 7.3.0, that's our last release https://github.com/taskforcesh/bullmq/blob/master/docs/gitbook/bullmq-pro/changelog.md#730-2024-03-16 that contains a fix that affects delayed jobs

wernermorgenstern · 2024-03-16T16:42:25Z

hi @wernermorgenstern could you pls try upgrading to pro version 7.3.0, that's our last reléase https://github.com/taskforcesh/bullmq/blob/master/docs/gitbook/bullmq-pro/changelog.md#730-2024-03-16 that contains a fix that afects delayed jobs

I will do that. And deploy it to our production environment next week. What we saw yesterday, that we had one job stuck in active state. In Taskforce.sh (Trial version) - I love it so far - I saw it had workers, and idle time was between 1s and 10s. So the worker was still processing other jobs.
But the active job was stuck in active. I could not even delete it with Taskforce.sh
Only way I could get it unstuck is to restart the service, which created a new worker, and then create a job with the same Group ID, and then the stuck job went out of active state.

roggervalf · 2024-05-04T04:48:45Z

hi @matthewgonzalez, are you passing a ioredis instance? if yes, are you also setting maxRetriesPerRequest: null ?

matthewgonzalez · 2024-05-06T16:00:18Z

Yes we are passing in an ioredis instance. Here is how that instance is configured:

export const connection = new IORedis(REDIS_URL, {
  // a warning is thrown on redis startup if these aren't added
  enableReadyCheck: false,
  maxRetriesPerRequest: null,
}).setMaxListeners(4 * queuesMQ.length + 10)

matthewgonzalez · 2024-05-17T04:50:58Z

We rolled back to 5.4.0 two weeks ago and have not experienced this issue since. Still curious as to the resolution here so we know when it may be safe to attempt to keep BullMQ up-to-date again. Thanks for all your support thus far.

manast · 2024-05-17T08:11:51Z

I am reopening as it seems this is still not working, until we can reproduce it.

manast · 2024-05-24T13:04:30Z

For the record, this is the test code I am using and not able to reproduce the issue. I wonder if there is not something more to it, like disconnections or something like that.

const { Queue, Worker } = require("bullmq");

const queueName = "test";

async function start() {
  const queue = new Queue(queueName, {
    connection: { host: "localhost", port: 6379 },
    // a warning is thrown on redis startup if these aren't added
    enableReadyCheck: false,
    maxRetriesPerRequest: null,
  });

  const job = await queue.add("__default__", null, {
    jobId: queueName + "-cron-worker-job",
    repeat: {
      every: 15000, // every 15 seconds
    },
    data: {
      foo: "bar",
    },
  });

  const processFn = async (job) => {
    console.log(`Processing job ${job.id} with data ${job.data}`)
    console.log(`-> ${job.id}`);
    await new Promise((res) => setTimeout(res, 1000));
    console.log(`\t<- ${job.id}`);
  };

  const worker = new Worker(queueName, processFn, {
    connection: {
      host: "localhost",
      port: 6379, // a warning is thrown on redis startup if these aren't added
      enableReadyCheck: false,
      maxRetriesPerRequest: null,
    },
  });

  worker.on("error", (err) => {
    console.error(err);
  });

  worker.on("completed", (job) => {
    console.log(`Job ${job.id} completed`);
  });

  worker.on("failed", (job, err) => {
    console.error(`Job ${job.id} failed with ${err.message}`);
  });
}

start();

lukas-becker0 · 2024-05-26T11:28:53Z

I think we encountered this issue today,

we are using bullmq 5.7.8 and ioredis 5.4.1
and today we noticed that for some reason no job was being processed.

We first thought that it might be a redis issue,
we did not do a thoroughly investigation yet but
after restarting the service we noticed that the jobs were processed as expected
which means that adding the jobs worked but the worker did not process them.

This is the first time we encountered this issue.

Aside from maxRetriesPerRequest: null we are using lazyConnect: true and we are currently not settingenableReadyCheck at all (should we set it to false ?).

lukas-becker0 · 2024-05-27T11:41:43Z

So next time we have a similar issue, and I notice the worker is not connected, what would the next steps be in troubleshooting, and resolving this issue?

If you use the suggested settings, the workers should automatically reconnect as soon they can, so you should not get this issue anymore.

Is is really necessary to explicitly set enableOfflineQueue: true for the workers in order for them to reconnect ?
So far this was not necessary.

You can use https://api.docs.bullmq.io/classes/v5.Queue.html#getWorkers or Taskforce.sh, I don't know if BullBoard can also show this information.

I connected to our Q Redis instance and checked the workers & worker count of the deployed queue locally
(Since the queues of the Q stage also stopped processing jobs)
and indeed the workers of our Q deployment are apparently no longer connected,
the worker count for the queue is 0.

lukas-becker0 · 2024-05-27T14:25:08Z

I connected to our Q Redis instance and checked the workers & worker count of the deployed queue locally (Since the queues of the Q stage also stopped processing jobs) and indeed the workers of our Q deployment are apparently no longer connected, the worker count for the queue is 0.

Apparently the managed redis instances were restarted prior to the issue.
About two hours later the queue did accept new jobs again, the worker however stopped processing jobs
(We do not know yet why the worker stopped, we use the same options (.e.g. infinite retries) for both the queue and the worker.

lukas-becker0 · 2024-05-27T16:55:18Z

Hi, @manast
I did some tests with the code you provided.

If I force close the redis connections with e.g.
redis-cli CLIENT KILL TYPE normal && redis-cli CLIENT KILL TYPE slave && redis-cli CLIENT KILL TYPE pubsub
the worker reconnects and continues to process jobs as expected.

If however I kill/shutdown the entire redis server docker-compose down, then wait a few seconds (~5 seconds) and then start redis again (docker-compose up -d),
the worker reconnects but does not continue to process jobs (although isRunning returns true) from the queue.
The queue does then no longer see any connected worker.

(Simulating a server restart and/or crash)

Is that the expected behaviour ?

docker-compose.yaml

version: '3.8'

services:
    redis:
        container_name: 'redis_test_server'
        image: redis:6.2.14-alpine
        restart: always
        ports:
            - '6379:6379'
        volumes:
            -   /tmp/redis-test-server:/data

(Setting redis-cli config set maxmemory-policy noeviction after the first start ofcourse)

I modified your code slightly:

const { Queue, Worker } = require('bullmq');

const queueName = 'test';

async function start() {
	const queue = new Queue(queueName, {
		connection: { host: 'localhost', port: 6379 },
		// a warning is thrown on redis startup if these aren't added
		enableReadyCheck: false,
		maxRetriesPerRequest: null,
		enableOfflineQueue: false,
	});

	setInterval(() => {
		queue.getWorkersCount().then((numberOfWorkers) => {
			console.warn(`Number of workers: ${numberOfWorkers}`);
		});

		queue.getJobCounts().then((numberOfJobs) => {
			console.warn(`Number of jobs: ${JSON.stringify(numberOfJobs)}`);
		});
	}, 10_000);

	const job = await queue.add('__default__', null, {
		jobId: queueName + '-cron-worker-job',
		repeat: {
			every: 3000, // every 3 seconds
		},
		data: {
			foo: 'bar',
		},
	});

	const processFn = async (job) => {
		console.log(`Processing job ${job.id} with data ${job.data}`);
		console.log(`-> ${job.id}`);
		await new Promise((res) => setTimeout(res, 1000));
		console.log(`\t<- ${job.id}`);
	};

	const worker = new Worker(queueName, processFn, {
		connection: {
			host: 'localhost',
			port: 6379, // a warning is thrown on redis startup if these aren't added
			enableReadyCheck: false,
			maxRetriesPerRequest: null,
			enableOfflineQueue: true,
		},
	});

	worker.on('error', (err) => {
		console.error(err);
	});

	worker.on('closed', () => {
		console.warn('Worker closed');
	});

	worker.on('ready', () => {
		console.warn('Worker is ready!');
	});

	worker.on('completed', (job) => {
		console.log(`Job ${job.id} completed`);
	});

	worker.on('failed', (job, err) => {
		console.error(`Job ${job.id} failed with ${err.message}`);
	});
}

start();

lukas-becker0 · 2024-05-27T18:02:55Z

I can reproduce the above with bullMQ starting with version 5.0.0,
bullMQ 4.17.0 works as expected.

manast · 2024-05-28T13:12:57Z

@lukas-becker0 seems like I am able to reproduce it following your instructions. I will keep you updated...

manast · 2024-05-29T08:55:37Z

I hope this small fix finally resolves this issue for everybody.

lukas-becker0 · 2024-05-29T10:21:59Z

Hi @manast,

I'm sorry but I'm still able to reproduce it with the fix and bullmq 5.7.13

Assuming it is enough to add the line from the fix PR to worker.js in bullmq/dist/cjs/classes/worker.js ?

Sometimes the worker can connect again but when I then restart redis for a second or third time it eventually results in the same issue as before.

I also tried with a custom redis.conf with the recommended AOF option being set from the start but it made no difference.

lukas-becker0 · 2024-05-29T11:03:15Z

FYI, @manast
I created a repo with my current test setup here, just in case it might be helpful.

manast · 2024-05-29T12:36:35Z

It is not enough adding the line with the disconnect, you must remove the other 2 as well.

lukas-becker0 · 2024-05-29T12:39:11Z

@manast

It is not enough adding the line with the disconnect, you must remove the other 2 as well.

Sorry you are right, I somehow missed that 🙈 ... (due to the darkreader firefox extension ....)

I just did run the tests again and it does indeed work now as expected.
Thank you very much and sorry for the confusion.
🎆 😃

croconut · 2024-05-29T23:41:39Z

Really glad this got fixed, literally just ran into this issue today during POC testing and upgrading to 5.7.14 did the trick. Thanks @manast

wernermorgenstern · 2024-05-29T23:55:30Z

Is this already part of the latest pro version too or is that still coming? Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Matthew Stevens ***@***.***> Sent: Wednesday, May 29, 2024 7:42:01 PM To: taskforcesh/bullmq ***@***.***> Cc: Werner Morgenstern ***@***.***>; Mention ***@***.***> Subject: Re: [taskforcesh/bullmq] [Bug]: Worker stopped processing jobs, and mostly delayed Jobs (Issue #2466) Really glad this got fixed, literally just ran into this issue today during POC testing. Thanks @manast<https://github.com/manast> — Reply to this email directly, view it on GitHub<#2466 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJPN7C4PKET7YZH7ICFCKXTZEZRUTAVCNFSM6AAAAABEM752DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZYGQZDGMBRHA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

manast · 2024-05-30T09:33:52Z

@wernermorgenstern its coming very soon.

roggervalf · 2024-05-31T04:29:28Z

hi @wernermorgenstern, it's available since v7.8.2 in pro version

matthewgonzalez · 2024-05-31T23:34:56Z

We will attempt the update next week and report back.

tavindev · 2024-06-27T17:38:56Z

Experiencing this in v5.8.1

manast · 2024-06-27T18:00:19Z

Since it seems that the original authors cannot reproduce the issue anymore, I will close it now, so that other users do not get lured into this one if they are experiencing a similar issue, but not this exact one, as that will just confuse everybody.

@tavindev you are welcome to open a new issue with the particular details for your use case.

lukas-becker0 · 2024-06-27T18:53:42Z

@tavindev
I used my test setup from last time and tested with v5.8.2 and I can not reproduce it.

wernermorgenstern added the bug Something isn't working label Mar 8, 2024

manast reopened this May 17, 2024

manast mentioned this issue May 29, 2024

fix(worker): properly cancel blocking command during disconnections #2585

Merged

Linksku mentioned this issue Jun 23, 2024

[Bug]: Is it normal to get lots of "Command timed out" from bzpopmin? #2619

Open

1 task

manast closed this as completed Jun 27, 2024

lindesvard mentioned this issue Jul 14, 2024

[Bug]: Delayed jobs are not triggered on time #2648

Closed

1 task

nullndr mentioned this issue Sep 4, 2024

[Bug]: Workers suddenly stopped processing jobs on AWS memorydb #2757

Open

1 task

[Bug]: Worker stopped processing jobs, and mostly delayed Jobs #2466

[Bug]: Worker stopped processing jobs, and mostly delayed Jobs #2466

Comments

wernermorgenstern commented Mar 8, 2024

Version

Platform

What happened?

How to reproduce.

Relevant log output

Code of Conduct

manast commented Mar 8, 2024

wernermorgenstern commented Mar 8, 2024

manast commented Mar 8, 2024

manast commented Mar 8, 2024

wernermorgenstern commented Mar 8, 2024

wernermorgenstern commented Mar 8, 2024

wernermorgenstern commented Mar 8, 2024

wernermorgenstern commented Mar 8, 2024

manast commented Mar 10, 2024

manast commented Mar 10, 2024 • edited Loading

wernermorgenstern commented Mar 11, 2024

wernermorgenstern commented Mar 11, 2024

manast commented Mar 11, 2024

mistersingh179 commented Mar 14, 2024

manast commented Mar 15, 2024

manast commented Mar 15, 2024

manast commented Mar 15, 2024

wernermorgenstern commented Mar 15, 2024

mistersingh179 commented Mar 15, 2024

mistersingh179 commented Mar 15, 2024

mistersingh179 commented Mar 15, 2024

roggervalf commented Mar 16, 2024

wernermorgenstern commented Mar 16, 2024

roggervalf commented Mar 16, 2024

manast commented Mar 16, 2024

wernermorgenstern commented Mar 16, 2024

wernermorgenstern commented Mar 16, 2024

roggervalf commented Mar 16, 2024 • edited Loading

wernermorgenstern commented Mar 16, 2024

roggervalf commented May 4, 2024

matthewgonzalez commented May 6, 2024 • edited Loading

matthewgonzalez commented May 17, 2024

manast commented May 17, 2024

manast commented May 24, 2024

lukas-becker0 commented May 26, 2024

lukas-becker0 commented May 27, 2024 • edited Loading

lukas-becker0 commented May 27, 2024

lukas-becker0 commented May 27, 2024 • edited Loading

lukas-becker0 commented May 27, 2024 • edited Loading

manast commented May 28, 2024

manast commented May 29, 2024

lukas-becker0 commented May 29, 2024

lukas-becker0 commented May 29, 2024

manast commented May 29, 2024 • edited Loading

lukas-becker0 commented May 29, 2024 • edited Loading

croconut commented May 29, 2024 • edited Loading

wernermorgenstern commented May 29, 2024 via email

manast commented May 30, 2024

roggervalf commented May 31, 2024

matthewgonzalez commented May 31, 2024 • edited Loading

tavindev commented Jun 27, 2024

manast commented Jun 27, 2024

lukas-becker0 commented Jun 27, 2024

manast commented Mar 10, 2024 •

edited

Loading

roggervalf commented Mar 16, 2024 •

edited

Loading

matthewgonzalez commented May 6, 2024 •

edited

Loading

lukas-becker0 commented May 27, 2024 •

edited

Loading

lukas-becker0 commented May 27, 2024 •

edited

Loading

lukas-becker0 commented May 27, 2024 •

edited

Loading

manast commented May 29, 2024 •

edited

Loading

lukas-becker0 commented May 29, 2024 •

edited

Loading

croconut commented May 29, 2024 •

edited

Loading

matthewgonzalez commented May 31, 2024 •

edited

Loading