[Bug]: Uneven job distribution between workers #2842

nemphys · 2024-10-19T06:54:16Z

Version

v5.4.3

Platform

NodeJS

What happened?

I am not sure if this is a bug, but I couldn't find a better category 😄

We are running a 5-node Redis sentinel setup, where each node is running both Redis and the application using BullMQ, therefore the node that is currently the master Redis instance has naturally the fastest connection to Redis.

What we observe is that this master node ends up executing the majority of jobs across queues, with the rest of the nodes usually picking up jobs only when this node has reached its worker concurrency limit.

I suppose this might be considered normal (I am not aware the mechanism that implements job picking, but it would make sense for it to be a on first-come-first-serve basis, with the node running the Redis master instance locally being the node that inherently arrives first), but I would like to know if there is some way to overcome this and get a more even job distribution between workers.

How to reproduce.

No response

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

mynameisvasco · 2024-10-25T11:38:59Z

I'm also facing this issue on my project.
I have about 32 servers connected to a redis cluster. Usually there are some servers with 60-70% CPU usage while others are 5-15%.
Did you find any solution to this problem?

nemphys · 2024-10-25T11:41:47Z

@mynameisvasco no, I haven't.

As mentioned, I suppose it is caused by the latency difference between the workers and Redis but it would be nice to have a confirmation (and possible workarounds) by a maintainer!

hhanesand · 2024-11-12T22:10:39Z

Would also be interested in some insight here, we are seeing the same thing with our setup.

manast · 2024-11-12T23:25:30Z

Workers picks up jobs as soon as they are done with the previous job. If you have a high concurrency factor then I guess it could happen that the first worker that starts picks up a lot of jobs, and then if there are not enough jobs for the rest of the workers they will be idling more than the first one. But overtime I do expect all workers doing more or less the same amount of processing.

manast · 2024-11-12T23:26:31Z

Btw, it would be useful to know more about the setup, are they just standard jobs, delayed, prioritised?

hhanesand · 2024-11-13T09:16:52Z

Hey! Thanks 😄 I wouldn't describe this as essential for us, just what we're seeing surprised us a bit.

At the moment we're running bullmq-pro@7.17.2 & bullmq@5.19.1 and are about to update to 7.20.1 & 5.25.4
We are using a redis cluster.
We have a group id and job id on all our jobs, together with removeOnComplete and removeOnFail. That's it for options.
We have ~100 queues running across 10 servers - so about 1000 workers.
Worker concurrencies are adjusted frequently, depending on the number of jobs waiting.

These are the metrics we have. Average is typically below 10%. We can definitely scale down the amount of workers we're running, but are a tad concerned about the higher level of max utilization. Spikes may be caused by the adjustments we're doing, not sure.

manast · 2024-11-13T09:34:22Z

@hhanesand yes, it is strange, maybe it has to do with the shape of the incoming jobs, like if they come in bursts, then it could be that one eager worker picks a lot of jobs leaving much less to the rest, something like that. I will think more about it.

nemphys · 2024-11-13T11:08:21Z

In our setup, most jobs are standard jobs (not delayed/prioritized/etc, nothing special there).

Is my hypothesis correct regarding the latency between workers and redis? It could be the most simple explanation, but it would be nice to have some kind of solution for better distribution.

manast · 2024-11-13T14:00:48Z

I am playing with the idea of introducing an automatic concurrency mechanism, based on how utilised is the NodeJS event loop. So for example if it is below a certain value it can increase the number of concurrent jobs up until some defined maximum. Well, I am assuming that the peaks in those graphs are produced by some workers processing more jobs in parallel than their peers.

hhanesand · 2024-11-13T14:15:38Z

We do definitely have bursts of jobs. When we're enqueuing jobs we queue everything for one queue in an addBulk operation. Essentially we're doing

load jobs from outside datastore
split jobs by queue

for (jobs of jobs split by queue) {
    queue.addBulk(jobs)
}

I've tried using a flow producer for this so we can avoid O(# of queues) and instead have O(# of jobs / # bulk batch size), but ran into this issue.

Maybe adopting delayed jobs could help with this issue, but we're unable to predict all of the load - we have an API endpoint which adds jobs to queue unpredictably.

manast · 2024-11-13T15:16:27Z

@hhanesand I think addBulk may be the issue here, we have a PR to resolve this problem: #2841

manast · 2024-11-13T15:17:14Z

oh sorry, this would only affect if you are adding prioritised jobs not standard ones.

manast · 2024-11-13T15:17:47Z

ah no, even standard ones actually sorry for the confusion.

nemphys · 2024-11-13T15:22:00Z

As I mentioned, we are not using addBulk in our setup, will this still have a positive effect?

manast · 2024-11-13T15:25:31Z

As I mentioned, we are not using addBulk in our setup, will this still have a positive effect?

Not for you, but it seems like this could be the case for @hhanesand

manast · 2024-11-13T15:27:14Z

@nemphys it could be that what is happening in addBulk also happens with standard add in some scenarios, we will investigate it as well.

nemphys · 2024-11-13T15:29:12Z

Nice! Any hint to what is actually happening so that we can have a general idea?

roggervalf · 2024-11-19T03:10:36Z

hi @nemphys sorry for the delay. For awakening workers, we have a mechanism based on adding markers. For standard and prioritized jobs we add 1 marker with the same score. These markers only awakes a worker that consumes it and then it can start consuming other jobs. If these markers are consumed but there are still other jobs in waiting or prioritized state and there are other workers that couldn't consume this marker, they will be waiting for the next marker for at least 5 seconds (after this period workers, will check if there are pending jobs anyway).

# [5.27.0](v5.26.2...v5.27.0) (2024-11-19) ### Features * **queue:** add rateLimit method ([#2896](#2896)) ([db84ad5](db84ad5)) * **queue:** add removeRateLimitKey method ([#2806](#2806)) ([ff70613](ff70613)) ### Performance Improvements * **marker:** add base markers while consuming jobs to get workers busy ([#2904](#2904)) fixes [#2842](#2842) ([1759c8b](1759c8b))

manast · 2024-11-19T08:22:49Z

@nemphys the PR above should improve the situation in your case, please let me know how it goes.

nemphys · 2024-11-19T09:14:58Z

@manast thank you, I will try to test it in the first available slot, although I am stuck in an pretty old version due to issues upgrading repeatable jobs to the new format.

manast · 2024-11-19T10:47:49Z

@nemphys if you need help upgrading please let us know.

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [bullmq](https://bullmq.io/) ([source](https://github.com/taskforcesh/bullmq)) | dependencies | minor | [`5.25.6` -> `5.29.1`](https://renovatebot.com/diffs/npm/bullmq/5.25.6/5.29.1) | --- ### Release Notes <details> <summary>taskforcesh/bullmq (bullmq)</summary> ### [`v5.29.1`](https://github.com/taskforcesh/bullmq/releases/tag/v5.29.1) [Compare Source](taskforcesh/bullmq@v5.29.0...v5.29.1) ##### Bug Fixes - **scheduler:** remove deprecation warning on immediately option ([#2923](taskforcesh/bullmq#2923)) ([14ca7f4](taskforcesh/bullmq@14ca7f4)) ### [`v5.29.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.29.0) [Compare Source](taskforcesh/bullmq@v5.28.2...v5.29.0) ##### Features - **queue:** refactor a protected addJob method allowing telemetry extensions ([09f2571](taskforcesh/bullmq@09f2571)) ### [`v5.28.2`](https://github.com/taskforcesh/bullmq/releases/tag/v5.28.2) [Compare Source](taskforcesh/bullmq@v5.28.1...v5.28.2) ##### Bug Fixes - **queue:** change \_jobScheduler from private to protected for extension ([#2920](taskforcesh/bullmq#2920)) ([34c2348](taskforcesh/bullmq@34c2348)) ### [`v5.28.1`](https://github.com/taskforcesh/bullmq/releases/tag/v5.28.1) [Compare Source](taskforcesh/bullmq@v5.28.0...v5.28.1) ##### Bug Fixes - **scheduler:** use Job class from getter for extension ([#2917](taskforcesh/bullmq#2917)) ([5fbb075](taskforcesh/bullmq@5fbb075)) ### [`v5.28.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.28.0) [Compare Source](taskforcesh/bullmq@v5.27.0...v5.28.0) ##### Features - **job-scheduler:** add telemetry support to the job scheduler ([72ea950](taskforcesh/bullmq@72ea950)) ### [`v5.27.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.27.0) [Compare Source](taskforcesh/bullmq@v5.26.2...v5.27.0) ##### Features - **queue:** add rateLimit method ([#2896](taskforcesh/bullmq#2896)) ([db84ad5](taskforcesh/bullmq@db84ad5)) - **queue:** add removeRateLimitKey method ([#2806](taskforcesh/bullmq#2806)) ([ff70613](taskforcesh/bullmq@ff70613)) ##### Performance Improvements - **marker:** add base markers while consuming jobs to get workers busy ([#2904](taskforcesh/bullmq#2904)) fixes [#2842](taskforcesh/bullmq#2842) ([1759c8b](taskforcesh/bullmq@1759c8b)) ### [`v5.26.2`](https://github.com/taskforcesh/bullmq/releases/tag/v5.26.2) [Compare Source](taskforcesh/bullmq@v5.26.1...v5.26.2) ##### Bug Fixes - **telemetry:** do not set span on parent context if undefined ([c417a23](taskforcesh/bullmq@c417a23)) ### [`v5.26.1`](https://github.com/taskforcesh/bullmq/releases/tag/v5.26.1) [Compare Source](taskforcesh/bullmq@v5.26.0...v5.26.1) ##### Bug Fixes - **queue:** fix generics to be able to properly be extended ([f2495e5](taskforcesh/bullmq@f2495e5)) ### [`v5.26.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.26.0) [Compare Source](taskforcesh/bullmq@v5.25.6...v5.26.0) ##### Features - improve queue getters to use generic job type ([#2905](taskforcesh/bullmq#2905)) ([c9531ec](taskforcesh/bullmq@c9531ec)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).  Reviewed-on: https://git.tristess.app/alexandresoro/ouca/pulls/340 Reviewed-by: Alexandre Soro <code@soro.dev> Co-authored-by: renovate <renovate@git.tristess.app> Co-committed-by: renovate <renovate@git.tristess.app>

nemphys added the bug Something isn't working label Oct 19, 2024

roggervalf mentioned this issue Nov 19, 2024

perf(marker): add base markers while consuming jobs to get workers busy #2904

Merged

roggervalf closed this as completed in #2904 Nov 19, 2024

roggervalf closed this as completed in 1759c8b Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Uneven job distribution between workers #2842

[Bug]: Uneven job distribution between workers #2842

nemphys commented Oct 19, 2024

mynameisvasco commented Oct 25, 2024

nemphys commented Oct 25, 2024

hhanesand commented Nov 12, 2024

manast commented Nov 12, 2024

manast commented Nov 12, 2024

hhanesand commented Nov 13, 2024 •

edited

Loading

manast commented Nov 13, 2024

nemphys commented Nov 13, 2024

manast commented Nov 13, 2024 •

edited

Loading

hhanesand commented Nov 13, 2024 •

edited

Loading

manast commented Nov 13, 2024

manast commented Nov 13, 2024

manast commented Nov 13, 2024

nemphys commented Nov 13, 2024

manast commented Nov 13, 2024

manast commented Nov 13, 2024

nemphys commented Nov 13, 2024

roggervalf commented Nov 19, 2024 •

edited by manast

Loading

manast commented Nov 19, 2024

nemphys commented Nov 19, 2024

manast commented Nov 19, 2024

[Bug]: Uneven job distribution between workers #2842

[Bug]: Uneven job distribution between workers #2842

Comments

nemphys commented Oct 19, 2024

Version

Platform

What happened?

How to reproduce.

Relevant log output

Code of Conduct

mynameisvasco commented Oct 25, 2024

nemphys commented Oct 25, 2024

hhanesand commented Nov 12, 2024

manast commented Nov 12, 2024

manast commented Nov 12, 2024

hhanesand commented Nov 13, 2024 • edited Loading

manast commented Nov 13, 2024

nemphys commented Nov 13, 2024

manast commented Nov 13, 2024 • edited Loading

hhanesand commented Nov 13, 2024 • edited Loading

manast commented Nov 13, 2024

manast commented Nov 13, 2024

manast commented Nov 13, 2024

nemphys commented Nov 13, 2024

manast commented Nov 13, 2024

manast commented Nov 13, 2024

nemphys commented Nov 13, 2024

roggervalf commented Nov 19, 2024 • edited by manast Loading

manast commented Nov 19, 2024

nemphys commented Nov 19, 2024

manast commented Nov 19, 2024

hhanesand commented Nov 13, 2024 •

edited

Loading

manast commented Nov 13, 2024 •

edited

Loading

hhanesand commented Nov 13, 2024 •

edited

Loading

roggervalf commented Nov 19, 2024 •

edited by manast

Loading