-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Uneven job distribution between workers #2842
Comments
I'm also facing this issue on my project. |
@mynameisvasco no, I haven't. As mentioned, I suppose it is caused by the latency difference between the workers and Redis but it would be nice to have a confirmation (and possible workarounds) by a maintainer! |
Would also be interested in some insight here, we are seeing the same thing with our setup. |
Workers picks up jobs as soon as they are done with the previous job. If you have a high concurrency factor then I guess it could happen that the first worker that starts picks up a lot of jobs, and then if there are not enough jobs for the rest of the workers they will be idling more than the first one. But overtime I do expect all workers doing more or less the same amount of processing. |
Btw, it would be useful to know more about the setup, are they just standard jobs, delayed, prioritised? |
Hey! Thanks 😄 I wouldn't describe this as essential for us, just what we're seeing surprised us a bit.
These are the metrics we have. Average is typically below 10%. We can definitely scale down the amount of workers we're running, but are a tad concerned about the higher level of max utilization. Spikes may be caused by the adjustments we're doing, not sure. |
@hhanesand yes, it is strange, maybe it has to do with the shape of the incoming jobs, like if they come in bursts, then it could be that one eager worker picks a lot of jobs leaving much less to the rest, something like that. I will think more about it. |
In our setup, most jobs are standard jobs (not delayed/prioritized/etc, nothing special there). Is my hypothesis correct regarding the latency between workers and redis? It could be the most simple explanation, but it would be nice to have some kind of solution for better distribution. |
I am playing with the idea of introducing an automatic concurrency mechanism, based on how utilised is the NodeJS event loop. So for example if it is below a certain value it can increase the number of concurrent jobs up until some defined maximum. Well, I am assuming that the peaks in those graphs are produced by some workers processing more jobs in parallel than their peers. |
We do definitely have bursts of jobs. When we're enqueuing jobs we queue everything for one queue in an addBulk operation. Essentially we're doing
I've tried using a flow producer for this so we can avoid O(# of queues) and instead have O(# of jobs / # bulk batch size), but ran into this issue. Maybe adopting delayed jobs could help with this issue, but we're unable to predict all of the load - we have an API endpoint which adds jobs to queue unpredictably. |
@hhanesand I think addBulk may be the issue here, we have a PR to resolve this problem: #2841 |
oh sorry, this would only affect if you are adding prioritised jobs not standard ones. |
ah no, even standard ones actually sorry for the confusion. |
As I mentioned, we are not using addBulk in our setup, will this still have a positive effect? |
Not for you, but it seems like this could be the case for @hhanesand |
@nemphys it could be that what is happening in addBulk also happens with standard add in some scenarios, we will investigate it as well. |
Nice! Any hint to what is actually happening so that we can have a general idea? |
hi @nemphys sorry for the delay. For awakening workers, we have a mechanism based on adding markers. For standard and prioritized jobs we add 1 marker with the same score. These markers only awakes a worker that consumes it and then it can start consuming other jobs. If these markers are consumed but there are still other jobs in waiting or prioritized state and there are other workers that couldn't consume this marker, they will be waiting for the next marker for at least 5 seconds (after this period workers, will check if there are pending jobs anyway). |
# [5.27.0](v5.26.2...v5.27.0) (2024-11-19) ### Features * **queue:** add rateLimit method ([#2896](#2896)) ([db84ad5](db84ad5)) * **queue:** add removeRateLimitKey method ([#2806](#2806)) ([ff70613](ff70613)) ### Performance Improvements * **marker:** add base markers while consuming jobs to get workers busy ([#2904](#2904)) fixes [#2842](#2842) ([1759c8b](1759c8b))
@nemphys the PR above should improve the situation in your case, please let me know how it goes. |
@manast thank you, I will try to test it in the first available slot, although I am stuck in an pretty old version due to issues upgrading repeatable jobs to the new format. |
@nemphys if you need help upgrading please let us know. |
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [bullmq](https://bullmq.io/) ([source](https://github.com/taskforcesh/bullmq)) | dependencies | minor | [`5.25.6` -> `5.29.1`](https://renovatebot.com/diffs/npm/bullmq/5.25.6/5.29.1) | --- ### Release Notes <details> <summary>taskforcesh/bullmq (bullmq)</summary> ### [`v5.29.1`](https://github.com/taskforcesh/bullmq/releases/tag/v5.29.1) [Compare Source](taskforcesh/bullmq@v5.29.0...v5.29.1) ##### Bug Fixes - **scheduler:** remove deprecation warning on immediately option ([#​2923](taskforcesh/bullmq#2923)) ([14ca7f4](taskforcesh/bullmq@14ca7f4)) ### [`v5.29.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.29.0) [Compare Source](taskforcesh/bullmq@v5.28.2...v5.29.0) ##### Features - **queue:** refactor a protected addJob method allowing telemetry extensions ([09f2571](taskforcesh/bullmq@09f2571)) ### [`v5.28.2`](https://github.com/taskforcesh/bullmq/releases/tag/v5.28.2) [Compare Source](taskforcesh/bullmq@v5.28.1...v5.28.2) ##### Bug Fixes - **queue:** change \_jobScheduler from private to protected for extension ([#​2920](taskforcesh/bullmq#2920)) ([34c2348](taskforcesh/bullmq@34c2348)) ### [`v5.28.1`](https://github.com/taskforcesh/bullmq/releases/tag/v5.28.1) [Compare Source](taskforcesh/bullmq@v5.28.0...v5.28.1) ##### Bug Fixes - **scheduler:** use Job class from getter for extension ([#​2917](taskforcesh/bullmq#2917)) ([5fbb075](taskforcesh/bullmq@5fbb075)) ### [`v5.28.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.28.0) [Compare Source](taskforcesh/bullmq@v5.27.0...v5.28.0) ##### Features - **job-scheduler:** add telemetry support to the job scheduler ([72ea950](taskforcesh/bullmq@72ea950)) ### [`v5.27.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.27.0) [Compare Source](taskforcesh/bullmq@v5.26.2...v5.27.0) ##### Features - **queue:** add rateLimit method ([#​2896](taskforcesh/bullmq#2896)) ([db84ad5](taskforcesh/bullmq@db84ad5)) - **queue:** add removeRateLimitKey method ([#​2806](taskforcesh/bullmq#2806)) ([ff70613](taskforcesh/bullmq@ff70613)) ##### Performance Improvements - **marker:** add base markers while consuming jobs to get workers busy ([#​2904](taskforcesh/bullmq#2904)) fixes [#​2842](taskforcesh/bullmq#2842) ([1759c8b](taskforcesh/bullmq@1759c8b)) ### [`v5.26.2`](https://github.com/taskforcesh/bullmq/releases/tag/v5.26.2) [Compare Source](taskforcesh/bullmq@v5.26.1...v5.26.2) ##### Bug Fixes - **telemetry:** do not set span on parent context if undefined ([c417a23](taskforcesh/bullmq@c417a23)) ### [`v5.26.1`](https://github.com/taskforcesh/bullmq/releases/tag/v5.26.1) [Compare Source](taskforcesh/bullmq@v5.26.0...v5.26.1) ##### Bug Fixes - **queue:** fix generics to be able to properly be extended ([f2495e5](taskforcesh/bullmq@f2495e5)) ### [`v5.26.0`](https://github.com/taskforcesh/bullmq/releases/tag/v5.26.0) [Compare Source](taskforcesh/bullmq@v5.25.6...v5.26.0) ##### Features - improve queue getters to use generic job type ([#​2905](taskforcesh/bullmq#2905)) ([c9531ec](taskforcesh/bullmq@c9531ec)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOC4xNDIuNyIsInVwZGF0ZWRJblZlciI6IjM4LjE0Mi43IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJkZXBlbmRlbmNpZXMiXX0=--> Reviewed-on: https://git.tristess.app/alexandresoro/ouca/pulls/340 Reviewed-by: Alexandre Soro <code@soro.dev> Co-authored-by: renovate <renovate@git.tristess.app> Co-committed-by: renovate <renovate@git.tristess.app>
Version
v5.4.3
Platform
NodeJS
What happened?
I am not sure if this is a bug, but I couldn't find a better category 😄
We are running a 5-node Redis sentinel setup, where each node is running both Redis and the application using BullMQ, therefore the node that is currently the master Redis instance has naturally the fastest connection to Redis.
What we observe is that this master node ends up executing the majority of jobs across queues, with the rest of the nodes usually picking up jobs only when this node has reached its worker concurrency limit.
I suppose this might be considered normal (I am not aware the mechanism that implements job picking, but it would make sense for it to be a on first-come-first-serve basis, with the node running the Redis master instance locally being the node that inherently arrives first), but I would like to know if there is some way to overcome this and get a more even job distribution between workers.
How to reproduce.
No response
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: