Queued tasks (OrmQ) are not always acknowledged #545

kennyhei · 2021-04-29T11:38:11Z

Here's our current config, we are using Django 2.2.16:

VERSION: 1.3.4
ACK_FAILURES: True
BULK: 1
CACHE: default
CACHED: False
CATCH_UP: True
COMPRESSED: False
CPU_AFFINITY: 0
DAEMONIZE_WORKERS: True
DISQUE_FASTACK: False
GUARD_CYCLE: 0.5
LABEL: Django Q
LOG_LEVEL: INFO
MAX_ATTEMPTS: 0
ORM: default
POLL: 0.2
PREFIX: DjangORM
QSIZE: True
QUEUE_LIMIT: 50
Q_STAT: django_q:DjangORM:cluster
RECYCLE: 500
REDIS: {}
RETRY: 2147483647
SAVE_LIMIT: 10000
SCHEDULER: True
SYNC: False
TESTING: False
TIMEOUT: 3300
WORKERS: 4

We have 3 clusters and 12 workers in total. QInfo:

QMonitor:

Sometimes queued tasks are not acknowledged and relevant Task instance (from the OrmQ payload) does not exist. This seems to happen mostly with tasks that have long execution time (200-400 seconds). TIMEOUT should be big enough and we get no errors from worker (we are using Sentry for error reporting). Any ideas? Oh and even though the SAVE_LIMIT is set to 10000, the limit doesn't always hold i.e. it seems to sometimes ignore this part:

        if task["success"] and 0 < Conf.SAVE_LIMIT <= Success.objects.count():
            Success.objects.last().delete()

As you can see from qinfo, at the moment there are 10485 successful tasks in the database.

The text was updated successfully, but these errors were encountered:

kennyhei · 2021-04-29T12:10:38Z

We have fairly frequent deploys, we'll check first how the numbers look like if we don't deploy for a day. We suspect that might be the culprit.

Koed00 · 2021-04-29T18:37:35Z

Cool to see you guys use it this much. I have no idea what it could be just from looking at this. Do you have some kind of replicating database cluster? Probably unrelated; I would highly recommend using at least Redis as the broker. The ORM one was really only added as a convenience for development, by the request of users.

…

On Thu, Apr 29, 2021 at 2:11 PM Kenny Heinonen ***@***.***> wrote: We have fairly frequent deploys, we'll check first how the numbers look like if we don't deploy for a day. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#545 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA6AQNOUF4AK6RPQVETJOSLTLFEFJANCNFSM43ZXAHQQ> .

kennyhei · 2021-05-04T11:10:14Z

Thanks, we'll look into that. I came up with a fix for the SAVE_LIMIT problem in multi-cluster environment by locking the row using select_for_update when oldest task is deleted:

with db.transaction.atomic():
    last = Success.objects.select_for_update().last()
    if task["success"] and 0 < Conf.SAVE_LIMIT <= Success.objects.count():
        last.delete()

Related to #225. Comments? Should I make a PR? @Koed00
Next I'll try to see what's up with the queued tasks getting stuck, maybe it has something to do with multiple clusters as well.

kennyhei · 2021-05-07T07:09:53Z

Update to the issue with tasks getting stuck:

I created a qmemory command for monitoring memory usage.

Lowest available memory (%) for one of the clusters was around ~2% which is pretty low. One of those clusters also has 15min uptime while the others have about 6 hours (since last deployment). We also have one specific task that usually gets stuck to the queue more than the others and during its execution it is reading a big file -> uses lots of memory compared to others.

This seems very promising. I'll test locally if I can recreate this issue by running Django Q out of memory 🙂 If that's the case, I'll probably have to lower RECYCLE value or set MAX_RSS.

Koed00 · 2021-05-07T07:16:35Z

@kennyhei great to see you made so much progress.

If you want to make PR's for this that would be cool.

kennyhei · 2021-05-07T13:21:32Z

@Koed00 Created two PR's, one for qmemory and one for SAVE_LIMIT bug fix.

kennyhei · 2021-05-08T06:46:37Z

We haven't encountered any problems after lowering the recycle setting so that memory is released more frequently. Going to close this issue.

Koed00 · 2021-05-08T07:48:44Z

@kennyhei already merged the limit fix, but need a bit more time to review the qmemory pr. Thanks for work, I'm sure you helped out a bunch of other people.

kennyhei mentioned this issue May 5, 2021

Feature request: override retry for individual tasks #551

Closed

kennyhei closed this as completed May 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queued tasks (OrmQ) are not always acknowledged #545

Queued tasks (OrmQ) are not always acknowledged #545

kennyhei commented Apr 29, 2021

kennyhei commented Apr 29, 2021 •

edited

Loading

Koed00 commented Apr 29, 2021 via email

kennyhei commented May 4, 2021 •

edited

Loading

kennyhei commented May 7, 2021

Koed00 commented May 7, 2021

kennyhei commented May 7, 2021

kennyhei commented May 8, 2021

Koed00 commented May 8, 2021

Queued tasks (OrmQ) are not always acknowledged #545

Queued tasks (OrmQ) are not always acknowledged #545

Comments

kennyhei commented Apr 29, 2021

kennyhei commented Apr 29, 2021 • edited Loading

Koed00 commented Apr 29, 2021 via email

kennyhei commented May 4, 2021 • edited Loading

kennyhei commented May 7, 2021

Koed00 commented May 7, 2021

kennyhei commented May 7, 2021

kennyhei commented May 8, 2021

Koed00 commented May 8, 2021

kennyhei commented Apr 29, 2021 •

edited

Loading

kennyhei commented May 4, 2021 •

edited

Loading