Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queued tasks (OrmQ) are not always acknowledged #545

Closed
kennyhei opened this issue Apr 29, 2021 · 8 comments
Closed

Queued tasks (OrmQ) are not always acknowledged #545

kennyhei opened this issue Apr 29, 2021 · 8 comments

Comments

@kennyhei
Copy link
Contributor

Here's our current config, we are using Django 2.2.16:

VERSION: 1.3.4
ACK_FAILURES: True
BULK: 1
CACHE: default
CACHED: False
CATCH_UP: True
COMPRESSED: False
CPU_AFFINITY: 0
DAEMONIZE_WORKERS: True
DISQUE_FASTACK: False
GUARD_CYCLE: 0.5
LABEL: Django Q
LOG_LEVEL: INFO
MAX_ATTEMPTS: 0
ORM: default
POLL: 0.2
PREFIX: DjangORM
QSIZE: True
QUEUE_LIMIT: 50
Q_STAT: django_q:DjangORM:cluster
RECYCLE: 500
REDIS: {}
RETRY: 2147483647
SAVE_LIMIT: 10000
SCHEDULER: True
SYNC: False
TESTING: False
TIMEOUT: 3300
WORKERS: 4

We have 3 clusters and 12 workers in total. QInfo:
Screenshot 2021-04-29 at 14 25 21

QMonitor:
Screenshot 2021-04-28 at 17 57 30

Sometimes queued tasks are not acknowledged and relevant Task instance (from the OrmQ payload) does not exist. This seems to happen mostly with tasks that have long execution time (200-400 seconds). TIMEOUT should be big enough and we get no errors from worker (we are using Sentry for error reporting). Any ideas? Oh and even though the SAVE_LIMIT is set to 10000, the limit doesn't always hold i.e. it seems to sometimes ignore this part:

        if task["success"] and 0 < Conf.SAVE_LIMIT <= Success.objects.count():
            Success.objects.last().delete()

As you can see from qinfo, at the moment there are 10485 successful tasks in the database.

@kennyhei
Copy link
Contributor Author

kennyhei commented Apr 29, 2021

We have fairly frequent deploys, we'll check first how the numbers look like if we don't deploy for a day. We suspect that might be the culprit.

@Koed00
Copy link
Owner

Koed00 commented Apr 29, 2021 via email

@kennyhei
Copy link
Contributor Author

kennyhei commented May 4, 2021

Thanks, we'll look into that. I came up with a fix for the SAVE_LIMIT problem in multi-cluster environment by locking the row using select_for_update when oldest task is deleted:

with db.transaction.atomic():
    last = Success.objects.select_for_update().last()
    if task["success"] and 0 < Conf.SAVE_LIMIT <= Success.objects.count():
        last.delete()

Related to #225. Comments? Should I make a PR? @Koed00
Next I'll try to see what's up with the queued tasks getting stuck, maybe it has something to do with multiple clusters as well.

@kennyhei
Copy link
Contributor Author

kennyhei commented May 7, 2021

Update to the issue with tasks getting stuck:

I created a qmemory command for monitoring memory usage.

Screenshot 2021-05-07 at 4 00 33

Lowest available memory (%) for one of the clusters was around ~2% which is pretty low. One of those clusters also has 15min uptime while the others have about 6 hours (since last deployment). We also have one specific task that usually gets stuck to the queue more than the others and during its execution it is reading a big file -> uses lots of memory compared to others.

This seems very promising. I'll test locally if I can recreate this issue by running Django Q out of memory 🙂 If that's the case, I'll probably have to lower RECYCLE value or set MAX_RSS.

@Koed00
Copy link
Owner

Koed00 commented May 7, 2021

@kennyhei great to see you made so much progress.

If you want to make PR's for this that would be cool.

@kennyhei
Copy link
Contributor Author

kennyhei commented May 7, 2021

@Koed00 Created two PR's, one for qmemory and one for SAVE_LIMIT bug fix.

@kennyhei
Copy link
Contributor Author

kennyhei commented May 8, 2021

We haven't encountered any problems after lowering the recycle setting so that memory is released more frequently. Going to close this issue.

@kennyhei kennyhei closed this as completed May 8, 2021
@Koed00
Copy link
Owner

Koed00 commented May 8, 2021

@kennyhei already merged the limit fix, but need a bit more time to review the qmemory pr. Thanks for work, I'm sure you helped out a bunch of other people.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants