-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NotAcquired with longer celery jobs with short expire and auto_renewal=True #85
Comments
There is a case I just saw where the sequence: if lock.locked():
lock.release() lead to @contextmanager
def lock_or_fail(key, expire=60):
import redis_lock
lock = redis_lock.Lock(get_redis(), name=key, expire=expire, auto_renewal=True)
if lock.acquire(blocking=False):
try:
yield
finally:
if lock.locked():
lock.release()
else:
raise Ignore(f'Could not acquire lock for {key}') It seems there are cases where auto_renewal does not work at all. Could system load play a role here? |
Regarding system load: Server time of the error is: Wed, 17 Mar 2021 04:37:42 +0100 a vmstat log from around that time shows around 4 to 6 processes waiting for io around that time while a redis persist took unusually long: vmstat:
redis.log:
While rabbitmq has no entries around that time, the celery master worker reports So it's safe to say that things took unusually long. I don't know if redis is still responsive while saving, if not this would definitely explain why the lock could not be renewed in time. (And it could also explain a very long time between the call to So my guess would be that this is most likely not a bug in python-redis-lock as things just take too long... (And that wouldn't be a problem in production for me either.) Feel free to close if you agree with that last assessment. I'm going to set higher expiry times, which should also work in production. I don't want to workaround too much of this in production code, so I'm going to live with an occasional occurance of this on my development VMs. |
What sort of work do you do in the tasks? The lock auto renewal runs in a thread - if you'd have the main thread hold the GIL with something it could fail to renew, yes. But that seems so unlikely... Maybe there's some connection management issue (eg: multiple things going on on the same connection instance, thus the occasional failure)? Does |
The tasks in question do lots of database queries (mostly selects, but some updates as well) in a transaction (django) and dump the contents as gzipped json. Database layer is psycopg2. I think that this doesn't hold a GIL, but I don't know for sure. In any case increasing the expire timeout has helped for those tasks. (I didn't even have to get close to the running time, it was enough to raise it to something short of 5 minutes.) Also I'm very sure that multiple processes were blocking because of IO (swap because of tight memory conditions or network). Since I've moved the VM to a different host with more memory I didn't see the exceptions any more.
In the expected case it does use the connection pool from celery/kombu (in the fallback case it won't). If I interpret this correctly https://docs.celeryproject.org/projects/kombu/en/stable/userguide/pools.html there's a pre-set limit of the number of connections in the pool: In [1]: from kombu import pools
In [2]: pools.get_limit()
Out[2]: 200 However we never have as much as 200 workers running and there are no other clients on the redis server, so this should be more than enough. I hope that answers your second question? |
psycopg2 may in fact have a problem with it's default waiter implementation, see: https://www.psycopg.org/docs/extensions.html#psycopg2.extensions.set_wait_callback |
I've experienced problems with the default waiter in one project (those time limit signals weren't handled). Pretty sure psycopg2 holds the GIL with the default waiter :-)
|
Wow, thanks for the hint! I'll see if those errors resurface and consider that workaround if they do. Edit: In fact the documentation mentions Eventlet and I was considering switching to that (currently we use the multiprocessing backend) and it definitely looks like that might be help if I do. (Currently there's no demand for that though and the loads look like Eventlet might not help much.) |
Hi!
I've (sort of) successfully implemented an alternative locking mechanism for celery tasks (see https://docs.celeryproject.org/en/3.1/tutorials/task-cookbook.html#cookbook-task-serial ) with python-redis-lock and it works like a charm (again: sort of), except that I was getting some NotAcquired exceptions from tasks that run longer (a couple of minutes and in some cases hours).
Note that I'm not sure if I merely need assistance or if this is an actual bug.
CentOS 7 (core)
linux kernel 3.10.0
python-redis-lock: 3.7.0
redis: 3.2.12
python 3.6.8
django 2.2.14
celery 3.1.18
I've implemented a context manager that acquires the locks:
(On my setup,
get_redis()
always returns the celery result backend, which happens to be an instance ofStrictRedis
if that might be an issue.)A celery task uses this as follows:
This mostly works, however I have some tasks that can run a couple of minutes and (very few) tasks that can run for hours.
For those I got a couple of NotAcquired exceptions: "Lock foo_task_<md5_1>_<md5_2> is not acquired or it already expired." which are raised from the
release()
call above.I didn't figure out if the exceptions were raised while the machine was under high load or tight memory conditions. It's running on a VM which recently has been quite laggy. (Also I didn't figure out why, yet.) Redis runs on the same host however.
I think using the Lock as a context manager (as in https://python-redis-lock.readthedocs.io/en/latest/readme.html#troubleshooting ) probably won't help here, since it's basically the same code as in my context manager.
As a workaround I'll probably check the lock with
lock.locked()
but wondering if that's the right thing here.Should I raise the expiry time so it's more close to the order of the expected runtime of the task? Would
signal_expire
help here?Did I make some other mistake you can spot? Any information I left out or that I need to check?
Best regards
The text was updated successfully, but these errors were encountered: