-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handling redis lock timeouts #77
Comments
Michael Bayer (zzzeek) wrote: well to get into the release() you'd wrap the RedisLock with some kind of wrapper. Any chance this should really be an upstream option in the redis package anyway? |
jvanasco (jvanasco) wrote: I'm asking them to consider a config option for suppressing LockRelease errors, if they're open then it should be easier to pass in that value via dogpile.cache. My thinking is that while it makes more sense to handle this within dogpile.core or other caching libraries... looking at the 'plugin architecture' of most python caching libraries, it will be a pain to implement a wrapper across the board. |
jvanasco (jvanasco) wrote: Coming back to this... I had filed a ticket upstream in The more I think on this, the more I think I can generate a patch where the redis backend uses a wrapped class to handle this. Would it be a good idea to document a Another approach (which I'm not in favor of, but just throwing out) would be extending the region to support a list of acceptable lock release error... and then patching |
jvanasco (jvanasco) wrote: Potential PR: https://bitbucket.org/zzzeek/dogpile.cache/pull-requests/43/first-attempt-at-suppressing-distributed/diff |
Michael Bayer (zzzeek) wrote: why wouldn't it be an error if my cache generation function took longer than my lock was set up to last for? |
jvanasco (jvanasco) wrote: Before passing judgement on it, let's call the a "lock release error" in a distributed environment a "situation" for a moment... This "situation" can happen from 2 events:
The first event is unavoidable, and leads do massive amounts of lock release errors. The second event is awkward and hard to avoid. If you set arbitrarily long timeouts or none at all, you can jam workers indefinitely. A workable solution takes finding the right values and regions for timeouts, and balancing out bottlenecks. In both of these causing events, the computed data has already been generated and sent to the cloud store -- so the tool that requested the data could service it's request. Keep in mind, dogpile doesn't check if there is a lock before "creating" the data (doing so would be inefficient). So a new question: what sort of "situation" is this:
If we raise an exception, we'd have to do quite a bit of work to handle this gracefully. Accessing the data would require a second call to the cache, which means the existing caching @decorators can't be used reliably; dealing with direct region get/get_create/multis is not easily handled either. The chances of encountering these release errors on a distributed store is pretty high. In practice, one just wouldn't care aside from seeing it in a log. The current design ensures we raise an exception (which is "informational" at best) in this situation, and makes it very awkward to catch that exception. While the creation function did generate usable data (and save it), it basically disappears. So the "situation" is either a common error that someone who hasn't encountered it yet should lean about, or an annoying inherency of the datastore that people eventually despise for forcing fatal exceptions that could have resulted in successful responses. |
Michael Bayer (zzzeek) wrote: it sounds like you're just looking to have any number of processes just jump on the creation of the object if either of these events happens, e.g. the dogpile lock would just fail because it vanishes. Which seems a little weird, if a creator is taking too long, you want new processes to keep starting up new creations every N seconds, now it's even slower ? |
jvanasco (jvanasco) wrote: Yes and No. I'm not concerned with subsequent processes handling additional creations, because that will always happen regardless of the Exception bubbling up. I am concerned with the first process not raising an Exception after it has successfully generated (and set) a new value. The most important thing to me -- at least in the context of the Assume we're in a production environment and a distributed cache becomes invalidated. Every client that was midway through a Assuming we have a lock timeout, again there is no value-add to the situation over simply logging this error. Someone might want to raise exceptions in development to discover these situations, but for production and staging use there is no tangible benefit to raising an error. Dogpile has the target data, and we lose it with the raised Exception; properly handling this would require catching the error and requesting the data again from dogpile. In the worst-case scenario of constant load over a particular key, the lack of a lock means subsequent processes are instantiated every N seconds -- but only until the first value is created. This would make things potentially slower, but in a production environment this would happen whether or not the lock error was suppressed. If anything, this potentially avoids another error because someone would not have to request the same resource again -- potentially triggering another lock+creation cycle. In an ideal situation we would re-acquire a new lock if we're still "creating" and the lock disappeared -- but that would be complex outside of something like twisted or gevent. In a more ideal situation, we wouldn't be losing locks or have backends that are prone to losing locks -- but life isn't perfect. |
Michael Bayer (zzzeek) wrote:
IMO as it should, the cache is down totally, if the lock release exception is silently squashed then the "put" would fail too. if the cache is back, if i were engineering this to really handle this (which obviously right now, we don't), I'd want a retry system in place that will take that generated value and try the whole thing again - keeping in mind that another generator may have put a value in there already. because if the lock is gone, you have no idea what happened. This aspect may very well benefit from new logic inside the region code, but I haven't looked into this.
well you'd like it to gracefully handle that the locking system went down totally. If you want to squash the exception, you can use your custom lock thing on your end with a hook on ours, so the discussion here is only for informational purposes. But I'd look into more robust ways of handling that if I had to deal with that problem.
right, you're willing to tolerate a dogpile situation temporarily. The thing is, getting this lock to work was so fucking hard, I've spent years on this process from back when I did it in myghty and beaker and then trying to generalize it on dogpile, also running beaker in a a few production systems for a couple of years and watching all the weird things that happened over there, when presented with the idea of, "what if we just make it that if the lock itself just vanishes, who cares?" , I have no idea what effects that will cause. When i dealt with this on a very loaded production system, basically restarting the cache was a recipe for doom. I hope to have the task of building up a redis cache environment someday so I can come up with a solution to this problem. |
jvanasco (jvanasco) wrote:
Bubbling up an exception on the PUT makes sense - the action failed and the system is currently down. IIRC, the logic is currently:
On the
I fundamentally agree with this, but I think it can be messy to implement. Assuming a release error, there are 3 possible situations:
If there is a newer lock, do we just let it be -- and return the value? If there is an existing value, do we reset it? (even if the inner payload is unique, the cached metadata with expiretime could need a reset) The value could have been created via a a direct
Well, this is a intermediary solution to a complex problem that will need a lot of thought and discussion.
"Restarting the cache" took down Facebook and handful of other big sites a few times. When I was at IAC, I discovered a long-present repeated downtime was from a caching layer failing and the site getting DDOS'd under certain conditions. So I fully appreciate and need the utility of the lock. My use of "who cares?" is not a comment on the utility of the lock, but identifying the stakeholders. If the locking system goes down (which is highly likely on a distributed backend like redis), what should we do? Depending on the environment and project, the stakeholders and solutions will be different. Redis is a great tool, with a lot of peculiarities. I think you'd understand a lot of my logged issues and ideas after a deep-dive into it on production. It's very stable -- but since it's entirely memory based, you become very concerned with size and performance. You have to set redis-based timeouts for keys and locks, otherwise you easily run out of memory or jam your applications. |
jvanasco (jvanasco) wrote:
|
erowan1 () wrote: Hello, I am running into this issue using a nested dogpile cache_on_arguments decorator (one cache function calls another). The program is multi-threaded. Would using TLS by exposing the thread_local arg in the redis client lock in the redis backend config perhaps fix it? The thread_local arg defaults to False.
Cheers, Rowan |
jvanasco (jvanasco) wrote: It's hard to tell. I've gotten the same error from lock timeouts on the server:
It looks like the issue in that PR is talking about a multi-threaded process that shares locks (which may or may not be the case). You can try extending the length of the lock timeouts to something ridiculously long as a test. I ended up writing an alternative backend: https://github.com/jvanasco/dogpile_backend_redis_advanced It has a two features that might help you:
|
erowan1 () wrote: I am going to split out the caching functions so they are not nested to see if that helps but using the redis TLS lock option sounds like a good addition to the dogpile redis backend. I'll take a look at your code if I am still in trouble. Thanks. |
jvanasco (jvanasco) wrote: We've moved to a custom Redis backend, as mentioned above. |
Changes by jvanasco (jvanasco):
|
Changes by jvanasco (jvanasco):
|
Migrated issue, originally created by jvanasco (jvanasco)
I've run into a situation with the
distributed_lock
in the redis backend (ie, locks that redis manages).I realized through a few Keys which are backed by
create
functions that take too long or fail, that alock_timeout
basically must be set – otherwise you can jam your application by setting an eternal lock for a key that is never set.So i set the lock_timeout, and that solves one problem, but creates another – if the process ends up completing after the lock_timeout, a LockError is raised in the redis package's lock.py
Lock.do_release
method.For context, this is happening in a webapp and there can be a variable time on the lock_timeout as it can be caused due to a slow-connection, dropped connection, process killed by a long-process monitor or something else. Since this is only being used for caching, I opted to expire the redis lock using the shortest possible time.
A hopeful fix would have been to configure the dogpile redis backend to ignore this particular lock error. However, looking at the traceback and source, this looks to be really tricky.
In order to keep the value AND suppress the timeout, the config data in dogpile.cache would need to be used in dogpile.core. that is a lot of updating.
another approach might be catching this particular lock error in region, and then immediately getting the value again (which should not have an error). that too looks tricky though.
does anyone have suggestions for an elegant fix?
The text was updated successfully, but these errors were encountered: