-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BaseException at I/O corrupts Connection #2499
Comments
Digging through history:
|
Update: After considering I've realized this is fundamentally unsafe to continue using the socket, since we don't know how much data was sent by |
I need to better understand motivation of #2104 since it's directly at odds with this issue. It appears to focus on PubSub. Since PubSub makes the connection receive ad-hoc responses (in addition to responses to commands), a disconnection probably causes a brief data loss. Since PubSub responses are designed to be unambiguous — response is a list whose first element indicates message type, e.g. |
Generally you want to avoid logic on Typically what happens is that the I think that the motivation behind the change in the Synchronous api was to better reflect that coding practice. Leave BaseExceptions in place, because they usually should be left alone. Handlers then do the right thing at the place where the error is caught. Normally, I'd suggest you catch the |
This is simply not true and does not reflect a sensible coding practice. A
You seem to be missing the point of the exception handler in the If you had actually taken the time to look at the history of the code you where modifying in #2104 – as @ikonst did – you would already know all of this.
You can not "deal with [this] at a higher level", because you now lack the necessary information to understand what happened. There is no way to know from the outside that the connection is garbage. To fix this, instead of closing the connection, the Exception handler could set a In #2103 you wrote
which is wishing for the impossible. You interrupted an operation while it was executing – unless you have a transactional system like in a database, you can't go back to the "state that it was". I understand your use-case in #2104 – since the connection is mostly idle during pubsub, you don't think you'll interrupt an ongoing transmission, but assume that the timeout will happen while the connection is idle. But:
Your whole approach is broken: You can't just use an external package to force timeout exceptions into network protocols and assume that connections will remain in a usable state. redis-py needs to handle the timeout itself. I just had a look the API here, and you will find that it does exactly that: The get_message method of PubSub takes a timeout parameter. That is what you should have been using all along instead of breaking the low-level implementation of the Connection class. |
I agree that exception handling is sensitive for BaseException. However, Also, not sure if we can rely on docs for best practices, but asyncio.CancelledError says "In almost all situations the exception must be re-raised." (not "never catch this").
This is unfortunately not possible, since the application code has a |
If the application doesn't have the connection, someone else has. This code exists on the |
Did you actually read what I wrote? The connection is the object that gets irreparably broken, which only the object itself can know. Why should the object not "decide" that? This is not about timeouts. This is about low-level networking operations failing. In your case because you forced a timeout exception into them. |
A forced timeout is not a failing network exception. It is resumable. Even under gevent. The connection is still in a perfectly valid state. The Recall, a
In only the first case must it be closed and flushed if a full command-response is not performed. This is not something which the Connection itself should be deciding, it is whichever of the layers above, which is using the connection (i.e. Having said that, there are actually two different timeouts in operation in Redis. One is the "socket timeout" which is applied to each operation, and ususally considered in the same way as other socket errors, i.e., "something happened on the network and we should just consider it as fatal", and then there is the user "timeout" which is a single operation timeout, applied to each operation from the api, but for the user to decide how to respond to. This does complicate things. In the Async code I have cleaned up the handling of these two so that they don't interfere with each other and are processed in a cleaner way. |
This is not correct. We are also not really talking about "timeouts" – the code you changed had nothing to do with timeouts. What you are claiming (with your words and your code change) is: "Any
This is not correct. I do not see any connection pools in the code example I provided (3 years ago) in #1128, do you?
Just to be very clear here: You forcing exceptions into the middle of IO operations is not a timeout. It is simply an unexpected exception. But this discussion doesn't really seem to go anywhere. Can you maybe instead just answer two simple questions:
|
There is a very aggressive tone in this thread which makes me disinclined to respond. However, please allow me to answer your comments in turn in a calm voice.
This regression should, of course, have been caught at the time of my PR, and probably would have been if unittest coverage had been adequate. There aren't many regression tests present. The changes were made in good faith. If you would kindly provide me with a Traceback which demonstrates your error, I'll be happy to create a PR, with regression tests, as appropriate. The callstack would help guide me to make the correct design. Please put a breakpoint at the place where previously there would have been a "disconnect()" call. We should, of course, ensure that the software works as intended, but it is important to do so in an architecturally sound way. The logic of a Cheers! |
@kristjanvalur, agreed. I also noticed the aggressive tone (e.g. "Your whole approach is broken", "this discussion doesn't really seem to go anywhere"). It understandably comes out of frustration over this issue being reintroduced twice (in 3.0 and in 4.2), but let's assume best intentions, good faith, and keep looking forward. Overall not a lot of people actually contribute to open source.
I'm adding a regression test in #2505. Let's get it merged before we address the problem.
FWIW in event-loop async frameworks like
If what I say above ☝️ is right and timeouts are timeouts, then fundamentally some operations are safe to interrupt and some aren't. In effect, this reenforces @kristjanvalur's point that the knowledge on whether something is interruptible belongs to a higher layer. |
When you changed the exception handler, did it not seem strange to you that someone would handle a But @ikonst is of course correct – me airing my grievances over this by participating in this discussion in an overly aggressive tone is not a productive approach. I apologize for that. In case @ikonst's regression test in 2500 doesn't clarify the situation:
The
That's the traceback:
The disconnect call used to here. A good argument could by made that it is actually not enough, as an exception injected into |
First, let's agree that all things being equal, we'd rather execute the least code possible to handle a The RESP protocol, while elegantly simple, doesn't seem to address cancellations. It's like the "touch-move" rule in chess. When a
(This is all made complicated by the fact we use If we chose to add an 'expected responses' counter, it'd have to be per-connection rather than a single counter on -- As for PubSub, the situation there is a bit better:
-- P.S. I went through this with the assumption of interruption points being I/O & yields (in gevent) or awaits (in asyncio). I'm sure a |
What I don't understand about all of this is why there should be special handling to make For the use-case of not blocking indefinitely on |
IMO
It's true that you can pass with Timeout(worker_timeout_sec):
do_some_work()
...
message = pubsub.get_message()
.. Saying that one must use |
Are you sure about that? Both For gevent I had too look deeper. Having never used it, I assume gevent will only raise the
This is read in a loop, with multiple IO operations, which can thus be interrupted in the middle. So it's actually worse than I thought: It will not only drop the message, but actually also corrupt the connection. The asyncio implementation is the same. |
You say "corrupt the connection" a lot. This is only true in the particular use case, seen in the "Send command, read response" pattern implemented by the Much of the work I recently did with Async IO was in ensuring that all of the code were interruptible, so that timeouts could be safely handled at the highest possible level, and without littering lower level code with various explicit timeout handlers. This is the way things are usually done with Async, and incidentally, also how Gevent and Stackless Python do things. I am one of the core developers of Stackless, and very familiar with interruptible IO. Now I fully accept that in the process I may have caused things to break. Such is the way of software development. And I'm happy to try to fix them. But I'm sure you understand if I prefer to fix them in "a better" way if possible, one which is friendly to all use cases and doesn't result in unnecessary coupling between layers. |
Corrupt is not the right word – the correct word is "desynchronized" and once the connection (also more precisely: the protocol) is in that state it is not usable by anyone anymore. There are protocols that allow resynchronization, a popular example is utf-8, but the redis protocol does not have that feature. Let's assume your usage example with pubsub: Imagine someone else sends the message The stream now contains Do you not agree that this a fatal problem and should never be allowed to happen? |
I thought that at this stage we're all in agreement about the nature of the problems:
... and yes, about (2) I stand corrected. Looking at the code, while there is a "read buffer", there are multiple opportunities to drop data. 😔 A safer implementation would perhaps take a "cursor" into the |
p.s. re redis-py/redis/asyncio/connection.py Lines 794 to 798 in 67214cc
So whatever safety |
Note that the lines you quoted were recently added by @kristjanvalur, but this was already broken before – just somewhere else. But the synchronous implementation (and thus gevent) does not suffer from this and is implemented correctly. It first waits up to The same could probably be done in the async implementation. This would also get rid of the whole async-timeout dependency. |
Hmm, it waits up to timeout for some available data, then proceeds to read the entire message, possibly blocking, but probably not for long if at all, since redis is fast? That is, if you schedule a gevent timeout, it could still trip one of the |
Yeah, but in general, calling that API I would not expect it return after
Exactly. |
In general, yes, I agree. A timeout sensibly applies to the |
Re: your discussion about timeouts in the synchronous code. Yes, this is the reason why I removed the "can_read()" implementation from the Async code. There is no |
Think about this: If a large message was send a minute ago and you call |
@Chronial Fair enough, it does make sense. I didn't use Redis for pub-sub, but I used SQS and its WaitTimeSeconds works more like how you describe IIUC. In my case, I'm scheduling the timeout far above in the call-stack, in some generalized "framework" code where I don't know about Redis or anything else. It is to implement deadline propagation. The timeout is "wrapping" an entire HTTP request handler, and imposes the deadline reported by the caller. |
That's similar to what we do, too (as part of redis-tasks), and is IMO a sensible reason to force timeout exceptions deep into callstacks. I think redis-py should support that and be well-behaved in such a context. But I'm not convinced that it makes sense to invest effort and code complexity to try and keep a forcefully interrupted connection alive in a usable state. The last years seem to show that disconnecting such connections is good enough. |
At least in the stateless (not-pub-sub) connection case, clearly so (and anyhow there's no safe way to do it given that you cannot |
If get_message is broken, it needs fixing. And without the "can_read" hack. With async, the message reader needs to maintain state, such as it does with Hiredis. |
So, looking at the code, I concur, that the Python parser is non-restartable. If interrupted while parsing a message, it will not be able to retry. |
What makes you sure that the Hiredis parser is not affected by this? You seem to be assuming that interrupted async socket reads have a guarantee against data-loss – do you have any source for this? My expectation would be that the cancellation exception can be raised at any |
@Chronial Implemented correctly, an exception should only be raised at an That's how the underlying |
Interrupted async socket reads are guaranteed against data loss. I'll gladly admit to being slightly overzealous when simplifying the |
The correct API for get_message is still that the timeout should only affect the waiting period, not the receiving, as I explained above. If you are distracted by the fact that the parameter is called I do not see how you could possibly implement that API with the |
When you say "the correct api", could you back that up somehow? What convention is this that a timeout, applied to an operation, should affect only some internal part of it and not the whole operation? The business with the "can_read()" is IMHO completely broken because using that makes no guarantees that a message can be retrieved within the timeout period. It is a necessary evil for synchronous code to help make at least some expectation of sanity without heavy bookkeeping of timeout "remaining" for each individual timeout call. The difference between "waiting" and "reading" is essentially none. the whole message appears as a single entity. There is no utility in trying to distinguish between a "read" time and a "wait" time. If you process an operation within a timout, you expect it to finish within that time. |
I feel like I always need to refer back to my comments multiple times for you to actually read them. That's not nice. |
In the "happy case" the difference between "wait_for" and "timeout" should be negligible, so we should pick what's easier to implement. If In the pathological case, where you reach the 'can read' point within 10ms, but then 'read entire message' phase stalls for 10 minutes... I'd personally think the caller would be happier with the "timeout" behavior rather than the "wait_for" behavior, e.g. if the motivation was "I need to ping another server once in a while, so don't hold me up too long". -- On a side note, I've mentioned AWS SQS as a precedent, where you provide a
This suggests to me that internally they limit the execution time of waiting and message retrieval. Of course it shouldn't matter that much to us what AWS chose to do in an unrelated tool... |
Okay, I believe I have found a good solution. Please see #2506 I have also provided #2510 and #2512, two alternative ways to make |
Sorry for not responding to all the comments. I find it stressful to keep up with a conversation which has so much negative energy in it. But I'm doing my best. To answer your question: No, I emphatically would not. I would expect the call to wait 10ms and then return. An API which accepts a timeout and then chooses to ignore the timeout is not very useful. For all we know, the whole message may never arrive, because the connection was cut short in the middle of transmission. The fact that the current Please note that I am talking about an operation timeout. You, providing a timeout to an operation explicitly. This should be considered differently to the socket timeout which is sometimes applied as a default to all socket operations and is a kind of emergency brake. when the "socket timeout" triggers, one should consider the connection defunct and discard it. But an "operation timeout" only means that an operation didn't succeed in time, and should be retried. I guess that we could improve the "socket timeout" to just apply to individual read operations, and be fatal, rather than affect the entire message parsing. That way it would not trip for each read operation (in your long transfer example) even if if would trip for the entire transfer. Would you like me to add a PR where we move the fatal "socket timeout" down to individual operations and thus deal with a trickling operation? I can do that but would prefer to do that once #2510 or #2512 are merged or rejected. Or, I could piggyback this on them. |
For future readers who encounters an error of |
I don't understand, can_read was removed from both PythonParser and HiredisParser. It was an internal method, not really part of the API and so one really shouldn't encounter it as missing, unless one was doing something strange. |
Yes they're both removed; however, PythonParser raises this exception. Back when we were maintaining aioredis, we encountered several issues with the PythonParser and frequently abandoned it. Feel free to take a look here for the impl that led to this error (to reproduce, simply uninstall hiredis when running pytest): Andrew-Chen-Wang/django-async-redis#5 |
Could you furnish me with a call stack or simple repro steps for the problem? I'm not a Django guy. There is no mention of |
stack is too large. Instead, here are repro steps, assuming Redis is running on localhost:
|
Ran the tests and didn't see the problem. 4.4.1 was released yesterday, so maybe that is the reason. 4.4.0 shouldn't have had this issue either.
@pytest.fixture(scope="session")
def event_loop():
policy = asyncio.get_event_loop_policy()
loop = policy.new_event_loop()
yield loop
loop.close() Cheers! |
Thanks for taking a look at the library @kristjanvalur and yes those were definitely the problems in general! Appreciate your time and effort:) I'm also unable to reproduce the steps, before your suggestions, that led me to |
This affected ChatGPT? #2665 seems to be the async version of this bug report. |
@Chronial Definitely an attention-grabbing post mortem these days :D When it happened back in our servers at Lyft's Bikes & Scooters, it was very intermittent — it took maybe 1 day for a single python node to get into a broken state (for a single connection in a pool, so even the same node didn't always exhibit a problem), and often someone would term that node to "fix" it, and move on. Then one day I took a deep dive, pinned py-redis < 4.4 and added a local regression test to prevent a kind soul from upgrading blindly, but not even a post mortem... |
tl;dr repeat of #360
My codebase uses gevent (w/monkey-patching) for concurrent workers, and schedules a gevent.Timeout to force a deadline on the workers. Since a timeout causes an exception to be raised from an arbitrary "yield point", there's risk of corrupting shared state if code is not written to be exception-safe.
The code talks to redis over a client that's shared between the workers. Socket I/O is a "yield point". Sending a command successfully, but then failing to read the entire response off the socket, gets
Connection
into an undefined state: a subsequent command would try to parse a response intended for a previous command, and in many cases would fail. Here's what a typical error would look like:As you can see, we're trying to parse an integer response to a previous command (e.g.
":12345\r\n"
) as a string response to aSET
command.The
except Exception
block inredis.connection.Connection.read_response
is intended to handle all sorts of parsing errors by disconnecting the connection:redis-py/redis/connection.py
Lines 819 to 821 in 6219574
but perhaps it could be changed to
except:
since e.g.gevent.Timeout
is intentionally aBaseException
to suggest that it should not be suppressed (and we are indeed not suppressing it).The text was updated successfully, but these errors were encountered: