-
-
Notifications
You must be signed in to change notification settings - Fork 31.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-81322: support multiple separators in Stream.readuntil #16429
Conversation
A few questions about this:
|
Lib/asyncio/streams.py
Outdated
break | ||
|
||
# see upper comment for explanation. | ||
offset = buflen + 1 - seplen | ||
offset = buflen + 1 - max_seplen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see this can make offset negative if max_seplen > min_seplen. I'll add a test to catch it and add a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to draft this PR @bmerry, and welcome to CPython! 😎
I'm not super familiar with this module, but I found a couple of things that could be tidied up. Other than that, the code looks good! This will definitely need updated docs, though.
Lib/asyncio/streams.py
Outdated
# Makes sure shortest matches wins, and supports arbitrary iterables | ||
separator = sorted(separator, key=lambda sep: len(sep)) | ||
if not separator: | ||
raise ValueError('Separator list should contain at least one element') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
raise ValueError('Separator list should contain at least one element') | |
raise ValueError('Separator should contain at least one element') |
Lib/asyncio/streams.py
Outdated
@@ -1672,26 +1709,35 @@ def _feed_data(self, data): | |||
# messages :) | |||
|
|||
# `offset` is the number of bytes from the beginning of the buffer | |||
# where there is no occurrence of `separator`. | |||
# where there is no occurrence of any separator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
# where there is no occurrence of any separator. | |
# where there is no occurrence of any `separator`. |
Lib/asyncio/streams.py
Outdated
offset = 0 | ||
|
||
# Loop until we find `separator` in the buffer, exceed the buffer size, | ||
# Loop until we find a separator in the buffer, exceed the buffer size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
# Loop until we find a separator in the buffer, exceed the buffer size, | |
# Loop until we find a `separator` in the buffer, exceed the buffer size, |
Lib/asyncio/streams.py
Outdated
separator = [separator] | ||
else: | ||
# Makes sure shortest matches wins, and supports arbitrary iterables | ||
separator = sorted(separator, key=lambda sep: len(sep)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, missed one. Here too!
separator = sorted(separator, key=lambda sep: len(sep)) | |
separator = sorted(separator, key=len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it was decided to revert the current streams implementation from 3.8. See https://bugs.python.org/issue38242 for more details. I'm really sorry, but we'll need to rebase this work on the new API we later add to 3.9 :(
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase And if you don't make the requested changes, you will be poked with soft cushions! |
@1st1 not a problem - definitely worth getting things right before they go into stdlib. Do you think the revert of the new stream API will conflict much with my changes to the old StreamReader class? If not, perhaps the way forward is for me to remove my changes to |
@bmerry I'm not yet sure. I'll be working on a revert later today or tomorrow and will probably know more. I'd keep everything as is as it's not going to be an easy revert anyways. |
@1st1 it looks like you've done the Streams reversion now, and I've merged that into my branch. So hopefully this is ready for review now. @brandtbucher thanks for the suggestions. I've applied all the ones that Github was still showing after the merge from master - let me know if anything got lost along the way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more tiny clarification from me, otherwise this looks good. Still needs updated docs though!
Misc/NEWS.d/next/Library/2019-09-26-17-52-52.bpo-37141.onYY2-.rst
Outdated
Show resolved
Hide resolved
@brandtbucher which docs need updating? |
Sorry about the delay @bmerry. The docs I'm referring to are located in .. coroutinemethod:: readuntil(separator=b'\\n')
Read data from the stream until *separator* is found.
On success, the data and separator will be removed from the
internal buffer (consumed). Returned data will include the
separator at the end.
If the amount of data read exceeds the configured stream limit, a
:exc:`LimitOverrunError` exception is raised, and the data
is left in the internal buffer and can be read again.
If EOF is reached before the complete separator is found,
an :exc:`IncompleteReadError` exception is raised, and the internal
buffer is reset. The :attr:`IncompleteReadError.partial` attribute
may contain a portion of the separator.
.. versionadded:: 3.5.2 The signature and description should be updated. Also, something like this should be included below the .. versionchanged:: 3.9
The *separator* parameter may now be an :term:`iterable` of separators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One tiny change to the NEWS entry due to the API revert:
Misc/NEWS.d/next/Library/2019-09-26-17-52-52.bpo-37141.onYY2-.rst
Outdated
Show resolved
Hide resolved
Thanks. I'd assumed the library docs were generated from the docstring, which is why I missed this. I'll update it. |
Looks good to me! @1st1? |
We will soon start a discussion about the new streaming API for 3.9. I'll update this issue with a link when that happens; please give us some time before making a decision on this one -- Python 3.9 is still relatively far away. |
Thanks, I will review in the coming week! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment nit, I'll merge that myself and then land it. Thanks for your contribution, and on behalf of the asyncio maintainers I apologize for the delay!
len(separator) would be the number of separators, which makes no sense.
PR python#16429 introduced support for an iterable of separators in Stream.readuntil. Since bytes-like types are themselves iterable, this can introduce ambiguities in deciding whether the argument is an iterator of separators or a singleton separator. In python#16429, only 'bytes' was considered a singleton, but this will break code that passes other buffer object types. The Python library docs don't indicate what separator types were permitted in Python <=3.12, but comments in typeshed indicate that it would work with types that implement the buffer protocol and provide a len(). To keep those cases working the way they did before, I've changed the detection logic to consider any instance of collections.abc.Buffer as a singleton separator. There may still be corner cases where this doesn't do what the user wants e.g. a numpy array of byte strings will implement the buffer protocol and hence be treated as a singleton; but at least those corner cases should behave the same in 3.13 as they did in 3.12. Relates to python#81322.
PR python#16429 introduced support for an iterable of separators in Stream.readuntil. Since bytes-like types are themselves iterable, this can introduce ambiguities in deciding whether the argument is an iterator of separators or a singleton separator. In python#16429, only 'bytes' was considered a singleton, but this will break code that passes other buffer object types. Fix it by only supporting tuples rather than arbitrary iterables. Closes python#117722.
gh-16429 introduced support for an iterable of separators in Stream.readuntil. Since bytes-like types are themselves iterable, this can introduce ambiguities in deciding whether the argument is an iterator of separators or a singleton separator. In gh-16429, only 'bytes' was considered a singleton, but this will break code that passes other buffer object types. Fix it by only supporting tuples rather than arbitrary iterables. Closes gh-117722.
…ython#117723) pythongh-16429 introduced support for an iterable of separators in Stream.readuntil. Since bytes-like types are themselves iterable, this can introduce ambiguities in deciding whether the argument is an iterator of separators or a singleton separator. In pythongh-16429, only 'bytes' was considered a singleton, but this will break code that passes other buffer object types. Fix it by only supporting tuples rather than arbitrary iterables. Closes pythongh-117722.
Allow Stream.readuntil to take an iterable of separators and match any
of them. The earliest match endpoint wins (which ensures that results
are dependent on the chunking) and on ties shortest separator wins
(which only matters if the user has supplied a redundant set like
[b'\r\n', b'\n'] and the limit is reached).
It's also implemented for the deprecated StreamReader, just because the
code for the two implementations was the same except for one line and it
seemed easier to keep them in sync than leaving two different versions
to maintain.
https://bugs.python.org/issue37141