Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect optimization in itertools.tee() #123884

Closed
rhettinger opened this issue Sep 9, 2024 · 9 comments
Closed

Incorrect optimization in itertools.tee() #123884

rhettinger opened this issue Sep 9, 2024 · 9 comments
Assignees
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes type-bug An unexpected behavior, bug, or error

Comments

@rhettinger
Copy link
Contributor

rhettinger commented Sep 9, 2024

Bug description:

To save a memory allocation, the code path for a tee-in-a-tee incorrectly reuses the outer tee object as the first tee object in the result tuple. This is incorrect. All tee objects in the result tuple should have the same behavior. They are supposed to be "n independent iterators". However, the first one is not independent and it has different behaviors from the others. This is an unfortunate side-effect of an early incorrect optimization. I've now seen this affect real code. It surprising, unhelpful, undocumented, and hard to debug.

Demonstration:

from itertools import tee

def demo(i):
    it = iter('abcdefghi')
    [outer_tee] = tee(it, 1)
    inner_tee = tee(outer_tee, 10)[i]
    return next(inner_tee), next(outer_tee)

print('These should all give the same result:')
for i in range(10):
    print(i, demo(i))

This outputs:

These should all give the same result:
0 ('a', 'b')
1 ('a', 'a')
2 ('a', 'a')
3 ('a', 'a')
4 ('a', 'a')
5 ('a', 'a')
6 ('a', 'a')
7 ('a', 'a')
8 ('a', 'a')
9 ('a', 'a')

There is a test for the optimization -- it wasn't an accident. However, the optimization itself is a bug against the published specification in the docs and against general expectations.

        a, b = tee('abc')
        c, d = tee(a)
        self.assertTrue(a is c)

Linked PRs

@rhettinger rhettinger added type-bug An unexpected behavior, bug, or error 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Sep 9, 2024
@rhettinger rhettinger self-assigned this Sep 10, 2024
@serhiy-storchaka
Copy link
Member

Agree, this is a bug. But fixing it can break a user code. For example:

while True:
    it, peek = tee(it)
    if next(peek).isdigit():
        it, num = parsenum(it)
        continue
    it, peek = tee(it)
    if next(peek).isalpha():
        it, word = parseword(it)
        continue
    it, peek = tee(it)
    if next(peek) in '+-*/':
        op = next(it)
        continue

Without optimization, you will get a growing chain of the tee objects with linearly growing computation time and stack consumption of next(). This may be the original purpose of this optimization, not only performance or memory use.

@rhettinger
Copy link
Contributor Author

Nested tee() calls self-flatten just like nested partial() calls. So no, there would not be a "growing chain of the tee objects".

@y5c4l3
Copy link
Contributor

y5c4l3 commented Sep 11, 2024

Without optimization, you will get a growing chain of the tee objects with linearly growing computation time and stack consumption of next(). This may be the original purpose of this optimization, not only performance or memory use.

teeobject is copyable, the problem is that the tuple construction loop in tee_impl only starts from 1. The 0-th element is a copyable iterator acquired by GetIter, but teeobject is self iterator AND copyable, which causes the bug here.

@y5c4l3
Copy link
Contributor

y5c4l3 commented Sep 11, 2024

@rhettinger It seems that the re-use of outer_tee isn't valid, since the doc said:

Once a tee() has been created, the original iterable should not be used anywhere else; otherwise, the iterable could get advanced without the tee objects being informed. Once a tee() has been created, the original iterable should not be used anywhere else; otherwise, the iterable could get advanced without the tee objects being informed.

@serhiy-storchaka
Copy link
Member

This behavior was introduced in ad983e7, and it was explicitly documented from the beginning, so it was considered as a feature or a fair price.

Nested tee() calls self-flatten just like nested partial() calls. So no, there would not be a "growing chain of the tee objects".

It depends on your expectations. Do you expect tee(x for x in it) and tee(it) produce the same result? There are two shorcuts in the tee constructor -- omitting the tee_fromiterable() call if the iterator is copyable and omitting copying the first item in a tuple. The first one allows to avoid the growing chain of the tee objects, but it goes against the published specification and the general expectations. It all does not matter if you do not use the original iterator.

BTW, your use of "outer" and "inner" for iterators is confusing. I would call them in the opposite way -- the nested one is inner.

emilyemorehouse added a commit to lysnikolaou/cpython that referenced this issue Sep 26, 2024
* main: (69 commits)
  Add "annotate" SET_FUNCTION_ATTRIBUTE bit to dis. (python#124566)
  pythongh-124412: Add helpers for converting annotations to source format (python#124551)
  pythongh-119180: Disallow instantiation of ConstEvaluator objects (python#124561)
  For-else deserves its own section in the tutorial (python#123946)
  Add 3.13 as a version option to the crash issue template (python#124560)
  pythongh-123242: Note that type.__annotations__ may not exist (python#124557)
  pythongh-119180: Make FORWARDREF format look at __annotations__ first (python#124479)
  pythonGH-58058: Add quick reference for `ArgumentParser` to argparse docs (pythongh-124227)
  pythongh-41431: Add `datetime.time.strptime()` and `datetime.date.strptime()` (python#120752)
  pythongh-102450: Add ISO-8601 alternative for midnight to `fromisoformat()` calls. (python#105856)
  pythongh-124370: Add "howto" for free-threaded Python (python#124371)
  pythongh-121277: Allow `.. versionadded:: next` in docs (pythonGH-121278)
  pythongh-119400:  make_ssl_certs: update reference test data automatically, pass in expiration dates as parameters python#119400  (pythonGH-119401)
  pythongh-119180: Avoid going through AST and eval() when possible in annotationlib (python#124337)
  pythongh-124448: Update Windows builds to use Tcl/Tk 8.6.15 (pythonGH-124449)
  pythongh-123884 Tee of tee was not producing n independent iterators (pythongh-124490)
  pythongh-124378: Update test_ttk for Tcl/Tk 8.6.15 (pythonGH-124542)
  pythongh-124513: Check args in framelocalsproxy_new() (python#124515)
  pythongh-101100: Add a table of class attributes to the "Custom classes" section of the data model docs (python#124480)
  Doc: Use ``major.minor`` for documentation distribution archive filenames (python#124489)
  ...
@picnixz
Copy link
Member

picnixz commented Sep 26, 2024

Automatic backports failed so someone should take care of them manually.

@bedevere-app
Copy link

bedevere-app bot commented Oct 8, 2024

GH-125081 is a backport of this pull request to the 3.13 branch.

@bedevere-app
Copy link

bedevere-app bot commented Oct 8, 2024

GH-125153 is a backport of this pull request to the 3.12 branch.

@serhiy-storchaka
Copy link
Member

There was something wrong with the backports -- @bedevere-app usually reports on the original PR page, not on the issue page.

Indeed, the topic of the backport PR should look like "[3.13] gh-{issue-number}: {text} (GH-{pr-number})". These PRs have the issue number in wrong place and do not have the original PR number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants