GH-98363: Presize the list for batched() #98419

rhettinger · 2022-10-18T23:09:29Z

Presizing the result lists saves space from list over-allocation and is faster than the PyList_Append() approach.

This is a short PR but it took a while to figure out that PyList_GetSlice() would make short work of the problem. Ideally, we need something like _PyTuple_Resize() to resize the list inplace using a realloc() and updating the allocated field.

Issue: Add itertools.batched() #98363

bedevere-bot · 2022-10-19T01:15:21Z

🤖 New build scheduled with the buildbot fleet by @rhettinger for commit fd5b508 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

sweeneyde · 2022-10-20T02:07:36Z

Modules/itertoolsmodule.c

    if (result == NULL) {
        return NULL;
    }
-    for (i=0 ; i < bo->batch_size ; i++) {
+    for (i=0 ; i < n ; i++) {
        item = PyIter_Next(it);
        if (item == NULL) {
            break;


Do we care how this handles strange iterators that keep going after having raised StopIteration?

>>> from itertools import batched >>> class A: ... def __init__(self): ... self.i = 0 ... def __iter__(self): ... return self ... def __next__(self): ... i = self.i ... self.i += 1 ... if i == 7: ... raise StopIteration() ... return i ... >>> x = batched(A(), 5) >>> next(x) [0, 1, 2, 3, 4] >>> next(x) [5, 6] >>> next(x) [8, 9, 10, 11, 12]

Maybe the Py_CLEAR(bo->it) could be moved to before this break;.

I'm thinking of leaving this as is. The current PR matches the behavior of the pure python version and the strange iterator works the same way with other tools:

>>> it = zip(range(10), A()) >>> list(it) [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)] >>> list(it) [(8, 8), (9, 9)] >>> it = enumerate(A()) >>> list(it) [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)] >>> next(it) (7, 8) >>> next(it) (8, 9)

That said, I'm not really sure about this one and could easily be persuaded to add a Py_CLEAR(bo->it) to the code path for the short list.

This does seem like an edge case that could go either way.

I'm noticing that some iterators do Py_CLEAR(obj->it) as soon as the next() call fails, while other objects have internal iterators that are never NULL. My concern is that batched() clears the iterator sometimes (when the first next() of a batch fails), but not other times (when a subsequent next() fails).

So I would think to do one or the other, but not both:

Clear the iterator immediately after any internal next() call fails, similar to pairwise, islice, chain, and cycle (sort of).

Never clear the iterator and remove the if (it == NULL) check altogether, similar to dropwhile, takewhile, accumulate, compress, zip and enumerate. Unfortunately, it would mean allocating and throwing away PyList_New(n), but that's in the rare case that someone is abusing the iterator.

Maybe removing the Py_CLEAR and the NULL check is most natural?

sweeneyde

Regardless of the decision about Py_CLEAR, I think this is a good improvement.

Out of curiosity, before this PR, I measured consume(batched(repeat(None, 1_000_000), 100)) to have a bimodal distribution of timings, with about half of runs at 41ms and half at 8ms (cache effects?). After this PR, I measured about 3ms consistently. This was on Windows, without PGO.

rhettinger · 2022-10-20T07:33:04Z

Thanks for the interesting line of thought. I like having this match the pure python equivalent. Also, I don't think that it would ever matter. Code like enumerate() and zip() has been deployed for 20 years with no ill effects. More-itertools has iterators than also succumb to your "strange iterator" and no one seems have had an issue.

The likely reason that you're getting a bimodal timing distribution is that PyList_Append() calls realloc() which uncontrollably runs in either of two modes, fast for an inplace resize when the memory can be extended inplace, and slow for a resize that has to move all the data because there isn't room for an inplace resize. The new code doesn't have that issue because PyList_New() makes a single call to malloc() which never moves data. The new code also has better memory utilization because it never returns an overallocated list.

* main: (40 commits) pythongh-98461: Fix source location in comprehensions bytecode (pythonGH-98464) pythongh-98421: Clean Up PyObject_Print (pythonGH-98422) pythongh-98360: multiprocessing now spawns children on Windows with correct argv[0] in virtual environments (pythonGH-98462) CODEOWNERS: Become a typing code owner (python#98480) [doc] Improve logging cookbook example. (pythonGH-98481) Add more tkinter.Canvas tests (pythonGH-98475) pythongh-95023: Added os.setns and os.unshare functions (python#95046) pythonGH-98363: Presize the list for batched() (pythonGH-98419) pythongh-98374: Suppress ImportError for invalid query for help() command. (pythongh-98450) typing tests: `_overload_dummy` raises `NotImplementedError`, not `RuntimeError` (python#98351) pythongh-98354: Add unicode check for 'name' attribute in _imp_create_builtin (pythonGH-98412) pythongh-98257: Make _PyEval_SetTrace() reentrant (python#98258) pythongh-98414: py.exe launcher does not use defaults for -V:company/ option (pythonGH-98460) pythongh-98417: Store int_max_str_digits on the Interpreter State (pythonGH-98418) Doc: Remove title text from internal links (python#98409) [doc] Refresh the venv introduction documentation, and correct the statement about VIRTUAL_ENV (pythonGH-98350) Docs: Bump sphinx-lint and fix unbalanced inline literal markup (python#98441) pythongh-92886: Replace assertion statements in `handlers.BaseHandler` to support running with optimizations (`-O`) (pythonGH-93231) pythongh-92886: Fix tests that fail when running with optimizations (`-O`) in `_test_multiprocessing.py` (pythonGH-93233) pythongh-92886: Fix tests that fail when running with optimizations (`-O`) in `test_py_compile.py` (pythonGH-93235) ...

Presize the list for batched()

ea9511e

rhettinger added performance Performance or resource usage skip news labels Oct 18, 2022

rhettinger requested a review from sweeneyde October 18, 2022 23:09

bedevere-bot added the awaiting core review label Oct 18, 2022

rhettinger added 5 commits October 18, 2022 19:29

Fast path for the common case (zero new items).

88fc1b1

Cleaner handling of empty result

0e05aa4

Let NULL error flow through the return

646630e

Use n variable consistently throughout

5e38c09

Remove spurious include directive

fd5b508

rhettinger added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Oct 19, 2022

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Oct 19, 2022

sweeneyde reviewed Oct 20, 2022

View reviewed changes

sweeneyde approved these changes Oct 20, 2022

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Oct 20, 2022

rhettinger merged commit c1e02d4 into python:main Oct 20, 2022

bedevere-bot removed the awaiting merge label Oct 20, 2022

rhettinger deleted the better_batched branch October 20, 2022 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GH-98363: Presize the list for batched() #98419

GH-98363: Presize the list for batched() #98419

Uh oh!

rhettinger commented Oct 18, 2022 •

edited by bedevere-bot

Loading

Uh oh!

bedevere-bot commented Oct 19, 2022

Uh oh!

sweeneyde Oct 20, 2022

Uh oh!

rhettinger Oct 20, 2022 •

edited

Loading

Uh oh!

sweeneyde Oct 20, 2022 •

edited

Loading

Uh oh!

sweeneyde left a comment

Uh oh!

rhettinger commented Oct 20, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

GH-98363: Presize the list for batched() #98419

GH-98363: Presize the list for batched() #98419

Uh oh!

Conversation

rhettinger commented Oct 18, 2022 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-bot commented Oct 19, 2022

Uh oh!

sweeneyde Oct 20, 2022

Choose a reason for hiding this comment

Uh oh!

rhettinger Oct 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sweeneyde Oct 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sweeneyde left a comment

Choose a reason for hiding this comment

Uh oh!

rhettinger commented Oct 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rhettinger commented Oct 18, 2022 •

edited by bedevere-bot

Loading

rhettinger Oct 20, 2022 •

edited

Loading

sweeneyde Oct 20, 2022 •

edited

Loading

rhettinger commented Oct 20, 2022 •

edited

Loading