-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-98363: Presize the list for batched() #98419
Conversation
🤖 New build scheduled with the buildbot fleet by @rhettinger for commit fd5b508 🤖 If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again. |
if (result == NULL) { | ||
return NULL; | ||
} | ||
for (i=0 ; i < bo->batch_size ; i++) { | ||
for (i=0 ; i < n ; i++) { | ||
item = PyIter_Next(it); | ||
if (item == NULL) { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care how this handles strange iterators that keep going after having raised StopIteration
?
>>> from itertools import batched
>>> class A:
... def __init__(self):
... self.i = 0
... def __iter__(self):
... return self
... def __next__(self):
... i = self.i
... self.i += 1
... if i == 7:
... raise StopIteration()
... return i
...
>>> x = batched(A(), 5)
>>> next(x)
[0, 1, 2, 3, 4]
>>> next(x)
[5, 6]
>>> next(x)
[8, 9, 10, 11, 12]
Maybe the Py_CLEAR(bo->it)
could be moved to before this break;
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of leaving this as is. The current PR matches the behavior of the pure python version and the strange iterator works the same way with other tools:
>>> it = zip(range(10), A())
>>> list(it)
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)]
>>> list(it)
[(8, 8), (9, 9)]
>>> it = enumerate(A())
>>> list(it)
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)]
>>> next(it)
(7, 8)
>>> next(it)
(8, 9)
That said, I'm not really sure about this one and could easily be persuaded to add a Py_CLEAR(bo->it)
to the code path for the short list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does seem like an edge case that could go either way.
I'm noticing that some iterators do Py_CLEAR(obj->it)
as soon as the next()
call fails, while other objects have internal iterators that are never NULL. My concern is that batched()
clears the iterator sometimes (when the first next()
of a batch fails), but not other times (when a subsequent next()
fails).
So I would think to do one or the other, but not both:
-
Clear the iterator immediately after any internal
next()
call fails, similar topairwise
,islice
,chain
, andcycle
(sort of). -
Never clear the iterator and remove the
if (it == NULL)
check altogether, similar todropwhile
,takewhile
,accumulate
,compress
,zip
andenumerate
. Unfortunately, it would mean allocating and throwing awayPyList_New(n)
, but that's in the rare case that someone is abusing the iterator.
Maybe removing the Py_CLEAR
and the NULL
check is most natural?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of the decision about Py_CLEAR
, I think this is a good improvement.
Out of curiosity, before this PR, I measured consume(batched(repeat(None, 1_000_000), 100))
to have a bimodal distribution of timings, with about half of runs at 41ms and half at 8ms (cache effects?). After this PR, I measured about 3ms consistently. This was on Windows, without PGO.
Thanks for the interesting line of thought. I like having this match the pure python equivalent. Also, I don't think that it would ever matter. Code like The likely reason that you're getting a bimodal timing distribution is that |
* main: (40 commits) pythongh-98461: Fix source location in comprehensions bytecode (pythonGH-98464) pythongh-98421: Clean Up PyObject_Print (pythonGH-98422) pythongh-98360: multiprocessing now spawns children on Windows with correct argv[0] in virtual environments (pythonGH-98462) CODEOWNERS: Become a typing code owner (python#98480) [doc] Improve logging cookbook example. (pythonGH-98481) Add more tkinter.Canvas tests (pythonGH-98475) pythongh-95023: Added os.setns and os.unshare functions (python#95046) pythonGH-98363: Presize the list for batched() (pythonGH-98419) pythongh-98374: Suppress ImportError for invalid query for help() command. (pythongh-98450) typing tests: `_overload_dummy` raises `NotImplementedError`, not `RuntimeError` (python#98351) pythongh-98354: Add unicode check for 'name' attribute in _imp_create_builtin (pythonGH-98412) pythongh-98257: Make _PyEval_SetTrace() reentrant (python#98258) pythongh-98414: py.exe launcher does not use defaults for -V:company/ option (pythonGH-98460) pythongh-98417: Store int_max_str_digits on the Interpreter State (pythonGH-98418) Doc: Remove title text from internal links (python#98409) [doc] Refresh the venv introduction documentation, and correct the statement about VIRTUAL_ENV (pythonGH-98350) Docs: Bump sphinx-lint and fix unbalanced inline literal markup (python#98441) pythongh-92886: Replace assertion statements in `handlers.BaseHandler` to support running with optimizations (`-O`) (pythonGH-93231) pythongh-92886: Fix tests that fail when running with optimizations (`-O`) in `_test_multiprocessing.py` (pythonGH-93233) pythongh-92886: Fix tests that fail when running with optimizations (`-O`) in `test_py_compile.py` (pythonGH-93235) ...
Presizing the result lists saves space from list over-allocation and is faster than the
PyList_Append()
approach.This is a short PR but it took a while to figure out that
PyList_GetSlice()
would make short work of the problem. Ideally, we need something like_PyTuple_Resize()
to resize the list inplace using arealloc()
and updating theallocated
field.