Skip to content

gh-129559: Add bytearray.resize() #129560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Feb 5, 2025
Merged

gh-129559: Add bytearray.resize() #129560

merged 24 commits into from
Feb 5, 2025

Conversation

cmaloney
Copy link
Contributor

@cmaloney cmaloney commented Feb 1, 2025

Add bytearray.resize() which wraps PyByteArray_Resize.

Make negative size passed to resize exception/error rather than crash in optimized builds.


📚 Documentation preview 📚: https://cpython-previews--129560.org.readthedocs.build/

Add `bytearray.resize()` which wraps `PyByteArray_Resize`
@cmaloney
Copy link
Contributor Author

cmaloney commented Feb 2, 2025

@vstinner I think this is ready for review if bytearray.resize seems like a reasonable feature add.

@vstinner
Copy link
Member

vstinner commented Feb 2, 2025

Please initialize new bytes to zero. We don't accept undefined behavior in Python.

cmaloney and others added 2 commits February 2, 2025 09:55
Co-authored-by: Victorien <65306057+Viicos@users.noreply.github.com>
@vstinner
Copy link
Member

vstinner commented Feb 3, 2025

I'm not sure that adding this API is needed since there is already a way to truncate a bytearray and to extend a bytearray using the existing API. You should run a benchmark to show that it's worth it, especially to extend a bytearray.

@cmaloney
Copy link
Contributor Author

cmaloney commented Feb 3, 2025

What is the way to extend the bytearray without copying byte by byte or having the length of the extension extra memory allocated?

@vstinner
Copy link
Member

vstinner commented Feb 4, 2025

What is the way to extend the bytearray without copying byte by byte or having the length of the extension extra memory allocated?

What you say. In short, ba[len(ba):] = b'\0' * extend.

Please run a benchmark to measure this compared to ba.extend().

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation mostly looks fine, but my primary concern is that we are indeed exposing an implementation detail on our end. Alternate Python implementations that either don't use an internal buffer, or can't control the size of it, will have a hard time effectively providing resize.

For example, Brython implements bytearray using JavaScript arrays, which don't have an explicit way to resize. I don't think they'll like this method.

I'd go for one of two options:

  • Mark resize as a CPython implementation detail.
  • Make resize a hint, allowing the runtime to ignore the request for resizing if it wants to.

I'd personally choose the latter, but I'd also like to hear Victor's thoughts.

@vstinner
Copy link
Member

vstinner commented Feb 4, 2025

@ZeroIntensity: It's already possible to implement extend() with the current bytearray API, it's just a convenient helper for: ba[len(ba):] = b'\0' * extend. So I don't see why other Python implementation couldn't implement it. They should already be able to resize a bytearray.

There are many ways to "extend" a bytearray. Another example:

>>> ba=bytearray(b'abc')
>>> ba += b'def'
>>> ba, len(ba)
(bytearray(b'abcdef'), 6)

@ZeroIntensity
Copy link
Member

Yeah, but I think that's a bit different than an explicit "resize." You can emulate it with that kind of thing, but it feels different than an actual method saying it will reallocate.

@vstinner
Copy link
Member

vstinner commented Feb 4, 2025

The method documentation should not promise any performance guarantee in general, it can be implemented in different ways (the implementation doesn't matter).

@ZeroIntensity
Copy link
Member

Can we clarify that with a CPython implementation detail note?

@cmaloney
Copy link
Contributor Author

cmaloney commented Feb 4, 2025

@ZeroIntensity can definitely add a note of "This is equivalent to bytearray += b'\0' * size". For more efficiency in JavaScript in particular I think this could map to https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/fill -- nevermind, just sets elements to 0, can't append.

(Planning to do next iteration + perf testing follow up today, yesterday got sucked into a BytesIO.readfrom(file, *, limit: int | None = None, expected: int | None = None) experiment)

cmaloney and others added 2 commits February 4, 2025 10:23
Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
Co-authored-by: Victor Stinner <vstinner@python.org>
Co-authored-by: Victor Stinner <vstinner@python.org>
@vstinner
Copy link
Member

vstinner commented Feb 4, 2025

test_bytes and test_capi are failing, you need to update the exception.

@cmaloney
Copy link
Contributor Author

cmaloney commented Feb 4, 2025

Working on them; Unfortunate part of "add suggestions", it adds "this person contributed" but doesn't let you do that + local changes before making push visible / alerting reviewers... Should have soon

.. method:: resize(size)

Resize the :class:`bytearray` to contain *size* bytes.
If :class:`bytearray` needs to grow, all new bytes will be set to null bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also describe the "obvious-to-us" behavior of what happens when it shrinks: the data at the end is truncated, the remaining data should be equivalent to a [:size] slice.

... and consider if the Python version should also support negative size to mean the same thing it would in a slice notation.
if (size < 0): size = max(0, len(self) + size) ?

Copy link
Contributor Author

@cmaloney cmaloney Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for the truncation.

I don't like negative, because the function operates in absolute buffer size, and requesting a negative buffer size sounds like a bug I'd write and I'd prefer an exception / exit rather than debug the symptom "my buffer shrank".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my original equivalent to is wrong, which make it look like a delta, current doc one I'm validating:

if len(self) > size:
    del self[size:]
else:
    self += b'\0' * (size - len(self))

@cmaloney
Copy link
Contributor Author

cmaloney commented Feb 5, 2025

I think I've addressed all review comments.

Tested performance on an optimized build (--with-lto --enable-optimizations)

$ python bm_bytearray_resize.py       
.....................
extend_bytes: Mean +- std dev: 142 ms +- 1 ms
.....................
extend_range: Mean +- std dev: 1.89 sec +- 0.07 sec
.....................
iadd: Mean +- std dev: 143 ms +- 1 ms
.....................
resize: Mean +- std dev: 48.7 ms +- 0.5 ms
import itertools
import pyperf

runner = pyperf.Runner()


# Aiming for what a file readall does in terms of resizing as get more data.
# Non-linear resizing up to a reaonably large read size.
blocksize = 1024
sizes = [
    blocksize,
    blocksize * 4,
    blocksize * 16,
    1_000_000,
    100_000_000,
    1_000_000_000,
    ]

def extend_bytes():
    """Extend by appending a temporary byte string"""
    ba = bytearray()
    for size in sizes:
        ba.extend(b'\0' * size)


def extend_range():
    """Resize by an iterator with length_hint"""
    ba = bytearray()
    for size in sizes:
        ba.extend(itertools.repeat(0, size))


def iadd():
    """Resize by using += """
    ba = bytearray()
    for size in sizes:
        ba += b'\0' * size


def resize():
    """Use resize"""
    ba = bytearray()
    for size in sizes:
        ba.resize(size)



runner.bench_func("extend_bytes", extend_bytes)
runner.bench_func("extend_range", extend_range)
runner.bench_func("iadd", iadd)
runner.bench_func("resize", resize)

Comment on lines 2863 to 2870
>>> shrink = bytearray(5)
>>> resize(shrink, 0)
>>> shrink
bytearray(b'')
>>> grow = bytearray(2)
>>> resize(grow, 7)
>>> grow
bytearray(b'\x00\x00\x00\x00\x00\x00\x00')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to use resize in examples:

Suggested change
>>> shrink = bytearray(5)
>>> resize(shrink, 0)
>>> shrink
bytearray(b'')
>>> grow = bytearray(2)
>>> resize(grow, 7)
>>> grow
bytearray(b'\x00\x00\x00\x00\x00\x00\x00')
Examples:
>>> shrink = bytearray(b'abc')
>>> shrink.resize(1)
>>> (shrink, len(shrink))
(bytearray(b'a'), 1)
>>> grow = bytearray(b'abc')
>>> grow.resize(5)
>>> (grow, len(grow))
(bytearray(b'abc\x00\x00'), 5)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Making locally so I can get the doctest passing)

@vstinner
Copy link
Member

vstinner commented Feb 5, 2025

The change mostly LGTM. I just made a new review for some details.

extend_bytes: Mean +- std dev: 142 ms +- 1 ms
extend_range: Mean +- std dev: 1.89 sec +- 0.07 sec
iadd: Mean +- std dev: 143 ms +- 1 ms
resize: Mean +- std dev: 48.7 ms +- 0.5 ms

Thanks for the benchmark. So resize is 2.9x faster than the existing fastest method to grow a bytearray (extend_bytes). So it's worth it to add a dedicated method.

@gpshead
Copy link
Member

gpshead commented Feb 5, 2025

Nice, I'm not surprised it is faster. Though what I care more about is that code using this is more readable than the other hacks. :)

@gpshead gpshead merged commit 5fb019f into python:main Feb 5, 2025
46 checks passed
@cmaloney cmaloney deleted the resize_pr branch February 5, 2025 20:03
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request Feb 7, 2025
Add bytearray.resize() which wraps PyByteArray_Resize.

Make negative size passed to resize exception/error rather than crash in optimized builds.
cmaloney added a commit to cmaloney/cpython that referenced this pull request Feb 8, 2025
Add bytearray.resize() which wraps PyByteArray_Resize.

Make negative size passed to resize exception/error rather than crash in optimized builds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants