Skip to content

gh-129005: Move bytearray to use bytes as a buffer #130563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

cmaloney
Copy link
Contributor

@cmaloney cmaloney commented Feb 26, 2025

  1. Add tp_new to guarantee ob_bytes_head is always set, often to the empty bytes singleton.
  2. ob_alloc information is now redundant, added assertion to validate that, would it make sense to deprecate?
  3. There's a lot of bytearray code very similar to bytes code, more could likely be just proxied to the bytes now (Ex. construction from an object in __new__). Here just focusing on the swap as that enables optimizations.
  4. If bytearray is passed a single-reference bytes it could potentially take "ownership" of it without copying the bytes, for now not implemented.

This enables adding bytearray._detach() which I plan to do in a separate PR.

TODO

  • Single byte strings passed to PyBytes_FromStringAndSize returns immortal objects that shouldn't be mutated. Not sure how to best work around that. _io.FileIO.readall() can't run into that case as it never passes in a string to start with, just a size.

Performance

# _io.FileIO.readall of a large file
./python -m test -M8g -uall test_largefile -m test.test_largefile.CLargeFileTest.test_large_read

# _pyio.FileIO.readall of a large file
./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read

On my machine (AMD 64 bit Linux, Optimized build):
_io takes: ~0.791s and uses ~2GB of RAM
_pyio current: ~1.073s and uses ~4GB of RAM
_pyio with bytearray._detach: ~0.887s and uses ~2GB of RAM

Perf checking no major swings in an optimized build: ./python -E -bb -Wd -m test -uall -M32G test_bytes test_capi.test_bytearray -vvv

before: ~1.4s
after: ~1.4s

Previous discussion: https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164

cc: @serhiy-storchaka @vstinner

1. add `tp_new` to guarantee `ob_bytes_head` is always set, often to the empty
   bytes singleton.
2. `ob_alloc` information is now redundant, added assertion to validate that,
   would it make sense to deprecate?
3. There's a lot of `bytearray` code very similar to `bytes` code, more could
   likely be just proxied to the `bytes` now. Here just focusing on the swap as
   that enables optimizations.
4. If `bytearray` is passed a single-reference bytes it could potentially take
   "ownership" of it without copying the bytes, for now not implemented.

This enables adding `bytearray._detach()` which I plan to do in a separate PR.

```bash
 # `_io` large file read test
./python -m test -M8g -uall test_largefile -m test.test_largefile.CLargeFileTest.test_large_read

 # `_pyio` large file read test
./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read
```

On my machine (AMD 64 bit Linux, Optimized build):
`_io` takes: ~0.791s and uses ~2GB of RAM
`_pyio` current: ~1.073s and uses ~4GB of RAM
`_pyio` w/ bytearray._detach: ~0.887s and uses ~2GB of RAM

Perf checking no major swings in an optimized build:
`./python -E -bb -Wd -m test -uall -M32G test_bytes test_capi.test_bytearray -vvv`

before: ~1.4s
after: ~1.5s

Previous discussion: https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164
}

/* Too small, grow allocation */
if (_PyBytes_Resize(&obj->ob_bytes_head, alloc - 1) < 0) {
Copy link
Contributor Author

@cmaloney cmaloney Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

behavior change: _PyBytes_Resize behaves differently than PyMem_Realloc here. _PyBytes_Resize can change the location of the buffer. That means the address returned from PyByteArray_AS_STRING may change on any resize. Previously it changed only when went from empty <-> non-empty bytearray (empty points to a constant storage).

https://docs.python.org/3/c-api/memory.html#c.PyMem_Realloc

@cmaloney
Copy link
Contributor Author

cmaloney commented Mar 12, 2025

Closing while discussion is ongoing, if decide to ship, will make a new github issue for the new feature this implements.

@cmaloney cmaloney closed this Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant