Reliance on C bit fields in C API is undefined behavior #89188

indygreg · 2021-08-27T01:47:13Z

BPO	45025
Nosy	@birkenfeld, @vstinner, @encukou, @methane, @serhiy-storchaka, @indygreg

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-08-27.01:47:13.094>
labels = ['type-bug', '3.8', '3.9', '3.10', '3.11', 'expert-C-API', '3.7']
title = 'Reliance on C bit fields in C API is undefined behavior'
updated_at = <Date 2021-08-31.22:02:29.674>
user = 'https://github.com/indygreg'

bugs.python.org fields:

activity = <Date 2021-08-31.22:02:29.674>
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['C API']
creation = <Date 2021-08-27.01:47:13.094>
creator = 'indygreg'
dependencies = []
files = []
hgrepos = []
issue_num = 45025
keywords = []
message_count = 9.0
messages = ['400388', '400615', '400617', '400620', '400621', '400622', '400624', '400634', '400788']
nosy_count = 6.0
nosy_names = ['georg.brandl', 'vstinner', 'petr.viktorin', 'methane', 'serhiy.storchaka', 'indygreg']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue45025'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9', 'Python 3.10', 'Python 3.11']

Linked PRs

The text was updated successfully, but these errors were encountered:

indygreg · 2021-08-27T01:47:13Z

At least the PyASCIIObject struct in Include/cpython/unicodeobject.h uses bit fields. Various preprocessor macros like PyUnicode_IS_ASCII() and PyUnicode_KIND() access this struct's bit field.

This is problematic because according to the C specification, the storage of bit fields is unspecified and may vary from compiler to compiler or architecture to architecture. Theoretically, a build of libpython with compiler A may not have the same storage layout of a bit field as a separate binary built with compiler B. These 2 binaries could be linked/loaded together, resulting in a crash or incorrect behavior at run-time.

https://stackoverflow.com/questions/6043483/why-bit-endianness-is-an-issue-in-bitfields/6044223#6044223

To ensure bit field behavior is consistent, the same compiler must be used for all bit field interaction. Since it is effectively impossible to ensure this for programs like Python where multiple compilers are commonly at play (a 3rd party C extension will likely not be built on the same machine that built libpython), bit fields must not be exposed in the C API. If a bit field must exist, the bit field should not be declared in a public .h header and any APIs for accessing the bit field must be implemented as compiled functions such that only a single compiler will define the bit field storage layout.

In order to avoid undefined behavior, Python's C API should avoid all use of bit fields.

This issue is in response to PyO3/pyo3#1824.

vstinner · 2021-08-30T15:03:17Z

At least the PyASCIIObject struct in Include/cpython/unicodeobject.h uses bit fields. Various preprocessor macros like PyUnicode_IS_ASCII() and PyUnicode_KIND() access this struct's bit field.

What is your use case? Which functions do you need?

You should not access directly the PyASCIIObject structure. Python provides many functions to access the content of a Unicode string object.

encukou · 2021-08-30T15:14:28Z

The macro PyUnicode_KIND is part of the documented public C API. It accesses the bit field "state.kind" directly.

vstinner · 2021-08-30T15:24:40Z

The macro PyUnicode_KIND is part of the documented public C API.

IMO it was a mistake to expose it as part of the public C API. This is an implementation detail which should not be exposed. The C API should not expose *directly* how characters are stored in memory, but provide an abstract way to read and write Unicode characters.

The PEP-393 implementation broke the old C API in many ways because it exposed too many implementation details. Sadly, the new C API is... not better :-(

If tomorrow, CPython is modified to use UTF-8 internally (as PyPy does), the C API will likely be broken *again* in many (new funny) ways.

11 years after the PEP-393 (Python 3.3), we only start fixing the old C API :-( The work will be completed in 2 or 3 Python releases (Python 3.12 or 3.13):

The C API for Unicode strings is causing a lot of issues in PyPy which uses UTF-8 internally. C extensions can fail to build on PyPy if they use functions (macros) like PyUnicode_KIND().

vstinner · 2021-08-30T15:26:43Z

In order to avoid undefined behavior, Python's C API should avoid all use of bit fields.

See also the PEP-620. IMO more generally, the C API should not expose structures, but provide ways to access it through getter and setter functions.

See bpo-40120 "Undefined C behavior going beyond end of struct via a [1] arrays" which is a similar issue.

encukou · 2021-08-30T15:30:44Z

PyUnicode_KIND does *not* expose the implementation details to the programmer.

If the internal representation os strings is switched to use masks and shifts instead of bitfields, PyUnicode_KIND (and others) can be adapted to the new details without breaking API compatibility.
And that switch would fix this issue.

vstinner · 2021-08-30T15:37:54Z

PyUnicode_KIND does *not* expose the implementation details to the programmer.

PyUnicode_KIND() is very specific to the exact PEP-393 implementation. Documentation of this field:
---
/* Character size:

PyUnicode_WCHAR_KIND (0):
- character type = wchar_t (16 or 32 bits, depending on the
  platform)
PyUnicode_1BYTE_KIND (1):
- character type = Py_UCS1 (8 bits, unsigned)
- all characters are in the range U+0000-U+00FF (latin1)
- if ascii is set, all characters are in the range U+0000-U+007F
  (ASCII), otherwise at least one character is in the range
  U+0080-U+00FF
PyUnicode_2BYTE_KIND (2):
- character type = Py_UCS2 (16 bits, unsigned)
- all characters are in the range U+0000-U+FFFF (BMP)
- at least one character is in the range U+0100-U+FFFF
PyUnicode_4BYTE_KIND (4):

     * character type = Py_UCS4 (32 bits, unsigned)
     * all characters are in the range U+0000-U+10FFFF
     * at least one character is in the range U+10000-U+10FFFF
 */
unsigned int kind:3;

I don't think that PyUnicode_KIND() makes sense if CPython uses UTF-8 tomorrow.

If the internal representation os strings is switched to use masks and shifts instead of bitfields, PyUnicode_KIND (and others) can be adapted to the new details without breaking API compatibility.

PyUnicode_KIND() was exposed in the *public* C API because unicodeobject.h provides functions as macros for best performances, and these macros use PyUnicode_KIND() internally.

Macros like PyUnicode_READ(kind, data, index) are also designed for best performances with the exact PEP-393 implementation.

The public C API should only contain PyUnicode_READ_CHAR(unicode, index): this macro doesn't use "kind" or "data" which are (again) specific to the PEP-393.

In the CPython implementation, we should use the most efficient code, it's fine to use macros accessing directly structures.

But for the public C API, I would recommend to only provide abstractions, even if there are a little bit slower.

indygreg · 2021-08-30T16:38:54Z

My use case for these low-level APIs is to write tests for low-level string/encoding handling in my custom use of the PyPreConfig and PyConfig structs. I wanted to verify that exact byte sequences were turned into specific representations inside of Python strings. This includes ensuring that certain byte sequences retain their appropriate "character" width in internal storage.

I know there are alternative ways of performing this testing. But testing against the actual data structure used internally by CPython seemed the most precise since it isolates problems to the "store in Python" side of the problem and not "what does Python do once the data is stored."

vstinner · 2021-08-31T22:02:30Z

My use case for these low-level APIs is to write tests for low-level string/encoding handling in my custom use of the PyPreConfig and PyConfig structs. I wanted to verify that exact byte sequences were turned into specific representations inside of Python strings. This includes ensuring that certain byte sequences retain their appropriate "character" width in internal storage.

CPython contains many checks to ensure that a string always use the most effecient storage, especially in debug mode. The C API should not allow to create a string using an inefficient storage, unless you "abuse" the C API :-D

I'm not sure what do you test.

ericsnowcurrently · 2024-03-01T17:07:12Z

(Also see the discussion between @davidhewitt and @encukou at https://www.youtube.com/watch?v=E_hczlgDqus.)

vstinner · 2024-03-02T22:44:49Z

See also capi-workgroup/api-evolution#10

indygreg mannequin added 3.7 (EOL) end of life 3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes topic-C-API type-bug An unexpected behavior, bug, or error labels Aug 27, 2021

indygreg mannequin changed the title ~~Reliance on C bit fields is C API is undefined behavior~~ Reliance on C bit fields in C API is undefined behavior Aug 27, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

decathorpe mentioned this issue Mar 7, 2023

Implement wrapper for PyASCIIObject.state bitfield accesses PyO3/pyo3#3015

Merged

davidhewitt added a commit to davidhewitt/cpython that referenced this issue Mar 10, 2023

pythongh-89188: replace bitfield with struct fields in PyASCIIObject

14c79d4

davidhewitt mentioned this issue Mar 10, 2023

gh-89188: replace bitfield with struct fields in PyASCIIObject #102589

Closed

davidhewitt added a commit to davidhewitt/cpython that referenced this issue Mar 10, 2023

pythongh-89188: replace bitfield with struct fields in PyASCIIObject

4dd82a4

davidhewitt added a commit to davidhewitt/cpython that referenced this issue Feb 9, 2024

pythongh-89188: add PyUnicode_Data and PyUnicode_GetKind

3279b86

bedevere-app bot mentioned this issue Feb 9, 2024

gh-89188: add PyUnicode_Data and PyUnicode_GetKind #115211

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliance on C bit fields in C API is undefined behavior #89188

Reliance on C bit fields in C API is undefined behavior #89188

indygreg mannequin commented Aug 27, 2021 •

edited by bedevere-app bot

Loading

indygreg mannequin commented Aug 27, 2021

vstinner commented Aug 30, 2021

encukou commented Aug 30, 2021

vstinner commented Aug 30, 2021

vstinner commented Aug 30, 2021

encukou commented Aug 30, 2021

vstinner commented Aug 30, 2021

indygreg mannequin commented Aug 30, 2021

vstinner commented Aug 31, 2021

ericsnowcurrently commented Mar 1, 2024

vstinner commented Mar 2, 2024

Reliance on C bit fields in C API is undefined behavior #89188

Reliance on C bit fields in C API is undefined behavior #89188

Comments

indygreg mannequin commented Aug 27, 2021 • edited by bedevere-app bot Loading

Linked PRs

indygreg mannequin commented Aug 27, 2021

vstinner commented Aug 30, 2021

encukou commented Aug 30, 2021

vstinner commented Aug 30, 2021

vstinner commented Aug 30, 2021

encukou commented Aug 30, 2021

vstinner commented Aug 30, 2021

indygreg mannequin commented Aug 30, 2021

vstinner commented Aug 31, 2021

ericsnowcurrently commented Mar 1, 2024

vstinner commented Mar 2, 2024

indygreg mannequin commented Aug 27, 2021 •

edited by bedevere-app bot

Loading