gh-89188: add `PyUnicode_Data` and `PyUnicode_GetKind` #115211

davidhewitt · 2024-02-09T11:51:59Z

The idea is to add PyUnicode_Data and PyUnicode_GetKind to the version-specific API as exported symbols, so that we can use these downstream in PyO3 instead of having a dependency on the C bitfield.

We built this on stream today with help of my chat; it was suggested by @mejrs that PyUnicode_GetData would be a better name for consistency.

I understood that @encukou didn't want these added to the stable ABI due to concerns about exposing PyUnicode encoding and also risk of dangling pointers. That's fine by me, the main aim here is just to get PyO3 off implementation details.

I guess we'll need tests and documentation too; I can add those if it looks like this API is acceptable to add.

Issue: Reliance on C bit fields in C API is undefined behavior #89188

encukou · 2024-02-09T12:16:36Z

@encukou didn't want these added to the stable ABI

Sorry if I wasn't clear, and sorry to spoil the fun, but I wasn't talking about only the stable ABI, but about the API in general.

the main aim here is just to get PyO3 off implementation details.

But what you're exposing here are implementation details. We don't want to expose any more of them than we already historically have.

A good API for this will need more design. Before that, PyO3 can keep doing what it's doing now -- it's fragile, but that's because we want to change the details in the future, not set them in stone now.

Let's talk privately about how to best do this. If you want to put in more of your time, I'll be happy to mentor or pair-program.

In case other core devs don't agree with my opinion:

Public API needs docs and tests.
Additions to the C API now go through the C API WG (which I'm part of; as above, I doubt it'll approve this API)

davidhewitt · 2024-02-09T14:57:28Z

Sorry if I wasn't clear, and sorry to spoil the fun, but I wasn't talking about only the stable ABI, but about the API in general.

No fun spoiled at all! Regardless of the acceptance of this patch, it was nice to do this on stream so that others could see some of the steps to contributing a patch to CPython (with a bit of PyO3-based testing thrown in), and also so that folks could correct me where my C knowledge fell short. 😄

I had understood you that these are definitely not candidates for inclusion in the stable ABI, and that there are good reasons to not want these APIs as they are at all.

Before that, PyO3 can keep doing what it's doing now -- it's fragile, but that's because we want to change the details in the future, not set them in stone now.

My main worry is that it's fragile to the point of bordering on unsound, for reasons related to the bitfield discussed in #89188. The nuclear option is to completely deprecate this API on the PyO3 side and refuse to expose it to users, though it would be nice to find a more gentle middle ground, especially as that doesn't actually prevent users from trying this themselves, which might be worse.

For what it's worth, I did a little bit of digging to see what happens downstream of PyO3 in other Rust code related to these unicode objects. Here's a couple of use cases:

reading from data in tests in PyOxidizer - I guess @indygreg it would be helpful to understand why it was useful to check the encoded result, rather than trying to read out the data as utf8 again?
writing to data in orjson - this doesn't go via PyUnicode_DATA but instead casts and looks past the end of the structure and then writes directly to the encoding. TBH this looks pretty extreme, I assume it was done for performance?
- I would have hoped that PyUnicode_FromUtf8AndSize (which PyO3 uses for its string bindings) would be good enough.

Let's talk privately about how to best do this.

👍 let's aim to catch up soon and circle back here.

encukou · 2024-02-12T10:16:33Z

My main issue is that this API encourages users to rely on the detail that they can “borrow” data in one of the formats in the current PyUnicode_KIND enum. But, future PyUnicode representations might not have any of them available for borrowing.

check the encoded result

The elements of Python strings are any numbers in range(0x110000). Not all of these are valid Unicode codepoints. In particular, surrogates (0xD800-0xDFFF) are allowed in str but are not valid Unicode strings. AFAIK (do check this!), CPython's internal “UTF-8” buffer actually uses a (straightforward) extension of UTF-8. The user-facing Python methods do checks:

>>> chr(0xD800)
'\ud800'
>>> chr(0xD800).encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

(And yes, the docs aren't clear on what's supposed to be valid Unicode/UTF-8 and what's not. Generally, though, we don't check.)

orjson

I consider writing the data an entirely separate use case from reading it. It should have its own API.

pythongh-89188: add PyUnicode_Data and PyUnicode_GetKind

3279b86

bedevere-app bot added the awaiting review label Feb 9, 2024

bedevere-app bot mentioned this pull request Feb 9, 2024

Reliance on C bit fields in C API is undefined behavior #89188

Open

📜🤖 Added by blurb_it.

0ed9e4d

dave-shawley mentioned this pull request Feb 9, 2024

gh-115165: Fix Annotated for immutable types #115213

Merged

davidhewitt mentioned this pull request Feb 9, 2024

gh-89188: replace bitfield with struct fields in PyASCIIObject #102589

Closed

Eclips4 added the topic-C-API label Feb 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-89188: add `PyUnicode_Data` and `PyUnicode_GetKind` #115211

gh-89188: add `PyUnicode_Data` and `PyUnicode_GetKind` #115211

davidhewitt commented Feb 9, 2024 •

edited by bedevere-app bot

Loading

encukou commented Feb 9, 2024

davidhewitt commented Feb 9, 2024

encukou commented Feb 12, 2024

gh-89188: add PyUnicode_Data and PyUnicode_GetKind #115211

Are you sure you want to change the base?

gh-89188: add PyUnicode_Data and PyUnicode_GetKind #115211

Conversation

davidhewitt commented Feb 9, 2024 • edited by bedevere-app bot Loading

encukou commented Feb 9, 2024

davidhewitt commented Feb 9, 2024

encukou commented Feb 12, 2024

gh-89188: add `PyUnicode_Data` and `PyUnicode_GetKind` #115211

gh-89188: add `PyUnicode_Data` and `PyUnicode_GetKind` #115211

davidhewitt commented Feb 9, 2024 •

edited by bedevere-app bot

Loading