Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-89188: add PyUnicode_Data and PyUnicode_GetKind #115211

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

davidhewitt
Copy link
Contributor

@davidhewitt davidhewitt commented Feb 9, 2024

This is inspired by #102589 (comment)

The idea is to add PyUnicode_Data and PyUnicode_GetKind to the version-specific API as exported symbols, so that we can use these downstream in PyO3 instead of having a dependency on the C bitfield.

We built this on stream today with help of my chat; it was suggested by @mejrs that PyUnicode_GetData would be a better name for consistency.

I understood that @encukou didn't want these added to the stable ABI due to concerns about exposing PyUnicode encoding and also risk of dangling pointers. That's fine by me, the main aim here is just to get PyO3 off implementation details.

I guess we'll need tests and documentation too; I can add those if it looks like this API is acceptable to add.

@encukou
Copy link
Member

encukou commented Feb 9, 2024

@encukou didn't want these added to the stable ABI

Sorry if I wasn't clear, and sorry to spoil the fun, but I wasn't talking about only the stable ABI, but about the API in general.

the main aim here is just to get PyO3 off implementation details.

But what you're exposing here are implementation details. We don't want to expose any more of them than we already historically have.

A good API for this will need more design. Before that, PyO3 can keep doing what it's doing now -- it's fragile, but that's because we want to change the details in the future, not set them in stone now.

Let's talk privately about how to best do this. If you want to put in more of your time, I'll be happy to mentor or pair-program.


In case other core devs don't agree with my opinion:

  • Public API needs docs and tests.
  • Additions to the C API now go through the C API WG (which I'm part of; as above, I doubt it'll approve this API)

@davidhewitt
Copy link
Contributor Author

Sorry if I wasn't clear, and sorry to spoil the fun, but I wasn't talking about only the stable ABI, but about the API in general.

No fun spoiled at all! Regardless of the acceptance of this patch, it was nice to do this on stream so that others could see some of the steps to contributing a patch to CPython (with a bit of PyO3-based testing thrown in), and also so that folks could correct me where my C knowledge fell short. 😄

I had understood you that these are definitely not candidates for inclusion in the stable ABI, and that there are good reasons to not want these APIs as they are at all.

Before that, PyO3 can keep doing what it's doing now -- it's fragile, but that's because we want to change the details in the future, not set them in stone now.

My main worry is that it's fragile to the point of bordering on unsound, for reasons related to the bitfield discussed in #89188. The nuclear option is to completely deprecate this API on the PyO3 side and refuse to expose it to users, though it would be nice to find a more gentle middle ground, especially as that doesn't actually prevent users from trying this themselves, which might be worse.

For what it's worth, I did a little bit of digging to see what happens downstream of PyO3 in other Rust code related to these unicode objects. Here's a couple of use cases:

  • reading from data in tests in PyOxidizer - I guess @indygreg it would be helpful to understand why it was useful to check the encoded result, rather than trying to read out the data as utf8 again?
  • writing to data in orjson - this doesn't go via PyUnicode_DATA but instead casts and looks past the end of the structure and then writes directly to the encoding. TBH this looks pretty extreme, I assume it was done for performance?
    • I would have hoped that PyUnicode_FromUtf8AndSize (which PyO3 uses for its string bindings) would be good enough.

Let's talk privately about how to best do this.

👍 let's aim to catch up soon and circle back here.

@encukou
Copy link
Member

encukou commented Feb 12, 2024

My main issue is that this API encourages users to rely on the detail that they can “borrow” data in one of the formats in the current PyUnicode_KIND enum. But, future PyUnicode representations might not have any of them available for borrowing.

check the encoded result

The elements of Python strings are any numbers in range(0x110000). Not all of these are valid Unicode codepoints. In particular, surrogates (0xD800-0xDFFF) are allowed in str but are not valid Unicode strings. AFAIK (do check this!), CPython's internal “UTF-8” buffer actually uses a (straightforward) extension of UTF-8. The user-facing Python methods do checks:

>>> chr(0xD800)
'\ud800'
>>> chr(0xD800).encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

(And yes, the docs aren't clear on what's supposed to be valid Unicode/UTF-8 and what's not. Generally, though, we don't check.)

orjson

I consider writing the data an entirely separate use case from reading it. It should have its own API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants