-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C API] PEP 756: Add PyUnicode_Export() and PyUnicode_Import() functions #119609
Comments
Add PyUnicode_AsNativeFormat() and PyUnicode_FromNativeFormat() functions to the C API.
cc @davidism |
Add PyUnicode_AsNativeFormat() and PyUnicode_FromNativeFormat() functions to the C API.
I believe we can do better. I'll write a longer reply this week. |
My gut feeling as well. I don't think we want to encourage authors to specialise their own code when using the limited API. An efficient, abstraction-agnostic, multiple find-replace should be our job (if it's an important scenario, which I'm inclined to think it probably is). |
C extensions already specialize their code for the Unicode native format (ex: MarkupSafe). The problem is that they cannot update their code to the limited API because of missing features to export to/import from native formats.
That sounds like a very specific API solving only one use case. |
Which means we forced them into it, not that they wanted to. Just because we did something not great for them in the past means we have to double-down on it into the future.
The same could be said about your proposed API. Do you have more use cases? I'm not going to argue heavily about this case until Petr provides his writeup. I haven't given it a huge amount of thought, just felt it was worth adding a +1 to "this doesn't feel great". |
What do you mean? I don't force users to use this API. It's an API which can be used to specialize code for UCS1, UCS2 and UCS4: it's the only way to write the most efficient code for the current implementation of Python.
Python itself has a wide library to specialize code for UCS1 / UCS2 / UCS4 in
There are different operations which are already optimized. The problem is that the macro PyUnicode_WRITE_CHAR() is implemented with 2 tests, it's inefficient. Well, in most cases, PyUnicode_WRITE_CHAR() is good enough. But not when you want the optimal case, such as MarkupSafe which has a single function optimized in C. There are 27 projects using
There are already other, less efficient, ways to export a string, depending on its maximum character. But some of these functions are excluded from the limited C API.
Proposed
|
"Most efficient code" is not the job of the limited API. At a certain level of exposure to internals, you need to stop using the limited API, or else accept that your performance will be reduced. If the But this proposal is literally about leaking implementation details. That is against the intent of the limited API, and so I am against the proposal. Thanks for making me think about it, you've upgraded me from "unsure" to "sure" ;) Still looking forward to Petr's thoughts. |
@davidhewitt: Would Rust benefit from such "native format" API? Or does Rust just prefer UTF-8 and then decode UTF-8 in Python? @da-woods: Would Cython benefit from such "native format" API? I see that Cython uses the following code in #if CYTHON_COMPILING_IN_LIMITED_API
// PyUnicode_Substring() does not support negative indexing but is otherwise fine to use.
return PyUnicode_Substring(text, start, stop);
#else
return PyUnicode_FromKindAndData(PyUnicode_KIND(text),
PyUnicode_1BYTE_DATA(text) + start*PyUnicode_KIND(text), stop-start);
#endif |
At the very least, please rename it to "InternalFormat" instead of "NativeFormat". At first I thought this was going to be a great API for following the conventions of the native machine, since the name led me incorrectly. It took me a few reads to figure out that it was telling me the format, not that I was getting to choose it. |
I looked how some of these projects use
They usually use I do not see how |
The Cython usage discussed above is an attempt at micro-optimizing slicing unicode. I think the need to handle UTF-8 as a possibility would mean we couldn't usefully use this as a replacement here (because of the variable character length). In this case I think There's maybe a few places we could use
|
PyUnicode_NATIVE_UTF8 is not used internally by Python, so PyUnicode_AsNativeFormat() doesn't use it. I added it for PyPy, if PyPy would like to adopt this C API. Maybe I can remove it to avoid confusion. |
Is it OK for an implementation to have a native format that's not in the fixed list? And if so then should they set an exception? That suggests that to be truly compatible extension authors should be prepared to have a fallback non-optimized implementation for valid unicode objects without a native format? |
The result of I don't think the limited API should avoid optimizations. We just need to Anyway, the issue I have with the proposed API is:
What should the fallback be? Generally, if we can't do a zero-copy export (as
I'll skip exploring the former here. (I did think about that part of that design space For the latter, we have a rather nice mechanism: PyAPI_FUNC(int) PyUnicode_GetBuffer(
PyObject *unicode,
Py_buffer *buf,
int buffer_flags, // as for PyObject_GetBuffer
int32_t supported_formats,
int32_t *result_format); with supported_formats being flags indicating what the caller can take, #define PyUnicode_BUFFER_FORMAT_ASCII 0x01
#define PyUnicode_BUFFER_FORMAT_UCS1 0x02
#define PyUnicode_BUFFER_FORMAT_UCS2 0x04
#define PyUnicode_BUFFER_FORMAT_UCS4 0x08
#define PyUnicode_BUFFER_FORMAT_UTF8 0x10 On success, The new idea here (AFAIK, at least for stdlib) is that The beauty of this design is that it can also do the zero-copy export
In other words, if you guess the internals correctly (or are prepared to handle
If either of those are a performance issue, then you probably shouldn't If we do add the PyAPI_FUNC(int) PyUnicode_FromBuffer(
// (edit: removed mistaken argument)
void *data,
Py_ssize_t nbytes,
int32_t data_format); (It's not necessary to require the full We've been expeoring this kind of API with @davidhewitt. |
Consider me a big fan (probably unsurprising given the similarity to my all-API forward compatibility proposal). An API that always works, is as fast as you let it be, and makes clear that you will need a fallback is about as ideal as I think things can get. |
I like the
I abandonned my draft PEP PyResource for 2 reasons:
I dislike reusing I would prefer a more specialized ("simpler") API:
|
It does, but it also allows the callers to specify which features they can handle. (Hmm, I think I've seen that before!) I like reusing API we already have, but I won't die on that hill. |
@zooba @serhiy-storchaka: Do you prefer PyBuffer or pointer+size for these specific APIs? |
I'm fine with either. Maybe +0 towards PyBuffer just so that the API doesn't necessarily have to be allocation-free (that is, the object being freed is different from the original one - otherwise, we'd have to track all additional allocations against the original object). But it does feel a bit heavy for a case where the point is to look inside our internal data structures. I'd prefer that kind of operation to feel as rough as possible, to discourage people from doing it 😉 |
Sorry to be slow to reply here, I started writing a reply describing the investigation I did with @encukou and then they managed to finish writing first 😁 I am happy with either the buffer or specialized form of You are welcome to take my branch commit which @encukou and I built if it's useful. Currently lacks tests IIRC 🫣 |
Add PyUnicode_AsNativeFormat() and PyUnicode_FromNativeFormat() functions to the C API.
I updated the PR to a new API: #define PyUnicode_FORMAT_ASCII 0x01
#define PyUnicode_FORMAT_UCS1 0x02
#define PyUnicode_FORMAT_UCS2 0x04
#define PyUnicode_FORMAT_UCS4 0x08
#define PyUnicode_FORMAT_UTF8 0x10
// Get the content of a string in the requested format:
// - Return the content, set '*size' and '*format' on success.
// - Set an exception and return NULL on error.
//
// The export must be released by PyUnicode_ReleaseExport().
PyAPI_FUNC(const void*) PyUnicode_Export(
PyObject *unicode,
uint32_t supported_formats,
Py_ssize_t *size,
uint32_t *format);
// Release an export created by PyUnicode_Export().
PyAPI_FUNC(void) PyUnicode_ReleaseExport(
PyObject *unicode,
const void* data,
uint32_t format);
// Create a string object from a string in the format 'format'.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(PyObject*) PyUnicode_Import(
const void *data,
Py_ssize_t size,
uint32_t format);
It's possible to export any string to UCS4 and UTF-8. In the UTF-8 case, the encoded UTF-8 string is cached into the Unicode object: similar to PyUnicode_AsUTF8(). Currently, I didn't implement exporting UCS1 to UCS2. Only export to UCS4 or UTF-8 is supported. Well, obviously, if the requested format include ASCII and UCS1, the native format is used (no copy or conversion needed). EDIT: I changed the format type to |
Today, sure, but it protects us for the future. And allows other implementations to handle more cases. I find "release export" a slightly uncomfortable name - did we have any other ideas? The API reminds me of But I do prefer this naming to "native format", so consider me in favour, but happy to see a name change if someone comes up with a brilliant idea. |
I started with |
Oh sure, I just described my current implementation. |
@encukou @serhiy-storchaka @da-woods: What do you think of the updated API #119609 (comment)? |
I won't have time for a full review this week.
Please use |
I won't have time to review it this week.
Please use |
Ok, done. |
Ping. Do you have time for a review this week? :-) |
Hi, |
There is an issue with your PR, it has something like 150 commits. |
Add PyUnicode_Export() and PyUnicode_Import() functions to the C API.
Yes, I merged in the main barnch to fix the conflicts. If you fix conflicts in your PR, I can rebase and just leave the last 6.
And of course I'd still prefer exporting |
@encukou: I integrated most of your suggestions in my PR. Please see the updated PR. |
I created Add PyUnicode_Export() and PyUnicode_Import() to the limited C API issue in the C API WG Decisions project. |
Add PyUnicode_Export() and PyUnicode_Import() functions to the C API.
Add PyUnicode_Export(), PyUnicode_GetBufferFormat() and PyUnicode_Import() functions to the limited C API.
Add PyUnicode_Export(), PyUnicode_GetBufferFormat() and PyUnicode_Import() functions to the limited C API.
Add PyUnicode_Export(), PyUnicode_GetBufferFormat() and PyUnicode_Import() functions to the limited C API.
PEP 756 is withdrawn. |
Feature or enhancement
PEP 393 – Flexible String Representation changed the Unicode implementation in Python 3.3 to use 3 string "kinds":
PyUnicode_KIND_1BYTE
(UCS-1): ASCII and Latin1, [U+0000; U+00ff] range.PyUnicode_KIND_2BYTE
(UCS-2): BMP, [U+0000; U+ffff] range.PyUnicode_KIND_4BYTE
(UCZ-4): Full Unicode Character Set, [U+0000; U+10ffff] range.Strings must always use the optimal storage: ASCII string must be stored as PyUnicode_KIND_2BYTE.
Strings have a flag indicating if the string only contains ASCII characters: [U+0000; U+007f] range. It's used by multiple internal optimizations.
This implementation is not leaked in the limited C API. For example, the
PyUnicode_FromKindAndData()
function is excluded from the stable ABI. Said differently, it's not possible to write efficient code for PEP 393 using the limited C API.I propose adding two functions:
PyUnicode_AsNativeFormat()
: export to the native formatPyUnicode_FromNativeFormat()
: import from the native formatThese functions are added to the limited C API version 3.14.
Native formats (new constants):
PyUnicode_NATIVE_ASCII
: ASCII string.PyUnicode_NATIVE_UCS1
: UCS-1 string.PyUnicode_NATIVE_UCS2
: UCS-2 string.PyUnicode_NATIVE_UCS4
: UCS-4 string.PyUnicode_NATIVE_UTF8
: UTF-8 string (CPython implementation detail: only supported for import, not used by export).Differences with
PyUnicode_FromKindAndData()
:PyUnicode_NATIVE_ASCII format allows further optimizations.
PyUnicode_NATIVE_UTF8 can be used by PyPy and other Python implementation using UTF-8 as the internal storage.
API:
See the attached pull request for more details.
This feature was requested to me to port the MarkupSafe C extension to the limited C API. Currently, each release requires producing around 60 wheel files which takes 20 minutes to build: https://pypi.org/project/MarkupSafe/#files
Using the stable ABI would reduce the number of wheel packages and so ease their release process.
See src/markupsafe/_speedups.c: string functions specialized for the 3 string kinds (UCS-1, UCS-2, UCS-4).
Linked PRs
The text was updated successfully, but these errors were encountered: