Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEP 756: Add PyUnicode_Export() and PyUnicode_Import() to the limited C API #33

Closed
vstinner opened this issue Jun 24, 2024 · 60 comments
Closed
Labels

Comments

@vstinner
Copy link

vstinner commented Jun 24, 2024

See PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions.

@encukou
Copy link
Collaborator

encukou commented Jun 25, 2024

I would prefer to use Py_buffer rather than 3 separate pieces of info that need to be kept around and passed to PyUnicode_ReleaseExport, but I'm OK with this too.

If a memory copy is needed, PyUnicode_ReleaseExport() is responsible to release it.

I'll note that the user must always call PyUnicode_ReleaseExport; there's no public indication about whether you have a copy or not.
The intent is that if string internal layout changes in the future, this API will still work as before -- it'll just be slower and use more memory.

I don't know if we should add a convenient constant for "all native formats"

I think we should not, unless there's a use case I don't see.

I'm OK with a library using knowledge of CPython internals to “guess” that UCSn will be preferred to UTF-8 if all of UCS1-UCS4 are given. The price of guessing wrong is “only” a performance decrease. And to support PyPy's native format, you should add UTF-8 to the formats you support rather than look at PyUnicode_FORMAT_NATIVE.

@vstinner
Copy link
Author

I'll note that the user must always call PyUnicode_ReleaseExport

Right, that's what the documentation says. It must always be called. I was just describing the implementation.

I think we should not, unless there's a use case I don't see.

The idea of PyUnicode_NATIVE_FORMATS would be to be a mask of formats which are supported natively by CPython or PyPy. CPython would define it as:

#define PyUnicode_NATIVE_FORMATS \
    (PyUnicode_FORMAT_ASCII \
     | PyUnicode_FORMAT_UCS1 \
     | PyUnicode_FORMAT_UCS2 \
     | PyUnicode_FORMAT_UCS4)

PyPy would define it as:

#define PyUnicode_NATIVE_FORMATS PyUnicode_FORMAT_UTF8

But well, we can add it later if needed.

@malemburg
Copy link

malemburg commented Jun 26, 2024

I think it would be better to use a slightly different signature which separates error reporting from returning data. The buffer API (and many other Python C APIs) use the same approach.

// Get the content of a string in one of the requested formats:
// - Set `**data`, '*nbytes' and '*format' on success, with the data available in `*data`
// - Set an exception and return -1 on error, 0 on success.
// - The function tries to avoid copying, so passing in multiple formats is advised.
//
// The export must be released using PyUnicode_ReleaseExport() in all cases.
//
PyAPI_FUNC(int) PyUnicode_Export(
    PyObject *unicode,
    uint32_t requested_formats,
    const void **data,
    Py_ssize_t *nbytes,
    uint32_t *format);

The release function may also need an int return value to report errors, e.g. if it finds that the data buffer does not match the Unicode object one, even though the format is the correct one.

I'm also not sure how PyUnicode_ReleaseExport() will deal with exports which do require copying data, e.g. if the Unicode object has a different kind than requested. In the PR, the function basically is a no-op in most cases, which looks wrong.

Update: Fixed a bug in the sig and clarified the comment wording a bit more.

@vstinner
Copy link
Author

I'm also not sure how PyUnicode_ReleaseExport() will deal with exports which do require copying data, e.g. if the Unicode object has a different kind than requested. In the PR, the function basically is a no-op in most cases, which looks wrong.

If format doesn't match the Python str kind, a copy is needed. In this case, PyUnicode_ReleaseExport() release the memory. Example:

  • input UCS1, export UCS2
  • input UCS2, export UCS4
  • input <any kind>, export UTF-8

@mdboom
Copy link

mdboom commented Jun 27, 2024

I would prefer to use Py_buffer rather than 3 separate pieces of info that need to be kept around and passed to PyUnicode_ReleaseExport, but I'm OK with this too.

+1 to this from me, but if it were a memoryview, it could just be DECREF'd. That would also allow in the non-copy case to keep the string alive as long as all views on it are alive.

What are the intended semantics when PyUnicode_Export is followed by freeing the string, and then using the returned buffer? It seems like that would work if the types didn't match, and it copied, but not if it was non-copying. I think using a memoryview would sidestep that problem by giving us the ability to properly reference count the buffers.

@vstinner
Copy link
Author

How do you express for 5 different formats in Py_buffer format? There are no ASCII, UCS1, UCS2 and UTF-8 formats in https://docs.python.org/dev/library/array.html table, only "w" for UCS4.

@vstinner
Copy link
Author

@malemburg:

I think it would be better to use a slightly different signature which separates error reporting from returning data.

Ok, I made this change. Previously, I hesitated to do it, the API looked "unnatural" and difficult to use. So I like better returning -1 on error or 0 on success.

@mdboom
Copy link

mdboom commented Jun 28, 2024

How do you express for 5 different formats in Py_buffer format? There are no ASCII, UCS1, UCS2 and UTF-8 formats in https://docs.python.org/dev/library/array.html table, only "w" for UCS4.

Use b, b, h, respectively? The fact that it's text is erased from the void * pointer returned in the original proposal here anyway, so I don't think using ints of the right type here is any worse.

@mdboom
Copy link

mdboom commented Jun 28, 2024

To clarify, I think my main concern is that, as proposed this doesn't return a PyObject * of some sort, so the memory management has unexpected pitfalls and doesn't really match much of the rest of the C API. Particularly, think of code that always requests UCS4 -- it might always work as long as the string contains only ASCII, and thus is a copy, but then the user puts an emoji in the string and suddenly it's not a copy and the assumptions in the user's code breaks. I think semantic changes like this based on the values of objects can be particular difficult to test and confirm correctness of (for the C API's users). By returning a PyObject * and using reference counting semantics, at least it will match much of the rest of the C API.

I feel less strongly about whether the PyObject * returned is a memoryview or some new PyObject * subclass just for this purpose. Reusing memoryview reduces the amount of new API, but as @vstinner pointed out, maybe (at least for self-descriptive reasons) it would need to grow new data types to be more specific that it contains string data.

@encukou
Copy link
Collaborator

encukou commented Jul 1, 2024

IMO, the PyObject abstraction primarily allows multiple owners for the same data. For simpler types that don't need that, like array export, it's OK to add bespoke “release” API.

articularly, think of code that always requests UCS4 -- it might always work as long as the string contains only ASCII, and thus is a copy, but then the user puts an emoji in the string and suddenly it's not a copy and the assumptions in the user's code breaks.

What failure mode did you have in mind?
IMO, the main way this particular case can fail is when user writes into the array. We can't prevent that with PyObject*.

@vstinner
Copy link
Author

vstinner commented Jul 1, 2024

Ok, here is something else. What about adding a PyUnicodeExport structure which holds 4 members and a strong reference to the Python str object?

#define PyUnicode_FORMAT_UCS1 0x01
#define PyUnicode_FORMAT_UCS2 0x02
#define PyUnicode_FORMAT_UCS4 0x04
#define PyUnicode_FORMAT_UTF8 0x08

typedef struct PyUnicodeExport {
    PyObject *obj;
    uint32_t format;
    const void *data;
    Py_ssize_t nbytes;
} PyUnicodeExport;

// Get the content of a string in the requested format:
// - Set '*unicode_export' and return 0 on success.
// - Set an exception and return -1 on error.
//
// The export must be released by PyUnicode_ReleaseExport().
PyAPI_FUNC(int) PyUnicode_Export(
    PyObject *unicode,
    uint32_t requested_formats,
    PyUnicodeExport *unicode_export);

// Release an export created by PyUnicode_Export().
PyAPI_FUNC(void) PyUnicode_ReleaseExport(
    PyUnicodeExport *unicode_export);

// Create a string object from a string in the format 'format'.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(PyObject*) PyUnicode_Import(
    const void *data,
    Py_ssize_t nbytes,
    uint32_t format);

UPDATE: I removed PyUnicode_FORMAT_ASCII, I don't think that it brings much value over PyUnicode_FORMAT_UCS1. And PyUnicode_IS_ASCII() can be used to check if a Python str is ASCII-only.

UPDATE: I renamed export to unicode_export since export name causes compilation errors on Windows (MSC compiler).

Example of usage:

    PyUnicodeExport unicode_export;
    if (PyUnicode_Export(obj, requested_formats, &unicode_export) < 0) {
        // handle error
    }
    // ... use unicode_export ...
    PyUnicode_ReleaseExport(&unicode_export);

@malemburg
Copy link

Looks good. I assume obj will be set to unicode with incremented refcount, right ? And the release function will then take care of the DECREF ?

I still think that allowing the release function to return an int error code is important to have. It may be needed in the future when releasing does more than a DECREF and free().

@vstinner
Copy link
Author

vstinner commented Jul 2, 2024

Looks good. I assume obj will be set to unicode with incremented refcount, right ? And the release function will then take care of the DECREF ?

Yes, export.obj holds a strong reference to the Python str object.

I still think that allowing the release function to return an int error code is important to have. It may be needed in the future when releasing does more than a DECREF and free().

I don't see what the caller would do with an error on PyUnicode_ReleaseExport(). I prefer to log it directly as a warning or an unraisable exception in PyUnicode_ReleaseExport(). Usually, Python C API doesn't let a "free" function to report errors. Py_DECREF() is a good example: Py_DECREF() can fail and return "void".

@ncoghlan
Copy link

ncoghlan commented Jul 4, 2024

Along the same lines as @mdboom 's concern, when Py_buffer was added, its non-PyObject-semantics genuinely caused problems until 3.3 revamped the memoryview implementation and added PyMemoryView_FromMemory to the C API (after the fixes, you could pretty much just use memoryview everywhere if you didn't want to deal with managing the lifecycle of Py_buffer references yourself).

I suspect a similar split-level API where a Py_text_export struct is combined with a wrapping PyTextExport object may end up making sense here, too - folks that don't want to worry about the memory management details can use a PyTextExport instance, folks that want to avoid the object overhead can use Py_text_export directly.

(The non-PyObject C struct should also have a snake case name along the same lines as Py_buffer, rather than a PyCamelCase name that looks like a Python type name)

As far as convenience formats go, I do think there's one that would be worth defining:

#define PyUnicode_FORMAT_UCS \
     (PyUnicode_FORMAT_UCS1 \
     | PyUnicode_FORMAT_UCS2 \
     | PyUnicode_FORMAT_UCS4)

That way API clients can easily request an export that uses the smallest fixed width format that can handle every code point in the string to be exported.

Edit: while I don't think alternative string types are common enough to be worth defining a magic method protocol up front, I do wonder if it might be worth choosing a C API naming scheme that leaves open the possibility of adding one later. Specifically, using the PyText prefix and more explicitly aligning @vstinner's proposed API with the binary buffer protocol (but still keeping the "text export" terminology clearly distinct from the "binary buffer" terminology):

#define PyText_FORMAT_UCS1 0x01
#define PyText_FORMAT_UCS2 0x02
#define PyText_FORMAT_UCS4 0x04
#define PyText_FORMAT_UTF8 0x08

#define PyText_FORMAT_UCS (PyText_FORMAT_UCS1 | PyText_FORMAT_UCS2 | PyText_FORMAT_UCS4)

typedef struct Py_text_export {
    PyObject *obj;
    uint32_t format;
    const void *data;
    Py_ssize_t nbytes;
} Py_text_export;

// Get the content of a string in the requested format:
// - Set '*text' and return 0 on success.
// - Set an exception and return -1 on error.
//
// API signature intentionally similar to PyObject_GetBuffer().
// The export must be released by PyText_ReleaseExport().
// "Export" verb used to indicate data copying may be required, but will be skipped if possible
PyAPI_FUNC(int) PyText_Export(PyObject *exporter, Py_text_export *text, uint32_t requested_formats);

// Release an export created by PyText_Export().
// API signature intentionally similar to PyBuffer_Release().
PyAPI_FUNC(void) PyText_ReleaseExport(Py_text_export *text);

// And then in direct analogy to PyMemoryView, a helper API to manage text export lifecycles
// - text export objects would be pure data holders with no behaviour
// - exports have to be imported as Unicode objects to perform string operations
PyAPI_FUNC(PyObject *) PyTextExport_FromObject(PyObject *exporter, uint32_t requested_formats);
PyAPI_FUNC(PyObject *) PyTextExport_FromMemory(const void *data, Py_ssize_t nbytes, uint32_t format);
PyAPI_FUNC(PyObject *) PyTextExport_FromText(Py_text_export *text);
PyAPI_FUNC(Py_text_export *) PyTextExport_GetText(PyObject *export);

// PyObject_Str() would do the right thing with PyTextExport instances, so new
// APIs would only be needed for creating strings from the text export layouts

// Create a string object from a string in the format 'format'.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
// API signature intentionally similar to PyMemoryView_FromMemory().
// "Import" verb used to indicate that data is copied rather than referenced
PyAPI_FUNC(PyObject*) PyUnicode_Import(const void *data, Py_ssize_t nbytes, uint32_t format);

// API signature intentionally similar to PyMemoryView_FromBuffer().
// "Import" verb used to indicate that data is copied rather than referenced
PyAPI_FUNC(void) PyUnicode_ImportText(Py_text_export *text);

@serhiy-storchaka

This comment was marked as off-topic.

@serhiy-storchaka

This comment was marked as off-topic.

@encukou
Copy link
Collaborator

encukou commented Jul 22, 2024

Along the same lines as @mdboom 's concern, when Py_buffer was added, its non-PyObject-semantics genuinely caused problems until 3.3 revamped the memoryview implementation and added PyMemoryView_FromMemory to the C API (after the fixes, you could pretty much just use memoryview everywhere if you didn't want to deal with managing the lifecycle of Py_buffer references yourself).

I hate to be the broken record here, but... we already have an export struct that can be wrapped in a memoryview.
The piece that's missing in Py_buffer is a public field for the format. Perhaps something like this would work for it:

int PyUnicode_GetBufferFormat(const Py_buffer *buf, uint32_t *result);
// If `buf` is a buffer exported from a str (PyUnicode) object,
//   set `*result` to the corresponding `PyUnicode_FORMAT_*` value,
//   and return 0.
// Otherwise, or on another error, set `*result` to 0 and
//   return -1 with an exception set.

Internally, the format would be stored in buf->internal.

But in many cases this wouldn't be needed: if you ask for UTF-8 only you already know the format, and if you ask for any combination of PyUnicode_FORMAT_UCS*, you can use buf->itemsize.

@vstinner
Copy link
Author

vstinner commented Sep 5, 2024

Ok, here is something completely different :-) I rewrote the API based on Py_buffer! (Hopefully, Py_buffer is part of the limited C API since Python 3.11 in Include/pybuffer.h).

I wrote a new PR for that: python/cpython#123738

I also updated the API in the first comment of this issue.

@mdboom @encukou @zooba: Is it closer to what you expected? It's not a DECREF() but PyBuffer_Release() which must be called once the caller is done with the PyImport_Export() "view" (Py_buffer).

The Py_buffer length (view.len) is a number of characters and view.itemsize is the number of bytes of a single character, so view.len * view.itemsize is the string size in bytes.

I used formats "B" (1), "H" (2) and "I" (4, or "L" if int is 16-bit) for view.format, whereas view.internal stores the "Unicode export format": the public PyUnicode_GetBufferFormat() function must be used to get this Unicode export format.

PyUnicode_Import() still requires a number of bytes for its nbytes parameter.

@malemburg
Copy link

Hmm, unless I'm missing something, this removes the possibility to import/export UTF-8 data (which is a variable length encoding, so there is no fixed itemsize).

Given that there has been a lot of talk about eventually moving to UTF-8 a new single native format for CPython, I'd like to see some way to support zero-copy for this future implementation approach. You had this in your original proposal.

This could be done by defining a new "U" format character specifically for these APIs. itemsize would then be 1, since it's not meaningful and len point to the length of the string in bytes (not in code points).

@vstinner
Copy link
Author

vstinner commented Sep 5, 2024

Hmm, unless I'm missing something, this removes the possibility to import/export UTF-8 data (which is a variable length encoding, so there is no fixed itemsize).

Ah, for PyUnicode_FORMAT_UTF8, view.len is the total number of bytes: view.itemsize is fixed to 1 (byte).

@vstinner
Copy link
Author

vstinner commented Sep 5, 2024

A side effect of this change is to add the __release_buffer__() method to the built-in str type. The method is automatically generated by Objects/typeobject.c.

The buffer protocol is quite safe, it has multiple checks before calling unicode_releasebuffer():

$ ./python
>>> "abc".__release_buffer__(1)
Traceback (most recent call last):
  File "<python-input-0>", line 1, in <module>
    "abc".__release_buffer__(1)
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^
TypeError: expected a memoryview object

>>> m=memoryview(b'abc')

>>> "abc".__release_buffer__(m)
Traceback (most recent call last):
  File "<python-input-2>", line 1, in <module>
    "abc".__release_buffer__(m)
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^
ValueError: memoryview's buffer is not this object

(See wrap_releasebuffer() in Objects/typeobject.c.)

In short, you're only allowed to call str.__release_buffer__() on a memoryview (Py_buffer) created by PyUnicode_Export().

Note: I also had to implement collections.UserString.__release_buffer__() to fix test_collections (the UserString simply raises NotImplementedError).


If we don't want to add str.__release_buffer__(), the Py_buffer approach should be abandoned and we should come back to the initial idea of a specialized PyUnicode_ReleaseExport() function without the Py_buffer API.

IMO it's ok to add str.__release_buffer__(). Moreover, it's a deliberate choice to expose it: PEP 688 – Making the buffer protocol accessible in Python.

@zooba
Copy link

zooba commented Sep 5, 2024

Does implementing the buffer protocol like this mean that memoryview(str) will start to work? That seems undesirable, at least without more oversight.

Otherwise, I like the idea. But getting access to low-level details really ought to require a C API, and since we seem to be adding more and more of these it'd be nice to maximise consistency (either across the new APIs, or with whatever exists, like Py_buffer).

@vstinner
Copy link
Author

vstinner commented Sep 5, 2024

Does implementing the buffer protocol like this mean that memoryview(str) will start to work?

No. I didn't implement the "get buffer" slot (PyUnicodeType.tp_as_buffer.bf_getbuffer), only the "release buffer" slot (PyUnicodeType.tp_as_buffer.bf_releasebuffer), so memoryview(str) still fails as expected.

For now, only the C API, PyUnicode_Export(), is able to fill a view (Py_buffer) for a Python str object.

@vstinner
Copy link
Author

vstinner commented Sep 8, 2024

@zooba
Copy link

zooba commented Sep 9, 2024

Did we discuss whether PyUnicode_Export should return the actual format as an output variable? It could be optional since the additional function appears (or I assume the function fails if only one format is requested and it's not available), but otherwise it seems like it would always be needed?

@vstinner
Copy link
Author

vstinner commented Sep 9, 2024

Did we discuss whether PyUnicode_Export should return the actual format as an output variable? It could be optional since the additional function appears (or I assume the function fails if only one format is requested and it's not available), but otherwise it seems like it would always be needed?

I added back the format parameter to PyUnicode_Export() and I removed PyUnicode_GetBufferFormat(). If PyUnicode_Export() gives the format, I don't see the usage of the PyUnicode_GetBufferFormat() function. Or maybe someone sees an usage for this function?

@vstinner
Copy link
Author

vstinner commented Sep 9, 2024

I assume the function fails if only one format is requested and it's not available

Right. For example, if you request UCS1 and the string uses UCS2, you get an error: the string cannot be exported as UCS1.

@zooba
Copy link

zooba commented Sep 9, 2024

I don't see the usage of the PyUnicode_GetBufferFormat() function. Or maybe someone sees an usage for this function?

Yeah, I don't really see a use, unless someone was to pass you the buffer and you know it's a PyUnicode but you don't know what it is... there's no need to encourage that. Nobody is using this API yet, so if they start, they can also pass the format.

I voted for the addition with your changes.

@serhiy-storchaka
Copy link

If tomorrow, Python switchs to UTF16, we should add a new PyUnicode_FORMAT_UTF16 format (constant)

I only want to check that the API is extensible in this direction. What could be a mnemonic value for such constant if 16 is already taken for ASCII?

What would be the usage of this information?

Just an additional bit of information. I have no currently use for it. It can be obtained by checking view.obj.

Do you suggest to remove PyUnicode_Type.tp_as_buffer.bf_releasebuffer?

It is not needed.

@vstinner
Copy link
Author

I only want to check that the API is extensible in this direction. What could be a mnemonic value for such constant if 16 is already taken for ASCII?

The next available constant value is 0x20. So it would be 0x20.

It is not needed.

I don't know Py_buffer. How does it find how to be released if the type has no bf_releasebuffer? Would you mind to elaborate?

@vstinner
Copy link
Author

For example, if you request UCS2 and the string is UCS1, some memory is allocated, and PyBuffer_Release() must release this memory.

@serhiy-storchaka
Copy link

The next available constant value is 0x20. So it would be 0x20.

But this breaks a rule proposed in #33 (comment). It would be nicer to use constant 16 for UTF16. This is why I suggest to solve this problem now.

How does it find how to be released if the type has no bf_releasebuffer?

PyBuffer_Release() simply decrefs view->obj.

I looked at your implementation, and I see that it can be made simpler. Create a bytes object, set it as view->obj, set view->buf and view->len to its data pointer and size.

BTW, several years ago @methane proposed a similar function for UTF-8 only. It was equivalent to PyUnicode_Export(unicode, PyUnicode_FORMAT_UTF8, &view).

Yet one suggestion: why not make the function simply returning the output format instead of using the output parameter?

PyAPI_FUNC(int) PyUnicode_Export(
    PyObject *unicode,
    int requested_formats,
    Py_buffer *view);

-1 means error, positive value -- the output format. This is how most of the C API is designed now. The only issue with existing C API is when -1 is the valid success value (like in PyLong_AsLong()), only in these cases we should consider alternative design.

@methane
Copy link

methane commented Sep 12, 2024

I don't like it. I think we should provide only PyUnicode_AsUTF8AndSize().

There are many good string libs that can handle UTF-8. PEP 393 based API forces people to write string lib that supports latin1/ucs-2/ucs-4. It is really bad.

For example, I had port markupsafe to PEP 393. It is really ugly. I regret I used PEP 393 instead of UTF-8. Now I believe benefit of PEP 393 based API (zero overhead) is not larger than benefit of UTF-8.
https://github.com/pallets/markupsafe/blob/main/src/markupsafe/_speedups.c

I hope future CPython uses UTF-8 as internal encoding too. Please don't design APIs that will become technical debt when changing internal representation of string.
faster-cpython/ideas#684

@encukou
Copy link
Collaborator

encukou commented Sep 12, 2024

But this breaks a rule proposed in #33 (comment).

That shouldn't be a hard rule, it's just a nice thing to have. (And it's mainly for reading debuggers, where you'd commonly see 0x10 rather than 16 anyway.)

I think we should provide only PyUnicode_AsUTF8AndSize().

That means that current CPython would need to copy data if the UTF-8 buffer isn't yet filled in. The purpose of this API is that projects can get zero-copy export -- and fall back gracefully if we change the internal representation.

Please don't design APIs that will become technical debt when changing internal representation of string.

IMO, changing the representation is exactly the time when we want flexible API that can handle both the old and the new representation.

Today, we have strings in PEP 393, with UTF-8 filled in on demand. Next, we might have strings in either PEP 393 or UTF-8, with the other on demand. Later, we might get to UTF-8 always with PEP 393 on demand.
But I don't think we can ever get rid of conversions to the PEP 393 representation.

@vstinner
Copy link
Author

@serhiy-storchaka:

I looked at your implementation, and I see that it can be made simpler. Create a bytes object, set it as view->obj, set view->buf and view->len to its data pointer and size.

Ok, I done that.

@serhiy-storchaka:

Yet one suggestion: why not make the function simply returning the output format instead of using the output parameter?

I modified my implementation to do that. I replaced unsigned uint32_t with signed int32_t. Is it what you requested?

@vstinner
Copy link
Author

@methane:

I don't like it. I think we should provide only PyUnicode_AsUTF8AndSize().

PyUnicode_AsUTF8() and PyUnicode_AsUTF8AndSize() require to cache the string encoded to UTF-8 in the Unicode object. There were attempts to deprecate these functions, but it didn't happen since there was no good replacement and these functions are commonly used.

PyUnicode_Export() requires to call PyBuffer_Release() which allows to remove the cache later if we want that: the PyUnicode_Export(UTF8) function may or may not allocate memory, it becomes an implementation detail.

There are many good string libs that can handle UTF-8. PEP 393 based API forces people to write string lib that supports latin1/ucs-2/ucs-4. It is really bad.

PyUnicode_Export() doesn't force you to use UCS formats, it supports PyUnicode_FORMAT_UTF8. It's up to the caller to decide which formats are used by the code.

For example, I had port markupsafe to PEP 393. It is really ugly. I regret I used PEP 393 instead of UTF-8. Now I believe benefit of PEP 393 based API (zero overhead) is not larger than benefit of UTF-8.

markupsafe is the initial motivation to add this API :-) Someone should run benchmarks to compare UCS formats and UTF-8 on Python 3.14.

I suppose that UCS formats were chosen to have best performance on CPython.

I hope future CPython uses UTF-8 as internal encoding too. Please don't design APIs that will become technical debt when changing internal representation of string. faster-cpython/ideas#684

I have no opinion on such change, but if it happens, we would need a migration path: it should be possible to write code working on old and new Python versions.

For example, in my current implementation, UTF-8 has the lowest priority. If Python changes its internal storage to UTF-8, PyUnicode_Export() can be modified to make UTF-8 the first priority when the caller asks for UCS formats or UTF-8. For me, it's a nice way to expose the format "preferred" by Python without leaking too many implementation details.

@serhiy-storchaka
Copy link

Thank you @vstinner. I am fine with not adding support of UTF16 right now if it can be implemented in principle. You addressed also other my comments. But now the API is different from the one for which others voted, so this needs re-voting.

Yet one question we should consider -- how to handle strings containing surrogate characters? Currently UCS2 and UCS4 representations can contain them, but UTF8 fails. Naive PyPy implementation could return UTF8 containing surrogate characters, but fail to represent them as UCS2 or UCS4. In some cases it is nice that the resulting UTF8 is always valid, but there is some inconsistency.

Maybe add a flag to guarantee that the result does not contain lone surrogates? If it is set, UCS2 and UCS4 are checked for lone surrogates. If it is not set, encoding to UTF8 is performed with the "surrogatepass" error handler (if PyUnicode_AsUTF8AndSize() fails).

@vstinner
Copy link
Author

Maybe add a flag to guarantee that the result does not contain lone surrogates? If it is set, UCS2 and UCS4 are checked for lone surrogates. If it is not set, encoding to UTF8 is performed with the "surrogatepass" error handler (if PyUnicode_AsUTF8AndSize() fails).

Oh... That's a difficult question :-( By default, PyUnicode_Export() should not fail, it should export surrogates "somehow" (as if surrogatepass error handler is used). But for the consumer of the API, surrogates can be surprising and difficult to handle. So maybe having a flag would be nice.

@vstinner
Copy link
Author

@encukou would like the exported buffer to always end with a NULL character. But I'm trying to have implementation issues to respect this constraint. Since a Py_buffer has a size, do we really need to make sure that the buffer ends with NULL character? On the other side, embedded NULL characters are allowed which makes the trailing NULL character less useful/interesting.

@encukou
Copy link
Collaborator

encukou commented Sep 12, 2024

As long as the API is used from C, exported strings should be NUL-terminated for safety.
I'd prefer if people didn't rely on that and always used the length -- and we should definitely do that in CPython.
But, our internal buffers do have the terminating NUL, and in most cases we expose those, people will expect the NUL even if we'd explicitly document that it's not guaranteed.

This means that any Python implementation that doesn't store the NUL internally will need to allocate & copy for each export. Such an implementation can add a function like XPyUnicode_Export_Raw; if it becomes popular CPython can adopt it as an alias of PyUnicode_Export.

@vstinner
Copy link
Author

As long as the API is used from C, exported strings should be NUL-terminated for safety.

Which kind of safety? If you only rely on NULL to find the end of a string, you will truncate strings embedding null characters.

This means that any Python implementation that doesn't store the NUL internally will need to allocate & copy for each export.

In my current implementation, I have to copy characters (to add a a NULL character), because I'm using APIs which don't return a string with a trailing NULL characters. So the problem also exists in CPython.

@serhiy-storchaka
Copy link

We can add yet one flag to ensure that the result does not contain embedded null code units and is terminated by a null code unit.

UTF16 and UTF32 codecs can be modified to produce a bytes object that contains 2 or 4 zero bytes past the end.

@zooba
Copy link

zooba commented Sep 12, 2024

With the amount of proposed options and processing going on, I'm inclined to think the final API isn't going to meet the intended purpose.

If this turns into an O(N) function in any way (that is, the length of the string impacts runtime), then I'm opposed to it. I see the value in having a way to expose the internal representation directly, but any copying/validation/transcoding that is unique to this function (and not inherent to all strings all the time) makes this an overly complicated converter.

We've just had the same argument about integers. The pattern is exactly the same here as it was there. How about we just develop one pattern for native code to get access to native representation and use it consistently?

@vstinner
Copy link
Author

We can add yet one flag to ensure that the result does not contain embedded null code units and is terminated by a null code unit.

I would feel safer if we guarantee that there is always exactly one NULL character: at the end. It would be nice if it could not be the default, since scanning for embedded NULL characters has a complexity of O(n).

I can add a flag parameter with two flags:

  • PyUnicode_EXPORT_NUL_CHAR: reject embedded NULL character, write a NULL character at the end.
  • PyUnicode_EXPORT_NO_SURROGATE: reject surrogate characters.

@methane
Copy link

methane commented Sep 13, 2024

I have no opinion on such change, but if it happens, we would need a migration path: it should be possible to write code working on old and new Python versions.

For example, in my current implementation, UTF-8 has the lowest priority. If Python changes its internal storage to UTF-8, PyUnicode_Export() can be modified to make UTF-8 the first priority when the caller asks for UCS formats or UTF-8. For me, it's a nice way to expose the format "preferred" by Python without leaking too many implementation details.

Only CPython uses PEP 393. This API would force PyPy to implement latin1/UCS2/UCS4 exporter.
And if the change happens, this API is also technical debt on CPython.
Technical debts will be not only in CPython. Some libraries may adopt all of utf-8/latin1/UCS2/UCS4 only to win the benchmark game. After CPython change its implementation, all codes supports PEP 393 will become technical debt.

Supporting UTF-8/latin1/UCS2/UCS4 is not only 4x code, test cases, attack surfaces.
It is hard to test code under big endian.

That's why I don't like adding PEP 393 based API in limited API nor stable ABI.

@vstinner
Copy link
Author

Only CPython uses PEP 393. This API would force PyPy to implement latin1/UCS2/UCS4 exporter.

PyPy doesn't have to implement UCS formats in PyUnicode_Export(). It can return an error and only implement UTF8.

PyUnicode_Export() is more interesting when it's cheap, like a complexity of O(1). Otherwise, just call PyUnicode_AsUTF8() if you don't care of O(n) complexity (on CPython).

That's why I don't like adding PEP 393 based API in limited API nor stable ABI.

This API is a way for C extensions to adopt the limited C API. Currently, most C extensions use the regular C API which gives a full access to all PyUnicodeObject internals, way more than PyUnicode_Export().

Some libraries may adopt all of utf-8/latin1/UCS2/UCS4 only to win the benchmark game.

I'm not convinced that adding PyUnicode_Export() would be a trigger for that. Only few C extensions use the limited C API. So all extensions can already rely on PEP 393 internals, especially specialize code for the the 3 UCS formats.

For me, it's more the opposite: PyUnicode_Export() makes UTF-8 a first-class citizen and it should help C extensions to treat PyPy (UTF-8) the same way than CPython (UCS formats).


I ran a code search on PyPI top 7,500 projects (March 2024).

25 projects call PyUnicode_FromKindAndData():

  • Cython (3.0.9)
  • Levenshtein (0.25.0)
  • PyICU (2.12)
  • PyICU-binary (2.7.4)
  • PyQt5 (5.15.10)
  • PyQt6 (6.6.1)
  • aiocsv (1.3.1)
  • asyncpg (0.29.0)
  • biopython (1.83)
  • catboost (1.2.3)
  • cffi (1.16.0)
  • mojimoji (0.0.13)
  • mwparserfromhell (0.6.6)
  • numba (0.59.0)
  • numpy (1.26.4)
  • orjson (3.9.15)
  • pemja (0.4.1)
  • pyahocorasick (2.0.0)
  • pyjson5 (1.6.6)
  • rapidfuzz (3.6.2)
  • regex (2023.12.25)
  • srsly (2.4.8)
  • tokenizers (0.15.2)
  • ujson (5.9.0)
  • unicodedata2 (15.1.0)

21 projects call PyUnicode_2BYTE_DATA() and/or PyUnicode_4BYTE_DATA():

  • Cython (3.0.9)
  • MarkupSafe (2.1.5)
  • Nuitka (2.1.2)
  • PyICU (2.12)
  • PyICU-binary (2.7.4)
  • PyQt5_sip (12.13.0)
  • PyQt6_sip (13.6.0)
  • biopython (1.83)
  • catboost (1.2.3)
  • cement (3.0.10)
  • cffi (1.16.0)
  • duckdb (0.10.0)
  • mypy (1.9.0)
  • numpy (1.26.4)
  • orjson (3.9.15)
  • pemja (0.4.1)
  • pyahocorasick (2.0.0)
  • pyjson5 (1.6.6)
  • pyobjc-core (10.2)
  • sip (6.8.3)
  • wxPython (4.2.1)

So there are already many projects which rely on UCS formats.

@encukou
Copy link
Collaborator

encukou commented Sep 13, 2024

Which kind of safety? If you only rely on NULL to find the end of a string, you will truncate strings embedding null characters.

Users should not rely on the NUL. If they rely on it, they are doing things wrong. But, while wrong, in very many cases truncating on the first NUL is not a serious bug, and protecting against things like terminal escapes should be more important.

This wouldn't be much of an issue if the API returned non-NUL-terminated strings most of the time -- then normal tests would catch mistakes. But here, most calls will expose CPython's internal buffer, which is NUL-terminated, so some people will assume it's always the case.
The added NUL means this won't turn into a security issue.

Some libraries may adopt all of utf-8/latin1/UCS2/UCS4 only to win the benchmark game.

Yes, they will do that. Without this API, they need to reach into the internals for it.

@vstinner
Copy link
Author

This issues has now 54 comments which makes it difficult to read/navigate. I wrote a draft PEP 756 to summarize the discussion: python/peps#3960. Obviously, it's an opinionated PEP :-) For this first C API Working Group, I used PEP-Delegate: C API Working Group.

@vstinner
Copy link
Author

@encukou:

Users should not rely on the NUL. If they rely on it, they are doing things wrong.

In this case, I would prefer to recommend strongly to only rely on Py_buffer.len to get the string length.

The fact that the implementation provides a trailing NUL character or not should remain an implementation detail, rather than being part of the API.

What do you think?

Note: my current implementation provides a trailing NUL character in all PyUnicode_Export() cases.

@vstinner
Copy link
Author

vstinner commented Sep 14, 2024

I wrote PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions, it's now online! I announced the PEP on discuss.python.org as well.

There are open questions:

  • Should we guarantee that the exported buffer always ends with a NUL character? Is it possible to implement it in O(1) complexity in all Python implementations?
  • Is it ok to allow surrogate characters?
  • Should we add a flag to disallow embedded NUL characters? It would have an O(n) complexity.
  • Should we add a flag to disallow surrogate characters? It would have an O(n) complexity.

Well, the PEP already express my opinion: surrogates characters and embedded NUL characters are allowed in import and export. And I don't think that we should add a flag which cause an O(n) complexity.

@malemburg
Copy link

My take on these...

There are open questions:

  • Should we guarantee that the exported buffer always ends with a NUL character? Is it possible to implement it in O(1) complexity in all Python implementations?

Yes. Better safe than sorry.

  • Is it ok to allow surrogate characters?

Yes, since surrogates are first class Unicode code points.

  • Should we add a flag to disallow embedded NUL characters? It would have an O(n) complexity.

+0, this may be useful for some use cases. It should not be the default, though.

  • Should we add a flag to disallow surrogate characters? It would have an O(n) complexity.

+0, this may be useful for some use cases. It should not be the default, though.

@da-woods
Copy link

21 projects call PyUnicode_2BYTE_DATA() and/or PyUnicode_4BYTE_DATA():

Cython (3.0.9)

Cython provides declarations for users to manually call just about any C API functions they want. So be a bit careful here because it should come up with a match for almost every public function in the C API.

In this case I don't think we actively use these two functions ourselves.

@methane
Copy link

methane commented Sep 15, 2024

For me, it's more the opposite: PyUnicode_Export() makes UTF-8 a first-class citizen and it should help C extensions to treat PyPy (UTF-8) the same way than CPython (UCS formats).

Why do you think so?
PyUnicode_AsUTF8AndSize() makes UTF-8 first class already.

I think Python should always have UTF-8 and cache the PEP 393 representation for backward compatibility several years.

@vstinner vstinner changed the title Add PyUnicode_Export() and PyUnicode_Import() to the limited C API PEP 756: Add PyUnicode_Export() and PyUnicode_Import() to the limited C API Sep 23, 2024
@vstinner
Copy link
Author

vstinner commented Oct 1, 2024

This issue has a long history, the API changed multiple times, it became confused to follow it. So I created a new clean issue for PEP 756: #44. I close this issue.

@vstinner vstinner closed this as completed Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants