-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PEP 756: Add PyUnicode_Export() and PyUnicode_Import() to the limited C API #33
Comments
I would prefer to use
I'll note that the user must always call
I think we should not, unless there's a use case I don't see. I'm OK with a library using knowledge of CPython internals to “guess” that UCSn will be preferred to UTF-8 if all of UCS1-UCS4 are given. The price of guessing wrong is “only” a performance decrease. And to support PyPy's native format, you should add UTF-8 to the formats you support rather than look at |
Right, that's what the documentation says. It must always be called. I was just describing the implementation.
The idea of
PyPy would define it as:
But well, we can add it later if needed. |
I think it would be better to use a slightly different signature which separates error reporting from returning data. The buffer API (and many other Python C APIs) use the same approach.
The release function may also need an I'm also not sure how PyUnicode_ReleaseExport() will deal with exports which do require copying data, e.g. if the Unicode object has a different kind than requested. In the PR, the function basically is a no-op in most cases, which looks wrong. Update: Fixed a bug in the sig and clarified the comment wording a bit more. |
If format doesn't match the Python str kind, a copy is needed. In this case, PyUnicode_ReleaseExport() release the memory. Example:
|
+1 to this from me, but if it were a What are the intended semantics when |
How do you express for 5 different formats in Py_buffer format? There are no ASCII, UCS1, UCS2 and UTF-8 formats in https://docs.python.org/dev/library/array.html table, only "w" for UCS4. |
Ok, I made this change. Previously, I hesitated to do it, the API looked "unnatural" and difficult to use. So I like better returning -1 on error or 0 on success. |
Use |
To clarify, I think my main concern is that, as proposed this doesn't return a I feel less strongly about whether the |
IMO, the
What failure mode did you have in mind? |
Ok, here is something else. What about adding a #define PyUnicode_FORMAT_UCS1 0x01
#define PyUnicode_FORMAT_UCS2 0x02
#define PyUnicode_FORMAT_UCS4 0x04
#define PyUnicode_FORMAT_UTF8 0x08
typedef struct PyUnicodeExport {
PyObject *obj;
uint32_t format;
const void *data;
Py_ssize_t nbytes;
} PyUnicodeExport;
// Get the content of a string in the requested format:
// - Set '*unicode_export' and return 0 on success.
// - Set an exception and return -1 on error.
//
// The export must be released by PyUnicode_ReleaseExport().
PyAPI_FUNC(int) PyUnicode_Export(
PyObject *unicode,
uint32_t requested_formats,
PyUnicodeExport *unicode_export);
// Release an export created by PyUnicode_Export().
PyAPI_FUNC(void) PyUnicode_ReleaseExport(
PyUnicodeExport *unicode_export);
// Create a string object from a string in the format 'format'.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(PyObject*) PyUnicode_Import(
const void *data,
Py_ssize_t nbytes,
uint32_t format); UPDATE: I removed UPDATE: I renamed export to unicode_export since export name causes compilation errors on Windows (MSC compiler). Example of usage: PyUnicodeExport unicode_export;
if (PyUnicode_Export(obj, requested_formats, &unicode_export) < 0) {
// handle error
}
// ... use unicode_export ...
PyUnicode_ReleaseExport(&unicode_export); |
Looks good. I assume obj will be set to unicode with incremented refcount, right ? And the release function will then take care of the DECREF ? I still think that allowing the release function to return an int error code is important to have. It may be needed in the future when releasing does more than a DECREF and free(). |
Yes,
I don't see what the caller would do with an error on PyUnicode_ReleaseExport(). I prefer to log it directly as a warning or an unraisable exception in PyUnicode_ReleaseExport(). Usually, Python C API doesn't let a "free" function to report errors. Py_DECREF() is a good example: Py_DECREF() can fail and return "void". |
Along the same lines as @mdboom 's concern, when I suspect a similar split-level API where a (The non-PyObject C struct should also have a snake case name along the same lines as As far as convenience formats go, I do think there's one that would be worth defining: #define PyUnicode_FORMAT_UCS \
(PyUnicode_FORMAT_UCS1 \
| PyUnicode_FORMAT_UCS2 \
| PyUnicode_FORMAT_UCS4) That way API clients can easily request an export that uses the smallest fixed width format that can handle every code point in the string to be exported. Edit: while I don't think alternative string types are common enough to be worth defining a magic method protocol up front, I do wonder if it might be worth choosing a C API naming scheme that leaves open the possibility of adding one later. Specifically, using the #define PyText_FORMAT_UCS1 0x01
#define PyText_FORMAT_UCS2 0x02
#define PyText_FORMAT_UCS4 0x04
#define PyText_FORMAT_UTF8 0x08
#define PyText_FORMAT_UCS (PyText_FORMAT_UCS1 | PyText_FORMAT_UCS2 | PyText_FORMAT_UCS4)
typedef struct Py_text_export {
PyObject *obj;
uint32_t format;
const void *data;
Py_ssize_t nbytes;
} Py_text_export;
// Get the content of a string in the requested format:
// - Set '*text' and return 0 on success.
// - Set an exception and return -1 on error.
//
// API signature intentionally similar to PyObject_GetBuffer().
// The export must be released by PyText_ReleaseExport().
// "Export" verb used to indicate data copying may be required, but will be skipped if possible
PyAPI_FUNC(int) PyText_Export(PyObject *exporter, Py_text_export *text, uint32_t requested_formats);
// Release an export created by PyText_Export().
// API signature intentionally similar to PyBuffer_Release().
PyAPI_FUNC(void) PyText_ReleaseExport(Py_text_export *text);
// And then in direct analogy to PyMemoryView, a helper API to manage text export lifecycles
// - text export objects would be pure data holders with no behaviour
// - exports have to be imported as Unicode objects to perform string operations
PyAPI_FUNC(PyObject *) PyTextExport_FromObject(PyObject *exporter, uint32_t requested_formats);
PyAPI_FUNC(PyObject *) PyTextExport_FromMemory(const void *data, Py_ssize_t nbytes, uint32_t format);
PyAPI_FUNC(PyObject *) PyTextExport_FromText(Py_text_export *text);
PyAPI_FUNC(Py_text_export *) PyTextExport_GetText(PyObject *export);
// PyObject_Str() would do the right thing with PyTextExport instances, so new
// APIs would only be needed for creating strings from the text export layouts
// Create a string object from a string in the format 'format'.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
// API signature intentionally similar to PyMemoryView_FromMemory().
// "Import" verb used to indicate that data is copied rather than referenced
PyAPI_FUNC(PyObject*) PyUnicode_Import(const void *data, Py_ssize_t nbytes, uint32_t format);
// API signature intentionally similar to PyMemoryView_FromBuffer().
// "Import" verb used to indicate that data is copied rather than referenced
PyAPI_FUNC(void) PyUnicode_ImportText(Py_text_export *text); |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
I hate to be the broken record here, but... we already have an export struct that can be wrapped in a memoryview. int PyUnicode_GetBufferFormat(const Py_buffer *buf, uint32_t *result);
// If `buf` is a buffer exported from a str (PyUnicode) object,
// set `*result` to the corresponding `PyUnicode_FORMAT_*` value,
// and return 0.
// Otherwise, or on another error, set `*result` to 0 and
// return -1 with an exception set. Internally, the format would be stored in But in many cases this wouldn't be needed: if you ask for UTF-8 only you already know the format, and if you ask for any combination of |
Ok, here is something completely different :-) I rewrote the API based on I wrote a new PR for that: python/cpython#123738 I also updated the API in the first comment of this issue. @mdboom @encukou @zooba: Is it closer to what you expected? It's not a The I used formats
|
Hmm, unless I'm missing something, this removes the possibility to import/export UTF-8 data (which is a variable length encoding, so there is no fixed Given that there has been a lot of talk about eventually moving to UTF-8 a new single native format for CPython, I'd like to see some way to support zero-copy for this future implementation approach. You had this in your original proposal. This could be done by defining a new "U" format character specifically for these APIs. |
Ah, for |
A side effect of this change is to add the The buffer protocol is quite safe, it has multiple checks before calling
(See In short, you're only allowed to call Note: I also had to implement If we don't want to add IMO it's ok to add |
Does implementing the buffer protocol like this mean that Otherwise, I like the idea. But getting access to low-level details really ought to require a C API, and since we seem to be adding more and more of these it'd be nice to maximise consistency (either across the new APIs, or with whatever exists, like |
No. I didn't implement the "get buffer" slot ( For now, only the C API, |
Vote to add this API: |
Did we discuss whether |
I added back the format parameter to |
Right. For example, if you request UCS1 and the string uses UCS2, you get an error: the string cannot be exported as UCS1. |
Yeah, I don't really see a use, unless someone was to pass you the buffer and you know it's a PyUnicode but you don't know what it is... there's no need to encourage that. Nobody is using this API yet, so if they start, they can also pass the format. I voted for the addition with your changes. |
I only want to check that the API is extensible in this direction. What could be a mnemonic value for such constant if 16 is already taken for ASCII?
Just an additional bit of information. I have no currently use for it. It can be obtained by checking
It is not needed. |
The next available constant value is
I don't know |
For example, if you request UCS2 and the string is UCS1, some memory is allocated, and |
But this breaks a rule proposed in #33 (comment). It would be nicer to use constant 16 for UTF16. This is why I suggest to solve this problem now.
I looked at your implementation, and I see that it can be made simpler. Create a bytes object, set it as BTW, several years ago @methane proposed a similar function for UTF-8 only. It was equivalent to Yet one suggestion: why not make the function simply returning the output format instead of using the output parameter? PyAPI_FUNC(int) PyUnicode_Export(
PyObject *unicode,
int requested_formats,
Py_buffer *view); -1 means error, positive value -- the output format. This is how most of the C API is designed now. The only issue with existing C API is when -1 is the valid success value (like in |
I don't like it. I think we should provide only There are many good string libs that can handle UTF-8. PEP 393 based API forces people to write string lib that supports latin1/ucs-2/ucs-4. It is really bad. For example, I had port markupsafe to PEP 393. It is really ugly. I regret I used PEP 393 instead of UTF-8. Now I believe benefit of PEP 393 based API (zero overhead) is not larger than benefit of UTF-8. I hope future CPython uses UTF-8 as internal encoding too. Please don't design APIs that will become technical debt when changing internal representation of string. |
That shouldn't be a hard rule, it's just a nice thing to have. (And it's mainly for reading debuggers, where you'd commonly see 0x10 rather than 16 anyway.)
That means that current CPython would need to copy data if the UTF-8 buffer isn't yet filled in. The purpose of this API is that projects can get zero-copy export -- and fall back gracefully if we change the internal representation.
IMO, changing the representation is exactly the time when we want flexible API that can handle both the old and the new representation. Today, we have strings in PEP 393, with UTF-8 filled in on demand. Next, we might have strings in either PEP 393 or UTF-8, with the other on demand. Later, we might get to UTF-8 always with PEP 393 on demand. |
Ok, I done that.
I modified my implementation to do that. I replaced unsigned |
markupsafe is the initial motivation to add this API :-) Someone should run benchmarks to compare UCS formats and UTF-8 on Python 3.14. I suppose that UCS formats were chosen to have best performance on CPython.
I have no opinion on such change, but if it happens, we would need a migration path: it should be possible to write code working on old and new Python versions. For example, in my current implementation, UTF-8 has the lowest priority. If Python changes its internal storage to UTF-8, PyUnicode_Export() can be modified to make UTF-8 the first priority when the caller asks for UCS formats or UTF-8. For me, it's a nice way to expose the format "preferred" by Python without leaking too many implementation details. |
Thank you @vstinner. I am fine with not adding support of UTF16 right now if it can be implemented in principle. You addressed also other my comments. But now the API is different from the one for which others voted, so this needs re-voting. Yet one question we should consider -- how to handle strings containing surrogate characters? Currently UCS2 and UCS4 representations can contain them, but UTF8 fails. Naive PyPy implementation could return UTF8 containing surrogate characters, but fail to represent them as UCS2 or UCS4. In some cases it is nice that the resulting UTF8 is always valid, but there is some inconsistency. Maybe add a flag to guarantee that the result does not contain lone surrogates? If it is set, UCS2 and UCS4 are checked for lone surrogates. If it is not set, encoding to UTF8 is performed with the "surrogatepass" error handler (if |
Oh... That's a difficult question :-( By default, PyUnicode_Export() should not fail, it should export surrogates "somehow" (as if surrogatepass error handler is used). But for the consumer of the API, surrogates can be surprising and difficult to handle. So maybe having a flag would be nice. |
@encukou would like the exported buffer to always end with a NULL character. But I'm trying to have implementation issues to respect this constraint. Since a |
As long as the API is used from C, exported strings should be NUL-terminated for safety. This means that any Python implementation that doesn't store the NUL internally will need to allocate & copy for each export. Such an implementation can add a function like |
Which kind of safety? If you only rely on NULL to find the end of a string, you will truncate strings embedding null characters.
In my current implementation, I have to copy characters (to add a a NULL character), because I'm using APIs which don't return a string with a trailing NULL characters. So the problem also exists in CPython. |
We can add yet one flag to ensure that the result does not contain embedded null code units and is terminated by a null code unit. UTF16 and UTF32 codecs can be modified to produce a bytes object that contains 2 or 4 zero bytes past the end. |
With the amount of proposed options and processing going on, I'm inclined to think the final API isn't going to meet the intended purpose. If this turns into an O(N) function in any way (that is, the length of the string impacts runtime), then I'm opposed to it. I see the value in having a way to expose the internal representation directly, but any copying/validation/transcoding that is unique to this function (and not inherent to all strings all the time) makes this an overly complicated converter. We've just had the same argument about integers. The pattern is exactly the same here as it was there. How about we just develop one pattern for native code to get access to native representation and use it consistently? |
I would feel safer if we guarantee that there is always exactly one NULL character: at the end. It would be nice if it could not be the default, since scanning for embedded NULL characters has a complexity of O(n). I can add a flag parameter with two flags:
|
Only CPython uses PEP 393. This API would force PyPy to implement latin1/UCS2/UCS4 exporter. Supporting UTF-8/latin1/UCS2/UCS4 is not only 4x code, test cases, attack surfaces. That's why I don't like adding PEP 393 based API in limited API nor stable ABI. |
PyPy doesn't have to implement UCS formats in PyUnicode_Export(). It can return an error and only implement UTF8. PyUnicode_Export() is more interesting when it's cheap, like a complexity of O(1). Otherwise, just call PyUnicode_AsUTF8() if you don't care of O(n) complexity (on CPython).
This API is a way for C extensions to adopt the limited C API. Currently, most C extensions use the regular C API which gives a full access to all PyUnicodeObject internals, way more than PyUnicode_Export().
I'm not convinced that adding PyUnicode_Export() would be a trigger for that. Only few C extensions use the limited C API. So all extensions can already rely on PEP 393 internals, especially specialize code for the the 3 UCS formats. For me, it's more the opposite: PyUnicode_Export() makes UTF-8 a first-class citizen and it should help C extensions to treat PyPy (UTF-8) the same way than CPython (UCS formats). I ran a code search on PyPI top 7,500 projects (March 2024). 25 projects call
21 projects call
So there are already many projects which rely on UCS formats. |
Users should not rely on the NUL. If they rely on it, they are doing things wrong. But, while wrong, in very many cases truncating on the first NUL is not a serious bug, and protecting against things like terminal escapes should be more important. This wouldn't be much of an issue if the API returned non-NUL-terminated strings most of the time -- then normal tests would catch mistakes. But here, most calls will expose CPython's internal buffer, which is NUL-terminated, so some people will assume it's always the case.
Yes, they will do that. Without this API, they need to reach into the internals for it. |
This issues has now 54 comments which makes it difficult to read/navigate. I wrote a draft PEP 756 to summarize the discussion: python/peps#3960. Obviously, it's an opinionated PEP :-) For this first C API Working Group, I used |
In this case, I would prefer to recommend strongly to only rely on The fact that the implementation provides a trailing NUL character or not should remain an implementation detail, rather than being part of the API. What do you think? Note: my current implementation provides a trailing NUL character in all |
I wrote PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions, it's now online! I announced the PEP on discuss.python.org as well. There are open questions:
Well, the PEP already express my opinion: surrogates characters and embedded NUL characters are allowed in import and export. And I don't think that we should add a flag which cause an O(n) complexity. |
My take on these...
Yes. Better safe than sorry.
Yes, since surrogates are first class Unicode code points.
+0, this may be useful for some use cases. It should not be the default, though.
+0, this may be useful for some use cases. It should not be the default, though. |
Cython provides declarations for users to manually call just about any C API functions they want. So be a bit careful here because it should come up with a match for almost every public function in the C API. In this case I don't think we actively use these two functions ourselves. |
Why do you think so? I think Python should always have UTF-8 and cache the PEP 393 representation for backward compatibility several years. |
This issue has a long history, the API changed multiple times, it became confused to follow it. So I created a new clean issue for PEP 756: #44. I close this issue. |
See PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions.
The text was updated successfully, but these errors were encountered: