-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vote on PEP 756: Add PyUnicode_Export() and PyUnicode_Import() to the limited C API #44
Comments
I'm not voting yet, but wanted to add that I believe the design of this API is the best it can be without going too far into YAGNI territory. My hesitation is around the need for this to be in the limited API. From the PEP (emphasis mine, and it's referring to the current state of things before this PEP):
The high-level question is should the limited API be expected to provide best performance? MarkupSafe could easily use a slower method (converting via UTF-8) and use the limited API, or they could remain with the regular API and use faster methods (along with tracking version changes). Historically, we've considered that a fair trade-off. Even though we've got an API design that I believe will suit our stable ABI needs, I'm not convinced that we necessarily want to turn the limited API into the "for all purposes API". I'm already of the opinion that we change (mostly increase) it too often, and thereby devalue it as a compatibility layer. This API takes us further in that direction, and quite explicitly. FTR, I'd accept this API in an instant for the regular API, and am glad that it can be stable over multiple releases even if we innovate. We ought to design all our APIs for long-term stability, even if they aren't candidates for the limited API. (Also FTR, I'll say basically the same thing about the integer import/export API.) |
(And this is a different objection from the initialization APIs that didn't go into the limited API. Those I am genuinely worried about change, particularly to their semantics; these ones I'm not worried about change. I'm worried about us not having a clear idea what the point of the limited API vs. regular API vs. unstable API is.) |
And it might go without saying, but just to be sure: @vstinner I greatly appreciate the effort you've put in on this proposal, and how thoroughly you've worked on trying out the alternatives and handling the discussion. Thank you for that. |
For a bit of PEP process: This sentence looks like an endorsement. Generally, if you put something like this in a PEP, you should get the project maintainer's confirmation that the new API will help the project. Taking into account what @zooba said, I wonder: how fast will CPython switch to UTF-8 (or, WTF-8) for the “main” internal storage?
In my personal opinion: if it reduces a O(n) operation to O(1), I would put it in the limited API. If it just adds a constant factor, I wouldn't. |
As I understand this, exposing in the limited C API is the only reason to add these functions. If you do not use the limited C API, you can use existing C API like There is no such caveat about adding |
It has a C extension to make it faster than the pure Python fallback. I'm not sure that they would like to have a slower C extension just to use the limited C API.
If there is a way to provide a fast implementation in the limited C API, I don't see why we should not provide it.
The origin of PEP 756 is a request by @davidism during the Pycon US sprint. He was concerned about the big number of wheel packages that he has to produce for a single release, and would like to use the stable ABI to reduce that number. I exchanged with David Lord on the first version of my PR: python/cpython#119610 (comment) Later, I was also also motivated by https://peps.python.org/pep-0756/#usage-of-pep-393-c-apis : provide a better API for these users.
IMO faster-cpython/ideas#684 underestimates the cost of breaking the C API to switch to UTF-8. Too many APIs rely on the PEP 393 (compact strings) design and these APIs are widely used. I would not be easy to provide a backward compatibility. We should provide better APIs hiding implementation details before we can consider to even think about moving to UTF-8. PEP 756 is designed with that backward compatibility in mind. In the worst case, an export can be implemented with a copy. It might be better than failing. But such decision can be taken once we would seriously consider switching to UTF-8.
It doesn't stop working. It will export as UTF-8 rather than UCS formats, still in O(1) complexity. Currently, there is no simple way to ask Python "give me what you have in the most efficient format". IMO such export API is needed to have a smoother migration path from UCS formats to UTF-8: to better hide implementation details. When I started to work on the C API, I contacted PyPy developers. One of their main requests was to remove PEP 393 C APIs such as PyUnicode_KIND() and PyUnicode_1BYTE_DATA(). PyPy cannot implement these APIs in an efficient way since PyPy uses UTF-8 internally. PEP 756 is a little step towards this goal. |
That was the original intent, but the current PEP suggests that the export should fail if it wouldn't be O(1). Which means users should expect failure and fall back to
There should probably be a FORMAT for UTF8-with-surrogatepass, or use that instead of UTF-8, so that in this possible future the API can still export all strings. |
It's a pretty weak endorsement :) This one doesn't bother me at all, we have to say something about any example we give, and this sentence says enough to explain why the API would matter. Notably, it doesn't say what you might use the library for, which makes it a pretty terrible endorsement.
I'm not sure that we would like to restrict our ability to innovate just so they can use the limited C API. There are tradeoffs both ways here, and their fallback is to build per-version extension modules (as they currently do). Our fallback doesn't exist - we can only preserve backwards compatibility or break it for millions of users. So if we have to inconvenience our extension developers in order to protect all of our users, I'm prepared to make that call. (Part of why I opposed our frequent releases is because I knew it'd cause extension authors to have to build so many more extensions. But the argument for frequent releases was to enable frequent changes, so I want to do the best I can to allow the frequent changes to continue, rather than forcing us into the slower update cadence I originally asked for.)
I don't think there's any harm in defining the UTF-8 format loosely like that. In fact I'm sure we discussed it at one point (basically, it's UTF-8 encoding, so you account for multibyte characters that way, but otherwise say nothing about the string content). If "-with-surrogatepass" is the easiest way to capture that, then sure. |
Hi, I'm the one who asked for something like this for MarkupSafe. I am struggling with the number of wheels required to build, which is a slow process and different than every other package I maintain. I do not understand how to write C very well. I'd have to rely on someone else to make a PR to take advantage of this new API, as I don't think I would know what to do. (This is probably true of any form it takes, not a judgement on this specific proposal.) The extension was written before I became a maintainer, then modified by @methane at some point to be compatible with 2 and 3. I recently refactored it to reduce the amount of code, but it still relies on non-ABI-3 APIs. Speed is important for this extension, because it can be called many many times during a single template render by Jinja. But maintainability is also important, as it's basically not changing at this point so I only visit it once or twice a year to prepare a new release for a new Python. Being able to use abi3 wheels would remove that need, but I'd prefer not to have to sacrifice speed for it. All that said, I've also been exploring a rewrite in Rust using PyO3. A contributor created an implementation that is somewhat close to the C speed. I'm not sure if it's abi3 compatibly, but I think it is. Regardless, I'd still want to take advantage of abi3 wheels, presumably Py03 would expose this new API to allow that. |
For what it's worth, my understanding is that Rust strings are required to be valid UTF8, so if the export format here didn't guarantee valid UTF8 code points then I think any PyO3-based rust wrapper might have to consider copying and running replacement on the surrogate codepoints. That's potentially a reason to keep UTF8-with-surrogatepass a separate explicit format. |
Is there a practical difference between a If the "Export" format disallowed surrogates, for a string containing them it would be as if the format did not exist. In that scenario, you need to use a fallback function (there's no way to tell the difference between "UTF-8 isn't available already but (say) UCS-2 is" and "UTF-8 was available but it wasn't valid"), and the likely fallback is also going to disallow surrogates. To get them I believe you'll need to go via |
Ah, that's a good point. I had overlooked the case that this API doesn't really allow for good separation of "format unsupported" versus "contents are not representable in this format" If the Rust caller knew that the contents did not contain invalid code points, then this API would allow cheap zero-copy consumption from Rust as a string slice (
@davidism I took a quick peek at the PR on markupsafe and it looks like it should be fine to just enable PyO3's abi3 opt-in and it'll work for you. But you couldn't use this API and build a single abi3 wheel (you'd be supporting 3.14+ only); more realistic would be to build two sets of abi3 wheels, one for your minimum version and one for 3.14+. |
I studied the code recently; see What do you think of deprecating C APIs to modify immutable strings? discussion. There are a lot of code in the stdlib extensions and way more in 3rd party which rely on PEP 393 internals. If we decide to switch to UTF-8, IMO the C API will be a big blocker issue and I don't expect that it can be done in a single Python release (in 3.15). |
Vote on PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions.
Since I wrote PEP 756, I abstain from voting on my own PEP :-)
I closed the previous issue since it had a long history and the API changed multiple times.
See also @methane's messages, @methane is strongly against PEP 756:
The text was updated successfully, but these errors were encountered: