Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vote on PEP 756: Add PyUnicode_Export() and PyUnicode_Import() to the limited C API #44

Open
1 of 5 tasks
vstinner opened this issue Oct 1, 2024 · 13 comments
Open
1 of 5 tasks
Labels

Comments

@vstinner
Copy link

vstinner commented Oct 1, 2024

Vote on PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions.

Since I wrote PEP 756, I abstain from voting on my own PEP :-)

I closed the previous issue since it had a long history and the API changed multiple times.

See also @methane's messages, @methane is strongly against PEP 756:

@zooba
Copy link

zooba commented Oct 1, 2024

I'm not voting yet, but wanted to add that I believe the design of this API is the best it can be without going too far into YAGNI territory.

My hesitation is around the need for this to be in the limited API. From the PEP (emphasis mine, and it's referring to the current state of things before this PEP):

For example, the MarkupSafe project has a C extension specialized for UCS formats for best performance, and so cannot use the limited C API.

The high-level question is should the limited API be expected to provide best performance?

MarkupSafe could easily use a slower method (converting via UTF-8) and use the limited API, or they could remain with the regular API and use faster methods (along with tracking version changes). Historically, we've considered that a fair trade-off.

Even though we've got an API design that I believe will suit our stable ABI needs, I'm not convinced that we necessarily want to turn the limited API into the "for all purposes API". I'm already of the opinion that we change (mostly increase) it too often, and thereby devalue it as a compatibility layer. This API takes us further in that direction, and quite explicitly.

FTR, I'd accept this API in an instant for the regular API, and am glad that it can be stable over multiple releases even if we innovate. We ought to design all our APIs for long-term stability, even if they aren't candidates for the limited API.

(Also FTR, I'll say basically the same thing about the integer import/export API.)

@zooba
Copy link

zooba commented Oct 1, 2024

(And this is a different objection from the initialization APIs that didn't go into the limited API. Those I am genuinely worried about change, particularly to their semantics; these ones I'm not worried about change. I'm worried about us not having a clear idea what the point of the limited API vs. regular API vs. unstable API is.)

@zooba
Copy link

zooba commented Oct 1, 2024

And it might go without saying, but just to be sure: @vstinner I greatly appreciate the effort you've put in on this proposal, and how thoroughly you've worked on trying out the alternatives and handling the discussion. Thank you for that.

@encukou
Copy link
Collaborator

encukou commented Oct 2, 2024

For example, the MarkupSafe project has a C extension specialized for UCS formats for best performance, and so cannot use the limited C API.

For a bit of PEP process: This sentence looks like an endorsement. Generally, if you put something like this in a PEP, you should get the project maintainer's confirmation that the new API will help the project.
In this case, since Inada-san doesn't like the API at all, you should ask another MarkupSafe-related person, use a different example, or explicitly mention that MarkupSafe won't necessarily switch to the new API.


Taking into account what @zooba said, I wonder: how fast will CPython switch to UTF-8 (or, WTF-8) for the “main” internal storage?
If we do that in 3.15, we'd be adding stable API that would stop working after a single release. That does sound wasteful.


should the limited API be expected to provide best performance?

In my personal opinion: if it reduces a O(n) operation to O(1), I would put it in the limited API. If it just adds a constant factor, I wouldn't.
The caveat being that limited API's performance isn't guaranteed (old limited API can get slower over time; users do need to benchmark and potentially change code for every release to get best performance).

@serhiy-storchaka
Copy link

As I understand this, exposing in the limited C API is the only reason to add these functions. If you do not use the limited C API, you can use existing C API like PyUnicode_KIND(), PyUnicode_DATA(), etc. There is no way to add PyUnicode_KIND() and PyUnicode_DATA() in the limited C API, so new functions are needed for this.

There is no such caveat about adding PyUnicode_FromKindAndData() in the limited C API, it would eliminate the need in the import function. But it is not symmetric with respect to the proposed export function.

@vstinner
Copy link
Author

vstinner commented Oct 2, 2024

@zooba:

MarkupSafe could easily use a slower method

It has a C extension to make it faster than the pure Python fallback. I'm not sure that they would like to have a slower C extension just to use the limited C API.

@zooba:

Historically, we've considered that a fair trade-off.

If there is a way to provide a fast implementation in the limited C API, I don't see why we should not provide it.

@encukou:

For a bit of PEP process: This sentence looks like an endorsement. Generally, if you put something like this in a PEP, you should get the project maintainer's confirmation that the new API will help the project.

The origin of PEP 756 is a request by @davidism during the Pycon US sprint. He was concerned about the big number of wheel packages that he has to produce for a single release, and would like to use the stable ABI to reduce that number.

I exchanged with David Lord on the first version of my PR: python/cpython#119610 (comment)

Later, I was also also motivated by https://peps.python.org/pep-0756/#usage-of-pep-393-c-apis : provide a better API for these users.

@encukou:

Taking into account what @zooba said, I wonder: how fast will CPython switch to UTF-8 (or, WTF-8) for the “main” internal storage?

IMO faster-cpython/ideas#684 underestimates the cost of breaking the C API to switch to UTF-8. Too many APIs rely on the PEP 393 (compact strings) design and these APIs are widely used. I would not be easy to provide a backward compatibility. We should provide better APIs hiding implementation details before we can consider to even think about moving to UTF-8.

PEP 756 is designed with that backward compatibility in mind. In the worst case, an export can be implemented with a copy. It might be better than failing. But such decision can be taken once we would seriously consider switching to UTF-8.

@encukou:

If we do that in 3.15, we'd be adding stable API that would stop working after a single release. That does sound wasteful.

It doesn't stop working. It will export as UTF-8 rather than UCS formats, still in O(1) complexity. Currently, there is no simple way to ask Python "give me what you have in the most efficient format".

IMO such export API is needed to have a smoother migration path from UCS formats to UTF-8: to better hide implementation details.


When I started to work on the C API, I contacted PyPy developers. One of their main requests was to remove PEP 393 C APIs such as PyUnicode_KIND() and PyUnicode_1BYTE_DATA(). PyPy cannot implement these APIs in an efficient way since PyPy uses UTF-8 internally. PEP 756 is a little step towards this goal.

@encukou
Copy link
Collaborator

encukou commented Oct 2, 2024

PEP 756 is designed with that backward compatibility in mind. In the worst case, an export can be implemented with a copy.

That was the original intent, but the current PEP suggests that the export should fail if it wouldn't be O(1). Which means users should expect failure and fall back to PyUnicode_AsUTF8AndSize, and if they do that, there's no point in the export API doing a copy.

It doesn't stop working. It will export as UTF-8 rather than UCS formats, still in O(1) complexity.

There should probably be a FORMAT for UTF8-with-surrogatepass, or use that instead of UTF-8, so that in this possible future the API can still export all strings.
How does PyPy handle these, anyway?

@zooba
Copy link

zooba commented Oct 2, 2024

For a bit of PEP process: This sentence looks like an endorsement.

It's a pretty weak endorsement :) This one doesn't bother me at all, we have to say something about any example we give, and this sentence says enough to explain why the API would matter. Notably, it doesn't say what you might use the library for, which makes it a pretty terrible endorsement.

I'm not sure that they would like to have a slower C extension just to use the limited C API.

I'm not sure that we would like to restrict our ability to innovate just so they can use the limited C API. There are tradeoffs both ways here, and their fallback is to build per-version extension modules (as they currently do). Our fallback doesn't exist - we can only preserve backwards compatibility or break it for millions of users.

So if we have to inconvenience our extension developers in order to protect all of our users, I'm prepared to make that call. (Part of why I opposed our frequent releases is because I knew it'd cause extension authors to have to build so many more extensions. But the argument for frequent releases was to enable frequent changes, so I want to do the best I can to allow the frequent changes to continue, rather than forcing us into the slower update cadence I originally asked for.)

There should probably be a FORMAT for UTF8-with-surrogatepass, or use that instead of UTF-8

I don't think there's any harm in defining the UTF-8 format loosely like that. In fact I'm sure we discussed it at one point (basically, it's UTF-8 encoding, so you account for multibyte characters that way, but otherwise say nothing about the string content). If "-with-surrogatepass" is the easiest way to capture that, then sure.

@davidism
Copy link

davidism commented Oct 2, 2024

Hi, I'm the one who asked for something like this for MarkupSafe. I am struggling with the number of wheels required to build, which is a slow process and different than every other package I maintain.

I do not understand how to write C very well. I'd have to rely on someone else to make a PR to take advantage of this new API, as I don't think I would know what to do. (This is probably true of any form it takes, not a judgement on this specific proposal.) The extension was written before I became a maintainer, then modified by @methane at some point to be compatible with 2 and 3. I recently refactored it to reduce the amount of code, but it still relies on non-ABI-3 APIs.

Speed is important for this extension, because it can be called many many times during a single template render by Jinja. But maintainability is also important, as it's basically not changing at this point so I only visit it once or twice a year to prepare a new release for a new Python. Being able to use abi3 wheels would remove that need, but I'd prefer not to have to sacrifice speed for it.

All that said, I've also been exploring a rewrite in Rust using PyO3. A contributor created an implementation that is somewhat close to the C speed. I'm not sure if it's abi3 compatibly, but I think it is. Regardless, I'd still want to take advantage of abi3 wheels, presumably Py03 would expose this new API to allow that.

@davidhewitt
Copy link

I don't think there's any harm in defining the UTF-8 format loosely like that.

For what it's worth, my understanding is that Rust strings are required to be valid UTF8, so if the export format here didn't guarantee valid UTF8 code points then I think any PyO3-based rust wrapper might have to consider copying and running replacement on the surrogate codepoints. That's potentially a reason to keep UTF8-with-surrogatepass a separate explicit format.

@zooba
Copy link

zooba commented Oct 2, 2024

my understanding is that Rust strings are required to be valid UTF8, so if the export format here didn't guarantee valid UTF8 code points then I think any PyO3-based rust wrapper might have to consider copying and running replacement on the surrogate codepoints

Is there a practical difference between a PyUnicode_AsUTF8 function raising an error (disallowing surrogates) and the Rust conversion from raw UTF-8 failing (due to surrogates)?

If the "Export" format disallowed surrogates, for a string containing them it would be as if the format did not exist. In that scenario, you need to use a fallback function (there's no way to tell the difference between "UTF-8 isn't available already but (say) UCS-2 is" and "UTF-8 was available but it wasn't valid"), and the likely fallback is also going to disallow surrogates. To get them I believe you'll need to go via PyUnicode_AsEncodedString or PyCodec_Encode with specific arguments.

@davidhewitt
Copy link

davidhewitt commented Oct 2, 2024

Ah, that's a good point. I had overlooked the case that this API doesn't really allow for good separation of "format unsupported" versus "contents are not representable in this format"

If the Rust caller knew that the contents did not contain invalid code points, then this API would allow cheap zero-copy consumption from Rust as a string slice (&str). But we already have PyUnicode_AsUTF8AndSize to meet the same need anyway, and that is already in the stable ABI, so I expect we would be unlikely to switch to this API for UTF8 exporting regardless.

All that said, I've also been exploring a rewrite in Rust using PyO3. A contributor created an implementation that is somewhat close to the C speed. I'm not sure if it's abi3 compatibly, but I think it is. Regardless, I'd still want to take advantage of abi3 wheels, presumably Py03 would expose this new API to allow that.

@davidism I took a quick peek at the PR on markupsafe and it looks like it should be fine to just enable PyO3's abi3 opt-in and it'll work for you. But you couldn't use this API and build a single abi3 wheel (you'd be supporting 3.14+ only); more realistic would be to build two sets of abi3 wheels, one for your minimum version and one for 3.14+.

@vstinner
Copy link
Author

@encukou:

Taking into account what @zooba said, I wonder: how fast will CPython switch to UTF-8 (or, WTF-8) for the “main” internal storage?
If we do that in 3.15, we'd be adding stable API that would stop working after a single release. That does sound wasteful.

I studied the code recently; see What do you think of deprecating C APIs to modify immutable strings? discussion.

There are a lot of code in the stdlib extensions and way more in 3rd party which rely on PEP 393 internals. If we decide to switch to UTF-8, IMO the C API will be a big blocker issue and I don't expect that it can be done in a single Python release (in 3.15).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants