Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string: implement API to access raw string data #1794

Merged
merged 1 commit into from
Aug 21, 2021

Conversation

indygreg
Copy link
Contributor

With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.

This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.

Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a Cow<str>.

Closes #1776.

Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks useful to me! I wonder whether eventually we'd use this internally in PyO3 too to potentially avoid causing strings to generate (and cache) their UTF-8 representation.

I have a bunch of small nitty comments below.

I'd also very much like to see some more tests of PyStringData::to_string() - it would be good to cover both the success case (which looks like it's covered) and the error cases (which don't appear to be) for each of the three character widths.

Ucs4(&'a [u32]),
}

impl<'a> PyStringData<'a> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it doesn't look like any of these methods need to refer to 'a, I think you can just do:

Suggested change
impl<'a> PyStringData<'a> {
impl PyStringData<'_> {

src/types/string.rs Show resolved Hide resolved
src/types/string.rs Show resolved Hide resolved
src/types/string.rs Outdated Show resolved Hide resolved
src/types/string.rs Outdated Show resolved Hide resolved
src/types/string.rs Outdated Show resolved Hide resolved
src/types/string.rs Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@davidhewitt
Copy link
Member

Also, types::string is currently not a public module, which is why CI is unhappy. I don't personally see the harm in making it public? I think I'd rather users wrote use pyo3::types::string::PyStringData over use pyo3::types:::PyStringData - the latter seems misleading.

@indygreg
Copy link
Contributor Author

OK. I think the most recent push addresses most (all?) of your feedback.

I did disagree with you on exposing PyStringData under pyo3::types::string. If you insist on that, suggest a commit and I'll apply it.

I added tests for verifying .to_string() errors. This required using PyUnicode_FromKindAndData() to construct instances with invalid byte sequences, which Python appears to accept without validation.

As part of writing the tests, I noticed there may be a c_int / c_uint mismatch in the ffi definitions. When I implemented the FFI bindings, I attempted to copy signedness from the C headers. However, some of these things were macros and lacked type annotations. I may have used the wrong types for symbols like PyUnicode_1BYTE_KIND :/

@indygreg indygreg force-pushed the pystringdata branch 2 times, most recently from 14ed542 to 83bf70b Compare August 19, 2021 03:51
src/types/string.rs Outdated Show resolved Hide resolved
@indygreg indygreg force-pushed the pystringdata branch 3 times, most recently from e39c167 to 364d61f Compare August 19, 2021 05:17
src/types/string.rs Show resolved Hide resolved
src/types/string.rs Outdated Show resolved Hide resolved
src/types/string.rs Outdated Show resolved Hide resolved
src/types/string.rs Outdated Show resolved Hide resolved
With the recent implementation of non-limited unicode APIs, we're
able to query Python's low-level state to access the raw bytes that
Python is using to store string objects.

This commit implements a safe Rust API for obtaining a view into
Python's internals and representing the raw bytes Python is using
to store strings.

Not only do we allow accessing what Python has stored internally,
but we also support coercing this data to a `Cow<str>`.

Closes PyO3#1776.
Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, this looks good to me!

@davidhewitt davidhewitt merged commit 591f44e into PyO3:main Aug 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: raw data API for PyString
3 participants