-
Notifications
You must be signed in to change notification settings - Fork 760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: raw data API for PyString #1776
Comments
pyo3 doesn't currently define various Unicode bindings that allow the retrieval of raw data from Python strings. Said bindings are a prerequisite to possibly exposing this data in the Rust API (PyO3#1776). Even if those high-level APIs never materialize, the FFI bindings are necessary to enable consumers of the raw C API to utilize them. This commit partially defines the FFI bindings as defined in CPython's Include/cpython/unicodeobject.h file. I used the latest CPython 3.9 Git commit for defining the order of the symbols and the implementation of various inline preprocessor macros. I tried to be as faithful as possible to the original implementation, preserving intermediate `#define`s as inline functions. The structs are a bit wonky and probably warrant the most review scrutiny. I haven't tested this code thoroughly. Missing symbols have been annotated with `skipped` and symbols currently defined in `src/ffi/unicodeobject.rs` have been annotated with `move`.
👍 I'm ok with adding this, however I think it's going to be quite hard. See my comments on #1777 FWIW given that PEPs 393 & 623 are removing the "legacy" kinds (and if an object is in the legacy state, we can convert it with pub enum PyStringData<'a> {
Ucs1(&'a [u8]),
Ucs2(&'a [u16]),
Ucs4(&'a [u32]),
} There's perhaps room for an |
pyo3 doesn't currently define various Unicode bindings that allow the retrieval of raw data from Python strings. Said bindings are a prerequisite to possibly exposing this data in the Rust API (PyO3#1776). Even if those high-level APIs never materialize, the FFI bindings are necessary to enable consumers of the raw C API to utilize them. This commit partially defines the FFI bindings as defined in CPython's Include/cpython/unicodeobject.h file. I used the latest CPython 3.9 Git commit for defining the order of the symbols and the implementation of various inline preprocessor macros. I tried to be as faithful as possible to the original implementation, preserving intermediate `#define`s as inline functions. Missing symbols have been annotated with `skipped` and symbols currently defined in `src/ffi/unicodeobject.rs` have been annotated with `move`. The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't support. So we've provided accessor functions for retrieving these fields' values. No accessor functions are present because you shouldn't be touching these values from Rust code. Tests of the bitfield APIs and macro implementations have been added.
pyo3 doesn't currently define various Unicode bindings that allow the retrieval of raw data from Python strings. Said bindings are a prerequisite to possibly exposing this data in the Rust API (PyO3#1776). Even if those high-level APIs never materialize, the FFI bindings are necessary to enable consumers of the raw C API to utilize them. This commit partially defines the FFI bindings as defined in CPython's Include/cpython/unicodeobject.h file. I used the latest CPython 3.9 Git commit for defining the order of the symbols and the implementation of various inline preprocessor macros. I tried to be as faithful as possible to the original implementation, preserving intermediate `#define`s as inline functions. Missing symbols have been annotated with `skipped` and symbols currently defined in `src/ffi/unicodeobject.rs` have been annotated with `move`. The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't support. So we've provided accessor functions for retrieving these fields' values. No accessor functions are present because you shouldn't be touching these values from Rust code. Tests of the bitfield APIs and macro implementations have been added.
pyo3 doesn't currently define various Unicode bindings that allow the retrieval of raw data from Python strings. Said bindings are a prerequisite to possibly exposing this data in the Rust API (PyO3#1776). Even if those high-level APIs never materialize, the FFI bindings are necessary to enable consumers of the raw C API to utilize them. This commit partially defines the FFI bindings as defined in CPython's Include/cpython/unicodeobject.h file. I used the latest CPython 3.9 Git commit for defining the order of the symbols and the implementation of various inline preprocessor macros. I tried to be as faithful as possible to the original implementation, preserving intermediate `#define`s as inline functions. Missing symbols have been annotated with `skipped` and symbols currently defined in `src/ffi/unicodeobject.rs` have been annotated with `move`. The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't support. So we've provided accessor functions for retrieving these fields' values. No accessor functions are present because you shouldn't be touching these values from Rust code. Tests of the bitfield APIs and macro implementations have been added.
pyo3 doesn't currently define various Unicode bindings that allow the retrieval of raw data from Python strings. Said bindings are a prerequisite to possibly exposing this data in the Rust API (PyO3#1776). Even if those high-level APIs never materialize, the FFI bindings are necessary to enable consumers of the raw C API to utilize them. This commit partially defines the FFI bindings as defined in CPython's Include/cpython/unicodeobject.h file. I used the latest CPython 3.9 Git commit for defining the order of the symbols and the implementation of various inline preprocessor macros. I tried to be as faithful as possible to the original implementation, preserving intermediate `#define`s as inline functions. Missing symbols have been annotated with `skipped` and symbols currently defined in `src/ffi/unicodeobject.rs` have been annotated with `move`. The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't support. So we've provided accessor functions for retrieving these fields' values. No accessor functions are present because you shouldn't be touching these values from Rust code. Tests of the bitfield APIs and macro implementations have been added.
pyo3 doesn't currently define various Unicode bindings that allow the retrieval of raw data from Python strings. Said bindings are a prerequisite to possibly exposing this data in the Rust API (PyO3#1776). Even if those high-level APIs never materialize, the FFI bindings are necessary to enable consumers of the raw C API to utilize them. This commit partially defines the FFI bindings as defined in CPython's Include/cpython/unicodeobject.h file. I used the latest CPython 3.9 Git commit for defining the order of the symbols and the implementation of various inline preprocessor macros. I tried to be as faithful as possible to the original implementation, preserving intermediate `#define`s as inline functions. Missing symbols have been annotated with `skipped` and symbols currently defined in `src/ffi/unicodeobject.rs` have been annotated with `move`. The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't support. So we've provided accessor functions for retrieving these fields' values. No accessor functions are present because you shouldn't be touching these values from Rust code. Tests of the bitfield APIs and macro implementations have been added.
With the recent implementation of non-limited unicode APIs, we're able to query Python's low-level state to access the raw bytes that Python is using to store string objects. This commit implements a safe Rust API for obtaining a view into Python's internals and representing the raw bytes Python is using to store strings. Not only do we allow accessing what Python has stored internally, but we also support coercing this data to a `Cow<str>`. Closes PyO3#1776.
With the recent implementation of non-limited unicode APIs, we're able to query Python's low-level state to access the raw bytes that Python is using to store string objects. This commit implements a safe Rust API for obtaining a view into Python's internals and representing the raw bytes Python is using to store strings. Not only do we allow accessing what Python has stored internally, but we also support coercing this data to a `Cow<str>`. Closes PyO3#1776.
pyo3 doesn't currently define various Unicode bindings that allow the retrieval of raw data from Python strings. Said bindings are a prerequisite to possibly exposing this data in the Rust API (#1776). Even if those high-level APIs never materialize, the FFI bindings are necessary to enable consumers of the raw C API to utilize them. This commit partially defines the FFI bindings as defined in CPython's Include/cpython/unicodeobject.h file. I used the latest CPython 3.9 Git commit for defining the order of the symbols and the implementation of various inline preprocessor macros. I tried to be as faithful as possible to the original implementation, preserving intermediate `#define`s as inline functions. Missing symbols have been annotated with `skipped` and symbols currently defined in `src/ffi/unicodeobject.rs` have been annotated with `move`. The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't support. So we've provided accessor functions for retrieving these fields' values. No accessor functions are present because you shouldn't be touching these values from Rust code. Tests of the bitfield APIs and macro implementations have been added.
With the recent implementation of non-limited unicode APIs, we're able to query Python's low-level state to access the raw bytes that Python is using to store string objects. This commit implements a safe Rust API for obtaining a view into Python's internals and representing the raw bytes Python is using to store strings. Not only do we allow accessing what Python has stored internally, but we also support coercing this data to a `Cow<str>`. Closes PyO3#1776.
With the recent implementation of non-limited unicode APIs, we're able to query Python's low-level state to access the raw bytes that Python is using to store string objects. This commit implements a safe Rust API for obtaining a view into Python's internals and representing the raw bytes Python is using to store strings. Not only do we allow accessing what Python has stored internally, but we also support coercing this data to a `Cow<str>`. Closes PyO3#1776.
With the recent implementation of non-limited unicode APIs, we're able to query Python's low-level state to access the raw bytes that Python is using to store string objects. This commit implements a safe Rust API for obtaining a view into Python's internals and representing the raw bytes Python is using to store strings. Not only do we allow accessing what Python has stored internally, but we also support coercing this data to a `Cow<str>`. Closes PyO3#1776.
With the recent implementation of non-limited unicode APIs, we're able to query Python's low-level state to access the raw bytes that Python is using to store string objects. This commit implements a safe Rust API for obtaining a view into Python's internals and representing the raw bytes Python is using to store strings. Not only do we allow accessing what Python has stored internally, but we also support coercing this data to a `Cow<str>`. Closes PyO3#1776.
With the recent implementation of non-limited unicode APIs, we're able to query Python's low-level state to access the raw bytes that Python is using to store string objects. This commit implements a safe Rust API for obtaining a view into Python's internals and representing the raw bytes Python is using to store strings. Not only do we allow accessing what Python has stored internally, but we also support coercing this data to a `Cow<str>`. Closes PyO3#1776.
With the recent implementation of non-limited unicode APIs, we're able to query Python's low-level state to access the raw bytes that Python is using to store string objects. This commit implements a safe Rust API for obtaining a view into Python's internals and representing the raw bytes Python is using to store strings. Not only do we allow accessing what Python has stored internally, but we also support coercing this data to a `Cow<str>`. Closes PyO3#1776.
With the recent implementation of non-limited unicode APIs, we're able to query Python's low-level state to access the raw bytes that Python is using to store string objects. This commit implements a safe Rust API for obtaining a view into Python's internals and representing the raw bytes Python is using to store strings. Not only do we allow accessing what Python has stored internally, but we also support coercing this data to a `Cow<str>`. Closes PyO3#1776.
With the recent implementation of non-limited unicode APIs, we're able to query Python's low-level state to access the raw bytes that Python is using to store string objects. This commit implements a safe Rust API for obtaining a view into Python's internals and representing the raw bytes Python is using to store strings. Not only do we allow accessing what Python has stored internally, but we also support coercing this data to a `Cow<str>`. Closes #1776.
With the recent implementation of non-limited unicode APIs, we're able to query Python's low-level state to access the raw bytes that Python is using to store string objects. This commit implements a safe Rust API for obtaining a view into Python's internals and representing the raw bytes Python is using to store strings. Not only do we allow accessing what Python has stored internally, but we also support coercing this data to a `Cow<str>`. Closes #1776.
PyUnicode internally stores its data in various variations. See https://docs.python.org/3/c-api/unicode.html.
PyO3's
PyString
currently only allows you to get at UTF-8 / Ruststr
compatible variations of the data.rust-cpython - by contrast - exposes a
PyString.data()
returning aPyStringData
enum:This API enables Rust to have access to the raw bytes backing a Python string, not the UTF-8 normalization of it (if different).
PyOxidizer was relying on this API for testing. (There are some low-level tests around encoding handling that need to verify exact byte sequences and Python string representations are being handled properly.)
While I'm certainly capable of using
unsafe
Python C APIs to get at the raw string data to close this feature gap, I was curious if PyO3 would be interested in a PR to expose aPyStringData
enumeration forPyString
instances. Here is my proposal:PyString
gains apub fn data(&self) -> PyStringData<'_>
PyStringData
is an enum with a variant for each internal Python string variation.PyString.data()
calls out toPyUnicode_READY()
+PyUnicode_{KIND, DATA, GET_LENGTH}
and constructs aPyStringData
with a slice.I'd be willing to contribute a PR for this feature if there is interest.
The text was updated successfully, but these errors were encountered: