-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OsStr and Path conversions #1379
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for this PR! This is a really great start. I've given it a thorough read and got various feedback and suggestions from my time spent in battle with the PyO3 codebase 💂.
Wow, thank you!! That is one detailed code review. I will go over it tomorrow and implement all the suggestions |
I was thinking this morning that we should check to understand carefully the differences in behavior on unix vs windows. If String -> OsString -> PyString gives different results on the two platforms this might trip up users. Also |
I think this can be taken care of using unit or integration tests. We could feed many different strings through the String -> OsString -> PyString conversion and back. This test will be run on all OSes by CI, so we know if something breaks somewhere. Also, is there a reason you reacted confused on my previous comment? Now I am confused 😄 |
By the way, why do all Rust->Python conversions have to be duplicated? There's two traits to implement: This is a bit awkward in the test because I essentially have to test everything twice to account for all the trait implementations |
It's a common consume/ not consume pattern in Rust, though it does not work nicely sometimes.
Since they share the implementation, I don't think we need to test all things twice. |
test_roundtrip::<&Path>(py, path); | ||
test_roundtrip::<Cow<'_, Path>>(py, Cow::Borrowed(path)); | ||
test_roundtrip::<Cow<'_, Path>>(py, Cow::Owned(path.to_path_buf())); | ||
test_roundtrip::<PathBuf>(py, path.to_path_buf()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If .as_ref()
is called at last, do we need to test all these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.as_ref() is only called to verify the end result. The conversion is done on the types as is.
It seems like the codecov check fail is a false positive. The places it marked as "not covered by tests" are actually covered by the tests. Perhaps codecov is getting confused by the generic trickery in the tests. |
src/conversions/osstr.rs
Outdated
|
||
// TODO: move to other module to prevent accidentally circumventing the new function? | ||
#[cfg(windows)] | ||
struct DropGuard<T>(*mut T); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's good to abstract such a specific operation.
In this struct, *mut T
implicitly must be a pointer in the Python heap without any reference to Python objects, so it's unsafe to construct this object.
So I recommend making this a more limited one without new
and placing this in the function scope if you like this solution more than Vec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this type is super unsafe
so I'd rather it was inside the function too. That way only the function can misuse it.
Hehe nope that's just a fat-finger error when checking on the conversation on my phone!
Yes; I would love to rework this pair to also potentially model fallible conversions, but to avoid ecosystem churn until we're really sure what a better design is I think best to leave as-is!
I think Also, I was thinking we might want to consider whether |
My original plan was to have |
Docs seem to be slightly misleading in this case - the implementation of From that implementation it looks like that function will never raise an exception (as long as the first argument is a valid unicode object), and also that doc describes that passing |
I'm currently working on something else but I hope to continue work on this in the next few days |
I looked at the source code of Python ( In other words, I'm not able to verify if |
Is there anything left to work on in this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to adjust the set of traits provided just slightly, see other comments.
R.E. PyString::to_string_lossy
- I'm going to open a separate PR for that case. I ran a quick test script:
use pyo3::prelude::*;
use pyo3::types::{PyDict, PyString};
use std::borrow::Cow;
use std::error::Error;
use std::ffi::OsString;
fn main() -> PyResult<()> {
Python::with_gil(|py| -> PyResult<()> {
let locals = PyDict::new(py);
py.run(
r#"x = '\udcfa\udcfb\udcfc\udcfd\udcfe\udcff'"#,
None,
Some(locals),
)?;
let py_str = locals.get_item("x").unwrap().downcast::<PyString>()?;
dbg!(py_str, py_str.len()?);
let os_string: OsString = py_str.extract()?;
dbg!(&os_string, &os_string.len());
let string: Cow<str> = py_str.to_string_lossy();
dbg!(&string, &string.chars().count());
Ok(())
})
}
The output was this:
[src/main.rs:16] py_str = '\udcfa\udcfb\udcfc\udcfd\udcfe\udcff'
[src/main.rs:16] py_str.len()? = 6
[src/main.rs:18] &os_string = "\xFA\xFB\xFC\xFD\xFE\xFF"
[src/main.rs:18] &os_string.len() = 6
[/home/david/dev/pyo3/src/types/string.rs:86] bytes = b'\xed\xb3\xba\xed\xb3\xbb\xed\xb3\xbc\xed\xb3\x
bd\xed\xb3\xbe\xed\xb3\xbf'
[src/main.rs:20] &string = "������������������"
[src/main.rs:20] &string.chars().count() = 18
It looks like the "surrogatepass" escape handler literally hands the three-byte surrogate sequences to Rust, which then results in three replacement characters being produced by std::String::from_utf8_lossy
. I think this is wrong - Python thinks these surrogates are a single codepoint, but they're currently treated as three by PyString::to_string_lossy
.
I think that using PyUnicode_EncodeFSDefault
is probably better and will then match PyString -> OsString -> String
.
src/conversions/osstr.rs
Outdated
} | ||
} | ||
|
||
impl<'a> IntoPy<PyObject> for &'a OsString { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an implementation that's not strictly necessary. What's the motivation for having it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
types/string.rs has an equivalent implementation. I oriented myself at those trait implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh ok.
I'm not entirely sure why that impl exists... it might be because #[pyo3(get)]
on String
fields needs it? Not sure.
If we can't figure out why we need this impl, I'd rather skip it for now. We can always add it later!
src/conversions/path.rs
Outdated
} | ||
} | ||
|
||
impl<'a> IntoPy<PyObject> for &'a PathBuf { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as for &'_ OsString
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing; I oriented myself at the trait implementations for String
} | ||
} | ||
|
||
impl ToPyObject for Cow<'_, OsStr> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs IntoPy<PyObject>
for Cow<'_, OsStr>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha same thing, seems like types/string.rs
is missing this as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - if you're willing, can you add it there also? 🙏
I think that adding IntoPy
for Cow
makes sense (as it's the trait needed to be able to return this type from #[pyfunction]
).
src/conversions/osstr.rs
Outdated
let fs_encoded_bytes: &crate::types::PyBytes = unsafe { | ||
ob.py() | ||
.from_borrowed_ptr(ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr())) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyUnicode_EncodeFSDefault
returns a new reference, so we need to own this pointer or we leak memory. I suggest using Py<PyBytes>
.
let fs_encoded_bytes: &crate::types::PyBytes = unsafe { | |
ob.py() | |
.from_borrowed_ptr(ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr())) | |
}; | |
let fs_encoded_bytes: Py<crate::types::PyBytes> = unsafe { | |
Py::from_owned_ptr(ob.py(), ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr())) | |
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will apply the suggestion. If you have time, I would be interested to hear why you couldn't just replace from_borrowed_ptr
with from_owned_ptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. from_owned_ptr
returns &PyAny
(or another native type reference) where the owned pointer has to be stored by PyO3 in a thread-local vector. In comparison, Py
(and PyObject
, aka Py<PyAny>
) directly hold the owned pointer inside them. This is marginally more efficient, and also means that the temporary bytes
will be cleaned up immediately, instead of when PyO3 has a chance to cleanup its internal vector safely.
This has been a long thorn in PyO3's API imo - there's discussion at #1056 and #1308 where I hope to eventually remove this difference and make everything as efficient as possible!
I changed my mind on this - it actually would change Also the inconsistency does not seem to matter; I found someone else who stumbled across similar conversions at rust-lang/rust#56786 - in the end they decided it was better left as-is. At least the current implementation of |
src/conversions/osstr.rs
Outdated
unsafe { | ||
// This will not panic because the data from encode_wide is well-formed Windows | ||
// string data | ||
py.from_owned_ptr::<PyString>(ffi::PyUnicode_FromWideChar( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed that this can use PyObject::from_owned_ptr
, which will remove the need for .into()
also.
py.from_owned_ptr::<PyString>(ffi::PyUnicode_FromWideChar( | |
PyObject::from_owned_ptr(py, ffi::PyUnicode_FromWideChar( |
src/conversions/osstr.rs
Outdated
|
||
// Decode from Python's lossless bytes string representation back into raw bytes | ||
let fs_encoded_bytes: Py<crate::types::PyBytes> = unsafe { | ||
Py::from_owned_ptr(ob.py(), ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using crate::Py
here may allow you to avoid the painful OS-specific import at the top.
(sorry about the -D warnings
in CI - I think on balance it's useful to help keep the PyO3 code health up, even if it's a little frustrating at times 😬)
src/conversions/osstr.rs
Outdated
fn extract(ob: &PyAny) -> PyResult<Self> { | ||
#[cfg(not(windows))] | ||
{ | ||
let pystring = <PyString as PyTryFrom>::try_from(ob)?; // Cast PyAny to PyString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is shared between the two implementations so could be pulled out the top before them both.
} | ||
} | ||
|
||
impl ToPyObject for Cow<'_, OsStr> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - if you're willing, can you add it there also? 🙏
I think that adding IntoPy
for Cow
makes sense (as it's the trait needed to be able to return this type from #[pyfunction]
).
src/conversions/osstr.rs
Outdated
} | ||
} | ||
|
||
impl<'a> IntoPy<PyObject> for &'a OsString { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh ok.
I'm not entirely sure why that impl exists... it might be because #[pyo3(get)]
on String
fields needs it? Not sure.
If we can't figure out why we need this impl, I'd rather skip it for now. We can always add it later!
src/conversions/osstr.rs
Outdated
let fs_encoded_bytes: &crate::types::PyBytes = unsafe { | ||
ob.py() | ||
.from_borrowed_ptr(ffi::PyUnicode_EncodeFSDefault(pystring.as_ptr())) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. from_owned_ptr
returns &PyAny
(or another native type reference) where the owned pointer has to be stored by PyO3 in a thread-local vector. In comparison, Py
(and PyObject
, aka Py<PyAny>
) directly hold the owned pointer inside them. This is marginally more efficient, and also means that the temporary bytes
will be cleaned up immediately, instead of when PyO3 has a chance to cleanup its internal vector safely.
This has been a long thorn in PyO3's API imo - there's discussion at #1056 and #1308 where I hope to eventually remove this difference and make everything as efficient as possible!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think there's some final decisions to be made on the traits to provide, and then this is good to merge. Thanks again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the many rounds of iteration - this PR is looking great to me now 🚀
I'm going to rebase and merge this. Thanks again! |
Implements conversions between OsStr/OsString/Path/PathBuf and Python strings: #1377
This is my first PR on this project and my first interaction with the codebase
This PR is a draft because in several places of the code I'm not sure if I did the right thing. Those places are marked with a
// HELP
comment. I'd greatly appreciate if an experienced PyO3 maintainer could check those places.