Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] standardize PyBytes <-> Vec<u8> or &[u8] or Cow<[u8]> #4182

Closed
wants to merge 1 commit into from

Conversation

diliop
Copy link

@diliop diliop commented May 13, 2024

When generating PyBytes from Vec<u8> or &[u8] or Cow<[u8]> somewhat different behaviors are observed:

Python::with_gil(|py| {
    let bytes_py = PyBytes::new_bound(py, b"foobar");
    dbg!(&bytes_py);  // &bytes_py = b'foobar'

    let bytes_vec = bytes_py.extract::<Vec<u8>>().unwrap();
    let py_obj = bytes_vec.to_object(py);
    let maybe_py_bytes = py_obj.downcast_bound::<PyBytes>(py);
    dbg!(maybe_py_bytes);  // Err(DowncastError { from: [102, 111, 111, 98, 97, 114], to: "PyBytes" })

    let bytes_slice = bytes_py.extract::<&[u8]>().unwrap();
    let py_obj = bytes_slice.into_py(py);  // to_object(py) does not work!
    let maybe_py_bytes = py_obj.downcast_bound::<PyBytes>(py);
    dbg!(maybe_py_bytes);  // Ok(b'foobar')

    let bytes_cow = bytes_py.extract::<Cow<'_, [u8]>>().unwrap();
    let py_obj = bytes_cow.to_object(py);
    let maybe_py_bytes = py_obj.downcast_bound::<PyBytes>(py);
    dbg!(maybe_py_bytes);  // Ok(b'foobar')

    Ok(())
})

Ideally, I would like to be able to handle all 3 convesions the same way, with either into_py and/or to_object. I'm not 100% what the best approach would be but I took a stab at ToPyObject to showcase how that might work. Would love some feedback on the approach as well as pointers on how to best proceed to address IntoPy and FromPyObject, assuming all this doable.

Copy link

codspeed-hq bot commented May 13, 2024

CodSpeed Performance Report

Merging #4182 will not alter performance

Comparing diliop:main (59119db) with main (10152a7)

🎉 Hooray! pytest-codspeed just leveled up to 2.2.1!

A heads-up, this is a breaking change and it might affect your current performance baseline a bit. But here's the exciting part - it's packed with new, cool features and promises improved result stability 🥳!
Curious about what's new? Visit our releases page to delve into all the awesome details about this new version.

Summary

✅ 68 untouched benchmarks

@davidhewitt
Copy link
Member

davidhewitt commented May 24, 2024

Thanks for the PR, and sorry it's taken a little while to get a review!

This is definitely a topic with a situation in PyO3 which is currently unsatisfying, see also #2888 with some past discussion in this area. (cc @adamreichold)

I think as much as it feels impure, it would be desirable for Vec<u8> to special-case with Python bytes as this is such a common performance footgun. I think also if we can find a generic solution then #1200 can also support slices to-python better.

Regarding the conversion traits we currently also have #4060 for the to-python direction which would reduce the number of to-python traits down to just one. Maybe while we're making changes this is a good moment to explore a solution for this. (cc @Icxolu)

I definitely want this case to be better, and I still can't see any solution to implement the desired behaviour other than hanging the implementation off FromPyObject and IntoPyObject by an additional method.

I wonder, can we maybe solve this by adding additional methods to those traits which are effectively private? Basically we make one or more of the types in their signature PyO3-private and then we give a default implementation so that users can't touch it.

e.g.

trait FromPyObject<'py> {

    /* existing stuff ... */

    fn extract_bytes(&self, override: pyo3::impl_::Override) -> ... {
        // default implementation returns None or fails or similar
    }
}

where the pyo3::impl_::Override type cannot be instantiated by PyO3 users, so they cannot call this method. I wonder if it's possible to go further and make the type unnameable, so they cannot implement it either. I'm sure I saw ideas related to this somewhere.

@Dr-Emann
Copy link

Sounds like Sealed trait methods, just put the Override type in a private module with no public reexports.

@Icxolu
Copy link
Contributor

Icxolu commented May 24, 2024

Yeah, this is kind of problem for every kind of generic (de)serialization currently. The real solution would be specialization, but I'm not expecting that to land any time soon, so...

Regarding the conversion traits we currently also have #4060 for the to-python direction which would reduce the number of to-python traits down to just one. Maybe while we're making changes this is a good moment to explore a solution for this. (cc @Icxolu)

This would definitely be an opportunity to improve the type conversion pit falls together with a new simplified and more powerful conversion mechanism. I think I'm going to rebase #4060 and think a bit how these would interact.

@Icxolu
Copy link
Contributor

Icxolu commented May 24, 2024

After rebasing #4060 and giving this a first look, I have some observations about combining these two ideas:

  • I think the general idea of using a sealed trait-method for this should work with it as well.
  • The default implementation would need some additional trait bounds around Error, but since it would be sealed and the relevant collection already have the same bound, this should not be a problem.
  • The Target type for Vec<T>, etc would have to be PyAny (instead of PyList), since it could now be PyBytes (or even something else in the future) as well. A bit sad, but not the end of the world.
  • The current implementation uses self (by value) for the conversion. This makes working with borrowed iterators extremely inconvenient, since it requires a lot of HRTBs all over the place (and have not checked yet whether it would work out in the end). So currently all collections are implemented by value only, so that they can use the owning iterators... This would mean that this might only work for Vec<T> and not &[T] or Cow<T>. I'm still a bit thorn whether that's the right choice to make, or whether we should take &self instead. It seems like we're making trade-offs either way, but that's more of a discussion for Add IntoPyObject conversion trait #4060.

@davidhewitt
Copy link
Member

  • The default implementation would need some additional trait bounds around Error, but since it would be sealed and the relevant collection already have the same bound, this should not be a problem.

I don't entirely follow why this is necessary, though I guess it would become clearer to me in time.

  • The Target type for Vec<T>, etc would have to be PyAny (instead of PyList), since it could now be PyBytes (or even something else in the future) as well. A bit sad, but not the end of the world.

Ah, yes. That might actually be a good thing to help users realise that there's something a bit funny going on. I also sort of wonder if there's a way to exploit the type system to achieve this statically, though that'll be an experiment all on its own.

  • The current implementation uses self (by value) for the conversion. This makes working with borrowed iterators extremely inconvenient, since it requires a lot of HRTBs all over the place (and have not checked yet whether it would work out in the end). So currently all collections are implemented by value only, so that they can use the owning iterators... This would mean that this might only work for Vec<T> and not &[T] or Cow<T>. I'm still a bit thorn whether that's the right choice to make, or whether we should take &self instead. It seems like we're making trade-offs either way, but that's more of a discussion for Experiment: Fallible conversion trait #4060.

👍 I'm looking forward to carving out some time to play around with #4060 in detail as soon as we've shipped 0.22 (which I think is mostly waiting on me to rebase stuff...)

@diliop
Copy link
Author

diliop commented May 27, 2024

Thanks for all the issue links and overview of the problems with PyBytes <-> Vec<u8>!

Regarding the conversion traits we currently also have #4060 for the to-python direction which would reduce the number of to-python traits down to just one.

From a user's perspective this would be a great improvement since it makes the code more readable and whenever there is "just one way of doing X", makes using the said code more ergonomic.

I think the general idea of using a sealed trait-method for this should work with it as well.

This would also be amazing for the Vec<u8> -> PyBytes specialization! I now also understand a bit more the design choices behind from_py_object_bound_sealed::Sealed. Happy to close this PR and continue the conversation on #4060.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants