Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically wrap str in a vec![] for Vec<&str> and Vec<String> #2500

Merged
merged 3 commits into from
Jul 13, 2022

Conversation

jeertmans
Copy link
Contributor

@jeertmans jeertmans commented Jul 7, 2022

EDIT

After few discussions (see below), it was proposed to raise an error in the case when str can be split into a Vec<T>.
Thus, only a dynamic type checking is done with PyAny.isinstance_of::<PyString>().

Old proposition

As discussed in #2342, this PR proposes a way to wrap str arguments inside a vector to avoid iterating through chars when it is not desired.

Currently, only the Vec<String> is working, as I cannot manage to compile for Vec<&str> (code was commented out).

This "solution" leverages the use of the specialization features, which is currently unstable. As such, it must be compile with nightly feature activated as well as the nightly channel for the rust compiler.

Another solution, as discussed in #2342, would be to throw an error instead of wrapping in a vector.

Demo

[package]
name = "fromstr"
# ...

[dependencies]
pyo3 = { path = "...", features = ["extension-module", "nightly"] }
use pyo3::prelude::*;   
                                                            
#[pyfunction]   
fn print_strings(strings: Vec<String>) {            
    for s in strings {                                                                          
        println!("{}", s);                       
    }           
}                              
                                        
                          
#[pyfunction]                                                                                   
fn print_str(strings: Vec<&str>) {
    for s in strings {                      
        println!("{}", s);     
    }                                      
}                
         
#[pymodule]              
fn fromstr(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(print_strings, m)?)?;
    m.add_function(wrap_pyfunction!(print_str, m)?)?;
    Ok(())                                                           
}   
from fromstr import print_strings, print_str
                     
if __name__ == "__main__":
             
    print(1)                            
    print_strings("echo")
    print(2)              
    print_str("echo")
    print(3)
    print_strings(["echo"])
    print(4)
    print_str(["echo"])

# Outputs:
# 1
# echo
# 2
# e
# c
# h
# o
# 3
# echo
# 4
# echo

@birkenfeld
Copy link
Member

IMO it's not ok to silently change behavior when nightly is activated.

@jeertmans
Copy link
Contributor Author

@birkenfeld I agree that the user should be somehow notified of this, either through a warning or some error?

@davidhewitt
Copy link
Member

Gotta agree that I'm not a fan of having behavior change by a feature.

I know I copied a quote in #2342 which suggested this wrapping, but how about a simpler solution: just add a check when extracting Vec<T> that the input is not a string (and return an error if it is).

Without specialization we'd have to pay that cost even for non-text types like Vec<i32>. Though if we measure it, I'd hope it would be almost unnoticeable.

That would at least hopefully work on stable...?

@jeertmans
Copy link
Contributor Author

@davidhewitt The only way I see it could be implemented on the stable channel would be to use std::any::TypeId::of::<T>(), but it requires T to have a static lifetime, which is not allowed by pyo3 due to GIL restrictions I guess.

I have tried multiple tricks, but none seemed to work.

The implementation could look something like this:

impl<'a, T> FromPyObject<'a> for Vec<T>
where
    T: FromPyObject<'a>,
{
    fn extract(obj: &'a PyAny) -> PyResult<Self> {
        let ti = TypeId::of::<T>(); // Does not work
        if (ti == TypeId::of::<String>() || ti == TypeId::of::<&'static str>())
            && obj.is_instance_of::<PyString>()?
        {
            // Raise some error;
        }
        extract_sequence(obj)
    }
}

@davidhewitt
Copy link
Member

Do we need the TypeId checks? I was wondering about the simpler:

if let Ok(true) = obj.is_instance_of::<PyString>()
{
    return Err(PyValueError::new_err("Can't extract `str` to `Vec`"))
}
extract_sequence(obj)

Just depends how much that affects performance I think?

@jeertmans
Copy link
Contributor Author

The idea behind that kind of type checking (either with ˋTypeIdor specialization) was to still allow astr` to be transformed into, e.g., a ˋVecˋ.
But maybe this is not desirable as well and using isinstance is sufficient in this case.

@jeertmans
Copy link
Contributor Author

@davidhewitt Here is a small benchmark I ran and its results:

(Don't mind about the print... function names, it should be count...)

pyo3/src/types/sequence.rs

impl<'a, T> FromPyObject<'a> for Vec<T>
where
    T: FromPyObject<'a>, 
{
    fn extract(obj: &'a PyAny) -> PyResult<Self> {
        if let Ok(true) = obj.is_instance_of::<PyString>() {
            return Err(PyValueError::new_err("Can't extract `str` to `Vec`"));
        }
        extract_sequence(obj)
    }
}

fromstr/src/lib.rs

use pyo3::prelude::*;

#[pyfunction]
fn print_strings(strings: Vec<String>) -> usize {            
    strings.len()
}                              

  
#[pyfunction]  
fn print_str(strings: Vec<&str>) -> usize {
     strings.len()  
} 
                                         

#[pyfunction]  
fn print_int(strings: Vec<isize>) -> usize {
    strings.len() 
}                                                             
                                                                                
#[pyfunction]
fn print_char(strings: Vec<char>) -> usize {
    strings.len()
}  
  
#[pymodule]   
fn fromstr(_py: Python, m: &PyModule) -> PyResult<()> {                                                             
    m.add_function(wrap_pyfunction!(print_strings, m)?)?;                                                
    m.add_function(wrap_pyfunction!(print_str, m)?)?; 
    m.add_function(wrap_pyfunction!(print_int, m)?)?;        
    m.add_function(wrap_pyfunction!(print_char, m)?)?;         
    Ok(())                                             
}

main.py

  from fromstr import *
  from timeit import timeit
             
  if __name__ == "__main__":                                 
                 
      number = 100000          
                                        
      i = list(range(10)) 
      c = list("abcdefg")                                                                       
      s = "some random sentence".split(" ")
                                                                     
      print(timeit(lambda: print_strings(s), number=number) / number)
      print(timeit(lambda: print_str(s), number=number) / number)
      print(timeit(lambda: print_char(c), number=number) / number)
      print(timeit(lambda: print_int(i), number=number) / number)

Without the isinstance call

(venv) ➜  fromstr python main.py
4.702104139996663e-06                               
3.5061912999935886e-06                              
5.810010350005541e-06                               
5.9074885100017125e-06                              
(venv) ➜  fromstr python main.py
4.845567360007408e-06                               
3.9516925000043556e-06                              
5.810035859994969e-06                               
5.726638340001955e-06

With the isinstance call

(venv) ➜  fromstr python main.py 
4.092408500000601e-06
2.954179230000591e-06
4.588532469997517e-06
4.370560449997356e-06
(venv) ➜  fromstr python main.py
3.9268650299982255e-06
2.893349019996094e-06
4.480811690000337e-06
4.1616317100033485e-06

@davidhewitt
Copy link
Member

Hmm, those timeit numbers are really hard to interpret for me.

I did some benchmarking myself, my rough conclusion is that on my machine extracting an empty Vec takes about 100ns, and this check adds an additional 10ns (so 10%).

Extracting 10 &str elements takes another 100ns, or another way at looking at this is that we pay a cost similar to if the Vec was just one element larger.

Seems ok to me for correctness?

@jeertmans
Copy link
Contributor Author

I have to agree that timeit may not be the best to quantify performance differences, especially on that scale :'-)
10% seems fair enough, but hopefully if the specialization feature becomes stable one day, I think it would be nice to use it to gain some performances back.

@jeertmans jeertmans marked this pull request as ready for review July 12, 2022 15:36
@davidhewitt
Copy link
Member

Agreed. Correctness has to be the top priority, and as new compiler functionality emerges (and new language patterns) we can refine implementations to be more efficient.

A few things that this needs before ready for merge:

  • An entry in the Changed section of the CHANGELOG
  • A test for this new error case (extracting str as Vec should indeed fail).
  • To fix CI you'll need to rebase on clippy: fix some warnings from beta toolchain #2504 after merging (that PR includes a fix for the MSRV failure).

@jeertmans jeertmans force-pushed the str-specialization branch from 6b01925 to 433bdd8 Compare July 13, 2022 07:45
@jeertmans
Copy link
Contributor Author

@davidhewitt that should all be done now :-)
Please tell me if I am missing something

Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, thanks for iterating on this!

@davidhewitt davidhewitt merged commit 308ffa2 into PyO3:main Jul 13, 2022
@jeertmans jeertmans deleted the str-specialization branch July 13, 2022 21:05
@jeertmans
Copy link
Contributor Author

It was a pleasure, thank you for your time!

@alex
Copy link
Contributor

alex commented Oct 13, 2022

Hi all, I realize I'm leaving this comment many months after this was landed, but...

Is there a reason ValueError was chosen here? TypeError seems more appropriate to me, and I wanted to ask if there was a reason it was rejected.

@davidhewitt
Copy link
Member

@alex that's a reasonable suggestion; I didn't have a strong opinion when reviewing, however TypeError probably is better. PR welcome!

If this behaviour is useful to you, would you be willing to also join the discussion in #2632? The proposal there would remove this behaviour, and I'm undecided as I see both sides of the debate, so I would value having more user input.

@alex
Copy link
Contributor

alex commented Oct 13, 2022

Sure, happy to do a PR to switch to TypeError :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't accept str in FromPyObject for Vec<&str> and Vec<String>
4 participants