Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tch-based Python module? #174

Closed
cdfox opened this issue Apr 18, 2020 · 24 comments
Closed

tch-based Python module? #174

cdfox opened this issue Apr 18, 2020 · 24 comments

Comments

@cdfox
Copy link

cdfox commented Apr 18, 2020

Do you know of any examples of Python modules written in Rust using tch? I'm interested in implementing a custom RNN cell in Rust using tch and exposing it to be used in a PyTorch program.

@vegapit
Copy link

vegapit commented Apr 18, 2020

I am intrigued. Why using a python wrapper for tch and not pytorch directly?

@cdfox
Copy link
Author

cdfox commented Apr 18, 2020

An RNN cell is going to involve a for loop. When you implement it in Python, you end up with a bit too much work being done in Python, and inference speed for longer sequences on cpu can benefit quite a bit from moving the cell implementation to, say, C++. I'm thinking a similar speedup could be achieved by implementing the cell in Rust.

Edit: I think a good starting place might be the examples here: https://github.com/PyO3/pyo3#examples.

@vegapit
Copy link

vegapit commented Apr 18, 2020

I understand.

The ideal setup would be to train the model in a high perf language like Rust, and export it for usage to a scripting language like Python. Unfortunately, the tch library did not have a model export functionality last time I checked.

A python wrapper idea could be a good compromise if it works. I will give it a try and share if I get somewhere. Thanks

@LaurentMazare
Copy link
Owner

I don't know of any such example, and that indeed seems like a good use case. The C++ api has a tutorial about this [Custom C++ and CUDA Extensions], it would be very nice to write a Rust version of this. As you noted PyO3 is likely to be a good point to start (tch-rs already uses the cpython crate for interfacing with a Python runtime for reinforcement learning examples).

The ideal setup would be to train the model in a high perf language like Rust, and export it for usage to a scripting language like Python. Unfortunately, the tch library did not have a model export functionality last time I checked.

Funnily I'm actually using the opposite setup: I experiment with models and train them in python as it's more flexible to play with and the heavy lifting takes place on the gpu anyway. When it comes to productionizing/deploying models, I use rust as I find it much better to build large robust systems.

@vegapit
Copy link

vegapit commented Apr 19, 2020

Funnily I'm actually using the opposite setup: I experiment with models and train them in python as it's more flexible to play with and the heavy lifting takes place on the gpu anyway. When it comes to productionizing/deploying models, I use rust as I find it much better to build large robust systems.

I think we have had this debate when I was enquiring on how to export a JIT model with tch =;]

Using a tch model with cpython seems to work. Here is an example of exported function:

fn tch_train(_py: Python, xs: Vec<Vec<f64>>, ys: Vec<f64>) -> PyResult<f64> {
    let mut loss = 1f64;
    let vs = nn::VarStore::new( tch::Device::Cpu );
    let model = nn::seq().add( nn::linear(&vs.root(), 5, 1, Default::default()) );
    let mut optim = RmsProp::default().build(&vs, 0.001).unwrap();
    while loss > 1e-4 {
        for (x,y) in xs.iter().zip( ys.iter() ) {
            let t_x = Tensor::of_slice( &x.clone().as_slice() ).to_kind( Kind::Float ).unsqueeze(0);
            let t_y = Tensor::of_slice( &[y.clone()]).to_kind( Kind::Float ).unsqueeze(0);
            let t_out = model.forward( &t_x ).squeeze();
            let t_loss = (t_y - t_out).pow(2f64).sum( Kind::Float );
            optim.backward_step( &t_loss );
            loss = f64::from( t_loss );
        }
    }
    Ok(loss)
}

and the Python code I ran for testing:

import mymodule
import numpy as np

def my_func(x):
    return np.sum( x * np.array([5.0,-4.0,3.0,-2.0,1.0]))

xs = np.random.random(100).reshape((20,5)).tolist()
ys = np.apply_along_axis(my_func,1,xs).tolist()
print( mymodule.tch_train(xs,ys) )

The easy way to proceed is to have one wrapped function that trains the tch model and saves it to disk, and another that loads the model from disk and runs the estimation. It is obviously not ideal if estimations are requested at high frequency. Creating a PyClass that encapsulates the model could be the more optimal solution but I would not bet it would work.

@LaurentMazare
Copy link
Owner

That's a pretty cool example.
Compared to the tutorial on developing C++ extensions (which is very close to what @cdfox would like to achieve: writing a custom RNN cell in an optimized language and use it in python), the main missing bit is probably having a proper integration with tensors rather than going through numpy arrays. This is especially likely to be an issue on gpu as the data would have to move back and forth between the main memory and the gpu memory. Ideally you would want the exposed rust function to take as input/return pytorch tensors.

@vegapit
Copy link

vegapit commented Apr 19, 2020

In this example, I have only used Numpy arrays for brevity of code. In a couple of lines, I could generate random numbers, arrange them in the right shape and calculate the output vector. Ultimately, what is passed to the wrapper function are Python Lists as they are seamlessly converted to Rust Vecs by cypthon.

I do not quite understand the dynamics in the GPU setting as I have never really used CUDA.

@cdfox
Copy link
Author

cdfox commented Apr 19, 2020

Just to clarify on my use case, my first goal would be to get a speedup via Rust for inference on cpu. A speedup on training would be a bonus. Sounds like there is a pathway for passing a numpy in python into a function written in Rust, where it's available as a PyArrayDyn, for example:
Python: https://github.com/PyO3/rust-numpy/blob/master/examples/simple-extension/README.md
Rust: https://github.com/PyO3/rust-numpy/blob/master/examples/simple-extension/src/lib.rs
But maybe there's not a way to pass a PyTorch tensor into a Rust function (even just in main memory). I believe conversion from NumPy arrays to PyTorch tensors is pretty low overhead (https://discuss.pytorch.org/t/what-is-the-overhead-of-transforming-numpy-to-torch-and-vice-versa/7395) in Python at least. I'm not sure about passing NumPy array into Rust function, then in Rust converting to Tensor.

@vegapit
Copy link

vegapit commented Apr 19, 2020

Right, so in summary, the aim of this exercise is not only to access Tch models from Python, but it is also about limiting type casting during the data transfer. In that case, Pytorch Tensor from/to Tch Tensor conversions would indeed be the most appropriate. Intuitively if they are similar at memory level, there should be an "unsafe" way of getting it done.

@guillaume-be
Copy link
Contributor

guillaume-be commented Apr 24, 2020

I understand the motivation of speeding up the loop using Rust rather than a pure Python implementation, but I am unsure this is the most effective way to achieve greater speed. This discussion indicates that a pure translation from Python to a high performance language would result in speed gains of about ~10%. It points to a useful article that is most likely to offer more significant benefits, especially using fusing.

The cost of a loop over a sequence of say, 100 elements, is relatively low compared to the RNN operations within each iteration (especially for complex LSTM or GRU-like units).

@epwalsh
Copy link

epwalsh commented Jun 12, 2020

I actually have another use case here where I want to efficiently pass PyTorch tensor's between Python and Rust.

With AllenNLP, one of our main performance drags is data loading, especially when the dataset is too big to fit in memory so that you have to lazily load it on the fly. If you can't load data fast enough, you can't keep the GPUs occupied.

So we've been playing around with the idea of writing data loaders in Rust that would pass off tensors to Python.

I would love to hear if anyone's gotten any further with this.

@awaited-hare
Copy link

Supporting writing PyTorch extension in Rust will be extremely useful, especially if you want to do more complicated operations (just like the motivation of PyTorch official Cpp/CUDA extension tutorial). However currently pyo3 does not recognize tch tensor type, maybe we can start from adding tensor support for pyo3?

@awaited-hare
Copy link

@LaurentMazare Any thoughts on this? I'd like to help if needed.

@LaurentMazare
Copy link
Owner

Sorry for being slow to come back, a decent first step would indeed be to be able to get from the python tensor object the underlying pointer, pass it through pyo3 so that we could build the same tensor on the rust side. I guess most of the work would be in understanding how the python api wrap things, I may look at it when I find some time but it's unlikely to happen in the next couple weeks or so.

@jogardi
Copy link

jogardi commented Jun 29, 2020

I've started trying to do this for my own project and ran into an exit code 139. I used this code to create a function in PyO3:

use pyo3::prelude::*;
use pyo3::{wrap_pyfunction, AsPyPointer, PyNativeType};
extern crate tch;
use tch::Tensor;
use torch_sys::*;

#[pyfunction]
fn loss_for_neighbors(x: &PyAny) -> PyResult<()> {
    // return Ok((2 * x) as PyAny);
    let ct: *mut C_tensor = x.as_ptr() as *mut C_tensor;
    println!("got past ct");
    // let t = Tensor::from_py(x.as_ptr().into_py());
    let t = Tensor::from_ptr(ct);
    println!("got t");
    println!("{}", t.dim());
    Ok(())
}

#[pymodule]
fn geodb(py: Python, m: &PyModule) -> PyResult<()> {
    m.add_wrapped(wrap_pyfunction!(loss_for_neighbors))?;
    Ok(())
}

I added a from_ptr function to the Tensor class in the tensor.rs file of tch-rs:

impl Tensor {
    /// Creates a new tensor.
    pub fn new() -> Tensor {
        let c_tensor = unsafe_torch!(at_new_tensor());
        Tensor { c_tensor }
    }

    pub fn from_ptr(c_tensor: *mut C_tensor) -> Tensor {
        Tensor { c_tensor }
    }
...

But I got this error:

import geodb
import torch
t = torch.zeros((2,))
geodb.loss_for_neighbors(t)

got past ct
got t
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

@LaurentMazare
Copy link
Owner

I haven't investigated this very deeply so I may be way off but I think you're passing a pointer to the python tensor object and try to interpret it as a C tensor pointer which results in a segfault.
Instead I think you should probably try to pass a pointer to the tensor data, e.g. on the python side pass t.data_ptr() instead of t. This int should be interpreted as a pointer, and you would want to create the tensor from this storage. You could then try using Tensor.of_data_size on this.
Passing the tensor dimensions + element types would help, although when prototyping things, hardcoding these would be fine.

@jogardi
Copy link

jogardi commented Jun 30, 2020

I haven't investigated this very deeply so I may be way off but I think you're passing a pointer to the python tensor object and try to interpret it as a C tensor pointer which results in a segfault.
Instead I think you should probably try to pass a pointer to the tensor data, e.g. on the python side pass t.data_ptr() instead of t. This int should be interpreted as a pointer, and you would want to create the tensor from this storage. You could then try using Tensor.of_data_size on this.
Passing the tensor dimensions + element types would help, although when prototyping things, hardcoding these would be fine.

Wouldn't that copy the array but not include the gradient information and loose track computation graph? I would like this to work with autograd but that might be difficult if we can't directly wrap the same C++ object being used by python.

@LaurentMazare
Copy link
Owner

I'm not sure about the gradient as I haven't thought about this deeply but I feel that it would be a first proof of concept. If this ends up not segfaulting, it will be possible to build up from there.

@SunDoge
Copy link
Contributor

SunDoge commented Aug 8, 2020

HI, @jogardi @l1an0 .
I made a proof-of-concept project transferring tch::Tensor to Python in dlpack by using pyo3 and vice versa. I hope it can be helpful.

https://github.com/SunDoge/tch-to-pytorch-poc/blob/master/src/lib.rs

@awaited-hare
Copy link

@LaurentMazare Maybe we can create a user guide on how to create Python module with tch based on what @SunDoge has done?

@Ejhfast
Copy link

Ejhfast commented Jul 19, 2021

I'd also be very excited to see this happen. @SunDoge do gradients associated with the tensors pass between Python and Rust in your prototype?

@SunDoge
Copy link
Contributor

SunDoge commented Jul 20, 2021

@Ejhfast nope, but it's possible. Tensor.grad is also a tensor and can be passed in the same way.

@egordm
Copy link

egordm commented Feb 19, 2022

Hi, I had the same need for python - tch interface.
For that, I have set up a small proof of concept which uses the same functions as torch operators.

It requires linking pytorch_python library as the functions are located in python_variable.h.

As I understood THPVariable_Wrap is used to wrap the tensor and transfer ownership to python. While THPVariable_Unpack gives you the pointer to the tensor object itself.

The proof of concept is a bit of hack right now:
Library changes: egordm@5e5fd75
Demo: egordm@9c3b32e

I would like to build a more polished version and open a pr if there is interest in such a thing.
Thanks

@LaurentMazare
Copy link
Owner

We've added a new pyo3-tch crate to make it easier to write such tch based Python modules. You can find an example using this in the tch-ext repo (this is a work in progress so the api is likely to change in the future).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants