-
-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python bindings #220
Comments
Regarding the first point, unless there is a compelling reason not to do so, it would be better to follow |
I'm not sure whether the current linfa-rs is easy to adoption for pythonbindings.
I think there're 2 approaches: import my-numpy as np
x = np.xxx
# Now we can pass x to linfa API which could avoid unnecessary copies underhook. (2) numpy is implemented in native language(C++), so it's able to pass a numpy object to linfa, and then in linfa, we use the native handles. arr = np.array([[1,2,3],[1,2,3]], dtype=np.float32)
arr_ptr = arr.ctypes.data
# We could use arr_ptr in native language. But anyway, we should support linfa being able to operate a numpy object in native. |
Ideally I want to break the |
I see. but I'm not sure whether the functionalities that we need in ndarray is totally equivalent to numpy. |
Just as a counterpoint to @relf: I've started on simple efforts for porting Linfa to Python a couple times (PyO3 can be oddly frustrating sometimes), and I'm not convinced that matching However, the APIs, while similar, are not exactly the same, and for good reason: idiomatic Rust and idiomatic Python (which, granted, is somewhat more loose) are not identical, and porting APIs between the languages reflect this. Both |
I also didn't see this linked yet, so just wanted to add it to the thread: the PyO3 project already has a Numpy interop project called rust-numpy which could help to solve this problem without doing it ourselves. Since Luca's |
I think it will be good to have support for Arrow as in-memory data storage for linfa. Then any DataFrame based on the arrow will have default support such as Polars, DataFusion. That is how we can have multiple DataFrame API support. |
Feel free to open a separate issue for that |
Hi all,
I agree with most of @quietlychris's remark:
I do not know about pandas, but I think scikit-learn has made (close to, if not) the best design decisions back. The situation has changed now, and extending scikit-learn to make it compatible with other projects written in other languages come with unforeseen constraints and challenges (e.g. different idiomatic constructions, harden interfaces adaptations, adherence to dependencies' design choices and concept, vendoring or depending on shared libraries for OpenMP and BLAS implementations, packaging, etc.) making extend it harder but not impossible to extend. I think scikit-learn's initial design decisions put the emphasis on UX, documentation, compatibility and composability with other projects in the Scientific Python ecosystem, rather than on performance, portability to other context (e.g. embedded systems), and interfaces to other languages. To me, this explains scikit-learn's adoption. I think other projects like mlpack or SHOGUN took different design decision (based on different use-cases), and it might be valuable to learn about those projects' experience and challenges. I think one of the most notable room of improvement scikit-learn's maintainers have identified for the library is (native) performance. When it comes to native performance, we are putting efforts into optimizing costly pattern of computations of algorithms using Cython (see scikit-learn/scikit-learn#22587). If Cython is convenient because it manages a lot of complexity for us, we are facing limits of its constructs and concepts (mainly regarding the cost of polymorphism and dynamic method dispatch and the lack of alternatives in the Cython) for some of the lowest-level implementations. More generally, being tight to CPython we also face the intrinsic performance limitations of the interpreter as a whole: even if dependencies like NumPy and SciPy have efficient low-level implementations, the full execution of users' pipelines remains costly and generally are single threaded. One of the alternative pathways that we currently are working on to improve performance is a plugin system allowing third party packages' developers to extend scikit-learn with their own custom (GPU) implementations. scikit-learn/scikit-learn#22438 drives the discussions and design and https://github.com/soda-inria/sklearn-numba-dpex is for instance a package that is actively being developed to back some of the algorithms implementations with GPU kernels. I think the plugin system discussed in scikit-learn/scikit-learn#22438 might be suited for linfa to have python bindings. |
Commenting according to your This data communication will be generic over all algorithms and can make porting easy for python. But at the moment probably not enough people are involved in the project to implement this critical feature. |
Isn't it possible to have minimum overhead by reusing data allocated by CPython or Rust (via PyO3) similarly to that is possible with CPython and C/C++ (via Cython, PyBind11 or nanobind)? |
Depends on whether that allocated data is compatible with the |
I am not a low-code geek. It will be good to see how the python API of polars or datafusion communicates with the rust backend. Polars support NumPy array to build a polar data frame. They have very high-performance implementation so far. The same mechanism can be implemented here if some one interested. |
Sorry in advance to offer an somewhat pessimistic point of view. I am working on scientific computing in Rust so I really want to write down some of my reflections over the efforts.
Best of luck. Side note: For linear algebra related task, Faer-rs seems to be a great alternative. |
Thank you for this comprehensive comment, @abstractqqq. As a maintainer, I welcome and value this critic. Would you like to give details about your experience with scikit-learn and pain points you have faced on scikit-learn's issue tracker? |
Thank you for following up. I don't feel like going to the issue tracker because it's too crowded and because it's really hard to pinpoint the criticism to an "issue". As @quietlychris mentioned, a lot of it is backwards compatibility, because you need to support NumPy, or some other sparse data structure. I can give you more examples: SimpleImputer on pandas dataframes has terrible performance, likely because it's using NumPy to do imputing which isn't the right thing to do because of mixed data types in dataframes. And what's the deal with f_classif and f_regression being the same function with different names? Why not turn on multithreading for kdtrees for Mutual Information Score? To make f, and mutual information score useful in a data science pipeline, the methods must handle nulls (because of real world data quality issue) but right now they just fail and no mention of the the null issue can be found on the docs. I ended up rewriting the two algorithms in Polars + Kdtree (from SciPy) and got insane speed boost. Then again, I really think that except for the models, other functionalities in Scikit-learn are largely forgotten. That's why the docs are minimal and issues like null handling are not brought up or paid enough attention to... And finally what's the deal with transformers? Say I am doing feature engineering, I want to do a simple logTransform. In polars I just write "pl.col(a).log()", and this can be serialized and used in Pipelines. In Scikit-learn, I need all that boilerplate for a transformer, and FunctionTransformer doesn't serilize... |
I believe that this discussion has digressed, and I'd like to get things back on track. As a project, Linfa doesn't really follow grand plans. If anyone feels strongly enough about the way that a Python API "should" look, they are welcome to open a draft PR with a prototype for one or two algorithms demonstrating the layout, and what they feel are pros/cons of their approach. If that happens, I'm sure that other stakeholders (read, maintainers and developers who have contributed to Linfa and the associated ecosystem) will be happy to engage in a good-faith discussion around that initial implementation and other approaches (which I also imagine would include lots of code). There's not single right way to do software design, so until someone puts a concrete proposal forward, I'd like to avoid muddying the water of this issue with what-if's and sweeping generalizations about the scientific computing ecosystem. |
You are right. I apologize. Feel free to remove my comments if you see fit. I am too struggling with designing a good API and have been experimenting with things. It would be great if some of us can put out a plan or a meeting if there is enough momentum. |
We should add Python bindings to the public API of
linfa
crates. This will allow us to fairly benchmarklinfa
againstscikit-learn
, which also uses a Python API, as well as makinglinfa
easier to use, allowing for wider adoption. This process can be done piece by piece. I suggest we start withlinfa-clustering
, since that's the most-usedlinfa
crate and also has prior art behind it.Questions
scikit-learn
? Do we want exact parity?numpy
in our API without lots of data copying?linfa
makes heavy use of generics, but for Python bindings we need to pick one monomorphization to build. For type params likeF
we can just pickf64
, but for other it's less clear cut. We may also need to choose between different params at runtime instead of compile time. Do we use an enum? A trait object?linfa
with?We'll likely put all the bindings into one Python package, so that we don't build multiple copies of
linfa
across multiple packages.Prior Art
When @LukeMathWalker first released
linfa
, he also released Python bindingshere
, for benchmarking againstscikit-learn
. AFAIK these bindings only support KMeans, and they are also 3 years old, but they should provide a good starting point.The text was updated successfully, but these errors were encountered: