Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dataframe protocol. #10452

Closed
trivialfis opened this issue Jun 19, 2024 · 12 comments
Closed

Support dataframe protocol. #10452

trivialfis opened this issue Jun 19, 2024 · 12 comments

Comments

@trivialfis
Copy link
Member

https://data-apis.org/dataframe-protocol/latest/index.html

@MarcoGorelli
Copy link

Hi @trivialfis

Quick note to say that I'd discourage using the interchange protocol - I've collect some reasons why here: pandas-dev/pandas#56732 (comment)

If I may, I'd like to suggest Narwhals and/or the Arrow PyCapsule Interface. This is what several packages (e.g. Altair, Plotly, Vegafusion, Marimo, scikit-lego, Rio, and more) are using, with several others (Bokeh, Prophet, formulaic) considering doing the same

Happy to give this a go if you'd be open to it

@trivialfis
Copy link
Member Author

Thank you for sharing, will look into these.

@trivialfis
Copy link
Member Author

trivialfis commented Nov 23, 2024

Perhaps I'm missing something, how come that none of these interfaces can return a read-only C pointer for each column.

@MarcoGorelli
Copy link

Going to cc @kylebarron into the conversation

@kylebarron
Copy link

The Arrow C Data interface returns a pointer that describes a struct-type column, which recursively contains the pointers for all nested columns. That's defined on this page: https://arrow.apache.org/docs/format/CDataInterface.html

@trivialfis
Copy link
Member Author

trivialfis commented Nov 23, 2024

Thank you for the references! I will look into that.

We are hoping to avoid any c dependency (including copied definition, and no cpython either) and rely on the Python stack for passing data. In addition, we simply concatenate all the chunks in a column and use a np array as the final data view.

def pandas_pa_type(ser: Any) -> np.ndarray:

As you can see, it's fragile and delicate in terms of performance and correctness. My wish is something simpler like the numpy __array_interface__ for each column(additional one for the mask if needed). It has no c dependency and we can simply serialize it as a JSON document and pass it around across languages.

@baggiponte
Copy link

baggiponte commented Nov 25, 2024

and no cpython either

I think you mean Cython?

We are hoping to avoid any c dependency

I am not sure what you mean by "any C dependency" - do you mean it as in "numpy depends on C" or "does it contain Cython code"? Anyway, Narwhals is just Python. No Cython. It's a unified layer to write dataframe-agnostic code.


Perhaps I'm stating something obvious here, but just to be sure. Narwhals' syntax is a subset of the Polars API (just the syntax: it's not using Polars under the hood) and is used to provide maintainers with a way to write dataframe-agnostic code. In other words: if xgboost implemented its data transformation logic with Narwhals, it would work out of the box with Polars, pandas, cuDF, modin, dask... without the maintainers handling the complexity, or having to support requests to use Polars or whatever new dataframe library will be popular in the future.

This enables any-dataframe-in -> same-dataframe-out transformation: if the user passes a pandas dataframe, pandas will be used to do the transformation. if the user goes with Polars, Polars engine will do the transformation. Then, of course, at the end of your data transformation pipeline you can always cast everything into a (collection of) numpy array(s) for the good ol' model.fit().

Hope this provided a bit more context! 😊

@trivialfis
Copy link
Member Author

I think you mean Cython?

I meant cpython, we use the ctypes Python module for foreign function calls. Using PyCapsule (from arrow) requires Python CAPI.

I am not sure what you mean by "any C dependency"

Apologies for the ambiguity. I meant the ArrowSchema C struct. Using it implies we will need to either copy the definition of this struct and helper functions into XGBoost (this project) and hope that the ABI is indeed stable or include the Arrow C package as a dependency.

Hope this provided a bit more context! 😊

Thank you for the context! Yes, it's helpful. When looking into the arrow interface, I was under the impression that XGBoost should directly consume the arrow stream or the arrow C arrays. (I was hoping that something could help streamline the existing code and improve the performance). But now it's clear that I should continue to use numpy as the middle layer. It's still quite helpful, not meant to complain.

@kylebarron
Copy link

Narwhals is just Python but at some level you need a "C dependency" to pass ABI-stable C data, right?

If you want to use Arrow data for interop, then you need to trust the Arrow project's guarantee that it is actually ABI stable.

I would tend to argue that you should use a helper library to receive the Arrow data rather than trying to manage that yourself. Consider pyarrow, nanoarrow, or arro3.

Arrow and numpy do not map 1:1 to each other. Arrow includes lots of structured types and includes a nullability bitmask that Numpy cannot directly use.

@trivialfis
Copy link
Member Author

Closing as we are moving toward the arrow format as suggested by @MarcoGorelli , @baggiponte and @kylebarron . Thank you for sharing the latest status!

@kylebarron
Copy link

Do you have a related issue/PR we can follow?

@trivialfis
Copy link
Member Author

No tracking issue yet. But the initial support for polars uses arrow #11116

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants