-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add a .to_polars_df()
method (very similar to .to_dataframe()
, which implicitly uses pandas)
#10135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
What would the method do internally which would be faster than going through pandas? No objection per se, but there would need to be some benefit from adding the method... |
I'm not too certain exactly what the current mechanisms look like, but I do know there is an opportunity for improvement as converting to Other cross-compatible libraries (e.g., DuckDB) have separate methods for to-pandas vs to-polars, suggesting that there are benefits (_i.e., performance benefits). I believe there is a sort of dataframe-library-agnostic dataframe specification (something about |
OK, feel free to post more details when you / someone else has them. We can leave this open for a while; eventually would suggest closing until we have some legible benefit |
@DeflateAwning Here is the dataframe interchange protocol spec: https://data-apis.org/dataframe-protocol/latest/index.html I'm also interested in polars dataframe support in xarray. |
Oh, sorry that wasn't clear. The obvious benefit is performance. The secondary benefit is avoiding Pandas; it is rightfully deamed legacy tech in all organizations I work with. Thanks @DocOtak - that's exactly what I was talking about. |
OK, I'm not saying this isn't valid, but I am asking how it would be meaningfully more performant. An example showing the improvement would be great... The spec looks interesting, thanks for posting. I don't see it covering creating a dataframe though... |
I have to agree with @max-sixty - the question is not whether polars is faster / better than pandas in general (I believe you), but whether an |
please feel free to reopen with some empirics on performance improvements (on this specific method; we def believe polars is generally faster than pandas...) |
Sorry, you want me to implement this and then do a performance test? Then, you'll decide if it's worth implementing? |
anything that gives us some empirical data that this is worth a new method. that could be a full implementation, it could be something as simple as a comparison of creating a dataframe from an numpy array is that reasonable? |
For context, the reason @max-sixty is asking is because adding the method to xarray incurs a longer-term maintenance cost (borne by us, the maintainers), not just the one-time cost of implementation. Sorry if that seems annoying, but we have to be judicious about adding more API surface otherwise eventually the result is a sprawling unmaintainable mess. |
Here's a preliminary benchmark I did. You'll see that it is actually slower. Then I realized it's because a numpy array, as created, is a row-based store and not a columnar store: https://gist.github.com/DeflateAwning/2751ffa05bc74bad8e19d4a76c6ef8c5 [tl;dr: this benchmark is not representative; don't look at it] So I redid the benchmarks and created this columnar benchmark version: https://gist.github.com/DeflateAwning/dd19fd9089e7529b6d26322c4aed042d As you can see, the columnar benchmark version (which I assume more closely mimics how xarray stores the data, roughly) has significantly better performance (380 ms vs. 0.7ms, in the most extreme case). |
nice! thanks for doing that. one question: if we turn [ins] In [5]: import numpy as np
...: import pandas as pd
...: import polars as pl
...: import timeit
...:
...: # Array shapes to test
...: shapes = [
...: (10_000, 10),
...: (10_000, 200),
...: (100_000, 10),
...: (100_000, 200),
...: (1_000_000, 10),
...: (1_000_000, 200),
...: (10_000_000, 10),
...: ]
...:
...: REPEATS = 5
...:
...:
...: def time_numpy_to_polars(arr):
...: def fn():
...: df_pl = pl.from_numpy(arr, schema=[f"col{i}" for i in range(arr.shape[1])])
...: assert df_pl.height > 1000
...: assert len(df_pl.columns) in (10, 200)
...: return df_pl
...:
...: return timeit.timeit(fn, number=REPEATS) / REPEATS
...:
...:
...: def time_numpy_to_pandas_to_polars(arr):
...: def fn():
...: df = pd.DataFrame(arr, columns=[f"col{i}" for i in range(arr.shape[1])])
...: df_pl = pl.from_pandas(df, rechunk=False)
...: assert df_pl.height > 1000
...: assert len(df_pl.columns) in (10, 200)
...: del df
...: return df_pl
...:
...: return timeit.timeit(fn, number=REPEATS) / REPEATS
...:
...:
...: def benchmark():
...: print(f"{'Shape':>15} | {'NumPy → Polars':>18} | {'NumPy → Pandas → Polars':>26}")
...: print("-" * 65)
...:
...: for shape in shapes:
...: arr1 = np.random.rand(*shape)
...: t_np_pd_polars = time_numpy_to_pandas_to_polars(arr1)
...: del arr1
...:
...: arr2 = np.random.rand(*shape)
...: t_np_polars = time_numpy_to_polars(arr2)
...: del arr2
...:
...: print(f"{str(shape):>15} | {t_np_polars:>18.6f} s | {t_np_pd_polars:>26.6f} s")
...:
...:
...: for _ in range(5):
...: benchmark()
Shape | NumPy → Polars | NumPy → Pandas → Polars
-----------------------------------------------------------------
(10000, 10) | 0.000210 s | 0.002136 s
(10000, 200) | 0.004562 s | 0.017994 s
(100000, 10) | 0.002282 s | 0.004272 s
(100000, 200) | 0.056629 s | 0.043034 s
(1000000, 10) | 0.019926 s | 0.012140 s
(1000000, 200) | 1.040867 s | 0.352806 s
(10000000, 10) | 0.203197 s | 0.127347 s
Shape | NumPy → Polars | NumPy → Pandas → Polars
-----------------------------------------------------------------
(10000, 10) | 0.000214 s | 0.002998 s
(10000, 200) | 0.005002 s | 0.020305 s
(100000, 10) | 0.002292 s | 0.004627 s
(100000, 200) | 0.053998 s | 0.042060 s
(1000000, 10) | 0.022816 s | 0.011790 s
(1000000, 200) | 1.042097 s | 0.346280 s
(10000000, 10) | 0.208075 s | 0.127311 s
Shape | NumPy → Polars | NumPy → Pandas → Polars
-----------------------------------------------------------------
(10000, 10) | 0.000466 s | 0.004395 s
(10000, 200) | 0.004703 s | 0.020835 s
(100000, 10) | 0.001563 s | 0.004102 s
(100000, 200) | 0.052693 s | 0.046256 s
(1000000, 10) | 0.021292 s | 0.013224 s
(1000000, 200) | 1.052150 s | 0.345346 s
(10000000, 10) | 0.204095 s | 0.129334 s
Shape | NumPy → Polars | NumPy → Pandas → Polars
-----------------------------------------------------------------
(10000, 10) | 0.000628 s | 0.003400 s
(10000, 200) | 0.003552 s | 0.020649 s
(100000, 10) | 0.001446 s | 0.003422 s
(100000, 200) | 0.056508 s | 0.045235 s
(1000000, 10) | 0.023728 s | 0.012902 s
(1000000, 200) | 1.062767 s | 0.341092 s
(10000000, 10) | 0.219106 s | 0.149330 s
Shape | NumPy → Polars | NumPy → Pandas → Polars
-----------------------------------------------------------------
(10000, 10) | 0.000442 s | 0.003436 s
(10000, 200) | 0.004077 s | 0.017220 s
(100000, 10) | 0.001618 s | 0.003177 s
(100000, 200) | 0.059910 s | 0.045847 s
(1000000, 10) | 0.020943 s | 0.015437 s
(1000000, 200) | 1.066620 s | 0.361707 s
(10000000, 10) | 0.208172 s | 0.131291 s are we confident that the resulting polars arrays are identically laid out between each comparison? can we add a check for that? for context: I really don't want to seem overly skeptical — I'm a big fan of polars, and xarray would be keen to add modest features to allow better support. but I don't have a good theory for how pandas is adding overhead for array construction assuming it's laid out the same between pandas and polars (which might not be the case, hence the difference, in particular if polars supports row-major storage?) |
I believe the key difference here (at least theoretically) is the index creation? That MultiIndex can be quite large, and slow to build IME. So for API perhaps we can accept |
I think the intent is the the producing libraries (e.g. xarray) don't make the dataframe themselves, but provide an interface for dataframe libraries like pandas and polars to consume. In pandas it would be this: https://pandas.pydata.org/docs/reference/api/pandas.api.interchange.from_dataframe.html |
ok interesting, is the suggestion that xarray should implement the dataframe interface? |
I'd say yes - the likely-best way to implement this is with the Dataframe Interchange Protocol. Then, when the next hot dataframe library that uses quantum computing instead of lame 2025-era multithreading comes around, it'll be able to efficiently consume that. Adding the |
does that mean that a dataset / dataarray would be advertising itself as a dataframe, though? |
What if instead we added a method that returned an intermediate object defining only the dataframe protocol? With that, the explicit conversion would be something like |
that sounds ideal @keewis ! |
Definitely interested in going from XArray to Polars without needing pandas as a dependency, but I'd suggest not using the dataframe interchange protocol. pandas core dev Will Ayd wrote about his experiences with it here
Furthermore, based on my own experience trying to fixup the interchange protocol implementation in pandas, my suggestion is to never use it for anything. Instead, you may want to look at the PyCapsule Interface. Continuing on from Will's blog post:
@kylebarron is one of the leading advocates for PyCapsule Interface (apache/arrow#39195) and expert in geospatial data science, and so might be good loop in here. Reckon XArray is a good candidate to export a PyCapsule object which dataframe libraries could consume? |
I agree that I would dissuade you from trying to implement the dataframe interchange protocol and would encourage adoption of the Arrow PyCapsule Interface.
This is also not clear to me. I don't know xarray internals that well; I thought xarray uses pandas as a required dependency, and so I figure that most xarray data is stored in a pandas Pandas has implemented PyCapsule Interface export for a little while. pandas-dev/pandas#56587, pandas-dev/pandas#59518 |
xarray currently has a required pandas dependency for its indexing. the standard backend is a numpy array |
Seems like your options are either:
|
if someone wants to take this on, we could have but pandas itself seem like a satisfactory interchange format! whether or not the initially encouraging results are driven by alignment vs. real determines whether there's a perf improvement |
Pandas uses "nan" to represent nulls in string columns. It is a prime example of hacking things together. It is an awful interchange format. Why not make Polars the interchange format. |
Arrow makes more sense than Polars to be an interchange format. It's explicitly designed as such, and is already used under the hood in Polars. |
Is your feature request related to a problem?
Pandas is much less performant, and is decreasingly used in new projects. It would be awesome to be able to move data out of xarray and into Polars directly, without jumping through Pandas.
Describe the solution you'd like
Add a
.to_polars_df()
method (very similar to.to_dataframe()
, which implicitly uses pandas)Describe alternatives you've considered
You currently have to do:
This is slower than it could be if there was a directly-to-polars method.
Additional context
I'd even consider renaming the
.to_dataframe()
method to.to_pandas_df()
. Suggesting that the main/default dataframe is Pandas seems a little strange in the 2025 data analysis ecosystem.The text was updated successfully, but these errors were encountered: