Skip to content

Add a .to_polars_df() method (very similar to .to_dataframe(), which implicitly uses pandas) #10135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
DeflateAwning opened this issue Mar 16, 2025 · 28 comments

Comments

@DeflateAwning
Copy link

Is your feature request related to a problem?

Pandas is much less performant, and is decreasingly used in new projects. It would be awesome to be able to move data out of xarray and into Polars directly, without jumping through Pandas.

Describe the solution you'd like

Add a .to_polars_df() method (very similar to .to_dataframe(), which implicitly uses pandas)

Describe alternatives you've considered

You currently have to do:

import polars as pl

pl.from_pandas(da.to_dataframe())`

This is slower than it could be if there was a directly-to-polars method.

Additional context

I'd even consider renaming the .to_dataframe() method to .to_pandas_df(). Suggesting that the main/default dataframe is Pandas seems a little strange in the 2025 data analysis ecosystem.

Copy link

welcome bot commented Mar 16, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@max-sixty
Copy link
Collaborator

What would the method do internally which would be faster than going through pandas? No objection per se, but there would need to be some benefit from adding the method...

@DeflateAwning
Copy link
Author

DeflateAwning commented Mar 17, 2025

I'm not too certain exactly what the current mechanisms look like, but I do know there is an opportunity for improvement as converting to Python Pandas is not a zero-cost operation.

Other cross-compatible libraries (e.g., DuckDB) have separate methods for to-pandas vs to-polars, suggesting that there are benefits (_i.e., performance benefits).

I believe there is a sort of dataframe-library-agnostic dataframe specification (something about __dataframe__ maybe) which may be effective for this.

@max-sixty
Copy link
Collaborator

OK, feel free to post more details when you / someone else has them. We can leave this open for a while; eventually would suggest closing until we have some legible benefit

@DocOtak
Copy link
Contributor

DocOtak commented Mar 17, 2025

@DeflateAwning Here is the dataframe interchange protocol spec: https://data-apis.org/dataframe-protocol/latest/index.html

I'm also interested in polars dataframe support in xarray.

@DeflateAwning
Copy link
Author

OK, feel free to post more details when you / someone else has them. We can leave this open for a while; eventually would suggest closing until we have some legible benefit

Oh, sorry that wasn't clear. The obvious benefit is performance. The secondary benefit is avoiding Pandas; it is rightfully deamed legacy tech in all organizations I work with.

Thanks @DocOtak - that's exactly what I was talking about.

@max-sixty
Copy link
Collaborator

The obvious benefit is performance.

OK, I'm not saying this isn't valid, but I am asking how it would be meaningfully more performant. An example showing the improvement would be great...

The spec looks interesting, thanks for posting. I don't see it covering creating a dataframe though...

@max-sixty max-sixty added the plan to close May be closeable, needs more eyeballs label Mar 18, 2025
@TomNicholas
Copy link
Member

I have to agree with @max-sixty - the question is not whether polars is faster / better than pandas in general (I believe you), but whether an xarray.DataArray().to_polars_df() method can ever be faster than pl.from_pandas(da.to_dataframe()), and if it actually is faster today.

@max-sixty
Copy link
Collaborator

please feel free to reopen with some empirics on performance improvements (on this specific method; we def believe polars is generally faster than pandas...)

@max-sixty max-sixty closed this as not planned Won't fix, can't repro, duplicate, stale Mar 25, 2025
@DeflateAwning
Copy link
Author

Sorry, you want me to implement this and then do a performance test? Then, you'll decide if it's worth implementing?

@max-sixty
Copy link
Collaborator

anything that gives us some empirical data that this is worth a new method. that could be a full implementation, it could be something as simple as a comparison of creating a dataframe from an numpy array

is that reasonable?

@TomNicholas
Copy link
Member

you want me to implement this [...] Then, you'll decide if it's worth implementing?

For context, the reason @max-sixty is asking is because adding the method to xarray incurs a longer-term maintenance cost (borne by us, the maintainers), not just the one-time cost of implementation. Sorry if that seems annoying, but we have to be judicious about adding more API surface otherwise eventually the result is a sprawling unmaintainable mess.

@DeflateAwning
Copy link
Author

DeflateAwning commented Mar 25, 2025

Here's a preliminary benchmark I did. You'll see that it is actually slower. Then I realized it's because a numpy array, as created, is a row-based store and not a columnar store: https://gist.github.com/DeflateAwning/2751ffa05bc74bad8e19d4a76c6ef8c5 [tl;dr: this benchmark is not representative; don't look at it]

So I redid the benchmarks and created this columnar benchmark version: https://gist.github.com/DeflateAwning/dd19fd9089e7529b6d26322c4aed042d

As you can see, the columnar benchmark version (which I assume more closely mimics how xarray stores the data, roughly) has significantly better performance (380 ms vs. 0.7ms, in the most extreme case).

@max-sixty
Copy link
Collaborator

nice! thanks for doing that.

one question: if we turn rechunk=False, then the effect seems to go away:

[ins] In [5]: import numpy as np
         ...: import pandas as pd
         ...: import polars as pl
         ...: import timeit
         ...:
         ...: # Array shapes to test
         ...: shapes = [
         ...:     (10_000, 10),
         ...:     (10_000, 200),
         ...:     (100_000, 10),
         ...:     (100_000, 200),
         ...:     (1_000_000, 10),
         ...:     (1_000_000, 200),
         ...:     (10_000_000, 10),
         ...: ]
         ...:
         ...: REPEATS = 5
         ...:
         ...:
         ...: def time_numpy_to_polars(arr):
         ...:     def fn():
         ...:         df_pl = pl.from_numpy(arr, schema=[f"col{i}" for i in range(arr.shape[1])])
         ...:         assert df_pl.height > 1000
         ...:         assert len(df_pl.columns) in (10, 200)
         ...:         return df_pl
         ...:
         ...:     return timeit.timeit(fn, number=REPEATS) / REPEATS
         ...:
         ...:
         ...: def time_numpy_to_pandas_to_polars(arr):
         ...:     def fn():
         ...:         df = pd.DataFrame(arr, columns=[f"col{i}" for i in range(arr.shape[1])])
         ...:         df_pl = pl.from_pandas(df, rechunk=False)
         ...:         assert df_pl.height > 1000
         ...:         assert len(df_pl.columns) in (10, 200)
         ...:         del df
         ...:         return df_pl
         ...:
         ...:     return timeit.timeit(fn, number=REPEATS) / REPEATS
         ...:
         ...:
         ...: def benchmark():
         ...:     print(f"{'Shape':>15} | {'NumPy → Polars':>18} | {'NumPy → Pandas → Polars':>26}")
         ...:     print("-" * 65)
         ...:
         ...:     for shape in shapes:
         ...:         arr1 = np.random.rand(*shape)
         ...:         t_np_pd_polars = time_numpy_to_pandas_to_polars(arr1)
         ...:         del arr1
         ...:
         ...:         arr2 = np.random.rand(*shape)
         ...:         t_np_polars = time_numpy_to_polars(arr2)
         ...:         del arr2
         ...:
         ...:         print(f"{str(shape):>15} | {t_np_polars:>18.6f} s | {t_np_pd_polars:>26.6f} s")
         ...:
         ...:
         ...: for _ in range(5):
         ...:     benchmark()
          Shape |     NumPyPolars |    NumPyPandasPolars
-----------------------------------------------------------------
    (10000, 10) |           0.000210 s |                   0.002136 s
   (10000, 200) |           0.004562 s |                   0.017994 s
   (100000, 10) |           0.002282 s |                   0.004272 s
  (100000, 200) |           0.056629 s |                   0.043034 s
  (1000000, 10) |           0.019926 s |                   0.012140 s
 (1000000, 200) |           1.040867 s |                   0.352806 s
 (10000000, 10) |           0.203197 s |                   0.127347 s
          Shape |     NumPyPolars |    NumPyPandasPolars
-----------------------------------------------------------------
    (10000, 10) |           0.000214 s |                   0.002998 s
   (10000, 200) |           0.005002 s |                   0.020305 s
   (100000, 10) |           0.002292 s |                   0.004627 s
  (100000, 200) |           0.053998 s |                   0.042060 s
  (1000000, 10) |           0.022816 s |                   0.011790 s
 (1000000, 200) |           1.042097 s |                   0.346280 s
 (10000000, 10) |           0.208075 s |                   0.127311 s
          Shape |     NumPyPolars |    NumPyPandasPolars
-----------------------------------------------------------------
    (10000, 10) |           0.000466 s |                   0.004395 s
   (10000, 200) |           0.004703 s |                   0.020835 s
   (100000, 10) |           0.001563 s |                   0.004102 s
  (100000, 200) |           0.052693 s |                   0.046256 s
  (1000000, 10) |           0.021292 s |                   0.013224 s
 (1000000, 200) |           1.052150 s |                   0.345346 s
 (10000000, 10) |           0.204095 s |                   0.129334 s
          Shape |     NumPyPolars |    NumPyPandasPolars
-----------------------------------------------------------------
    (10000, 10) |           0.000628 s |                   0.003400 s
   (10000, 200) |           0.003552 s |                   0.020649 s
   (100000, 10) |           0.001446 s |                   0.003422 s
  (100000, 200) |           0.056508 s |                   0.045235 s
  (1000000, 10) |           0.023728 s |                   0.012902 s
 (1000000, 200) |           1.062767 s |                   0.341092 s
 (10000000, 10) |           0.219106 s |                   0.149330 s
          Shape |     NumPyPolars |    NumPyPandasPolars
-----------------------------------------------------------------
    (10000, 10) |           0.000442 s |                   0.003436 s
   (10000, 200) |           0.004077 s |                   0.017220 s
   (100000, 10) |           0.001618 s |                   0.003177 s
  (100000, 200) |           0.059910 s |                   0.045847 s
  (1000000, 10) |           0.020943 s |                   0.015437 s
 (1000000, 200) |           1.066620 s |                   0.361707 s
 (10000000, 10) |           0.208172 s |                   0.131291 s

are we confident that the resulting polars arrays are identically laid out between each comparison? can we add a check for that?

for context: I really don't want to seem overly skeptical — I'm a big fan of polars, and xarray would be keen to add modest features to allow better support. but I don't have a good theory for how pandas is adding overhead for array construction assuming it's laid out the same between pandas and polars (which might not be the case, hence the difference, in particular if polars supports row-major storage?)

@dcherian
Copy link
Contributor

dcherian commented Mar 25, 2025

I believe the key difference here (at least theoretically) is the index creation? That MultiIndex can be quite large, and slow to build IME.

So for API perhaps we can accept create_index: bool, dataframe_constructor: Callable? Assuming the constructors are compatible, looks like we just pass in a dict to pd.DataFrame

@DocOtak
Copy link
Contributor

DocOtak commented Mar 26, 2025

@max-sixty

The spec looks interesting, thanks for posting. I don't see it covering creating a dataframe though...

I think the intent is the the producing libraries (e.g. xarray) don't make the dataframe themselves, but provide an interface for dataframe libraries like pandas and polars to consume.

In pandas it would be this: https://pandas.pydata.org/docs/reference/api/pandas.api.interchange.from_dataframe.html
In polars it is this: https://docs.pola.rs/api/python/stable/reference/api/polars.from_dataframe.html

@max-sixty
Copy link
Collaborator

ok interesting, is the suggestion that xarray should implement the dataframe interface?

@DeflateAwning
Copy link
Author

I'd say yes - the likely-best way to implement this is with the Dataframe Interchange Protocol.

Then, when the next hot dataframe library that uses quantum computing instead of lame 2025-era multithreading comes around, it'll be able to efficiently consume that.

Adding the .to_polars() method is a 1-liner once the Dataframe Interchange Protocol is implemented.

@max-sixty
Copy link
Collaborator

does that mean that a dataset / dataarray would be advertising itself as a dataframe, though?

@keewis
Copy link
Collaborator

keewis commented Mar 26, 2025

What if instead we added a method that returned an intermediate object defining only the dataframe protocol? With that, the explicit conversion would be something like pd.DataFrame(ds.to_df()) or pl.DataFrame(ds.to_df()) (not sure if that's actually how you'd feed dataframe-like objects to pandas / polars)

@TomNicholas TomNicholas removed the plan to close May be closeable, needs more eyeballs label Mar 27, 2025
@TomNicholas TomNicholas reopened this Mar 27, 2025
@max-sixty
Copy link
Collaborator

that sounds ideal @keewis !

@MarcoGorelli
Copy link

MarcoGorelli commented Mar 27, 2025

Definitely interested in going from XArray to Polars without needing pandas as a dependency, but I'd suggest not using the dataframe interchange protocol. pandas core dev Will Ayd wrote about his experiences with it here

While initially promising, this soon became problematic. [...] After many unexpected segfaults, I started to grow weary of this solution.
it only talks about how to consume data, but offers no guidance on how to produce it. If starting from your extension, you have no tools or library to manually build buffers. Much like the status quo, this meant reading from a Hyper database to a pandas DataFrame would likely be going through Python objects.

Furthermore, based on my own experience trying to fixup the interchange protocol implementation in pandas, my suggestion is to never use it for anything.

Instead, you may want to look at the PyCapsule Interface. Continuing on from Will's blog post:

After stumbling around the DataFrame Protocol Interface for a few weeks, Joris Van den Bossche [another pandas core dev] asked me why I didn’t look at the Arrow C Data Interface. [...]. Almost immediately my issues went away. I felt more confident in the implementation and had to deal with less memory corruption / crashes than before. And, perhaps most importantly, I saved a lot of time.

@kylebarron is one of the leading advocates for PyCapsule Interface (apache/arrow#39195) and expert in geospatial data science, and so might be good loop in here. Reckon XArray is a good candidate to export a PyCapsule object which dataframe libraries could consume?

@kylebarron
Copy link

kylebarron commented Mar 27, 2025

I agree that I would dissuade you from trying to implement the dataframe interchange protocol and would encourage adoption of the Arrow PyCapsule Interface.

What would the method do internally which would be faster than going through pandas? No objection per se, but there would need to be some benefit from adding the method...

This is also not clear to me. I don't know xarray internals that well; I thought xarray uses pandas as a required dependency, and so I figure that most xarray data is stored in a pandas DataFrame or Series? Then I figure the fastest (and simplest to implement) way to convert xarray to polars would be to reuse pandas' implementation of the PyCapsule Interface.

Pandas has implemented PyCapsule Interface export for a little while. pandas-dev/pandas#56587, pandas-dev/pandas#59518

@max-sixty
Copy link
Collaborator

I thought xarray uses pandas as a required dependency, and so I figure that most xarray data is stored in a pandas DataFrame or Series?

xarray currently has a required pandas dependency for its indexing. the standard backend is a numpy array

@kylebarron
Copy link

Seems like your options are either:

  • implement a specific numpy -> polars implementation for to_polars_df
  • implement a generic DataFrame Interchange Protocol backend on top of how you store numpy data.
  • implement a generic Arrow PyCapsule Interface integration. This would require some Arrow backend, such as pandas' default pyarrow backend. However pyarrow is a massive dependency, which has some downsides. I wrote https://github.com/kylebarron/arro3 as a much smaller Arrow implementation, which you might be able to use to convert numpy data to Arrow.

@max-sixty
Copy link
Collaborator

implement a generic Arrow PyCapsule Interface integration. This would require some Arrow backend, such as pandas' default pyarrow backend. However pyarrow is a massive dependency...

if someone wants to take this on, we could have pyarrow as an optional dependency, that could work. it's optional for polars & pandas

but pandas itself seem like a satisfactory interchange format! whether or not the initially encouraging results are driven by alignment vs. real determines whether there's a perf improvement

@DeflateAwning
Copy link
Author

Pandas uses "nan" to represent nulls in string columns. It is a prime example of hacking things together. It is an awful interchange format.

Why not make Polars the interchange format.

@kylebarron
Copy link

Arrow makes more sense than Polars to be an interchange format. It's explicitly designed as such, and is already used under the hood in Polars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants