Add a `.to_polars_df()` method (very similar to `.to_dataframe()`, which implicitly uses pandas) #10135

DeflateAwning · 2025-03-16T21:50:44Z

Is your feature request related to a problem?

Pandas is much less performant, and is decreasingly used in new projects. It would be awesome to be able to move data out of xarray and into Polars directly, without jumping through Pandas.

Describe the solution you'd like

Add a .to_polars_df() method (very similar to .to_dataframe(), which implicitly uses pandas)

Describe alternatives you've considered

You currently have to do:

import polars as pl

pl.from_pandas(da.to_dataframe())`

This is slower than it could be if there was a directly-to-polars method.

Additional context

I'd even consider renaming the .to_dataframe() method to .to_pandas_df(). Suggesting that the main/default dataframe is Pandas seems a little strange in the 2025 data analysis ecosystem.

The text was updated successfully, but these errors were encountered:

welcome · 2025-03-16T21:50:48Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

max-sixty · 2025-03-16T22:57:09Z

What would the method do internally which would be faster than going through pandas? No objection per se, but there would need to be some benefit from adding the method...

DeflateAwning · 2025-03-17T16:19:39Z

I'm not too certain exactly what the current mechanisms look like, but I do know there is an opportunity for improvement as converting to ~~Python~~ Pandas is not a zero-cost operation.

Other cross-compatible libraries (e.g., DuckDB) have separate methods for to-pandas vs to-polars, suggesting that there are benefits (_i.e., performance benefits).

I believe there is a sort of dataframe-library-agnostic dataframe specification (something about __dataframe__ maybe) which may be effective for this.

max-sixty · 2025-03-17T19:22:02Z

OK, feel free to post more details when you / someone else has them. We can leave this open for a while; eventually would suggest closing until we have some legible benefit

DocOtak · 2025-03-17T23:21:18Z

@DeflateAwning Here is the dataframe interchange protocol spec: https://data-apis.org/dataframe-protocol/latest/index.html

I'm also interested in polars dataframe support in xarray.

DeflateAwning · 2025-03-17T23:38:01Z

OK, feel free to post more details when you / someone else has them. We can leave this open for a while; eventually would suggest closing until we have some legible benefit

Oh, sorry that wasn't clear. The obvious benefit is performance. The secondary benefit is avoiding Pandas; it is rightfully deamed legacy tech in all organizations I work with.

Thanks @DocOtak - that's exactly what I was talking about.

max-sixty · 2025-03-18T03:13:46Z

The obvious benefit is performance.

OK, I'm not saying this isn't valid, but I am asking how it would be meaningfully more performant. An example showing the improvement would be great...

The spec looks interesting, thanks for posting. I don't see it covering creating a dataframe though...

TomNicholas · 2025-03-18T19:07:25Z

I have to agree with @max-sixty - the question is not whether polars is faster / better than pandas in general (I believe you), but whether an xarray.DataArray().to_polars_df() method can ever be faster than pl.from_pandas(da.to_dataframe()), and if it actually is faster today.

max-sixty · 2025-03-25T02:08:31Z

please feel free to reopen with some empirics on performance improvements (on this specific method; we def believe polars is generally faster than pandas...)

DeflateAwning · 2025-03-25T19:47:53Z

Sorry, you want me to implement this and then do a performance test? Then, you'll decide if it's worth implementing?

max-sixty · 2025-03-25T19:55:34Z

anything that gives us some empirical data that this is worth a new method. that could be a full implementation, it could be something as simple as a comparison of creating a dataframe from an numpy array

is that reasonable?

TomNicholas · 2025-03-25T20:00:29Z

you want me to implement this [...] Then, you'll decide if it's worth implementing?

For context, the reason @max-sixty is asking is because adding the method to xarray incurs a longer-term maintenance cost (borne by us, the maintainers), not just the one-time cost of implementation. Sorry if that seems annoying, but we have to be judicious about adding more API surface otherwise eventually the result is a sprawling unmaintainable mess.

DeflateAwning · 2025-03-25T21:08:07Z

Here's a preliminary benchmark I did. You'll see that it is actually slower. Then I realized it's because a numpy array, as created, is a row-based store and not a columnar store: https://gist.github.com/DeflateAwning/2751ffa05bc74bad8e19d4a76c6ef8c5 [tl;dr: this benchmark is not representative; don't look at it]

So I redid the benchmarks and created this columnar benchmark version: https://gist.github.com/DeflateAwning/dd19fd9089e7529b6d26322c4aed042d

As you can see, the columnar benchmark version (which I assume more closely mimics how xarray stores the data, roughly) has significantly better performance (380 ms vs. 0.7ms, in the most extreme case).

max-sixty · 2025-03-25T23:14:06Z

nice! thanks for doing that.

one question: if we turn rechunk=False, then the effect seems to go away:

[ins] In [5]: import numpy as np
         ...: import pandas as pd
         ...: import polars as pl
         ...: import timeit
         ...:
         ...: # Array shapes to test
         ...: shapes = [
         ...:     (10_000, 10),
         ...:     (10_000, 200),
         ...:     (100_000, 10),
         ...:     (100_000, 200),
         ...:     (1_000_000, 10),
         ...:     (1_000_000, 200),
         ...:     (10_000_000, 10),
         ...: ]
         ...:
         ...: REPEATS = 5
         ...:
         ...:
         ...: def time_numpy_to_polars(arr):
         ...:     def fn():
         ...:         df_pl = pl.from_numpy(arr, schema=[f"col{i}" for i in range(arr.shape[1])])
         ...:         assert df_pl.height > 1000
         ...:         assert len(df_pl.columns) in (10, 200)
         ...:         return df_pl
         ...:
         ...:     return timeit.timeit(fn, number=REPEATS) / REPEATS
         ...:
         ...:
         ...: def time_numpy_to_pandas_to_polars(arr):
         ...:     def fn():
         ...:         df = pd.DataFrame(arr, columns=[f"col{i}" for i in range(arr.shape[1])])
         ...:         df_pl = pl.from_pandas(df, rechunk=False)
         ...:         assert df_pl.height > 1000
         ...:         assert len(df_pl.columns) in (10, 200)
         ...:         del df
         ...:         return df_pl
         ...:
         ...:     return timeit.timeit(fn, number=REPEATS) / REPEATS
         ...:
         ...:
         ...: def benchmark():
         ...:     print(f"{'Shape':>15} | {'NumPy → Polars':>18} | {'NumPy → Pandas → Polars':>26}")
         ...:     print("-" * 65)
         ...:
         ...:     for shape in shapes:
         ...:         arr1 = np.random.rand(*shape)
         ...:         t_np_pd_polars = time_numpy_to_pandas_to_polars(arr1)
         ...:         del arr1
         ...:
         ...:         arr2 = np.random.rand(*shape)
         ...:         t_np_polars = time_numpy_to_polars(arr2)
         ...:         del arr2
         ...:
         ...:         print(f"{str(shape):>15} | {t_np_polars:>18.6f} s | {t_np_pd_polars:>26.6f} s")
         ...:
         ...:
         ...: for _ in range(5):
         ...:     benchmark()
          Shape |     NumPy → Polars |    NumPy → Pandas → Polars
-----------------------------------------------------------------
    (10000, 10) |           0.000210 s |                   0.002136 s
   (10000, 200) |           0.004562 s |                   0.017994 s
   (100000, 10) |           0.002282 s |                   0.004272 s
  (100000, 200) |           0.056629 s |                   0.043034 s
  (1000000, 10) |           0.019926 s |                   0.012140 s
 (1000000, 200) |           1.040867 s |                   0.352806 s
 (10000000, 10) |           0.203197 s |                   0.127347 s
          Shape |     NumPy → Polars |    NumPy → Pandas → Polars
-----------------------------------------------------------------
    (10000, 10) |           0.000214 s |                   0.002998 s
   (10000, 200) |           0.005002 s |                   0.020305 s
   (100000, 10) |           0.002292 s |                   0.004627 s
  (100000, 200) |           0.053998 s |                   0.042060 s
  (1000000, 10) |           0.022816 s |                   0.011790 s
 (1000000, 200) |           1.042097 s |                   0.346280 s
 (10000000, 10) |           0.208075 s |                   0.127311 s
          Shape |     NumPy → Polars |    NumPy → Pandas → Polars
-----------------------------------------------------------------
    (10000, 10) |           0.000466 s |                   0.004395 s
   (10000, 200) |           0.004703 s |                   0.020835 s
   (100000, 10) |           0.001563 s |                   0.004102 s
  (100000, 200) |           0.052693 s |                   0.046256 s
  (1000000, 10) |           0.021292 s |                   0.013224 s
 (1000000, 200) |           1.052150 s |                   0.345346 s
 (10000000, 10) |           0.204095 s |                   0.129334 s
          Shape |     NumPy → Polars |    NumPy → Pandas → Polars
-----------------------------------------------------------------
    (10000, 10) |           0.000628 s |                   0.003400 s
   (10000, 200) |           0.003552 s |                   0.020649 s
   (100000, 10) |           0.001446 s |                   0.003422 s
  (100000, 200) |           0.056508 s |                   0.045235 s
  (1000000, 10) |           0.023728 s |                   0.012902 s
 (1000000, 200) |           1.062767 s |                   0.341092 s
 (10000000, 10) |           0.219106 s |                   0.149330 s
          Shape |     NumPy → Polars |    NumPy → Pandas → Polars
-----------------------------------------------------------------
    (10000, 10) |           0.000442 s |                   0.003436 s
   (10000, 200) |           0.004077 s |                   0.017220 s
   (100000, 10) |           0.001618 s |                   0.003177 s
  (100000, 200) |           0.059910 s |                   0.045847 s
  (1000000, 10) |           0.020943 s |                   0.015437 s
 (1000000, 200) |           1.066620 s |                   0.361707 s
 (10000000, 10) |           0.208172 s |                   0.131291 s

are we confident that the resulting polars arrays are identically laid out between each comparison? can we add a check for that?

for context: I really don't want to seem overly skeptical — I'm a big fan of polars, and xarray would be keen to add modest features to allow better support. but I don't have a good theory for how pandas is adding overhead for array construction assuming it's laid out the same between pandas and polars (which might not be the case, hence the difference, in particular if polars supports row-major storage?)

dcherian · 2025-03-25T23:24:42Z

I believe the key difference here (at least theoretically) is the index creation? That MultiIndex can be quite large, and slow to build IME.

So for API perhaps we can accept create_index: bool, dataframe_constructor: Callable? Assuming the constructors are compatible, looks like we just pass in a dict to pd.DataFrame

DocOtak · 2025-03-26T12:36:32Z

@max-sixty

The spec looks interesting, thanks for posting. I don't see it covering creating a dataframe though...

I think the intent is the the producing libraries (e.g. xarray) don't make the dataframe themselves, but provide an interface for dataframe libraries like pandas and polars to consume.

In pandas it would be this: https://pandas.pydata.org/docs/reference/api/pandas.api.interchange.from_dataframe.html
In polars it is this: https://docs.pola.rs/api/python/stable/reference/api/polars.from_dataframe.html

max-sixty · 2025-03-26T14:44:02Z

ok interesting, is the suggestion that xarray should implement the dataframe interface?

DeflateAwning · 2025-03-26T18:57:07Z

I'd say yes - the likely-best way to implement this is with the Dataframe Interchange Protocol.

Then, when the next hot dataframe library that uses quantum computing instead of lame 2025-era multithreading comes around, it'll be able to efficiently consume that.

Adding the .to_polars() method is a 1-liner once the Dataframe Interchange Protocol is implemented.

max-sixty · 2025-03-26T21:50:32Z

does that mean that a dataset / dataarray would be advertising itself as a dataframe, though?

keewis · 2025-03-26T22:39:17Z

What if instead we added a method that returned an intermediate object defining only the dataframe protocol? With that, the explicit conversion would be something like pd.DataFrame(ds.to_df()) or pl.DataFrame(ds.to_df()) (not sure if that's actually how you'd feed dataframe-like objects to pandas / polars)

max-sixty · 2025-03-27T02:36:17Z

that sounds ideal @keewis !

MarcoGorelli · 2025-03-27T11:41:56Z

Definitely interested in going from XArray to Polars without needing pandas as a dependency, but I'd suggest not using the dataframe interchange protocol. pandas core dev Will Ayd wrote about his experiences with it here

While initially promising, this soon became problematic. [...] After many unexpected segfaults, I started to grow weary of this solution.
it only talks about how to consume data, but offers no guidance on how to produce it. If starting from your extension, you have no tools or library to manually build buffers. Much like the status quo, this meant reading from a Hyper database to a pandas DataFrame would likely be going through Python objects.

Furthermore, based on my own experience trying to fixup the interchange protocol implementation in pandas, my suggestion is to never use it for anything.

Instead, you may want to look at the PyCapsule Interface. Continuing on from Will's blog post:

After stumbling around the DataFrame Protocol Interface for a few weeks, Joris Van den Bossche [another pandas core dev] asked me why I didn’t look at the Arrow C Data Interface. [...]. Almost immediately my issues went away. I felt more confident in the implementation and had to deal with less memory corruption / crashes than before. And, perhaps most importantly, I saved a lot of time.

@kylebarron is one of the leading advocates for PyCapsule Interface (apache/arrow#39195) and expert in geospatial data science, and so might be good loop in here. Reckon XArray is a good candidate to export a PyCapsule object which dataframe libraries could consume?

kylebarron · 2025-03-27T17:08:41Z

I agree that I would dissuade you from trying to implement the dataframe interchange protocol and would encourage adoption of the Arrow PyCapsule Interface.

What would the method do internally which would be faster than going through pandas? No objection per se, but there would need to be some benefit from adding the method...

This is also not clear to me. I don't know xarray internals that well; I thought xarray uses pandas as a required dependency, and so I figure that most xarray data is stored in a pandas DataFrame or Series? Then I figure the fastest (and simplest to implement) way to convert xarray to polars would be to reuse pandas' implementation of the PyCapsule Interface.

Pandas has implemented PyCapsule Interface export for a little while. pandas-dev/pandas#56587, pandas-dev/pandas#59518

max-sixty · 2025-03-27T17:30:55Z

I thought xarray uses pandas as a required dependency, and so I figure that most xarray data is stored in a pandas DataFrame or Series?

xarray currently has a required pandas dependency for its indexing. the standard backend is a numpy array

kylebarron · 2025-03-27T19:35:51Z

Seems like your options are either:

implement a specific numpy -> polars implementation for to_polars_df
implement a generic DataFrame Interchange Protocol backend on top of how you store numpy data.
implement a generic Arrow PyCapsule Interface integration. This would require some Arrow backend, such as pandas' default pyarrow backend. However pyarrow is a massive dependency, which has some downsides. I wrote https://github.com/kylebarron/arro3 as a much smaller Arrow implementation, which you might be able to use to convert numpy data to Arrow.

max-sixty · 2025-03-27T22:45:22Z

implement a generic Arrow PyCapsule Interface integration. This would require some Arrow backend, such as pandas' default pyarrow backend. However pyarrow is a massive dependency...

if someone wants to take this on, we could have pyarrow as an optional dependency, that could work. it's optional for polars & pandas

but pandas itself seem like a satisfactory interchange format! whether or not the initially encouraging results are driven by alignment vs. real determines whether there's a perf improvement

DeflateAwning · 2025-03-28T03:13:13Z

Pandas uses "nan" to represent nulls in string columns. It is a prime example of hacking things together. It is an awful interchange format.

Why not make Polars the interchange format.

kylebarron · 2025-03-28T17:07:25Z

Arrow makes more sense than Polars to be an interchange format. It's explicitly designed as such, and is already used under the hood in Polars.

DeflateAwning added the enhancement label Mar 16, 2025

TomNicholas added the topic-pandas-like label Mar 17, 2025

max-sixty added the plan to close May be closeable, needs more eyeballs label Mar 18, 2025

max-sixty closed this as not planned Won't fix, can't repro, duplicate, stale Mar 25, 2025

TomNicholas removed the plan to close May be closeable, needs more eyeballs label Mar 27, 2025

TomNicholas reopened this Mar 27, 2025

Uh oh!

Add a .to_polars_df() method (very similar to .to_dataframe(), which implicitly uses pandas) #10135

Add a .to_polars_df() method (very similar to .to_dataframe(), which implicitly uses pandas) #10135

Comments

DeflateAwning commented Mar 16, 2025

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

welcome bot commented Mar 16, 2025

Uh oh!

max-sixty commented Mar 16, 2025

Uh oh!

DeflateAwning commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-sixty commented Mar 17, 2025

Uh oh!

DocOtak commented Mar 17, 2025

Uh oh!

DeflateAwning commented Mar 17, 2025

Uh oh!

max-sixty commented Mar 18, 2025

Uh oh!

TomNicholas commented Mar 18, 2025

Uh oh!

max-sixty commented Mar 25, 2025

Uh oh!

DeflateAwning commented Mar 25, 2025

Uh oh!

max-sixty commented Mar 25, 2025

Uh oh!

TomNicholas commented Mar 25, 2025

Uh oh!

DeflateAwning commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-sixty commented Mar 25, 2025

Uh oh!

dcherian commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DocOtak commented Mar 26, 2025

Uh oh!

max-sixty commented Mar 26, 2025

Uh oh!

DeflateAwning commented Mar 26, 2025

Uh oh!

max-sixty commented Mar 26, 2025

Uh oh!

keewis commented Mar 26, 2025

Uh oh!

max-sixty commented Mar 27, 2025

Uh oh!

MarcoGorelli commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylebarron commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-sixty commented Mar 27, 2025

Uh oh!

kylebarron commented Mar 27, 2025

Uh oh!

max-sixty commented Mar 27, 2025

Uh oh!

DeflateAwning commented Mar 28, 2025

Uh oh!

kylebarron commented Mar 28, 2025

Uh oh!

Add a `.to_polars_df()` method (very similar to `.to_dataframe()`, which implicitly uses pandas) #10135

Add a `.to_polars_df()` method (very similar to `.to_dataframe()`, which implicitly uses pandas) #10135

DeflateAwning commented Mar 17, 2025 •

edited

Loading

DeflateAwning commented Mar 25, 2025 •

edited

Loading

dcherian commented Mar 25, 2025 •

edited

Loading

MarcoGorelli commented Mar 27, 2025 •

edited

Loading

kylebarron commented Mar 27, 2025 •

edited

Loading