Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove PyArrow dependency for Polars support #3445

Closed
MarcoGorelli opened this issue Jun 26, 2024 · 15 comments · Fixed by #3452
Closed

Remove PyArrow dependency for Polars support #3445

MarcoGorelli opened this issue Jun 26, 2024 · 15 comments · Fixed by #3452

Comments

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Jun 26, 2024

What is your suggestion?

Currently, PyArrow is required by Altair for Polars support. I think it shouldn't be too hard to remove it, given that Polars implements the dataframe interchange protocol natively (without depending on PyArrow)

If #3384 can make it in, then Altair would actually support plotting Polars dataframe natively without any extra heavy dependencies. That'd be...pretty amazing? I'd suggest using Altair for polars.DataFrame.plot if that was the case

I think what would need doing is:

  • don't require pyarrow to be installed for the dfi = data.__dataframe__ part
  • instead of using sanitize_arrow_table, for Polars, just select date/datetime columns and call .dt.to_string()
  • instead of using to_pylist from PyArrow, just use DataFrame.rows(named=True) for Polars
  • for categoricals, find a non-pyarrow workaround for Polars in infer_vegalite_type_for_dfi_column. I haven't tried this yet, but it looks straightforward-ish

Would you open to considering this? Happy to work on a PR if so

Have you considered any alternative solutions?

Just keep the status-quo :) But, I think Altair is the only plotting library that gets close to native Polars support without extra large dependencies, and it doesn't look like a large stretch to go all the way there, so I'm hoping we can do it 💪


Demo from having tried this locally:

image

@dangotbanned
Copy link
Member

dangotbanned commented Jun 26, 2024

Not a maintainer here, but #3426, #3384 and this all landing in the next release sounds great to me - if all goes well.

Found some additional pyarrow references in data.py, but most lead here:

altair/altair/utils/data.py

Lines 406 to 408 in 76a9ce1

def arrow_table_from_dfi_dataframe(dfi_df: DataFrameLike) -> "pyarrow.lib.Table":
"""Convert a DataFrame Interchange Protocol compatible object to an Arrow Table"""
import pyarrow as pa

Also, if this ends up being a PR and only changes a few files - it may be worth checking https://github.com/vega/altair/pull/3431/files since there are a lot of minor changes that may have simplified/standardized what is on main.

@mattijn
Copy link
Contributor

mattijn commented Jun 26, 2024

Thanks for opening this issue! I'm surely in favor of improving native polars support without relying on a pyarrow dependency.
As you have noticed, we still need to distill existing pandas serialization logic to become more dataframe-agnostic.

If I understand right, the suggested change here is to introduce a direct link to polars for a polars DataFrame if that object is compatible with the dataframe interchange protocol?

Eg, some changes here:

altair/altair/utils/data.py

Lines 406 to 408 in 76a9ce1

def arrow_table_from_dfi_dataframe(dfi_df: DataFrameLike) -> "pyarrow.lib.Table":
"""Convert a DataFrame Interchange Protocol compatible object to an Arrow Table"""
import pyarrow as pa

And here:

altair/altair/utils/data.py

Lines 406 to 416 in 76a9ce1

def arrow_table_from_dfi_dataframe(dfi_df: DataFrameLike) -> "pyarrow.lib.Table":
"""Convert a DataFrame Interchange Protocol compatible object to an Arrow Table"""
import pyarrow as pa
# First check if the dataframe object has a method to convert to arrow.
# Give this preference over the pyarrow from_dataframe function since the object
# has more control over the conversion, and may have broader compatibility.
# This is the case for Polars, which supports Date32 columns in direct conversion
# while pyarrow does not yet support this type in from_dataframe
for convert_method_name in ("arrow", "to_arrow", "to_arrow_table"):
convert_method = getattr(dfi_df, convert_method_name, None)

The introduction of the dataframe interchange protocol is not only for polars, but others are using it too, eg ibis ( #3110), or custom objects as in vega/vegafusion#386.

I'm not really in favor of introducing an optional polars dependency, but in line with #3377, we are already working towards a direction where we can use direct arrow conversion methods without using the pyarrow from_dataframe interface. If we can improve this direction even more so that pyarrow is not a hard dependency anymore within the part that serialize objects that are compatible with the dataframe interchange protocol that would be great.

Simultaneously, I also noticed that vegafusion has recently introduced an alternative approach for serializing objects that are compatible with polars, see eg this part:

https://github.com/vega/vegafusion/blob/007bd44188676de7259bc02a61693b3dc7586072/python/vegafusion/vegafusion/runtime.py#L218-L225

Anyways! PRs in this direction are welcome!

@MarcoGorelli
Copy link
Contributor Author

Thanks for your response! 🙏

I think there are some methods which aren't covered by the interchange protocol unfortunately, such as converting a Date / Datetime column to string. So, with regards to

I'm not really in favor of introducing a optional polars dependency

would you be OK with having a couple of specialised Polars branches, where you can do something like

if (pl := sys.modules.get('polars') is not None) and isinstance(df, pl.DataFrame):
    import polars.selectors as cs
    df = df.with_columns(cs.date().dt.to_string('%F'))
    return df

?
I think the diff should be quite minimal

I think there's only 2 places where that would be necessary. Other parts where you use the dataframe interchange protocol are fine, it's just the part where you convert to a pyarrow table which could be skipped for the Polars case

Anyways! PRs in this direction are welcome!

Thanks! I'll put something together to drive this forwards, just wanted to check that you'd be OK with a couple (just a couple!) of specialised Polars paths

@mattijn
Copy link
Contributor

mattijn commented Jun 26, 2024

First, we try to solve things pragmatic if the dataframe interchange protocol is incomplete. There is not always a royal road. But for my own understanding, isn't this what we are doing already in #3377, where we use a direct to_arrow method on a dataframe that supports the dataframe interchange protocol?

@MarcoGorelli
Copy link
Contributor Author

Yeah, and to_arrow requires PyArrow as a dependency :) In the Polars case, Altair is converting Polars to PyArrow, to do things which are already possible (and fairly easy) to do in Polars directly. And that feels like a missed opportunity to keep Altair really lightweight for Polars users who otherwise wouldn't need to bloat their environment up with PyArrow (which is a huge dependency, and not generally required for working with Polars)

@mattijn
Copy link
Contributor

mattijn commented Jun 26, 2024

Ah.. therefor raising this issue:)

So the sanitize_dataframe function:

altair/altair/utils/core.py

Lines 309 to 310 in 62ab14d

def sanitize_dataframe(df: "pd.DataFrame") -> "pd.DataFrame": # noqa: C901
"""Sanitize a DataFrame to prepare it for serialization.

should be renamed to sanitize_pandas_dataframe

And next to sanitize_arrow_table

altair/altair/utils/core.py

Lines 434 to 437 in 62ab14d

def sanitize_arrow_table(pa_table):
"""Sanitize arrow table for JSON serialization"""
import pyarrow as pa
import pyarrow.compute as pc

Will come a new sanitize_polars_dataframe function, with a couple of specialised Polars paths?

I agree that pyarrow is a rather large dependency and we hope to stay lightweight, especially for wasm environments.

@MarcoGorelli
Copy link
Contributor Author

Yup, thanks for appreciating the value in this! Alright, PR incoming this week

@jonmmease
Copy link
Contributor

Hi @MarcoGorelli, just catching up on this thread and on your and @dangotbanned's comments in #3384.

I'm all in favor of streamlining polars support in Altair, but I'm less enthused about having polars specific logic scattered around the code base. On the surface, it seems like narwhals is exactly the kind of abstraction layer we would like in Altair going forward. Could we jump straight to supporting polars without pyarrow using narwhals? Then ideally over time we could move the other DataFrame types that narwhals supports behind this code path as well.

cc @mattijn, I would personally prefer this approach to making pandas optional, but interested to hear your thoughts.

@MarcoGorelli
Copy link
Contributor Author

Thanks for your response!

On the surface, it seems like narwhals is exactly the kind of abstraction layer we would like in Altair going forward. Could we jump straight to supporting polars without pyarrow using narwhals?

Well, I'm not going to make you ask me twice 😄 Thanks, I'll give this a go and open a PR for your consideration! I don't think it'll be too big of an effort, if you then decide you don't want it, I promise no offense will be taken 😇

@binste
Copy link
Contributor

binste commented Jun 27, 2024

Just browsed through the narwhals docs, sounds very interesting! I'm also onboard to allow Polars users to use Altair without the need to have PyArrow or Pandas alongside it.

@MarcoGorelli Thanks for offering to open a PR! Just for you to know, #3431 will introduce a lot of changes to the code base. It's almost ready to be merged so you might want to wait a few more days.

@binste
Copy link
Contributor

binste commented Jun 27, 2024

@MarcoGorelli Altair supports Pandas >= 0.25. I think it would be ok to bump this to 1.1.5 in your PR to be in line with Narwhal.

Btw, really enjoyed your talk at PyData Berlin in April :)

@binste
Copy link
Contributor

binste commented Jun 27, 2024

@MarcoGorelli Thanks for offering to open a PR! Just for you to know, #3431 will introduce a lot of changes to the code base. It's almost ready to be merged so you might want to wait a few more days.

Just fyi, #3431 is merged.

@mattijn
Copy link
Contributor

mattijn commented Jun 27, 2024

cc @mattijn, I would personally prefer this approach to making pandas optional, but interested to hear your thoughts.

We like it when Altair is lightweight and dataframe agnostic.

We have experimented with adopting the dataframe interchange protocol through relying on pyarrow.interchange.from_dataframe to reach that goal.

Eventhough the interface protocol brought us a lot in relation to becoming dataframe agnostic, but not all (ref #3377) and unfortunately pyarrow is not a library that makes Altair lightweight.

If there are better options to become dataframe agnostic and lightweight by not having hard dependencies on heavy dataframe packages, than it seems like something to consider seriously.

If we found out that column type inference and data serialization for objects that are like dataframes can be delegated safely to narwhals, than it at least is something we should consider as a potential candidate.

I say, 'like dataframes', because I don't think narwhals will supports arrays is it? (something we eventually hope to cover one day). Surely we aim not to introduce regressions during this consideration.

@MarcoGorelli
Copy link
Contributor Author

Could we jump straight to supporting polars without pyarrow using narwhals?

Here we go 🚀 #3452

I totally agree that it's an imperative that this would come with zero regressions. I'm not aware of any:

  • pandas supported is unaffected
  • pyarrow.Table is still supported
  • Polars and Modin are supported natively, without needing either of the above
  • Ibis (and other interchange protocol libraries) are still supported by going via PyArrow (so, there's no regression here)

I don't think narwhals will supports arrays is it?

That's right, arrays are a different beast - I'd suggest taking a look at https://data-apis.org/array-api/latest/ for that. Array libraries were sufficiently aligned to begin with that a standard like that was possible and successful, whereas for dataframes the related effort was discontinued (and that's how Narwhals was born - I'm not calling it a "standard", but I am trying to use my position as pandas + Polars dev to enable libraries to support the latter at no cost to the former)


I think a consideration that will come up will be "can we trust Narwhals? Will it stay maintained? Will it make breaking changes". If so, these would be totally valid :)

  • Regarding maintainability, there are other major libraries considering using Narwhals (scikit-learn, shiny), so I think there's enough interested parties such that, if we set up a good governance model, it could thrive even if I disappeared. I've already given write and admin access to some people I trust and who have made really solid contributions
  • Regarding breaking changes: we're aiming for a "perfect backwards compatibility" policy in Narwhals, inspired by Rust Editions, see this RFC

@binste
Copy link
Contributor

binste commented Jul 2, 2024

I think a consideration that will come up will be "can we trust Narwhals? Will it stay maintained? Will it make breaking changes". If so, these would be totally valid :)

  • Regarding maintainability, there are other major libraries considering using Narwhals (scikit-learn, shiny), so I think there's enough interested parties such that, if we set up a good governance model, it could thrive even if I disappeared. I've already given write and admin access to some people I trust and who have made really solid contributions
  • Regarding breaking changes: we're aiming for a "perfect backwards compatibility" policy in Narwhals, inspired by Rust Editions, see this RFC

Thanks for addressing this proactively. If other major libraries use it and if it has such a compatibility policy, that does definitely provide some comfort. In addition, Altair does not have that much data transformation logic built in. So even if for whatever reason we would need to remove the dependency on Narwhals again, I think that would be doable in a weekend :) To me the project looks well structured, documented, and the best shot we have. The alternative would be to implement different code paths ourselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants