-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite the Polars datasets to not rely on fsspec unnecessarily #625
Comments
Interesting to learn about why Rust choose object store instead of a general filesystem interface. One things that come into mind is how to make the Kedro Versioning works for this as we almost use To be fair local + object store is the majority case. We do handle some niche fs like hdfs (was common 10 years ago, almost non-exist now), sftp etc. |
The good thing is that, if you drop fsspec by leveraging the underlying mechanism of the target library, writing a new custom dataset is trivial |
Today I showed this fsspec-free dataset to a user and they were happy to see how easy it is to write: import typing as t
import polars as pl
from kedro.io import AbstractDataset
class DeltaPolarsDataset(AbstractDataset[pl.DataFrame, pl.DataFrame]):
"""``DeltaDataset`` loads/saves data from/to a Delta Table using an underlying
filesystem (e.g.: local, S3, GCS). It returns a Polars dataframe.
"""
DEFAULT_LOAD_ARGS: dict[str, t.Any] = {}
DEFAULT_SAVE_ARGS: dict[str, t.Any] = {}
def __init__(
self,
filepath: str,
load_args: dict[str, t.Any] | None = None,
save_args: dict[str, t.Any] | None = None,
credentials: dict[str, t.Any] | None = None,
storage_options: dict[str, t.Any] | None = None,
metadata: dict[str, t.Any] | None = None,
):
self._filepath = filepath
self._load_args = {**self.DEFAULT_LOAD_ARGS, **(load_args or {})}
self._save_args = {**self.DEFAULT_SAVE_ARGS, **(save_args or {})}
self._credentials = credentials or {}
self._storage_options = storage_options or {}
self._storage_options.update(self._credentials)
self._metadata = metadata or {}
def _load(self) -> pl.DataFrame:
return pl.read_delta(
self._filepath, storage_options=self._storage_options, **self._load_args
)
def _save(self, data: pl.DataFrame) -> None:
data.write_delta(
self._filepath, storage_options=self._storage_options, **self._save_args
)
def _describe(self) -> dict[str, t.Any]:
return dict(
filepath=self._filepath,
load_args=self._load_args,
save_args=self._save_args,
storage_options=self._storage_options,
metadata=self._metadata,
) |
It was noted today in backlog grooming by @noklam that For the particular case of Delta format, let's
|
From backlog grooming:
|
I did some experiments to see where we are in using plain vanilla
So if we make any change, we could split up the logic based on the Any thoughts? |
At this point we should start by documenting what the "social contract" of Kedro datasets is. In other words, how war we go in filling gaps like the ones you describe. Somewhat related: kedro-org/kedro#1936 |
Description
Rewrite the polars datasets: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets/polars
to not rely on
fsspec
, because they don't need it.Context
Polars can read remote systems just fine thanks to https://docs.rs/object_store/latest/object_store/, but the kedro-datasets version asks me for fsspec dependencies anyway:
Related: #590
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro
orkedro -V
):pip show kedro-airflow
):python -V
):The text was updated successfully, but these errors were encountered: