Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add matlab dataset into kedro-datasets #435

Closed
wants to merge 105 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
f73fcfd
feat(datasets): Migrated `PartitionedDataSet` and `IncrementalDataSet…
PtrBld Oct 11, 2023
6c1b73f
fix: backwards compatibility for `kedro-airflow` (#381)
sbrugman Oct 12, 2023
3283618
fix(datasets): Don't warn for SparkDataset on Databricks when using s…
alamastor Oct 12, 2023
8f782e7
chore: Hot fix for RTD due to bad pip version (#396)
noklam Oct 17, 2023
48fde27
chore: Pin pip version temporarily (#398)
ankatiyar Oct 18, 2023
34011aa
perf(datasets): don't create connection until need (#281)
deepyaman Oct 19, 2023
bb8e6ce
chore: Drop Python 3.7 support for kedro-plugins (#392)
lrcouto Oct 19, 2023
ddef350
feat(datasets): support Polars lazy evaluation (#350)
MatthiasRoels Oct 20, 2023
aa46530
build(datasets): Release `1.8.0` (#406)
merelcht Oct 24, 2023
74cc4de
build(airflow): Release 0.7.0 (#407)
ankatiyar Oct 24, 2023
9c06069
build(telemetry): Release 0.3.0 (#408)
ankatiyar Oct 24, 2023
8af5870
build(docker): Release 0.4.0 (#409)
ankatiyar Oct 24, 2023
8d7838a
style(airflow): blacken README.md of Kedro-Airflow (#418)
deepyaman Oct 25, 2023
0a7bdc2
fix(datasets): Fix missing jQuery (#414)
astrojuanlu Oct 25, 2023
a8e9319
fix(datasets): Fix Lazy Polars dataset to use the new-style base clas…
astrojuanlu Oct 25, 2023
9f7b24c
chore(datasets): lazily load `partitions` classes (#411)
deepyaman Oct 25, 2023
1948c10
docs(datasets): fix code blocks and `data_set` use (#417)
deepyaman Oct 25, 2023
c10e043
fix: TF model load failure when model is saved as a TensorFlow Saved …
Edouard59 Oct 26, 2023
4c09ada
chore: Drop support for Python 3.7 on kedro-datasets (#419)
lrcouto Oct 27, 2023
a7b2967
test(datasets): run doctests to check examples run (#416)
deepyaman Oct 27, 2023
8b538b7
feat(datasets): Add support for `databricks-connect>=13.0` (#352)
MigQ2 Nov 1, 2023
9f10b42
fix(telemetry): remove double execution by moving to after catalog cr…
fdroessler Nov 1, 2023
02d67b7
docs: Add python version support policy to plugin `README.md`s (#425)
merelcht Nov 2, 2023
3782147
matlab_dataset init
samuel-lee-sj Nov 6, 2023
7589c05
Pytest returns: No data to report
samuel-lee-sj Nov 14, 2023
7ba05a8
Completed pytest on the matlab dataset
samuel-lee-sj Nov 17, 2023
00b91cb
Point before pull request.
samuel-lee-sj Nov 17, 2023
0cb9107
docs(airflow): Use new docs link (#393)
astrojuanlu Nov 10, 2023
8209071
style: Add shared CSS and meganav to datasets docs (#400)
stichbury Nov 10, 2023
6685a07
feat(datasets): Add Hugging Face datasets (#344)
astrojuanlu Nov 13, 2023
ffa733c
Reviewed and resolved errors in code 201123
samuel-lee-sj Nov 20, 2023
fc7e7a0
amended docstring for example
samuel-lee-sj Nov 21, 2023
886eef4
added function _invalidate_cache
samuel-lee-sj Nov 21, 2023
411a074
Added credentials, file save args to init
samuel-lee-sj Nov 22, 2023
4fddd0a
test(datasets): fix `dask.ParquetDataset` doctests (#439)
deepyaman Nov 22, 2023
a6a9e43
refactor: Remove `DataSet` aliases and mentions (#440)
merelcht Nov 24, 2023
df4a782
chore(datasets): replace "Pyspark" with "PySpark" (#423)
deepyaman Nov 25, 2023
5ddc210
test(datasets): make `api.APIDataset` doctests run (#448)
deepyaman Nov 27, 2023
4430dca
chore(datasets): Fix `pandas.GenericDataset` doctest (#445)
merelcht Nov 27, 2023
c34fa85
feat(datasets): make datasets arguments keywords only (#358)
felixscherz Nov 27, 2023
7854c85
chore: Drop support for python 3.8 on kedro-datasets (#442)
DimedS Nov 27, 2023
451b2d1
test(datasets): add outputs to matplotlib doctests (#449)
deepyaman Nov 28, 2023
7846c07
chore(datasets): Fix more doctest issues (#451)
merelcht Nov 28, 2023
ff58c81
test(datasets): fix failing doctests in Windows CI (#457)
deepyaman Nov 30, 2023
0381206
chore(datasets): fix accidental reference to NumPy (#450)
deepyaman Nov 30, 2023
d6dcc93
chore(datasets): don't pollute dev env in doctests (#452)
deepyaman Nov 30, 2023
80a1962
feat: Add tools to heap event (#430)
lrcouto Nov 30, 2023
95b6780
ci(datasets): install deps in single `pip install` (#454)
deepyaman Dec 5, 2023
69c7a14
build(datasets): Bump s3fs (#463)
merelcht Dec 7, 2023
06ee879
test(datasets): make SQL dataset examples runnable (#455)
deepyaman Dec 7, 2023
093fc8c
fix(datasets): correct pandas-gbq as py311 dependency (#460)
kuruonur1 Dec 7, 2023
dcbadcc
docs(datasets): Document `IncrementalDataset` (#468)
astrojuanlu Dec 8, 2023
50dffce
chore: Update datasets to be arguments keyword only (#466)
merelcht Dec 8, 2023
85c38ed
chore: Clean up code for old dataset syntax compatibility (#465)
merelcht Dec 8, 2023
c73671d
chore: Update scikit-learn version (#469)
noklam Dec 8, 2023
1ab6611
feat(datasets): support versioning data partitions (#447)
deepyaman Dec 11, 2023
9825160
docs(datasets): Improve documentation index (#428)
astrojuanlu Dec 11, 2023
4bc2e66
docs(datasets): update wrong docstring about `con` (#461)
deepyaman Dec 11, 2023
350085b
build(datasets): Release `2.0.0` (#472)
merelcht Dec 11, 2023
6be91f6
ci(telemetry): Pin `PyYAML` (#474)
ankatiyar Dec 12, 2023
719009b
build(telemetry): Release 0.3.1 (#475)
SajidAlamQB Dec 12, 2023
1377cfe
docs(datasets): Fix broken links in README (#477)
astrojuanlu Dec 12, 2023
827c5f2
chore(datasets): replace more "data_set" instances (#476)
deepyaman Dec 12, 2023
217fe28
chore(datasets): Fix doctests (#488)
merelcht Dec 19, 2023
c02c32d
chore(datasets): Fix delta + incremental dataset docstrings (#489)
merelcht Dec 20, 2023
37c11ea
chore(airflow): Post 0.19 cleanup (#478)
ankatiyar Dec 20, 2023
d3a7995
build(airflow): Release 0.8.0 (#491)
ankatiyar Dec 20, 2023
74d83d2
fix: telemetry metadata (#495)
DimedS Dec 21, 2023
d3a7a22
fix: Update tests on kedro-docker for 0.5.0 release. (#496)
lrcouto Dec 21, 2023
1aa3968
rebase matlab_dataset to incoming
samuel-lee-sj Jan 12, 2024
621ab35
Rebase 9ebe88c
samuel-lee-sj Jan 12, 2024
5a67aa3
Point before pull request.
samuel-lee-sj Nov 17, 2023
f6a01ac
Reviewed and resolved errors in code 201123
samuel-lee-sj Nov 20, 2023
77f4286
Added credentials, file save args to init
samuel-lee-sj Nov 22, 2023
ac314dc
build: Release kedro-docker 0.5.0 (#497)
lrcouto Dec 21, 2023
24c51a7
chore(datasets): Update partitioned dataset docstring (#502)
merelcht Jan 3, 2024
f47e7c9
fix(datasets): Relax pandas.HDFDataSet dependencies which are broken …
Galileo-Galilei Jan 4, 2024
43fb7cb
fix: airflow metadata (#498)
AhdraMeraliQB Jan 4, 2024
2dd8723
amended docstring for example
samuel-lee-sj Nov 21, 2023
3900703
added function _invalidate_cache
samuel-lee-sj Nov 21, 2023
c4cb619
rebase in progress onto d8f1fd5
samuel-lee-sj Jan 12, 2024
bc76e83
docs: Add python version support policy to plugin `README.md`s (#425)
merelcht Nov 2, 2023
dba9dc8
matlab_dataset init
samuel-lee-sj Nov 6, 2023
5a031b7
Pytest returns: No data to report
samuel-lee-sj Nov 14, 2023
a204f6b
Completed pytest on the matlab dataset
samuel-lee-sj Nov 17, 2023
77ef1ee
Point before pull request.
samuel-lee-sj Nov 17, 2023
69029a0
style: Add shared CSS and meganav to datasets docs (#400)
stichbury Nov 10, 2023
20dc8ae
feat(datasets): Add Hugging Face datasets (#344)
astrojuanlu Nov 13, 2023
afe6e7e
Reviewed and resolved errors in code 201123
samuel-lee-sj Nov 20, 2023
9da01f7
amended docstring for example
samuel-lee-sj Nov 21, 2023
ac175c9
added function _invalidate_cache
samuel-lee-sj Nov 21, 2023
76f5130
Added credentials, file save args to init
samuel-lee-sj Nov 22, 2023
df827ee
test(datasets): fix `dask.ParquetDataset` doctests (#439)
deepyaman Nov 22, 2023
94f9ec8
refactor: Remove `DataSet` aliases and mentions (#440)
merelcht Nov 24, 2023
1e4a968
test(datasets): make `api.APIDataset` doctests run (#448)
deepyaman Nov 27, 2023
757ea12
test(datasets): add outputs to matplotlib doctests (#449)
deepyaman Nov 28, 2023
ac388fa
chore(datasets): fix accidental reference to NumPy (#450)
deepyaman Nov 30, 2023
60cec53
feat: Add tools to heap event (#430)
lrcouto Nov 30, 2023
3322c01
fix(datasets): correct pandas-gbq as py311 dependency (#460)
kuruonur1 Dec 7, 2023
6c0c99c
chore: Clean up code for old dataset syntax compatibility (#465)
merelcht Dec 8, 2023
6080b5c
chore: Update scikit-learn version (#469)
noklam Dec 8, 2023
c5730ab
build(telemetry): Release 0.3.1 (#475)
SajidAlamQB Dec 12, 2023
f93ee7b
docs(datasets): Fix broken links in README (#477)
astrojuanlu Dec 12, 2023
7e72302
chore(datasets): replace more "data_set" instances (#476)
deepyaman Dec 12, 2023
be82889
Merge branch 'main' into sslsj
samuel-lee-sj Jan 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion kedro-airflow/features/steps/cli_steps.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,6 @@ def create_project_from_config_file(context):
"-c",
str(context.config_file),
"--starter",
"astro-airflow-iris",
],
env=context.env,
cwd=str(context.temp_dir),
Expand Down
2 changes: 1 addition & 1 deletion kedro-airflow/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,12 @@ Tracker = "https://github.com/kedro-org/kedro-plugins/issues"

[project.optional-dependencies]
test = [
"apache-airflow<3.0",
"bandit",
"behave",
"black~=22.0",
"connexion<3.0.0", # TODO: Temporary fix, connexion has changed their API, but airflow hasn't caught up yet
"kedro-datasets",
"pendulum<3.0.0", # TODO: Also to be removed
"pre-commit>=2.9.2",
"pytest",
"pytest-cov",
Expand Down
3 changes: 1 addition & 2 deletions kedro-datasets/docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@

# -- Project information -----------------------------------------------------

project = "kedro-datasets"
project = "kedro"
author = "kedro"

# The short X.Y version.
Expand Down Expand Up @@ -99,7 +99,6 @@
"py:class": (
"kedro.io.core.AbstractDataset",
"kedro.io.AbstractDataset",
"AbstractDataset",
"kedro.io.core.Version",
"requests.auth.AuthBase",
"google.oauth2.credentials.Credentials",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,6 @@ class HFDataset(AbstractVersionedDataset):
>>> assert len(yelp_review_full["train"]) == 650000

"""

def __init__(self, *, dataset_name: str):
self.dataset_name = dataset_name

def _load(self):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,6 @@ class HFTransformerPipelineDataset(AbstractDataset):

def __init__(
self,
*,
task: str | None = None,
model_name: str | None = None,
pipeline_kwargs: dict[t.Any] | None = None,
Expand Down
13 changes: 13 additions & 0 deletions kedro-datasets/kedro_datasets/matlab/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
"""``AbstractDataset`` implementation to load/save data from/to a Matlab file."""
from __future__ import annotations

from typing import Any

import lazy_loader as lazy

MatlabDataSet: type[MatlabDataset]
MatlabDataset: Any

__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"matlab_dataset": ["MatlabDataSet", "MatlabDataset"]}
)
162 changes: 162 additions & 0 deletions kedro-datasets/kedro_datasets/matlab/matlab_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
"""``MatlabDataset`` loads/saves data from/to a Matlab file using an underlying
filesystem ?(e.g.: local, S3, GCS)?. The underlying functionality is supported by
the specified backend library passed in (defaults to the ``matlab`` library), so it
supports all allowed options for loading and saving matlab files.
"""
import warnings
from copy import deepcopy
from pathlib import PurePosixPath
from typing import Any, Dict

import fsspec
import numpy as np
from kedro.io.core import Version, get_filepath_str, get_protocol_and_path
from scipy import io

from kedro_datasets import KedroDeprecationWarning
from kedro_datasets._io import AbstractVersionedDataset, DatasetError


class MatlabDataset(AbstractVersionedDataset[np.ndarray, np.ndarray]):
"""`MatlabDataSet` loads and saves data from/to a MATLAB file using scipy.io.

Example usage for the
`YAML API <https://kedro.readthedocs.io/en/stable/data/\
data_catalog_yaml_examples.html>`_:

.. code-block:: yaml
cars:
type: mat.MatlabDataset
filepath: gcs://your_bucket/cars.mat
fs_args:
project: my-project
credentials: my_gcp_credentials

Example usage for the
`Python API <https://kedro.readthedocs.io/en/stable/data/\
advanced_data_catalog_usage.html>`_:

.. code-block:: pycon
>>> from kedro_datasets.matlab import MatlabDataset
>>> data = {"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]}
>>> dataset = MatlabDataset(filepath='my_data.mat')
>>> dataset.save(data)
>>> reloaded = dataset.load()
>>> assert data == reloaded
"""
DEFAULT_SAVE_ARGS: Dict[str, Any] = {"indent": 2}
def __init__( # noqa = PLR0913
self,
filepath: str,
save_args: Dict[str, Any]=None,
version: Version =None,
credentials: Dict[str, Any] =None,
fs_args: Dict[str, Any]=None,
metadata: Dict[str, Any]=None)->None:
"""Creates a new instance of MatlabDataSet to load and save data from/to a MATLAB file.

Args:
filepath: Filepath in POSIX format to a Matlab file prefixed with a protocol like `s3://`.
If prefix is not provided, `file` protocol (local filesystem) will be used.
The prefix should be any protocol supported by ``fsspec``.
Note: `http(s)` doesn't support versioning.
save_args: .mat options for saving .mat files.
version: If specified, should be an instance of
``kedro.io.core.Version``. If its ``load`` attribute is
None, the latest version will be loaded. If its ``save``
attribute is None, save version will be autogenerated.
credentials: Credentials required to get access to the underlying filesystem.
E.g. for ``GCSFileSystem`` it should look like `{"token": None}`.
fs_args: Extra arguments to pass into underlying filesystem class constructor
(e.g. `{"project": "my-project"}` for ``GCSFileSystem``), as well as
to pass to the filesystem's `open` method through nested keys
`open_args_load` and `open_args_save`.
Here you can find all available arguments for `open`:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open
All defaults are preserved, except `mode`, which is set to `r` when loading
and to `w` when saving.
metadata: Any arbitrary metadata.
This is ignored by Kedro, but may be consumed by users or external plugins.
"""
_fs_args = deepcopy(fs_args) or {}
_fs_open_args_load = _fs_args.pop("open_args_load", {})
_fs_open_args_save = _fs_args.pop("open_args_save", {})
_credentials = deepcopy(credentials) or {}

protocol, path = get_protocol_and_path(filepath, version)
self._protocol = protocol
if protocol == "file":
_fs_args.setdefault("auto_mkdir", True)
self._fs = fsspec.filesystem(self._protocol, **_credentials, **_fs_args)
self.metadata = metadata

super().__init__(
filepath=PurePosixPath(path),
version=version,
exists_function=self._fs.exists,
glob_function=self._fs.glob,
)
# Handle default save arguements
self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
if save_args is not None:
self._save_args.update(save_args)

_fs_open_args_save.setdefault("mode", "w")
self._fs_open_args_load = _fs_open_args_load
self._fs_open_args_save = _fs_open_args_save

def _describe(self) -> Dict[str, Any]:
return {
"filepath": self._filepath,
"protocol": self._protocol,
"save_args": self._save_args,
"version": self._version,
}

def _load(self) -> np.ndarray:
'''
Access the specific variable in the .mat file, e.g, data['variable_name']
'''
load_path = get_filepath_str(self._filepath, self._protocol)
with self._fs.open(load_path, mode="rb") as f:
data = io.loadmat(f)
return data

def _save(self, data: np.ndarray) -> None:
save_path = get_filepath_str(self._filepath, self._protocol)
with self._fs.open(save_path, mode="wb") as f:
io.savemat(f, {'data': data})
self._invalidate_cache()

def _exists(self)-> bool:
try:
load_path = get_filepath_str(self._get_load_path(), self._protocol)
except DatasetError:
return False

return self._fs.exists(load_path)

def _release(self) -> None:
super()._release()
self._invalidate_cache()

def _invalidate_cache(self) -> None:
"""Invalidate underlying filesystem caches."""
filepath = get_filepath_str(self._filepath, self._protocol)
self._fs._invalidate_cache(filepath)

_DEPRECATED_CLASSES = {
"MatlabDataSet": MatlabDataset
}

def __getattr__(name):
if name in _DEPRECATED_CLASSES:
alias = _DEPRECATED_CLASSES[name]
warnings.warn(
f"{repr(name)} has been renamed to {repr(alias.__name__)}, "
f"and the alias will be removed in Kedro-Datasets 2.0.0",
KedroDeprecationWarning,
stacklevel=2,
)
return alias
raise AttributeError(f"module {repr(__name__)} has no attribute {repr(name)}")
6 changes: 6 additions & 0 deletions kedro-datasets/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,12 @@ def _collect_requirements(requires):
"pyarrow>=4.0",
"deltalake >= 0.6.2",
],
"polars.LazyPolarsDataset": [
# Note: there is no Lazy read Excel option, so we exclude xlsx2csv here.
POLARS,
"pyarrow>=4.0",
"deltalake >= 0.6.2",
],
}
redis_require = {"redis.PickleDataset": ["redis~=4.1"]}
snowflake_require = {
Expand Down
Empty file.
Loading
Loading