Skip to content

Commit

Permalink
[DataCatalog2.0]: Protocol abstraction for DataCatalog (#4160)
Browse files Browse the repository at this point in the history
* Added a skeleton for AbstractDataCatalog and KedroDataCatalog

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Removed from_config method

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Implemented _init_datasets method

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Implemented get dataset

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Started resolve_patterns implementation

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Implemented resolve_patterns

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed credentials resolving

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated match pattern

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Implemented add from dict method

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated io __init__

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added list method

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Implemented _validate_missing_keys

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added datasets access logic

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added __contains__ and comments on lazy loading

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Renamed dataset_name to ds_name

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated some docstrings

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed _update_ds_configs

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed _init_datasets

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Implemented add_runtime_patterns

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed runtime patterns usage

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Moved pattern logic out of data catalog, implemented KedroDataCatalog

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* KedroDataCatalog updates

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added property to return config

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added list patterns method

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Renamed and moved ConfigResolver

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Renamed ConfigResolver

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Cleaned KedroDataCatalog

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Cleaned up DataCatalogConfigResolver

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Docs build fix attempt

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Removed KedroDataCatalog

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated from_config method

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated constructor and add methods

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated _get_dataset method

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated __contains__

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated __eq__ and shallow_copy

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added __iter__ and __getitem__

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Removed unused imports

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added TODO

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated runner.run()

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated session

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added confil_resolver property

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated catalog list command

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated catalog create command

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated catalog rank command

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated catalog resolve command

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Remove some methods

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Removed ds configs from catalog

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed lint

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed typo

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added module docstring

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Removed None from Pattern type

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed docs failing to find class reference

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed docs failing to find class reference

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated Patterns type

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fix tests (#4149)

* Fix most tests

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Fix most tests

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

---------

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Returned constants to avoid breaking changes

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Minor fix

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated test_sorting_order_with_other_dataset_through_extra_pattern

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Removed odd properties

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated tests

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Removed None from _fetch_credentials input

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Renamed DataCatalogConfigResolver to CatalogConfigResolver

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Renamed _init_configs to _resolve_config_credentials

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Moved functions to the class

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Refactored resolve_dataset_pattern

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed refactored part

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Changed the order of arguments for DataCatalog constructor

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Replaced __getitem__ with .get()

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated catalog commands

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Moved warm up block outside of the try block

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed linter

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Removed odd copying

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated release notes

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Returned DatasetError

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added _dataset_patterns and _default_pattern to _config_resolver to avoid breaking change

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Made resolve_dataset_pattern return just dict

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed linter

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added Catalogprotocol draft

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Implemented CatalogProtocol

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated types

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed linter

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added _ImplementsCatalogProtocolValidator

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated docstrings

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed tests

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed docs

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Excluded Potocol from coverage

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed docs

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Removed reference to DataCatalog in docstrings

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Removed add_all from protocol

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated docstrings

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated docstrings

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed docstrings

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated RELEASE.md

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

---------

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Co-authored-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com>
  • Loading branch information
ElenaKhaustova and ankatiyar authored Sep 17, 2024
1 parent b60b02d commit 6bf29f9
Show file tree
Hide file tree
Showing 14 changed files with 178 additions and 71 deletions.
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Upcoming Release

## Major features and improvements
* Implemented `Protocol` abstraction for the current `DataCatalog` and adding new catalog implementations.
* Refactored `kedro run` and `kedro catalog` commands.
* Moved pattern resolution logic from `DataCatalog` to a separate component - `CatalogConfigResolver`. Updated `DataCatalog` to use `CatalogConfigResolver` internally.
* Made packaged Kedro projects return `session.run()` output to be used when running it in the interactive environment.
Expand Down
2 changes: 2 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@
"kedro.io.catalog_config_resolver.CatalogConfigResolver",
"kedro.io.core.AbstractDataset",
"kedro.io.core.AbstractVersionedDataset",
"kedro.io.core.CatalogProtocol",
"kedro.io.core.DatasetError",
"kedro.io.core.Version",
"kedro.io.data_catalog.DataCatalog",
Expand Down Expand Up @@ -170,6 +171,7 @@
"None. Update D from mapping/iterable E and F.",
"Patterns",
"CatalogConfigResolver",
"CatalogProtocol",
),
"py:data": (
"typing.Any",
Expand Down
20 changes: 10 additions & 10 deletions kedro/framework/context/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

from kedro.config import AbstractConfigLoader, MissingConfigException
from kedro.framework.project import settings
from kedro.io import DataCatalog # noqa: TCH001
from kedro.io import CatalogProtocol, DataCatalog # noqa: TCH001
from kedro.pipeline.transcoding import _transcode_split

if TYPE_CHECKING:
Expand Down Expand Up @@ -123,7 +123,7 @@ def _convert_paths_to_absolute_posix(
return conf_dictionary


def _validate_transcoded_datasets(catalog: DataCatalog) -> None:
def _validate_transcoded_datasets(catalog: CatalogProtocol) -> None:
"""Validates transcoded datasets are correctly named
Args:
Expand Down Expand Up @@ -178,13 +178,13 @@ class KedroContext:
)

@property
def catalog(self) -> DataCatalog:
"""Read-only property referring to Kedro's ``DataCatalog`` for this context.
def catalog(self) -> CatalogProtocol:
"""Read-only property referring to Kedro's catalog` for this context.
Returns:
DataCatalog defined in `catalog.yml`.
catalog defined in `catalog.yml`.
Raises:
KedroContextError: Incorrect ``DataCatalog`` registered for the project.
KedroContextError: Incorrect catalog registered for the project.
"""
return self._get_catalog()
Expand Down Expand Up @@ -213,13 +213,13 @@ def _get_catalog(
self,
save_version: str | None = None,
load_versions: dict[str, str] | None = None,
) -> DataCatalog:
"""A hook for changing the creation of a DataCatalog instance.
) -> CatalogProtocol:
"""A hook for changing the creation of a catalog instance.
Returns:
DataCatalog defined in `catalog.yml`.
catalog defined in `catalog.yml`.
Raises:
KedroContextError: Incorrect ``DataCatalog`` registered for the project.
KedroContextError: Incorrect catalog registered for the project.
"""
# '**/catalog*' reads modular pipeline configs
Expand Down
28 changes: 14 additions & 14 deletions kedro/framework/hooks/specs.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

if TYPE_CHECKING:
from kedro.framework.context import KedroContext
from kedro.io import DataCatalog
from kedro.io import CatalogProtocol
from kedro.pipeline import Pipeline
from kedro.pipeline.node import Node

Expand All @@ -22,7 +22,7 @@ class DataCatalogSpecs:
@hook_spec
def after_catalog_created( # noqa: PLR0913
self,
catalog: DataCatalog,
catalog: CatalogProtocol,
conf_catalog: dict[str, Any],
conf_creds: dict[str, Any],
feed_dict: dict[str, Any],
Expand Down Expand Up @@ -53,7 +53,7 @@ class NodeSpecs:
def before_node_run(
self,
node: Node,
catalog: DataCatalog,
catalog: CatalogProtocol,
inputs: dict[str, Any],
is_async: bool,
session_id: str,
Expand All @@ -63,7 +63,7 @@ def before_node_run(
Args:
node: The ``Node`` to run.
catalog: A ``DataCatalog`` containing the node's inputs and outputs.
catalog: An implemented instance of ``CatalogProtocol`` containing the node's inputs and outputs.
inputs: The dictionary of inputs dataset.
The keys are dataset names and the values are the actual loaded input data,
not the dataset instance.
Expand All @@ -81,7 +81,7 @@ def before_node_run(
def after_node_run( # noqa: PLR0913
self,
node: Node,
catalog: DataCatalog,
catalog: CatalogProtocol,
inputs: dict[str, Any],
outputs: dict[str, Any],
is_async: bool,
Expand All @@ -93,7 +93,7 @@ def after_node_run( # noqa: PLR0913
Args:
node: The ``Node`` that ran.
catalog: A ``DataCatalog`` containing the node's inputs and outputs.
catalog: An implemented instance of ``CatalogProtocol`` containing the node's inputs and outputs.
inputs: The dictionary of inputs dataset.
The keys are dataset names and the values are the actual loaded input data,
not the dataset instance.
Expand All @@ -110,7 +110,7 @@ def on_node_error( # noqa: PLR0913
self,
error: Exception,
node: Node,
catalog: DataCatalog,
catalog: CatalogProtocol,
inputs: dict[str, Any],
is_async: bool,
session_id: str,
Expand All @@ -122,7 +122,7 @@ def on_node_error( # noqa: PLR0913
Args:
error: The uncaught exception thrown during the node run.
node: The ``Node`` to run.
catalog: A ``DataCatalog`` containing the node's inputs and outputs.
catalog: An implemented instance of ``CatalogProtocol`` containing the node's inputs and outputs.
inputs: The dictionary of inputs dataset.
The keys are dataset names and the values are the actual loaded input data,
not the dataset instance.
Expand All @@ -137,7 +137,7 @@ class PipelineSpecs:

@hook_spec
def before_pipeline_run(
self, run_params: dict[str, Any], pipeline: Pipeline, catalog: DataCatalog
self, run_params: dict[str, Any], pipeline: Pipeline, catalog: CatalogProtocol
) -> None:
"""Hook to be invoked before a pipeline runs.
Expand All @@ -164,7 +164,7 @@ def before_pipeline_run(
}
pipeline: The ``Pipeline`` that will be run.
catalog: The ``DataCatalog`` to be used during the run.
catalog: An implemented instance of ``CatalogProtocol`` to be used during the run.
"""
pass

Expand All @@ -174,7 +174,7 @@ def after_pipeline_run(
run_params: dict[str, Any],
run_result: dict[str, Any],
pipeline: Pipeline,
catalog: DataCatalog,
catalog: CatalogProtocol,
) -> None:
"""Hook to be invoked after a pipeline runs.
Expand Down Expand Up @@ -202,7 +202,7 @@ def after_pipeline_run(
run_result: The output of ``Pipeline`` run.
pipeline: The ``Pipeline`` that was run.
catalog: The ``DataCatalog`` used during the run.
catalog: An implemented instance of ``CatalogProtocol`` used during the run.
"""
pass

Expand All @@ -212,7 +212,7 @@ def on_pipeline_error(
error: Exception,
run_params: dict[str, Any],
pipeline: Pipeline,
catalog: DataCatalog,
catalog: CatalogProtocol,
) -> None:
"""Hook to be invoked if a pipeline run throws an uncaught Exception.
The signature of this error hook should match the signature of ``before_pipeline_run``
Expand Down Expand Up @@ -242,7 +242,7 @@ def on_pipeline_error(
}
pipeline: The ``Pipeline`` that will was run.
catalog: The ``DataCatalog`` used during the run.
catalog: An implemented instance of ``CatalogProtocol`` used during the run.
"""
pass

Expand Down
25 changes: 23 additions & 2 deletions kedro/framework/project/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from dynaconf import LazySettings
from dynaconf.validator import ValidationError, Validator

from kedro.io import CatalogProtocol
from kedro.pipeline import Pipeline, pipeline

if TYPE_CHECKING:
Expand Down Expand Up @@ -59,6 +60,25 @@ def validate(
)


class _ImplementsCatalogProtocolValidator(Validator):
"""A validator to check if the supplied setting value is a subclass of the default class"""

def validate(
self, settings: dynaconf.base.Settings, *args: Any, **kwargs: Any
) -> None:
super().validate(settings, *args, **kwargs)

protocol = CatalogProtocol
for name in self.names:
setting_value = getattr(settings, name)
if not isinstance(setting_value(), protocol):
raise ValidationError(
f"Invalid value '{setting_value.__module__}.{setting_value.__qualname__}' "
f"received for setting '{name}'. It must implement "
f"'{protocol.__module__}.{protocol.__qualname__}'."
)


class _HasSharedParentClassValidator(Validator):
"""A validator to check that the parent of the default class is an ancestor of
the settings value."""
Expand Down Expand Up @@ -115,8 +135,9 @@ class _ProjectSettings(LazySettings):
_CONFIG_LOADER_ARGS = Validator(
"CONFIG_LOADER_ARGS", default={"base_env": "base", "default_run_env": "local"}
)
_DATA_CATALOG_CLASS = _IsSubclassValidator(
"DATA_CATALOG_CLASS", default=_get_default_class("kedro.io.DataCatalog")
_DATA_CATALOG_CLASS = _ImplementsCatalogProtocolValidator(
"DATA_CATALOG_CLASS",
default=_get_default_class("kedro.io.DataCatalog"),
)

def __init__(self, *args: Any, **kwargs: Any):
Expand Down
2 changes: 2 additions & 0 deletions kedro/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from .core import (
AbstractDataset,
AbstractVersionedDataset,
CatalogProtocol,
DatasetAlreadyExistsError,
DatasetError,
DatasetNotFoundError,
Expand All @@ -23,6 +24,7 @@
"AbstractDataset",
"AbstractVersionedDataset",
"CachedDataset",
"CatalogProtocol",
"DataCatalog",
"CatalogConfigResolver",
"DatasetAlreadyExistsError",
Expand Down
79 changes: 78 additions & 1 deletion kedro/io/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,15 @@
from glob import iglob
from operator import attrgetter
from pathlib import Path, PurePath, PurePosixPath
from typing import TYPE_CHECKING, Any, Callable, Generic, TypeVar
from typing import (
TYPE_CHECKING,
Any,
Callable,
Generic,
Protocol,
TypeVar,
runtime_checkable,
)
from urllib.parse import urlsplit

from cachetools import Cache, cachedmethod
Expand All @@ -29,6 +37,8 @@
if TYPE_CHECKING:
import os

from kedro.io.catalog_config_resolver import CatalogConfigResolver, Patterns

VERSION_FORMAT = "%Y-%m-%dT%H.%M.%S.%fZ"
VERSIONED_FLAG_KEY = "versioned"
VERSION_KEY = "version"
Expand Down Expand Up @@ -871,3 +881,70 @@ def validate_on_forbidden_chars(**kwargs: Any) -> None:
raise DatasetError(
f"Neither white-space nor semicolon are allowed in '{key}'."
)


_C = TypeVar("_C")


@runtime_checkable
class CatalogProtocol(Protocol[_C]):
_datasets: dict[str, AbstractDataset]

def __contains__(self, ds_name: str) -> bool:
"""Check if a dataset is in the catalog."""
...

@property
def config_resolver(self) -> CatalogConfigResolver:
"""Return a copy of the datasets dictionary."""
...

@classmethod
def from_config(cls, catalog: dict[str, dict[str, Any]] | None) -> _C:
"""Create a catalog instance from configuration."""
...

def _get_dataset(
self,
dataset_name: str,
version: Any = None,
suggest: bool = True,
) -> AbstractDataset:
"""Retrieve a dataset by its name."""
...

def list(self, regex_search: str | None = None) -> list[str]:
"""List all dataset names registered in the catalog."""
...

def save(self, name: str, data: Any) -> None:
"""Save data to a registered dataset."""
...

def load(self, name: str, version: str | None = None) -> Any:
"""Load data from a registered dataset."""
...

def add(self, ds_name: str, dataset: Any, replace: bool = False) -> None:
"""Add a new dataset to the catalog."""
...

def add_feed_dict(self, datasets: dict[str, Any], replace: bool = False) -> None:
"""Add datasets to the catalog using the data provided through the `feed_dict`."""
...

def exists(self, name: str) -> bool:
"""Checks whether registered data set exists by calling its `exists()` method."""
...

def release(self, name: str) -> None:
"""Release any cached data associated with a dataset."""
...

def confirm(self, name: str) -> None:
"""Confirm a dataset by its name."""
...

def shallow_copy(self, extra_dataset_patterns: Patterns | None = None) -> _C:
"""Returns a shallow copy of the current object."""
...
Loading

0 comments on commit 6bf29f9

Please sign in to comment.