Merge branch 'develop' into noklam/ci-add-on

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
kedro-org · Oct 24, 2023 · c7d9da7 · c7d9da7
2 parents 1535c39 + 05f1473
commit c7d9da7
Show file tree

Hide file tree

Showing 15 changed files with 35 additions and 1,677 deletions.
diff --git a/RELEASE.md b/RELEASE.md
@@ -13,10 +13,12 @@
 * Renamed the `data_sets()` method in `Pipeline` and all references to it to `datasets()`.
 * Renamed the `create_default_data_set()` method in the `Runner` to `create_default_dataset()`.
 * Renamed all other uses of `data_set` and `data_sets` in the codebase to `dataset` and `datasets` respectively.
+* Remove deprecated `project_version` from `ProjectMetadata`.
 
 ### DataSets
 * Removed `kedro.extras.datasets` and tests.
 * Reduced constructor arguments for `APIDataSet` by replacing most arguments with a single constructor argument `load_args`. This makes it more consistent with other Kedro DataSets and the underlying `requests` API, and automatically enables the full configuration domain: stream, certificates, proxies, and more.
+* Removed `PartitionedDataset` and `IncrementalDataset` from `kedro.io`
 
 ### CLI
 * Removed deprecated `kedro docs` command.

diff --git a/docs/source/data/how_to_create_a_custom_dataset.md b/docs/source/data/how_to_create_a_custom_dataset.md
@@ -271,7 +271,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
 
 Currently, the `ImageDataset` only works with a single image, but this example needs to load all Pokemon images from the raw data directory for further processing.
 
-Kedro's [`PartitionedDataset`](/kedro.io.PartitionedDataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory.
+Kedro's [`PartitionedDataset`](/kedro_datasets.partitions.PartitionedDataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory.
 
 To use `PartitionedDataset` with `ImageDataset` to load all Pokemon PNG images, add this to the data catalog YAML so that `PartitionedDataset` loads all PNG files from the data directory using `ImageDataset`:
 

diff --git a/docs/source/data/partitioned_and_incremental_datasets.md b/docs/source/data/partitioned_and_incremental_datasets.md
@@ -4,7 +4,7 @@
 
 Distributed systems play an increasingly important role in ETL data pipelines. They increase the processing throughput, enabling us to work with much larger volumes of input data. A situation may arise where your Kedro node needs to read the data from a directory full of uniform files of the same type like JSON or CSV. Tools like `PySpark` and the corresponding [SparkDataset](/kedro_datasets.spark.SparkDataset) cater for such use cases but may not always be possible.
 
-This is why Kedro provides a built-in [PartitionedDataset](/kedro.io.PartitionedDataset), with the following features:
+This is why Kedro provides a built-in [PartitionedDataset](/kedro_datasets.partitions.PartitionedDataset), with the following features:
 
 * `PartitionedDataset` can recursively load/save all or specific files from a given location.
 * It is platform agnostic, and can work with any filesystem implementation supported by [fsspec](https://filesystem-spec.readthedocs.io/) including local, S3, GCS, and many more.
@@ -240,7 +240,7 @@ When using lazy saving, the dataset will be written _after_ the `after_node_run`
 
 ## Incremental datasets
 
-[IncrementalDataset](/kedro.io.IncrementalDataset) is a subclass of `PartitionedDataset`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataset` addresses the use case when partitions have to be processed incrementally, i.e. each subsequent pipeline run should only process the partitions which were not processed by the previous runs.
+[IncrementalDataset](/kedro_datasets.partitions.IncrementalDataset) is a subclass of `PartitionedDataset`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataset` addresses the use case when partitions have to be processed incrementally, that is, each subsequent pipeline run should process just the partitions which were not processed by the previous runs.
 
 This checkpoint, by default, is persisted to the location of the data partitions. For example, for `IncrementalDataset` instantiated with path `s3://my-bucket-name/path/to/folder`, the checkpoint will be saved to `s3://my-bucket-name/path/to/folder/CHECKPOINT`, unless [the checkpoint configuration is explicitly overwritten](#checkpoint-configuration).
 
@@ -309,7 +309,7 @@ pipeline(
 
 Important notes about the confirmation operation:
 
-* Confirming a partitioned dataset does not affect any subsequent loads within the same run. All downstream nodes that input the same partitioned dataset as input will all receive the _same_ partitions. Partitions that are created externally during the run will also not affect the dataset loads and won't appear in the list of loaded partitions until the next run or until the [`release()`](/kedro.io.IncrementalDataset) method is called on the dataset object.
+* Confirming a partitioned dataset does not affect any subsequent loads within the same run. All downstream nodes that input the same partitioned dataset as input will all receive the _same_ partitions. Partitions that are created externally during the run will also not affect the dataset loads and won't appear in the list of loaded partitions until the next run or until the [`release()`](/kedro_datasets.partitions.IncrementalDataset) method is called on the dataset object.
 * A pipeline cannot contain more than one node confirming the same dataset.
 
 

diff --git a/docs/source/kedro.io.rst b/docs/source/kedro.io.rst
@@ -15,10 +15,8 @@ kedro.io
    kedro.io.AbstractVersionedDataset
    kedro.io.CachedDataset
    kedro.io.DataCatalog
-   kedro.io.IncrementalDataset
    kedro.io.LambdaDataset
    kedro.io.MemoryDataset
-   kedro.io.PartitionedDataset
    kedro.io.Version
 
 .. rubric:: Exceptions

diff --git a/docs/source/kedro_datasets.rst b/docs/source/kedro_datasets.rst
@@ -36,12 +36,15 @@ kedro_datasets
    kedro_datasets.pandas.SQLQueryDataset
    kedro_datasets.pandas.SQLTableDataset
    kedro_datasets.pandas.XMLDataset
+   kedro_datasets.partitions.IncrementalDataset
+   kedro_datasets.partitions.PartitionedDataset
    kedro_datasets.pickle.PickleDataset
    kedro_datasets.pillow.ImageDataset
    kedro_datasets.plotly.JSONDataset
    kedro_datasets.plotly.PlotlyDataset
    kedro_datasets.polars.CSVDataset
    kedro_datasets.polars.GenericDataset
+   kedro_datasets.polars.EagerPolarsDataset
    kedro_datasets.redis.PickleDataset
    kedro_datasets.snowflake.SnowparkTableDataset
    kedro_datasets.spark.DeltaTableDataset

diff --git a/kedro/framework/startup.py b/kedro/framework/startup.py
@@ -1,13 +1,11 @@
 """This module provides metadata for a Kedro project."""
 import os
 import sys
-import warnings
 from pathlib import Path
 from typing import NamedTuple, Union
 
 import anyconfig
 
-from kedro import KedroDeprecationWarning
 from kedro import __version__ as kedro_version
 from kedro.framework.project import configure_project
 
@@ -21,9 +19,9 @@ class ProjectMetadata(NamedTuple):
     package_name: str
     project_name: str
     project_path: Path
-    project_version: str
     source_dir: Path
     kedro_init_version: str
+    add_ons: list
 
 
 def _version_mismatch_error(kedro_init_version) -> str:
@@ -89,35 +87,23 @@ def _get_project_metadata(project_path: Union[str, Path]) -> ProjectMetadata:
             f"configuration parameters."
         ) from exc
 
-    mandatory_keys = ["package_name", "project_name"]
+    mandatory_keys = ["package_name", "project_name", "kedro_init_version"]
     missing_keys = [key for key in mandatory_keys if key not in metadata_dict]
     if missing_keys:
         raise RuntimeError(f"Missing required keys {missing_keys} from '{_PYPROJECT}'.")
 
-    # Temporary solution to keep project_version backwards compatible to be removed in 0.19.0
-    if "project_version" in metadata_dict:
-        warnings.warn(
-            "project_version in pyproject.toml is deprecated, use kedro_init_version instead",
-            KedroDeprecationWarning,
-        )
-        metadata_dict["kedro_init_version"] = metadata_dict["project_version"]
-    elif "kedro_init_version" in metadata_dict:
-        metadata_dict["project_version"] = metadata_dict["kedro_init_version"]
-    else:
-        raise RuntimeError(
-            f"Missing required key kedro_init_version from '{_PYPROJECT}'."
-        )
-
-    mandatory_keys.append("kedro_init_version")
     # check the match for major and minor version (skip patch version)
     if (
         metadata_dict["kedro_init_version"].split(".")[:2]
         != kedro_version.split(".")[:2]
     ):
         raise ValueError(_version_mismatch_error(metadata_dict["kedro_init_version"]))
 
+    # Default settings
     source_dir = Path(metadata_dict.get("source_dir", "src")).expanduser()
     source_dir = (project_path / source_dir).resolve()
+    metadata_dict["add_ons"] = metadata_dict.get("add_ons")
+
     metadata_dict["source_dir"] = source_dir
     metadata_dict["config_file"] = pyproject_toml
     metadata_dict["project_path"] = project_path

diff --git a/kedro/io/__init__.py b/kedro/io/__init__.py
@@ -15,10 +15,6 @@
 from .data_catalog import DataCatalog
 from .lambda_dataset import LambdaDataset
 from .memory_dataset import MemoryDataset
-from .partitioned_dataset import (
-    IncrementalDataset,
-    PartitionedDataset,
-)
 
 __all__ = [
     "AbstractDataset",
@@ -28,9 +24,7 @@
     "DatasetAlreadyExistsError",
     "DatasetError",
     "DatasetNotFoundError",
-    "IncrementalDataset",
     "LambdaDataset",
     "MemoryDataset",
-    "PartitionedDataset",
     "Version",
 ]