Skip to content

Commit

Permalink
Merge branch 'develop' into noklam/ci-add-on
Browse files Browse the repository at this point in the history
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
  • Loading branch information
noklam authored Oct 24, 2023
2 parents 1535c39 + 05f1473 commit c7d9da7
Show file tree
Hide file tree
Showing 15 changed files with 35 additions and 1,677 deletions.
2 changes: 2 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,12 @@
* Renamed the `data_sets()` method in `Pipeline` and all references to it to `datasets()`.
* Renamed the `create_default_data_set()` method in the `Runner` to `create_default_dataset()`.
* Renamed all other uses of `data_set` and `data_sets` in the codebase to `dataset` and `datasets` respectively.
* Remove deprecated `project_version` from `ProjectMetadata`.

### DataSets
* Removed `kedro.extras.datasets` and tests.
* Reduced constructor arguments for `APIDataSet` by replacing most arguments with a single constructor argument `load_args`. This makes it more consistent with other Kedro DataSets and the underlying `requests` API, and automatically enables the full configuration domain: stream, certificates, proxies, and more.
* Removed `PartitionedDataset` and `IncrementalDataset` from `kedro.io`

### CLI
* Removed deprecated `kedro docs` command.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/data/how_to_create_a_custom_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):

Currently, the `ImageDataset` only works with a single image, but this example needs to load all Pokemon images from the raw data directory for further processing.

Kedro's [`PartitionedDataset`](/kedro.io.PartitionedDataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory.
Kedro's [`PartitionedDataset`](/kedro_datasets.partitions.PartitionedDataset) is a convenient way to load multiple separate data files of the same underlying dataset type into a directory.

To use `PartitionedDataset` with `ImageDataset` to load all Pokemon PNG images, add this to the data catalog YAML so that `PartitionedDataset` loads all PNG files from the data directory using `ImageDataset`:

Expand Down
6 changes: 3 additions & 3 deletions docs/source/data/partitioned_and_incremental_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

Distributed systems play an increasingly important role in ETL data pipelines. They increase the processing throughput, enabling us to work with much larger volumes of input data. A situation may arise where your Kedro node needs to read the data from a directory full of uniform files of the same type like JSON or CSV. Tools like `PySpark` and the corresponding [SparkDataset](/kedro_datasets.spark.SparkDataset) cater for such use cases but may not always be possible.

This is why Kedro provides a built-in [PartitionedDataset](/kedro.io.PartitionedDataset), with the following features:
This is why Kedro provides a built-in [PartitionedDataset](/kedro_datasets.partitions.PartitionedDataset), with the following features:

* `PartitionedDataset` can recursively load/save all or specific files from a given location.
* It is platform agnostic, and can work with any filesystem implementation supported by [fsspec](https://filesystem-spec.readthedocs.io/) including local, S3, GCS, and many more.
Expand Down Expand Up @@ -240,7 +240,7 @@ When using lazy saving, the dataset will be written _after_ the `after_node_run`

## Incremental datasets

[IncrementalDataset](/kedro.io.IncrementalDataset) is a subclass of `PartitionedDataset`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataset` addresses the use case when partitions have to be processed incrementally, i.e. each subsequent pipeline run should only process the partitions which were not processed by the previous runs.
[IncrementalDataset](/kedro_datasets.partitions.IncrementalDataset) is a subclass of `PartitionedDataset`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataset` addresses the use case when partitions have to be processed incrementally, that is, each subsequent pipeline run should process just the partitions which were not processed by the previous runs.

This checkpoint, by default, is persisted to the location of the data partitions. For example, for `IncrementalDataset` instantiated with path `s3://my-bucket-name/path/to/folder`, the checkpoint will be saved to `s3://my-bucket-name/path/to/folder/CHECKPOINT`, unless [the checkpoint configuration is explicitly overwritten](#checkpoint-configuration).

Expand Down Expand Up @@ -309,7 +309,7 @@ pipeline(

Important notes about the confirmation operation:

* Confirming a partitioned dataset does not affect any subsequent loads within the same run. All downstream nodes that input the same partitioned dataset as input will all receive the _same_ partitions. Partitions that are created externally during the run will also not affect the dataset loads and won't appear in the list of loaded partitions until the next run or until the [`release()`](/kedro.io.IncrementalDataset) method is called on the dataset object.
* Confirming a partitioned dataset does not affect any subsequent loads within the same run. All downstream nodes that input the same partitioned dataset as input will all receive the _same_ partitions. Partitions that are created externally during the run will also not affect the dataset loads and won't appear in the list of loaded partitions until the next run or until the [`release()`](/kedro_datasets.partitions.IncrementalDataset) method is called on the dataset object.
* A pipeline cannot contain more than one node confirming the same dataset.


Expand Down
2 changes: 0 additions & 2 deletions docs/source/kedro.io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,8 @@ kedro.io
kedro.io.AbstractVersionedDataset
kedro.io.CachedDataset
kedro.io.DataCatalog
kedro.io.IncrementalDataset
kedro.io.LambdaDataset
kedro.io.MemoryDataset
kedro.io.PartitionedDataset
kedro.io.Version

.. rubric:: Exceptions
Expand Down
3 changes: 3 additions & 0 deletions docs/source/kedro_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,15 @@ kedro_datasets
kedro_datasets.pandas.SQLQueryDataset
kedro_datasets.pandas.SQLTableDataset
kedro_datasets.pandas.XMLDataset
kedro_datasets.partitions.IncrementalDataset
kedro_datasets.partitions.PartitionedDataset
kedro_datasets.pickle.PickleDataset
kedro_datasets.pillow.ImageDataset
kedro_datasets.plotly.JSONDataset
kedro_datasets.plotly.PlotlyDataset
kedro_datasets.polars.CSVDataset
kedro_datasets.polars.GenericDataset
kedro_datasets.polars.EagerPolarsDataset
kedro_datasets.redis.PickleDataset
kedro_datasets.snowflake.SnowparkTableDataset
kedro_datasets.spark.DeltaTableDataset
Expand Down
24 changes: 5 additions & 19 deletions kedro/framework/startup.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
"""This module provides metadata for a Kedro project."""
import os
import sys
import warnings
from pathlib import Path
from typing import NamedTuple, Union

import anyconfig

from kedro import KedroDeprecationWarning
from kedro import __version__ as kedro_version
from kedro.framework.project import configure_project

Expand All @@ -21,9 +19,9 @@ class ProjectMetadata(NamedTuple):
package_name: str
project_name: str
project_path: Path
project_version: str
source_dir: Path
kedro_init_version: str
add_ons: list


def _version_mismatch_error(kedro_init_version) -> str:
Expand Down Expand Up @@ -89,35 +87,23 @@ def _get_project_metadata(project_path: Union[str, Path]) -> ProjectMetadata:
f"configuration parameters."
) from exc

mandatory_keys = ["package_name", "project_name"]
mandatory_keys = ["package_name", "project_name", "kedro_init_version"]
missing_keys = [key for key in mandatory_keys if key not in metadata_dict]
if missing_keys:
raise RuntimeError(f"Missing required keys {missing_keys} from '{_PYPROJECT}'.")

# Temporary solution to keep project_version backwards compatible to be removed in 0.19.0
if "project_version" in metadata_dict:
warnings.warn(
"project_version in pyproject.toml is deprecated, use kedro_init_version instead",
KedroDeprecationWarning,
)
metadata_dict["kedro_init_version"] = metadata_dict["project_version"]
elif "kedro_init_version" in metadata_dict:
metadata_dict["project_version"] = metadata_dict["kedro_init_version"]
else:
raise RuntimeError(
f"Missing required key kedro_init_version from '{_PYPROJECT}'."
)

mandatory_keys.append("kedro_init_version")
# check the match for major and minor version (skip patch version)
if (
metadata_dict["kedro_init_version"].split(".")[:2]
!= kedro_version.split(".")[:2]
):
raise ValueError(_version_mismatch_error(metadata_dict["kedro_init_version"]))

# Default settings
source_dir = Path(metadata_dict.get("source_dir", "src")).expanduser()
source_dir = (project_path / source_dir).resolve()
metadata_dict["add_ons"] = metadata_dict.get("add_ons")

metadata_dict["source_dir"] = source_dir
metadata_dict["config_file"] = pyproject_toml
metadata_dict["project_path"] = project_path
Expand Down
6 changes: 0 additions & 6 deletions kedro/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,6 @@
from .data_catalog import DataCatalog
from .lambda_dataset import LambdaDataset
from .memory_dataset import MemoryDataset
from .partitioned_dataset import (
IncrementalDataset,
PartitionedDataset,
)

__all__ = [
"AbstractDataset",
Expand All @@ -28,9 +24,7 @@
"DatasetAlreadyExistsError",
"DatasetError",
"DatasetNotFoundError",
"IncrementalDataset",
"LambdaDataset",
"MemoryDataset",
"PartitionedDataset",
"Version",
]
Loading

0 comments on commit c7d9da7

Please sign in to comment.