- Allowed using of custom cookiecutter templates for creating pipelines with
--template
flag forkedro pipeline create
or viatemplate/pipeline
folder. - Allowed overriding of configuration keys with runtime parameters using the
runtime_params
resolver withOmegaConfigLoader
.
- Updated dataset factories to resolve nested catalog config properly.
- Updated
OmegaConfigLoader
to handle paths containing dots outside ofconf_source
. - Made
settings.py
optional.
- Added documentation to clarify execution order of hooks.
- Added a notebook example for spaceflights to illustrate how to incrementally add Kedro features.
- Moved documentation for the
standalone-datacatalog
starter into its README file. - Added new documentation about deploying a Kedro project with Amazon EMR.
- Added new documentation about how to publish a Kedro-Viz project to make it shareable.
- New TSC members added to the page and the organisation of each member is also now listed.
- Plus some minor bug fixes and changes across the documentation.
- All dataset classes will be removed from the core Kedro repository (
kedro.extras.datasets
). Install and import them from thekedro-datasets
package instead. - All dataset classes ending with
DataSet
are deprecated and will be removed in Kedro0.19.0
andkedro-datasets
2.0.0
. Instead, use the updated class names ending withDataset
. - The starters
pandas-iris
,pyspark-iris
,pyspark
, andstandalone-datacatalog
are deprecated and will be archived in Kedro 0.19.0. PartitionedDataset
andIncrementalDataset
have been moved tokedro-datasets
and will be removed in Kedro0.19.0
. Install and import them from thekedro-datasets
package instead.
Many thanks to the following Kedroids for contributing PRs to this release:
- Jason Hite
- IngerMathilde
- Laíza Milena Scheid Parizotto
- Richard
- flpvvvv
- qheuristics
- Miguel Ortiz
- rxm7706
- Iñigo Hidalgo
- harmonys-qb
- Yi Kuang
- Jens Lordén
- Added support for Python 3.11. This includes tackling challenges like dependency pinning and test adjustments to ensure a smooth experience. Detailed migration tips are provided below for further context.
- Added new
OmegaConfigLoader
features:- Allowed registering of custom resolvers to
OmegaConfigLoader
throughCONFIG_LOADER_ARGS
. - Added support for global variables to
OmegaConfigLoader
.
- Allowed registering of custom resolvers to
- Added
kedro catalog resolve
CLI command that resolves dataset factories in the catalog with any explicit entries in the project pipeline. - Implemented a flat
conf/
structure for modular pipelines, and accordingly, updated thekedro pipeline create
andkedro catalog create
command. - Updated new Kedro project template and Kedro starters:
- Change Kedro starters and new Kedro projects to use
OmegaConfigLoader
. - Converted
setup.py
in new Kedro project template and Kedro starters topyproject.toml
and moved flake8 configuration to dedicated file.flake8
. - Updated the spaceflights starter to use the new flat
conf/
structure.
- Change Kedro starters and new Kedro projects to use
- Updated
OmegaConfigLoader
to ignore config from hidden directories like.ipynb_checkpoints
.
- Revised the
data
section to restructure beginner and advanced pages about the Data Catalog and datasets. - Moved contributor documentation to the GitHub wiki.
- Updated example of using generator functions in nodes.
- Added migration guide from the
ConfigLoader
and theTemplatedConfigLoader
to theOmegaConfigLoader
. TheConfigLoader
and theTemplatedConfigLoader
are deprecated and will be removed in the0.19.0
release.
- PyTables on Windows: Users on Windows with Python >=3.8 should note we've pinned
pytables
to3.8.0
due to compatibility issues. - Spark Dependency: We've set an upper version limit for
pyspark
at <3.4 due to breaking changes in 3.4. - Testing with Python 3.10: The latest
moto
version now supports parallel test execution for Python 3.10, resolving previous issues.
- Renamed abstract dataset classes, in accordance with the Kedro lexicon. Dataset classes ending with "DataSet" are deprecated and will be removed in 0.19.0. Note that all of the below classes are also importable from
kedro.io
; only the module where they are defined is listed as the location.
Type | Deprecated Alias | Location |
---|---|---|
AbstractDataset |
AbstractDataSet |
kedro.io.core |
AbstractVersionedDataset |
AbstractVersionedDataSet |
kedro.io.core |
- Using the
layer
attribute at the top level is deprecated; it will be removed in Kedro version 0.19.0. Please movelayer
inside themetadata
->kedro-viz
attributes.
Thanks to Laíza Milena Scheid Parizotto and Jonathan Cohen.
- Added dataset factories feature which uses pattern matching to reduce the number of catalog entries.
- Activated all built-in resolvers by default for
OmegaConfigLoader
except foroc.env
. - Added
kedro catalog rank
CLI command that ranks dataset factories in the catalog by matching priority.
- Consolidated dependencies and optional dependencies in
pyproject.toml
. - Made validation of unique node outputs much faster.
- Updated
kedro catalog list
to show datasets generated with factories.
- Recommended
ruff
as the linter and removed mentions ofpylint
,isort
,flake8
.
Thanks to Laíza Milena Scheid Parizotto and Chris Schopp.
ConfigLoader
andTemplatedConfigLoader
will be deprecated. Please useOmegaConfigLoader
instead.
- Added
databricks-iris
as an official starter.
- Reworked micropackaging workflow to use standard Python packaging practices.
- Make
kedro micropkg package
accept--verbose
. - Compare for protocol and delimiter in
PartitionedDataSet
to be able to pass the protocol to partitions which paths starts with the same characters as the protocol (e.g.s3://s3-my-bucket
).
- Significant improvements to the documentation that covers working with Databricks and Kedro, including a new page for workspace-only development, and a guide to choosing the best workflow for your use case.
- Updated documentation for deploying with Prefect for version 2.0.
- Renamed dataset and error classes, in accordance with the Kedro lexicon. Dataset classes ending with "DataSet" and error classes starting with "DataSet" are deprecated and will be removed in 0.19.0. Note that all of the below classes are also importable from
kedro.io
; only the module where they are defined is listed as the location.
Type | Deprecated Alias | Location |
---|---|---|
CachedDataset |
CachedDataSet |
kedro.io.cached_dataset |
LambdaDataset |
LambdaDataSet |
kedro.io.lambda_dataset |
IncrementalDataset |
IncrementalDataSet |
kedro.io.partitioned_dataset |
MemoryDataset |
MemoryDataSet |
kedro.io.memory_dataset |
PartitionedDataset |
PartitionedDataSet |
kedro.io.partitioned_dataset |
DatasetError |
DataSetError |
kedro.io.core |
DatasetAlreadyExistsError |
DataSetAlreadyExistsError |
kedro.io.core |
DatasetNotFoundError |
DataSetNotFoundError |
kedro.io.core |
Many thanks to the following Kedroids for contributing PRs to this release:
- Rebrand across all documentation and Kedro assets.
- Added support for variable interpolation in the catalog with the
OmegaConfigLoader
.
kedro run --params
now updates interpolated parameters correctly when usingOmegaConfigLoader
.- Added
metadata
attribute tokedro.io
datasets. This is ignored by Kedro, but may be consumed by users or external plugins. - Added
kedro.logging.RichHandler
. This replaces the defaultrich.logging.RichHandler
and is more flexible, user can turn off therich
traceback if needed.
OmegaConfigLoader
will return adict
instead ofDictConfig
.OmegaConfigLoader
does not show aMissingConfigError
when the config files exist but are empty.
- Added documentation for collaborative experiment tracking within Kedro-Viz.
- Revised section on deployment to better organise content and reflect how recently docs have been updated.
- Minor improvements to fix typos and revise docs to align with engineering changes.
kedro package
does not produce.egg
files anymore, and now relies exclusively on.whl
files.
Many thanks to the following Kedroids for contributing PRs to this release:
- Added
KEDRO_LOGGING_CONFIG
environment variable, which can be used to configure logging from the beginning of thekedro
process. - Removed logs folder from the kedro new project template. File-based logging will remain but just be level INFO and above and go to project root instead.
- Improvements to Jupyter E2E tests.
- Added full
kedro run
CLI command to session store to improve run reproducibility usingKedro-Viz
experiment tracking.
- Improvements to documentation about configuration.
- Improvements to Sphinx toolchain including incrementing to use a newer version.
- Improvements to documentation on visualising Kedro projects on Databricks, and additional documentation about the development workflow for Kedro projects on Databricks.
- Updated Technical Steering Committee membership documentation.
- Revised documentation section about linting and formatting and extended to give details of
flake8
configuration. - Updated table of contents for documentation to reduce scrolling.
- Expanded FAQ documentation.
- Added a 404 page to documentation.
- Added deprecation warnings about the removal of
kedro.extras.datasets
.
Many thanks to the following Kedroids for contributing PRs to this release:
- Added new Kedro CLI
kedro jupyter setup
to setup Jupyter Kernel for Kedro. kedro package
now includes the project configuration in a compressedtar.gz
file.- Added functionality to the
OmegaConfigLoader
to load configuration from compressed files ofzip
ortar
format. This feature requiresfsspec>=2023.1.0
. - Significant improvements to on-boarding documentation that covers setup for new Kedro users. Also some major changes to the spaceflights tutorial to make it faster to work through. We think it's a better read. Tell us if it's not.
- Added a guide and tooling for developing Kedro for Databricks.
- Implemented missing dict-like interface for
_ProjectPipeline
.
- Fixed bug that didn't allow to read or write datasets with
s3a
ors3n
filepaths - Fixed bug with overriding nested parameters using the
--params
flag - Fixed bug that made session store incompatible with
Kedro-Viz
experiment tracking
A regression introduced in Kedro version 0.18.5
caused the Kedro-Viz
console to fail to show experiment tracking correctly. If you experienced this issue, you will need to:
- upgrade to Kedro version
0.18.6
- delete any erroneous session entries created with Kedro 0.18.5 from your session_store.db stored at
<project-path>/data/session_store.db
.
Thanks to Kedroids tomohiko kato, tsanikgr and maddataanalyst for very detailed reports about the bug.
This release introduced a bug that causes a failure in experiment tracking within the
Kedro-Viz
console. We recommend that you use Kedro version0.18.6
in preference.
- Added new
OmegaConfigLoader
which usesOmegaConf
for loading and merging configuration. - Added the
--conf-source
option tokedro run
, allowing users to specify a source for project configuration for the run. - Added
omegaconf
syntax as option for--params
. Keys and values can now be separated by colons or equals signs. - Added support for generator functions as nodes, i.e. using
yield
instead of return.- Enable chunk-wise processing in nodes with generator functions.
- Save node outputs after every
yield
before proceeding with next chunk.
- Fixed incorrect parsing of Azure Data Lake Storage Gen2 URIs used in datasets.
- Added support for loading credentials from environment variables using
OmegaConfigLoader
. - Added new
--namespace
flag tokedro run
to enable filtering by node namespace. - Added a new argument
node
for all four dataset hooks. - Added the
kedro run
flags--nodes
,--tags
, and--load-versions
to replace--node
,--tag
, and--load-version
.
- Commas surrounded by square brackets (only possible for nodes with default names) will no longer split the arguments to
kedro run
options which take a list of nodes as inputs (--from-nodes
and--to-nodes
). - Fixed bug where
micropkg
manifest section inpyproject.toml
isn't recognised as allowed configuration. - Fixed bug causing
load_ipython_extension
not to register the%reload_kedro
line magic when called in a directory that does not contain a Kedro project. - Added
anyconfig
'sac_context
parameter tokedro.config.commons
module functions for more flexibleConfigLoader
customizations. - Change reference to
kedro.pipeline.Pipeline
object throughout test suite withkedro.modular_pipeline.pipeline
factory. - Fixed bug causing the
after_dataset_saved
hook only to be called for one output dataset when multiple are saved in a single node and async saving is in use. - Log level for "Credentials not found in your Kedro project config" was changed from
WARNING
toDEBUG
. - Added safe extraction of tar files in
micropkg pull
to fix vulnerability caused by CVE-2007-4559. - Documentation improvements
- Bug fix in table font size
- Updated API docs links for datasets
- Improved CLI docs for
kedro run
- Revised documentation for visualisation to build plots and for experiment tracking
- Added example for loading external credentials to the Hooks documentation
Many thanks to the following Kedroids for contributing PRs to this release:
project_version
will be deprecated inpyproject.toml
please usekedro_init_version
instead.- Deprecated
kedro run
flags--node
,--tag
, and--load-version
in favour of--nodes
,--tags
, and--load-versions
.
- Make Kedro instantiate datasets from
kedro_datasets
with higher priority thankedro.extras.datasets
.kedro_datasets
is the namespace for the newkedro-datasets
python package. - The config loader objects now implement
UserDict
and the configuration is accessed throughconf_loader['catalog']
. - You can configure config file patterns through
settings.py
without creating a custom config loader. - Added the following new datasets:
Type | Description | Location |
---|---|---|
svmlight.SVMLightDataSet |
Work with svmlight/libsvm files using scikit-learn library | kedro.extras.datasets.svmlight |
video.VideoDataSet |
Read and write video files from a filesystem | kedro.extras.datasets.video |
video.video_dataset.SequenceVideo |
Create a video object from an iterable sequence to use with VideoDataSet |
kedro.extras.datasets.video |
video.video_dataset.GeneratorVideo |
Create a video object from a generator to use with VideoDataSet |
kedro.extras.datasets.video |
- Implemented support for a functional definition of schema in
dask.ParquetDataSet
to work with thedask.to_parquet
API.
- Fixed
kedro micropkg pull
for packages on PyPI. - Fixed
format
insave_args
forSparkHiveDataSet
, previously it didn't allow you to save it as delta format. - Fixed save errors in
TensorFlowModelDataset
when used without versioning; previously, it wouldn't overwrite an existing model. - Added support for
tf.device
inTensorFlowModelDataset
. - Updated error message for
VersionNotFoundError
to handle insufficient permission issues for cloud storage. - Updated Experiment Tracking docs with working examples.
- Updated
MatplotlibWriter
,text.TextDataSet
,plotly.PlotlyDataSet
andplotly.JSONDataSet
docs with working examples. - Modified implementation of the Kedro IPython extension to use
local_ns
rather than a global variable. - Refactored
ShelveStore
to its own module to ensure multiprocessing works with it. kedro.extras.datasets.pandas.SQLQueryDataSet
now takes optional argumentexecution_options
.- Removed
attrs
upper bound to support newer versions of Airflow. - Bumped the lower bound for the
setuptools
dependency to <=61.5.1.
kedro test
andkedro lint
will be deprecated.
- Revised the Introduction to shorten it
- Revised the Get Started section to remove unnecessary information and clarify the learning path
- Updated the spaceflights tutorial to simplify the later stages and clarify what the reader needed to do in each phase
- Moved some pages that covered advanced materials into more appropriate sections
- Moved visualisation into its own section
- Fixed a bug that degraded user experience: the table of contents is now sticky when you navigate between pages
- Added redirects where needed on ReadTheDocs for legacy links and bookmarks
We are grateful to the following for submitting PRs that contributed to this release: jstammers, FlorianGD, yash6318, carlaprv, dinotuku, williamcaicedo, avan-sh, Kastakin, amaralbf, BSGalvan, levimjoseph, daniel-falk, clotildeguinard, avsolatorio, and picklejuicedev for comments and input to documentation changes
-
Implemented autodiscovery of project pipelines. A pipeline created with
kedro pipeline create <pipeline_name>
can now be accessed immediately without needing to explicitly register it insrc/<package_name>/pipeline_registry.py
, either individually by name (e.g.kedro run --pipeline=<pipeline_name>
) or as part of the combined default pipeline (e.g.kedro run
). By default, the simplifiedregister_pipelines()
function inpipeline_registry.py
looks like:def register_pipelines() -> Dict[str, Pipeline]: """Register the project's pipelines. Returns: A mapping from pipeline names to ``Pipeline`` objects. """ pipelines = find_pipelines() pipelines["__default__"] = sum(pipelines.values()) return pipelines
-
The Kedro IPython extension should now be loaded with
%load_ext kedro.ipython
. -
The line magic
%reload_kedro
now accepts keywords arguments, e.g.%reload_kedro --env=prod
. -
Improved resume pipeline suggestion for
SequentialRunner
, it will backtrack the closest persisted inputs to resume.
- Changed default
False
value for rich loggingshow_locals
, to make sure credentials and other sensitive data isn't shown in logs. - Rich traceback handling is disabled on Databricks so that exceptions now halt execution as expected. This is a workaround for a bug in
rich
. - When using
kedro run -n [some_node]
, ifsome_node
is missing a namespace the resulting error message will suggest the correct node name. - Updated documentation for
rich
logging. - Updated Prefect deployment documentation to allow for reruns with saved versioned datasets.
- The Kedro IPython extension now surfaces errors when it cannot load a Kedro project.
- Relaxed
delta-spark
upper bound to allow compatibility with Spark 3.1.x and 3.2.x. - Added
gdrive
to list of cloud protocols, enabling Google Drive paths for datasets. - Added svg logo resource for ipython kernel.
- The Kedro IPython extension will no longer be available as
%load_ext kedro.extras.extensions.ipython
; use%load_ext kedro.ipython
instead. kedro jupyter convert
,kedro build-docs
,kedro build-reqs
andkedro activate-nbstripout
will be deprecated.
- Added
abfss
to list of cloud protocols, enabling abfss paths. - Kedro now uses the Rich library to format terminal logs and tracebacks.
- The file
conf/base/logging.yml
is now optional. See our documentation for details. - Introduced a
kedro.starters
entry point. This enables plugins to create custom starter aliases used bykedro starter list
andkedro new
. - Reduced the
kedro new
prompts to just one question asking for the project name.
- Bumped
pyyaml
upper bound to make Kedro compatible with the pyodide stack. - Updated project template's Sphinx configuration to use
myst_parser
instead ofrecommonmark
. - Reduced number of log lines by changing the logging level from
INFO
toDEBUG
for low priority messages. - Kedro's framework-side logging configuration no longer performs file-based logging. Hence superfluous
info.log
/errors.log
files are no longer created in your project root, and running Kedro on read-only file systems such as Databricks Repos is now possible. - The
root
logger is now set to the Python default level ofWARNING
rather thanINFO
. Kedro's logger is still set to emitINFO
level messages. SequentialRunner
now has consistent execution order across multiple runs with sorted nodes.- Bumped the upper bound for the Flake8 dependency to <5.0.
kedro jupyter notebook/lab
no longer reuses a Jupyter kernel.- Required
cookiecutter>=2.1.1
to address a known command injection vulnerability. - The session store no longer fails if a username cannot be found with
getpass.getuser
. - Added generic typing for
AbstractDataSet
andAbstractVersionedDataSet
as well as typing to all datasets. - Rendered the deployment guide flowchart as a Mermaid diagram, and added Dask.
- The module
kedro.config.default_logger
no longer exists; default logging configuration is now set automatically throughkedro.framework.project.LOGGING
. Unless you explicitly importkedro.config.default_logger
you do not need to make any changes.
kedro.extras.ColorHandler
will be removed in 0.19.0.
- Added a new hook
after_context_created
that passes theKedroContext
instance ascontext
. - Added a new CLI hook
after_command_run
. - Added more detail to YAML
ParserError
exception error message. - Added option to
SparkDataSet
to specify aschema
load argument that allows for supplying a user-defined schema as opposed to relying on the schema inference of Spark. - The Kedro package no longer contains a built version of the Kedro documentation significantly reducing the package size.
- Removed fatal error from being logged when a Kedro session is created in a directory without git.
- Fixed
CONFIG_LOADER_CLASS
validation so thatTemplatedConfigLoader
can be specified in settings.py. AnyCONFIG_LOADER_CLASS
must be a subclass ofAbstractConfigLoader
. - Added runner name to the
run_params
dictionary used in pipeline hooks. - Updated Databricks documentation to include how to get it working with IPython extension and Kedro-Viz.
- Update sections on visualisation, namespacing, and experiment tracking in the spaceflight tutorial to correspond to the complete spaceflights starter.
- Fixed
Jinja2
syntax loading withTemplatedConfigLoader
usingglobals.yml
. - Removed global
_active_session
,_activate_session
and_deactivate_session
. Plugins that need to access objects such as the config loader should now do so throughcontext
in the newafter_context_created
hook. config_loader
is available as a public read-only attribute ofKedroContext
.- Made
hook_manager
argument optional forrunner.run
. kedro docs
now opens an online version of the Kedro documentation instead of a locally built version.
kedro docs
will be removed in 0.19.0.
Kedro 0.18.0 strives to reduce the complexity of the project template and get us closer to a stable release of the framework. We've introduced the full micro-packaging workflow 📦, which allows you to import packages, utility functions and existing pipelines into your Kedro project. Integration with IPython and Jupyter has been streamlined in preparation for enhancements to Kedro's interactive workflow. Additionally, the release comes with long-awaited Python 3.9 and 3.10 support 🐍.
- Added
kedro.config.abstract_config.AbstractConfigLoader
as an abstract base class for allConfigLoader
implementations.ConfigLoader
andTemplatedConfigLoader
now inherit directly from this base class. - Streamlined the
ConfigLoader.get
andTemplatedConfigLoader.get
API and delegated the actualget
method functional implementation to thekedro.config.common
module. - The
hook_manager
is no longer a global singleton. Thehook_manager
lifecycle is now managed by theKedroSession
, and a newhook_manager
will be created every time asession
is instantiated. - Added support for specifying parameters mapping in
pipeline()
without theparams:
prefix. - Added new API
Pipeline.filter()
(previously inKedroContext._filter_pipeline()
) to filter parts of a pipeline. - Added
username
to Session store for logging during Experiment Tracking. - A packaged Kedro project can now be imported and run from another Python project as following:
from my_package.__main__ import main
main(
["--pipleine", "my_pipeline"]
) # or just main() if no parameters are needed for the run
- Removed
cli.py
from the Kedro project template. By default, all CLI commands, includingkedro run
, are now defined on the Kedro framework side. You can still define custom CLI commands by creating your owncli.py
. - Removed
hooks.py
from the Kedro project template. Registration hooks have been removed in favour ofsettings.py
configuration, but you can still define execution timeline hooks by creating your ownhooks.py
. - Removed
.ipython
directory from the Kedro project template. The IPython/Jupyter workflow no longer uses IPython profiles; it now uses an IPython extension. - The default
kedro
run configuration environment names can now be set insettings.py
using theCONFIG_LOADER_ARGS
variable. The relevant keyword arguments to supply arebase_env
anddefault_run_env
, which are set tobase
andlocal
respectively by default.
- Added the following new datasets:
Type | Description | Location |
---|---|---|
pandas.XMLDataSet |
Read XML into Pandas DataFrame. Write Pandas DataFrame to XML | kedro.extras.datasets.pandas |
networkx.GraphMLDataSet |
Work with NetworkX using GraphML files | kedro.extras.datasets.networkx |
networkx.GMLDataSet |
Work with NetworkX using Graph Modelling Language files | kedro.extras.datasets.networkx |
redis.PickleDataSet |
loads/saves data from/to a Redis database | kedro.extras.datasets.redis |
- Added
partitionBy
support and exposedsave_args
forSparkHiveDataSet
. - Exposed
open_args_save
infs_args
forpandas.ParquetDataSet
. - Refactored the
load
andsave
operations forpandas
datasets in order to leveragepandas
own API and delegatefsspec
operations to them. This reduces the need to have our ownfsspec
wrappers. - Merged
pandas.AppendableExcelDataSet
intopandas.ExcelDataSet
. - Added
save_args
tofeather.FeatherDataSet
.
- The only recommended way to work with Kedro in Jupyter or IPython is now the Kedro IPython extension. Managed Jupyter instances should load this via
%load_ext kedro.ipython
and use the line magic%reload_kedro
. kedro ipython
launches an IPython session that preloads the Kedro IPython extension.kedro jupyter notebook/lab
creates a custom Jupyter kernel that preloads the Kedro IPython extension and launches a notebook with that kernel selected. There is no longer a need to specify--all-kernels
to show all available kernels.
- Bumped the minimum version of
pandas
to 1.3. Anystorage_options
should continue to be specified underfs_args
and/orcredentials
. - Added support for Python 3.9 and 3.10, dropped support for Python 3.6.
- Updated
black
dependency in the project template to a non pre-release version.
- Documented distribution of Kedro pipelines with Dask.
- Removed
RegistrationSpecs
and its associatedregister_config_loader
andregister_catalog
hook specifications in favour ofCONFIG_LOADER_CLASS
/CONFIG_LOADER_ARGS
andDATA_CATALOG_CLASS
insettings.py
. - Removed deprecated functions
load_context
andget_project_context
. - Removed deprecated
CONF_SOURCE
,package_name
,pipeline
,pipelines
,config_loader
andio
attributes fromKedroContext
as well as the deprecatedKedroContext.run
method. - Added the
PluginManager
hook_manager
argument toKedroContext
and theRunner.run()
method, which will be provided by theKedroSession
. - Removed the public method
get_hook_manager()
and replaced its functionality by_create_hook_manager()
. - Enforced that only one run can be successfully executed as part of a
KedroSession
.run_id
has been renamed tosession_id
as a result.
- The
settings.py
settingCONF_ROOT
has been renamed toCONF_SOURCE
. Default value ofconf
remains unchanged. ConfigLoader
andTemplatedConfigLoader
argumentconf_root
has been renamed toconf_source
.extra_params
has been renamed toruntime_params
inkedro.config.config.ConfigLoader
andkedro.config.templated_config.TemplatedConfigLoader
.- The environment defaulting behaviour has been removed from
KedroContext
and is now implemented in aConfigLoader
class (or equivalent) with thebase_env
anddefault_run_env
attributes.
pandas.ExcelDataSet
now usesopenpyxl
engine instead ofxlrd
.pandas.ParquetDataSet
now callspd.to_parquet()
upon saving. Note that the argumentpartition_cols
is not supported.spark.SparkHiveDataSet
API has been updated to reflectspark.SparkDataSet
. Thewrite_mode=insert
option has also been replaced withwrite_mode=append
as per Spark styleguide. This change addresses Issue 725 and Issue 745. Additionally,upsert
mode now leveragescheckpoint
functionality and requires a validcheckpointDir
be set for currentSparkContext
.yaml.YAMLDataSet
can no longer save apandas.DataFrame
directly, but it can save a dictionary. Usepandas.DataFrame.to_dict()
to convert yourpandas.DataFrame
to a dictionary before you attempt to save it to YAML.- Removed
open_args_load
andopen_args_save
from the following datasets:pandas.CSVDataSet
pandas.ExcelDataSet
pandas.FeatherDataSet
pandas.JSONDataSet
pandas.ParquetDataSet
storage_options
are now dropped if they are specified underload_args
orsave_args
for the following datasets:pandas.CSVDataSet
pandas.ExcelDataSet
pandas.FeatherDataSet
pandas.JSONDataSet
pandas.ParquetDataSet
- Renamed
lambda_data_set
,memory_data_set
, andpartitioned_data_set
tolambda_dataset
,memory_dataset
, andpartitioned_dataset
, respectively, inkedro.io
. - The dataset
networkx.NetworkXDataSet
has been renamed tonetworkx.JSONDataSet
.
- Removed
kedro install
in favour ofpip install -r src/requirements.txt
to install project dependencies. - Removed
--parallel
flag fromkedro run
in favour of--runner=ParallelRunner
. The-p
flag is now an alias for--pipeline
. kedro pipeline package
has been replaced bykedro micropkg package
and, in addition to the--alias
flag used to rename the package, now accepts a module name and path to the pipeline or utility module to package, relative tosrc/<package_name>/
. The--version
CLI option has been removed in favour of setting a__version__
variable in the micro-package's__init__.py
file.kedro pipeline pull
has been replaced bykedro micropkg pull
and now also supports--destination
to provide a location for pulling the package.- Removed
kedro pipeline list
andkedro pipeline describe
in favour ofkedro registry list
andkedro registry describe
. kedro package
andkedro micropkg package
now saveegg
andwhl
ortar
files in the<project_root>/dist
folder (previously<project_root>/src/dist
).- Changed the behaviour of
kedro build-reqs
to compile requirements fromrequirements.txt
instead ofrequirements.in
and save them torequirements.lock
instead ofrequirements.txt
. kedro jupyter notebook/lab
no longer accept--all-kernels
or--idle-timeout
flags.--all-kernels
is now the default behaviour.KedroSession.run
now raisesValueError
rather thanKedroContextError
when the pipeline contains no nodes. The sameValueError
is raised when there are no matching tags.KedroSession.run
now raisesValueError
rather thanKedroContextError
when the pipeline name doesn't exist in the pipeline registry.
- Added namespace to parameters in a modular pipeline, which addresses Issue 399.
- Switched from packaging pipelines as wheel files to tar archive files compressed with gzip (
.tar.gz
). - Removed decorator API from
Node
andPipeline
, as well as the moduleskedro.extras.decorators
andkedro.pipeline.decorators
. - Removed transformer API from
DataCatalog
, as well as the moduleskedro.extras.transformers
andkedro.io.transformers
. - Removed the
Journal
andDataCatalogWithDefault
. - Removed
%init_kedro
IPython line magic, with its functionality incorporated into%reload_kedro
. This means that if%reload_kedro
is called with a filepath, that will be set as default for subsequent calls.
- Remove any existing
hook_impl
of theregister_config_loader
andregister_catalog
methods fromProjectHooks
inhooks.py
(or custom alternatives). - If you use
run_id
in theafter_catalog_created
hook, replace it withsave_version
instead. - If you use
run_id
in any of thebefore_node_run
,after_node_run
,on_node_error
,before_pipeline_run
,after_pipeline_run
oron_pipeline_error
hooks, replace it withsession_id
instead.
- If you use a custom config loader class such as
kedro.config.TemplatedConfigLoader
, alterCONFIG_LOADER_CLASS
to specify the class andCONFIG_LOADER_ARGS
to specify keyword arguments. If not set, these default tokedro.config.ConfigLoader
and an empty dictionary respectively. - If you use a custom data catalog class, alter
DATA_CATALOG_CLASS
to specify the class. If not set, this defaults tokedro.io.DataCatalog
. - If you have a custom config location (i.e. not
conf
), updateCONF_ROOT
toCONF_SOURCE
and set it to a string with the expected configuration location. If not set, this defaults to"conf"
.
- If you use any modular pipelines with parameters, make sure they are declared with the correct namespace. See example below:
For a given pipeline:
active_pipeline = pipeline(
pipe=[
node(
func=some_func,
inputs=["model_input_table", "params:model_options"],
outputs=["**my_output"],
),
...,
],
inputs="model_input_table",
namespace="candidate_modelling_pipeline",
)
The parameters should look like this:
-model_options:
- test_size: 0.2
- random_state: 8
- features:
- - engines
- - passenger_capacity
- - crew
+candidate_modelling_pipeline:
+ model_options:
+ test_size: 0.2
+ random_state: 8
+ features:
+ - engines
+ - passenger_capacity
+ - crew
- Optional: You can now remove all
params:
prefix when supplying values toparameters
argument in apipeline()
call. - If you pull modular pipelines with
kedro pipeline pull my_pipeline --alias other_pipeline
, now usekedro micropkg pull my_pipeline --alias pipelines.other_pipeline
instead. - If you package modular pipelines with
kedro pipeline package my_pipeline
, now usekedro micropkg package pipelines.my_pipeline
instead. - Similarly, if you package any modular pipelines using
pyproject.toml
, you should modify the keys to include the full module path, and wrapped in double-quotes, e.g:
[tool.kedro.micropkg.package]
-data_engineering = {destination = "path/to/here"}
-data_science = {alias = "ds", env = "local"}
+"pipelines.data_engineering" = {destination = "path/to/here"}
+"pipelines.data_science" = {alias = "ds", env = "local"}
[tool.kedro.micropkg.pull]
-"s3://my_bucket/my_pipeline" = {alias = "aliased_pipeline"}
+"s3://my_bucket/my_pipeline" = {alias = "pipelines.aliased_pipeline"}
- If you use
pandas.ExcelDataSet
, make sure you haveopenpyxl
installed in your environment. This is automatically installed if you specifykedro[pandas.ExcelDataSet]==0.18.0
in yourrequirements.txt
. You can uninstallxlrd
if you were only using it for this dataset. - If you use
pandas.ParquetDataSet
, pass pandas saving arguments directly tosave_args
instead of nested infrom_pandas
(e.g.save_args = {"preserve_index": False}
instead ofsave_args = {"from_pandas": {"preserve_index": False}}
). - If you use
spark.SparkHiveDataSet
withwrite_mode
option set toinsert
, change this toappend
in line with the Spark styleguide. If you usespark.SparkHiveDataSet
withwrite_mode
option set toupsert
, make sure that yourSparkContext
has a validcheckpointDir
set either bySparkContext.setCheckpointDir
method or directly in theconf
folder. - If you use
pandas~=1.2.0
and passstorage_options
throughload_args
orsavs_args
, specify them underfs_args
or viacredentials
instead. - If you import from
kedro.io.lambda_data_set
,kedro.io.memory_data_set
, orkedro.io.partitioned_data_set
, change the import tokedro.io.lambda_dataset
,kedro.io.memory_dataset
, orkedro.io.partitioned_dataset
, respectively (or import the dataset directly fromkedro.io
). - If you have any
pandas.AppendableExcelDataSet
entries in your catalog, replace them withpandas.ExcelDataSet
. - If you have any
networkx.NetworkXDataSet
entries in your catalog, replace them withnetworkx.JSONDataSet
.
- Edit any scripts containing
kedro pipeline package --version
to usekedro micropkg package
instead. If you wish to set a specific pipeline package version, set the__version__
variable in the pipeline package's__init__.py
file. - To run a pipeline in parallel, use
kedro run --runner=ParallelRunner
rather than--parallel
or-p
. - If you call
ConfigLoader
orTemplatedConfigLoader
directly, update the keyword argumentsconf_root
toconf_source
andextra_params
toruntime_params
. - If you use
KedroContext
to accessConfigLoader
, usesettings.CONFIG_LOADER_CLASS
to access the currently usedConfigLoader
instead. - The signature of
KedroContext
has changed and now needsconfig_loader
andhook_manager
as additional arguments of typeConfigLoader
andPluginManager
respectively.
pipeline
now acceptstags
and a collection ofNode
s and/orPipeline
s rather than just a singlePipeline
object.pipeline
should be used in preference toPipeline
when creating a Kedro pipeline.pandas.SQLTableDataSet
andpandas.SQLQueryDataSet
now only open one connection per database, at instantiation time (therefore at catalog creation time), rather than one per load/save operation.- Added new command group,
micropkg
, to replacekedro pipeline pull
andkedro pipeline package
withkedro micropkg pull
andkedro micropkg package
for Kedro 0.18.0.kedro micropkg package
saves packages toproject/dist
whilekedro pipeline package
saves packages toproject/src/dist
.
- Added tutorial documentation for experiment tracking.
- Added Plotly dataset documentation.
- Added the upper limit
pandas<1.4
to maintain compatibility withxlrd~=1.0
. - Bumped the
Pillow
minimum version requirement to 9.0 (Python 3.7+ only) following CVE-2022-22817. - Fixed
PickleDataSet
to be copyable and hence work with the parallel runner. - Upgraded
pip-tools
, which is used bykedro build-reqs
, to 6.5 (Python 3.7+ only). Thispip-tools
version is compatible withpip>=21.2
, including the most recent releases ofpip
. Python 3.6 users should continue to usepip-tools
6.4 andpip<22
. - Added
astro-iris
as alias forastro-airlow-iris
, so that old tutorials can still be followed. - Added details about Kedro's Technical Steering Committee and governance model.
kedro pipeline pull
andkedro pipeline package
will be deprecated. Please usekedro micropkg
instead.
- Added
pipelines
global variable to IPython extension, allowing you to access the project's pipelines inkedro ipython
orkedro jupyter notebook
. - Enabled overriding nested parameters with
params
in CLI, i.e.kedro run --params="model.model_tuning.booster:gbtree"
updates parameters to{"model": {"model_tuning": {"booster": "gbtree"}}}
. - Added option to
pandas.SQLQueryDataSet
to specify afilepath
with a SQL query, in addition to the current method of supplying the query itself in thesql
argument. - Extended
ExcelDataSet
to support saving Excel files with multiple sheets. - Added the following new datasets:
Type | Description | Location |
---|---|---|
plotly.JSONDataSet |
Works with plotly graph object Figures (saves as json file) | kedro.extras.datasets.plotly |
pandas.GenericDataSet |
Provides a 'best effort' facility to read / write any format provided by the pandas library |
kedro.extras.datasets.pandas |
pandas.GBQQueryDataSet |
Loads data from a Google Bigquery table using provided SQL query | kedro.extras.datasets.pandas |
spark.DeltaTableDataSet |
Dataset designed to handle Delta Lake Tables and their CRUD-style operations, including update , merge and delete |
kedro.extras.datasets.spark |
- Fixed an issue where
kedro new --config config.yml
was ignoring the config file whenprompts.yml
didn't exist. - Added documentation for
kedro viz --autoreload
. - Added support for arbitrary backends (via importable module paths) that satisfy the
pickle
interface toPickleDataSet
. - Added support for
sum
syntax for connecting pipeline objects. - Upgraded
pip-tools
, which is used bykedro build-reqs
, to 6.4. Thispip-tools
version requirespip>=21.2
while adding support forpip>=21.3
. To upgradepip
, please refer to their documentation. - Relaxed the bounds on the
plotly
requirement forplotly.PlotlyDataSet
and thepyarrow
requirement forpandas.ParquetDataSet
. kedro pipeline package <pipeline>
now raises an error if the<pipeline>
argument doesn't look like a valid Python module path (e.g. has/
instead of.
).- Added new
overwrite
argument toPartitionedDataSet
andMatplotlibWriter
to enable deletion of existing partitions and plots on datasetsave
. kedro pipeline pull
now works when the project requirements contains entries such as-r
,--extra-index-url
and local wheel files (Issue #913).- Fixed slow startup because of catalog processing by reducing the exponential growth of extra processing during
_FrozenDatasets
creations. - Removed
.coveragerc
from the Kedro project template.coverage
settings are now given inpyproject.toml
. - Fixed a bug where packaging or pulling a modular pipeline with the same name as the project's package name would throw an error (or silently pass without including the pipeline source code in the wheel file).
- Removed unintentional dependency on
git
. - Fixed an issue where nested pipeline configuration was not included in the packaged pipeline.
- Deprecated the "Thanks for supporting contributions" section of release notes to simplify the contribution process; Kedro 0.17.6 is the last release that includes this. This process has been replaced with the automatic GitHub feature.
- Fixed a bug where the version on the tracking datasets didn't match the session id and the versions of regular versioned datasets.
- Fixed an issue where datasets in
load_versions
that are not found in the data catalog would silently pass. - Altered the string representation of nodes so that node inputs/outputs order is preserved rather than being alphabetically sorted.
- Update
APIDataSet
to acceptauth
throughcredentials
and allow any iterable forauth
.
kedro.extras.decorators
andkedro.pipeline.decorators
are being deprecated in favour of Hooks.kedro.extras.transformers
andkedro.io.transformers
are being deprecated in favour of Hooks.- The
--parallel
flag onkedro run
is being removed in favour of--runner=ParallelRunner
. The-p
flag will change to be an alias for--pipeline
. kedro.io.DataCatalogWithDefault
is being deprecated, to be removed entirely in 0.18.0.
Deepyaman Datta, Brites, Manish Swami, Avaneesh Yembadi, Zain Patel, Simon Brugman, Kiyo Kunii, Benjamin Levy, Louis de Charsonville, Simon Picard
- Added new CLI group
registry
, with the associated commandskedro registry list
andkedro registry describe
, to replacekedro pipeline list
andkedro pipeline describe
. - Added support for dependency management at a modular pipeline level. When a pipeline with
requirements.txt
is packaged, its dependencies are embedded in the modular pipeline wheel file. Upon pulling the pipeline, Kedro will append dependencies to the project'srequirements.in
. More information is available in our documentation. - Added support for bulk packaging/pulling modular pipelines using
kedro pipeline package/pull --all
andpyproject.toml
. - Removed
cli.py
from the Kedro project template. By default all CLI commands, includingkedro run
, are now defined on the Kedro framework side. These can be overridden in turn by a plugin or acli.py
file in your project. A packaged Kedro project will respect the same hierarchy when executed withpython -m my_package
. - Removed
.ipython/profile_default/startup/
from the Kedro project template in favour of.ipython/profile_default/ipython_config.py
and thekedro.extras.extensions.ipython
. - Added support for
dill
backend toPickleDataSet
. - Imports are now refactored at
kedro pipeline package
andkedro pipeline pull
time, so that aliasing a modular pipeline doesn't break it. - Added the following new datasets to support basic Experiment Tracking:
Type | Description | Location |
---|---|---|
tracking.MetricsDataSet |
Dataset to track numeric metrics for experiment tracking | kedro.extras.datasets.tracking |
tracking.JSONDataSet |
Dataset to track data for experiment tracking | kedro.extras.datasets.tracking |
- Bumped minimum required
fsspec
version to 2021.04. - Fixed the
kedro install
andkedro build-reqs
flows when uninstalled dependencies are present in a project'ssettings.py
,context.py
orhooks.py
(Issue #829). - Imports are now refactored at
kedro pipeline package
andkedro pipeline pull
time, so that aliasing a modular pipeline doesn't break it.
- Pinned
dynaconf
to<3.1.6
because the method signature for_validate_items
changed which is used in Kedro.
kedro pipeline list
andkedro pipeline describe
are being deprecated in favour of new commandskedro registry list
andkedro registry describe
.kedro install
is being deprecated in favour of usingpip install -r src/requirements.txt
to install project dependencies.
- Added the following new datasets:
Type | Description | Location |
---|---|---|
plotly.PlotlyDataSet |
Works with plotly graph object Figures (saves as json file) | kedro.extras.datasets.plotly |
- Defined our set of Kedro Principles! Have a read through our docs.
ConfigLoader.get()
now raises aBadConfigException
, with a more helpful error message, if a configuration file cannot be loaded (for instance due to wrong syntax or poor formatting).run_id
now defaults tosave_version
whenafter_catalog_created
is called, similarly to what happens during akedro run
.- Fixed a bug where
kedro ipython
andkedro jupyter notebook
didn't work if thePYTHONPATH
was already set. - Update the IPython extension to allow passing
env
andextra_params
toreload_kedro
similar to how the IPython script works. kedro info
now outputs if a plugin has anyhooks
orcli_hooks
implemented.PartitionedDataSet
now supports lazily materializing data on save.kedro pipeline describe
now defaults to the__default__
pipeline when no pipeline name is provided and also shows the namespace the nodes belong to.- Fixed an issue where spark.SparkDataSet with enabled versioning would throw a VersionNotFoundError when using databricks-connect from a remote machine and saving to dbfs filesystem.
EmailMessageDataSet
added to doctree.- When node inputs do not pass validation, the error message is now shown as the most recent exception in the traceback (Issue #761).
kedro pipeline package
now only packages the parameter file that exactly matches the pipeline name specified and the parameter files in a directory with the pipeline name.- Extended support to newer versions of third-party dependencies (Issue #735).
- Ensured consistent references to
model input
tables in accordance with our Data Engineering convention. - Changed behaviour where
kedro pipeline package
takes the pipeline package version, rather than the kedro package version. If the pipeline package version is not present, then the package version is used. - Launched GitHub Discussions and Kedro Discord Server
- Improved error message when versioning is enabled for a dataset previously saved as non-versioned (Issue #625).
-
Kedro plugins can now override built-in CLI commands.
-
Added a
before_command_run
hook for plugins to add extra behaviour before Kedro CLI commands run. -
pipelines
frompipeline_registry.py
andregister_pipeline
hooks are now loaded lazily when they are first accessed, not on startup:from kedro.framework.project import pipelines print(pipelines["__default__"]) # pipeline loading is only triggered here
TemplatedConfigLoader
now correctly inserts default values when no globals are supplied.- Fixed a bug where the
KEDRO_ENV
environment variable had no effect on instantiating thecontext
variable in an iPython session or a Jupyter notebook. - Plugins with empty CLI groups are no longer displayed in the Kedro CLI help screen.
- Duplicate commands will no longer appear twice in the Kedro CLI help screen.
- CLI commands from sources with the same name will show under one list in the help screen.
- The setup of a Kedro project, including adding src to path and configuring settings, is now handled via the
bootstrap_project
method. configure_project
is invoked if apackage_name
is supplied toKedroSession.create
. This is added for backward-compatibility purpose to support a workflow that createsSession
manually. It will be removed in0.18.0
.- Stopped swallowing up all
ModuleNotFoundError
ifregister_pipelines
not found, so that a more helpful error message will appear when a dependency is missing, e.g. Issue #722. - When
kedro new
is invoked using a configuration yaml file,output_dir
is no longer a required key; by default the current working directory will be used. - When
kedro new
is invoked using a configuration yaml file, the appropriateprompts.yml
file is now used for validating the provided configuration. Previously, validation was always performed against the kedro project templateprompts.yml
file. - When a relative path to a starter template is provided,
kedro new
now generates user prompts to obtain configuration rather than supplying empty configuration. - Fixed error when using starters on Windows with Python 3.7 (Issue #722).
- Fixed decoding error of config files that contain accented characters by opening them for reading in UTF-8.
- Fixed an issue where
after_dataset_loaded
run would finish before a dataset is actually loaded when using--async
flag.
kedro.versioning.journal.Journal
will be removed.- The following properties on
kedro.framework.context.KedroContext
will be removed:io
in favour ofKedroContext.catalog
pipeline
(equivalent topipelines["__default__"]
)pipelines
in favour ofkedro.framework.project.pipelines
-
Added support for
compress_pickle
backend toPickleDataSet
. -
Enabled loading pipelines without creating a
KedroContext
instance:from kedro.framework.project import pipelines print(pipelines)
-
Projects generated with kedro>=0.17.2:
- should define pipelines in
pipeline_registry.py
rather thanhooks.py
. - when run as a package, will behave the same as
kedro run
- should define pipelines in
- If
settings.py
is not importable, the errors will be surfaced earlier in the process, rather than at runtime.
kedro pipeline list
andkedro pipeline describe
no longer accept redundant--env
parameter.from kedro.framework.cli.cli import cli
no longer includes thenew
andstarter
commands.
kedro.framework.context.KedroContext.run
will be removed in release 0.18.0.
-
Added
env
andextra_params
toreload_kedro()
line magic. -
Extended the
pipeline()
API to allow strings and sets of strings asinputs
andoutputs
, to specify when a dataset name remains the same (not namespaced). -
Added the ability to add custom prompts with regexp validator for starters by repurposing
default_config.yml
asprompts.yml
. -
Added the
env
andextra_params
arguments toregister_config_loader
hook. -
Refactored the way
settings
are loaded. You will now be able to run:from kedro.framework.project import settings print(settings.CONF_ROOT)
-
Added a check on
kedro.runner.parallel_runner.ParallelRunner
which checks datasets for the_SINGLE_PROCESS
attribute in the_validate_catalog
method. If this attribute is set toTrue
in an instance of a dataset (e.g.SparkDataSet
), theParallelRunner
will raise anAttributeError
. -
Any user-defined dataset that should not be used with
ParallelRunner
may now have the_SINGLE_PROCESS
attribute set toTrue
.
- The version of a packaged modular pipeline now defaults to the version of the project package.
- Added fix to prevent new lines being added to pandas CSV datasets.
- Fixed issue with loading a versioned
SparkDataSet
in the interactive workflow. - Kedro CLI now checks
pyproject.toml
for atool.kedro
section before treating the project as a Kedro project. - Added fix to
DataCatalog::shallow_copy
now it should copy layers. kedro pipeline pull
now usespip download
for protocols that are not supported byfsspec
.- Cleaned up documentation to fix broken links and rewrite permanently redirected ones.
- Added a
jsonschema
schema definition for the Kedro 0.17 catalog. kedro install
now waits on Windows until all the requirements are installed.- Exposed
--to-outputs
option in the CLI, throughout the codebase, and as part of hooks specifications. - Fixed a bug where
ParquetDataSet
wasn't creating parent directories on the fly. - Updated documentation.
- This release has broken the
kedro ipython
andkedro jupyter
workflows. To fix this, follow the instructions in the migration guide below. - You will also need to upgrade
kedro-viz
to 3.10.1 if you use the%run_viz
line magic in Jupyter Notebook.
Note: If you're using the
ipython
extension instead, you will not encounter this problem.
You will have to update the file <your_project>/.ipython/profile_default/startup/00-kedro-init.py
in order to make kedro ipython
and/or kedro jupyter
work. Add the following line before the KedroSession
is created:
configure_project(metadata.package_name) # to add
session = KedroSession.create(metadata.package_name, path)
Make sure that the associated import is provided in the same place as others in the file:
from kedro.framework.project import configure_project # to add
from kedro.framework.session import KedroSession
Mariana Silva, Kiyohito Kunii, noklam, Ivan Doroshenko, Zain Patel, Deepyaman Datta, Sam Hiscox, Pascal Brokmeier
- In a significant change, we have introduced
KedroSession
which is responsible for managing the lifecycle of a Kedro run. - Created a new Kedro Starter:
kedro new --starter=mini-kedro
. It is possible to use the DataCatalog as a standalone component in a Jupyter notebook and transition into the rest of the Kedro framework. - Added
DatasetSpecs
with Hooks to run before and after datasets are loaded from/saved to the catalog. - Added a command:
kedro catalog create
. For a registered pipeline, it creates a<conf_root>/<env>/catalog/<pipeline_name>.yml
configuration file withMemoryDataSet
datasets for each dataset that is missing fromDataCatalog
. - Added
settings.py
andpyproject.toml
(to replace.kedro.yml
) for project configuration, in line with Python best practice. ProjectContext
is no longer needed, unless for very complex customisations.KedroContext
,ProjectHooks
andsettings.py
together implement sensible default behaviour. As a resultcontext_path
is also now an optional key inpyproject.toml
.- Removed
ProjectContext
fromsrc/<package_name>/run.py
. TemplatedConfigLoader
now supports Jinja2 template syntax alongside its original syntax.- Made registration Hooks mandatory, as the only way to customise the
ConfigLoader
or theDataCatalog
used in a project. If no such Hook is provided insrc/<package_name>/hooks.py
, aKedroContextError
is raised. There are sensible defaults defined in any project generated with Kedro >= 0.16.5.
ParallelRunner
no longer results in a run failure, when triggered from a notebook, if the run is started usingKedroSession
(session.run()
).before_node_run
can now overwrite node inputs by returning a dictionary with the corresponding updates.- Added minimal, black-compatible flake8 configuration to the project template.
- Moved
isort
andpytest
configuration from<project_root>/setup.cfg
to<project_root>/pyproject.toml
. - Extra parameters are no longer incorrectly passed from
KedroSession
toKedroContext
. - Relaxed
pyspark
requirements to allow for installation ofpyspark
3.0. - Added a
--fs-args
option to thekedro pipeline pull
command to specify configuration options for thefsspec
filesystem arguments used when pulling modular pipelines from non-PyPI locations. - Bumped maximum required
fsspec
version to 0.9. - Bumped maximum supported
s3fs
version to 0.5 (S3FileSystem
interface has changed since 0.4.1 version).
- In Kedro 0.17.0 we have deleted the deprecated
kedro.cli
andkedro.context
modules in favour ofkedro.framework.cli
andkedro.framework.context
respectively.
kedro.io.DataCatalog.exists()
returnsFalse
when the dataset does not exist, as opposed to raising an exception.- The pipeline-specific
catalog.yml
file is no longer automatically created for modular pipelines when runningkedro pipeline create
. Usekedro catalog create
to replace this functionality. - Removed
include_examples
prompt fromkedro new
. To generate boilerplate example code, you should use a Kedro starter. - Changed the
--verbose
flag from a global command to a project-specific command flag (e.gkedro --verbose new
becomeskedro new --verbose
). - Dropped support of the
dataset_credentials
key in credentials inPartitionedDataSet
. get_source_dir()
was removed fromkedro/framework/cli/utils.py
.- Dropped support of
get_config
,create_catalog
,create_pipeline
,template_version
,project_name
andproject_path
keys byget_project_context()
function (kedro/framework/cli/cli.py
). kedro new --starter
now defaults to fetching the starter template matching the installed Kedro version.- Renamed
kedro_cli.py
tocli.py
and moved it inside the Python package (src/<package_name>/
), for a better packaging and deployment experience. - Removed
.kedro.yml
from the project template and replaced it withpyproject.toml
. - Removed
KEDRO_CONFIGS
constant (previously residing inkedro.framework.context.context
). - Modified
kedro pipeline create
CLI command to add a boilerplate parameter config file inconf/<env>/parameters/<pipeline_name>.yml
instead ofconf/<env>/pipelines/<pipeline_name>/parameters.yml
. CLI commandskedro pipeline delete
/package
/pull
were updated accordingly. - Removed
get_static_project_data
fromkedro.framework.context
. - Removed
KedroContext.static_data
. - The
KedroContext
constructor now takespackage_name
as first argument. - Replaced
context
property onKedroSession
withload_context()
method. - Renamed
_push_session
and_pop_session
inkedro.framework.session.session
to_activate_session
and_deactivate_session
respectively. - Custom context class is set via
CONTEXT_CLASS
variable insrc/<your_project>/settings.py
. - Removed
KedroContext.hooks
attribute. Instead, hooks should be registered insrc/<your_project>/settings.py
under theHOOKS
key. - Restricted names given to nodes to match the regex pattern
[\w\.-]+$
. - Removed
KedroContext._create_config_loader()
andKedroContext._create_data_catalog()
. They have been replaced by registration hooks, namelyregister_config_loader()
andregister_catalog()
(see also upcoming deprecations).
kedro.framework.context.load_context
will be removed in release 0.18.0.kedro.framework.cli.get_project_context
will be removed in release 0.18.0.- We've added a
DeprecationWarning
to the decorator API for bothnode
andpipeline
. These will be removed in release 0.18.0. Use Hooks to extend a node's behaviour instead. - We've added a
DeprecationWarning
to the Transformers API when adding a transformer to the catalog. These will be removed in release 0.18.0. Use Hooks to customise theload
andsave
methods.
Deepyaman Datta, Zach Schuster
Reminder: Our documentation on how to upgrade Kedro covers a few key things to remember when updating any Kedro version.
The Kedro 0.17.0 release contains some breaking changes. If you update Kedro to 0.17.0 and then try to work with projects created against earlier versions of Kedro, you may encounter some issues when trying to run kedro
commands in the terminal for that project. Here's a short guide to getting your projects running against the new version of Kedro.
Note: As always, if you hit any problems, please check out our documentation:
To get an existing Kedro project to work after you upgrade to Kedro 0.17.0, we recommend that you create a new project against Kedro 0.17.0 and move the code from your existing project into it. Let's go through the changes, but first, note that if you create a new Kedro project with Kedro 0.17.0 you will not be asked whether you want to include the boilerplate code for the Iris dataset example. We've removed this option (you should now use a Kedro starter if you want to create a project that is pre-populated with code).
To create a new, blank Kedro 0.17.0 project to drop your existing code into, you can create one, as always, with kedro new
. We also recommend creating a new virtual environment for your new project, or you might run into conflicts with existing dependencies.
-
Update
pyproject.toml
: Copy the following three keys from the.kedro.yml
of your existing Kedro project into thepyproject.toml
file of your new Kedro 0.17.0 project:[tools.kedro] package_name = "<package_name>" project_name = "<project_name>" project_version = "0.17.0"
Check your source directory. If you defined a different source directory (source_dir
), make sure you also move that to pyproject.toml
.
-
Copy files from your existing project:
- Copy subfolders of
project/src/project_name/pipelines
from existing to new project - Copy subfolders of
project/src/test/pipelines
from existing to new project - Copy the requirements your project needs into
requirements.txt
and/orrequirements.in
. - Copy your project configuration from the
conf
folder. Take note of the new locations needed for modular pipeline configuration (move it fromconf/<env>/pipeline_name/catalog.yml
toconf/<env>/catalog/pipeline_name.yml
and likewise forparameters.yml
). - Copy from the
data/
folder of your existing project, if needed, into the same location in your new project. - Copy any Hooks from
src/<package_name>/hooks.py
.
- Copy subfolders of
-
Update your new project's README and docs as necessary.
-
Update
settings.py
: For example, if you specified additional Hook implementations inhooks
, or listed plugins underdisable_hooks_by_plugin
in your.kedro.yml
, you will need to move them tosettings.py
accordingly:from <package_name>.hooks import MyCustomHooks, ProjectHooks HOOKS = (ProjectHooks(), MyCustomHooks()) DISABLE_HOOKS_FOR_PLUGINS = ("my_plugin1",)
-
Migration for
node
names. From 0.17.0 the only allowed characters for node names are letters, digits, hyphens, underscores and/or fullstops. If you have previously defined node names that have special characters, spaces or other characters that are no longer permitted, you will need to rename those nodes. -
Copy changes to
kedro_cli.py
. If you previously customised thekedro run
command or added more CLI commands to yourkedro_cli.py
, you should move them into<project_root>/src/<package_name>/cli.py
. Note, however, that the new way to run a Kedro pipeline is via aKedroSession
, rather than using theKedroContext
:with KedroSession.create(package_name=...) as session: session.run()
-
Copy changes made to
ConfigLoader
. If you have defined a custom class, such asTemplatedConfigLoader
, by overridingProjectContext._create_config_loader
, you should move the contents of the function insrc/<package_name>/hooks.py
, underregister_config_loader
. -
Copy changes made to
DataCatalog
. Likewise, if you haveDataCatalog
defined withProjectContext._create_catalog
, you should copy-paste the contents intoregister_catalog
. -
Optional: If you have plugins such as Kedro-Viz installed, it's likely that Kedro 0.17.0 won't work with their older versions, so please either upgrade to the plugin's newest version or follow their migration guides.
- Added documentation with a focus on single machine and distributed environment deployment; the series includes Docker, Argo, Prefect, Kubeflow, AWS Batch, AWS Sagemaker and extends our section on Databricks.
- Added kedro-starter-spaceflights alias for generating a project:
kedro new --starter spaceflights
.
- Fixed
TypeError
when converting dict inputs to a node made from a wrappedpartial
function. PartitionedDataSet
improvements:- Supported passing arguments to the underlying filesystem.
- Improved handling of non-ASCII word characters in dataset names.
- For example, a dataset named
jalapeño
will be accessible asDataCatalog.datasets.jalapeño
rather thanDataCatalog.datasets.jalape__o
.
- For example, a dataset named
- Fixed
kedro install
for an Anaconda environment defined inenvironment.yml
. - Fixed backwards compatibility with templates generated with older Kedro versions <0.16.5. No longer need to update
.kedro.yml
to usekedro lint
andkedro jupyter notebook convert
. - Improved documentation.
- Added documentation using MinIO with Kedro.
- Improved error messages for incorrect parameters passed into a node.
- Fixed issue with saving a
TensorFlowModelDataset
in the HDF5 format with versioning enabled. - Added missing
run_result
argument inafter_pipeline_run
Hooks spec. - Fixed a bug in IPython script that was causing context hooks to be registered twice. To apply this fix to a project generated with an older Kedro version, apply the same changes made in this PR to your
00-kedro-init.py
file. - Improved documentation.
Deepyaman Datta, Bhavya Merchant, Lovkush Agarwal, Varun Krishna S, Sebastian Bertoli, noklam, Daniel Petti, Waylon Walker, Saran Balaji C
- Added the following new datasets.
Type | Description | Location |
---|---|---|
email.EmailMessageDataSet |
Manage email messages using the Python standard library | kedro.extras.datasets.email |
- Added support for
pyproject.toml
to configure Kedro.pyproject.toml
is used if.kedro.yml
doesn't exist (Kedro configuration should be under[tool.kedro]
section). - Projects created with this version will have no
pipeline.py
, having been replaced byhooks.py
. - Added a set of registration hooks, as the new way of registering library components with a Kedro project:
register_pipelines()
, to replace_get_pipelines()
register_config_loader()
, to replace_create_config_loader()
register_catalog()
, to replace_create_catalog()
These can be defined insrc/<python_package>/hooks.py
and added to.kedro.yml
(orpyproject.toml
). The order of execution is: plugin hooks,.kedro.yml
hooks, hooks inProjectContext.hooks
.
- Added ability to disable auto-registered Hooks using
.kedro.yml
(orpyproject.toml
) configuration file.
- Added option to run asynchronously via the Kedro CLI.
- Absorbed
.isort.cfg
settings intosetup.cfg
. - Packaging a modular pipeline raises an error if the pipeline directory is empty or non-existent.
project_name
,project_version
andpackage_name
now have to be defined in.kedro.yml
for projects using Kedro 0.16.5+.
This release has accidentally broken the usage of kedro lint
and kedro jupyter notebook convert
on a project template generated with previous versions of Kedro (<=0.16.4). To amend this, please either upgrade to kedro==0.16.6
or update .kedro.yml
within your project root directory to include the following keys:
project_name: "<your_project_name>"
project_version: "<kedro_version_of_the_project>"
package_name: "<your_package_name>"
Deepyaman Datta, Bas Nijholt, Sebastian Bertoli
- Fixed a bug for using
ParallelRunner
on Windows. - Enabled auto-discovery of hooks implementations coming from installed plugins.
- Fixed a bug for using
ParallelRunner
on Windows. - Modified
GBQTableDataSet
to load customized results using customized queries from Google Big Query tables. - Documentation improvements.
Ajay Bisht, Vijay Sajjanar, Deepyaman Datta, Sebastian Bertoli, Shahil Mawjee, Louis Guitton, Emanuel Ferm
- Added the
kedro pipeline pull
CLI command to extract a packaged modular pipeline, and place the contents in a Kedro project. - Added the
--version
option tokedro pipeline package
to allow specifying alternative versions to package under. - Added the
--starter
option tokedro new
to create a new project from a local, remote or aliased starter template. - Added the
kedro starter list
CLI command to list all starter templates that can be used to bootstrap a new Kedro project. - Added the following new datasets.
Type | Description | Location |
---|---|---|
json.JSONDataSet |
Work with JSON files using the Python standard library | kedro.extras.datasets.json |
- Removed
/src/nodes
directory from the project template and madekedro jupyter convert
create it on the fly if necessary. - Fixed a bug in
MatplotlibWriter
which prevented saving lists and dictionaries of plots locally on Windows. - Closed all pyplot windows after saving in
MatplotlibWriter
. - Documentation improvements:
- Added kedro-wings and kedro-great to the list of community plugins.
- Fixed broken versioning for Windows paths.
- Fixed
DataSet
string representation for falsy values. - Improved the error message when duplicate nodes are passed to the
Pipeline
initializer. - Fixed a bug where
kedro docs
would fail because the built docs were located in a different directory. - Fixed a bug where
ParallelRunner
would fail on Windows machines whose reported CPU count exceeded 61. - Fixed an issue with saving TensorFlow model to
h5
file on Windows. - Added a
json
parameter toAPIDataSet
for the convenience of generating requests with JSON bodies. - Fixed dependencies for
SparkDataSet
to include spark.
Deepyaman Datta, Tam-Sanh Nguyen, DataEngineerOne
- Added the following new datasets.
Type | Description | Location |
---|---|---|
pandas.AppendableExcelDataSet |
Work with Excel files opened in append mode |
kedro.extras.datasets.pandas |
tensorflow.TensorFlowModelDataset |
Work with TensorFlow models using TensorFlow 2.X |
kedro.extras.datasets.tensorflow |
holoviews.HoloviewsWriter |
Work with Holoviews objects (saves as image file) |
kedro.extras.datasets.holoviews |
kedro install
will now compile project dependencies (by runningkedro build-reqs
behind the scenes) before the installation if thesrc/requirements.in
file doesn't exist.- Added
only_nodes_with_namespace
inPipeline
class to filter only nodes with a specified namespace. - Added the
kedro pipeline delete
command to help delete unwanted or unused pipelines (it won't remove references to the pipeline in yourcreate_pipelines()
code). - Added the
kedro pipeline package
command to help package up a modular pipeline. It will bundle up the pipeline source code, tests, and parameters configuration into a .whl file.
DataCatalog
improvements:- Introduced regex filtering to the
DataCatalog.list()
method. - Non-alphanumeric characters (except underscore) in dataset name are replaced with
__
inDataCatalog.datasets
, for ease of access to transcoded datasets.
- Introduced regex filtering to the
- Dataset improvements:
- Improved initialization speed of
spark.SparkHiveDataSet
. - Improved S3 cache in
spark.SparkDataSet
. - Added support of options for building
pyarrow
table inpandas.ParquetDataSet
.
- Improved initialization speed of
kedro build-reqs
CLI command improvements:kedro build-reqs
is now called with-q
option and will no longer print out compiled requirements to the console for security reasons.- All unrecognized CLI options in
kedro build-reqs
command are now passed to pip-compile call (e.g.kedro build-reqs --generate-hashes
).
kedro jupyter
CLI command improvements:- Improved error message when running
kedro jupyter notebook
,kedro jupyter lab
orkedro ipython
with Jupyter/IPython dependencies not being installed. - Fixed
%run_viz
line magic for showing kedro viz inside a Jupyter notebook. For the fix to be applied on existing Kedro project, please see the migration guide. - Fixed the bug in IPython startup script (issue 298).
- Improved error message when running
- Documentation improvements:
- Updated community-generated content in FAQ.
- Added find-kedro and kedro-static-viz to the list of community plugins.
- Add missing
pillow.ImageDataSet
entry to the documentation.
Even though this release ships a fix for project generated with kedro==0.16.2
, after upgrading, you will still need to make a change in your existing project if it was generated with kedro>=0.16.0,<=0.16.1
for the fix to take effect. Specifically, please change the content of your project's IPython init script located at .ipython/profile_default/startup/00-kedro-init.py
with the content of this file. You will also need kedro-viz>=3.3.1
.
Miguel Rodriguez Gutierrez, Joel Schwarzmann, w0rdsm1th, Deepyaman Datta, Tam-Sanh Nguyen, Marcus Gawronsky
- Fixed deprecation warnings from
kedro.cli
andkedro.context
when runningkedro jupyter notebook
. - Fixed a bug where
catalog
andcontext
were not available in Jupyter Lab and Notebook. - Fixed a bug where
kedro build-reqs
would fail if you didn't have your project dependencies installed.
- Added new CLI commands (only available for the projects created using Kedro 0.16.0 or later):
kedro catalog list
to list datasets in your catalogkedro pipeline list
to list pipelineskedro pipeline describe
to describe a specific pipelinekedro pipeline create
to create a modular pipeline
- Improved the CLI speed by up to 50%.
- Improved error handling when making a typo on the CLI. We now suggest some of the possible commands you meant to type, in
git
-style.
- All modules in
kedro.cli
andkedro.context
have been moved intokedro.framework.cli
andkedro.framework.context
respectively.kedro.cli
andkedro.context
will be removed in future releases. - Added
Hooks
, which is a new mechanism for extending Kedro. - Fixed
load_context
changing user's current working directory. - Allowed the source directory to be configurable in
.kedro.yml
. - Added the ability to specify nested parameter values inside your node inputs, e.g.
node(func, "params:a.b", None)
- Added the following new datasets.
Type | Description | Location |
---|---|---|
pillow.ImageDataSet |
Work with image files using Pillow |
kedro.extras.datasets.pillow |
geopandas.GeoJSONDataSet |
Work with geospatial data using GeoPandas |
kedro.extras.datasets.geopandas |
api.APIDataSet |
Work with data from HTTP(S) API requests | kedro.extras.datasets.api |
- Added
joblib
backend support topickle.PickleDataSet
. - Added versioning support to
MatplotlibWriter
dataset. - Added the ability to install dependencies for a given dataset with more granularity, e.g.
pip install "kedro[pandas.ParquetDataSet]"
. - Added the ability to specify extra arguments, e.g.
encoding
orcompression
, forfsspec.spec.AbstractFileSystem.open()
calls when loading/saving a dataset. See Example 3 under docs.
- Added
namespace
property onNode
, related to the modular pipeline where the node belongs. - Added an option to enable asynchronous loading inputs and saving outputs in both
SequentialRunner(is_async=True)
andParallelRunner(is_async=True)
class. - Added
MemoryProfiler
transformer. - Removed the requirement to have all dependencies for a dataset module to use only a subset of the datasets within.
- Added support for
pandas>=1.0
. - Enabled Python 3.8 compatibility. Please note that a Spark workflow may be unreliable for this Python version as
pyspark
is not fully-compatible with 3.8 yet. - Renamed "features" layer to "feature" layer to be consistent with (most) other layers and the relevant FAQ.
- Fixed a bug where a new version created mid-run by an external system caused inconsistencies in the load versions used in the current run.
- Documentation improvements
- Added instruction in the documentation on how to create a custom runner).
- Updated contribution process in
CONTRIBUTING.md
- added Developer Workflow. - Documented installation of development version of Kedro in the FAQ section.
- Added missing
_exists
method toMyOwnDataSet
example in 04_user_guide/08_advanced_io.
- Fixed a bug where
PartitionedDataSet
andIncrementalDataSet
were not working withs3a
ors3n
protocol. - Added ability to read partitioned parquet file from a directory in
pandas.ParquetDataSet
. - Replaced
functools.lru_cache
withcachetools.cachedmethod
inPartitionedDataSet
andIncrementalDataSet
for per-instance cache invalidation. - Implemented custom glob function for
SparkDataSet
when running on Databricks. - Fixed a bug in
SparkDataSet
not allowing for loading data from DBFS in a Windows machine using Databricks-connect. - Improved the error message for
DataSetNotFoundError
to suggest possible dataset names user meant to type. - Added the option for contributors to run Kedro tests locally without Spark installation with
make test-no-spark
. - Added option to lint the project without applying the formatting changes (
kedro lint --check-only
).
- Deleted obsolete datasets from
kedro.io
. - Deleted
kedro.contrib
andextras
folders. - Deleted obsolete
CSVBlobDataSet
andJSONBlobDataSet
dataset types. - Made
invalidate_cache
method on datasets private. get_last_load_version
andget_last_save_version
methods are no longer available onAbstractDataSet
.get_last_load_version
andget_last_save_version
have been renamed toresolve_load_version
andresolve_save_version
onAbstractVersionedDataSet
, the results of which are cached.- The
release()
method on datasets extendingAbstractVersionedDataSet
clears the cached load and save version. All custom datasets must callsuper()._release()
inside_release()
. TextDataSet
no longer hasload_args
andsave_args
. These can instead be specified underopen_args_load
oropen_args_save
infs_args
.PartitionedDataSet
andIncrementalDataSet
methodinvalidate_cache
was made private:_invalidate_caches
.
- Removed
KEDRO_ENV_VAR
fromkedro.context
to speed up the CLI run time. Pipeline.name
has been removed in favour ofPipeline.tag()
.- Dropped
Pipeline.transform()
in favour ofkedro.pipeline.modular_pipeline.pipeline()
helper function. - Made constant
PARAMETER_KEYWORDS
private, and moved it fromkedro.pipeline.pipeline
tokedro.pipeline.modular_pipeline
. - Layers are no longer part of the dataset object, as they've moved to the
DataCatalog
. - Python 3.5 is no longer supported by the current and all future versions of Kedro.
reminder How do I upgrade Kedro covers a few key things to remember when updating any kedro version.
Since all the datasets (from kedro.io
and kedro.contrib.io
) were moved to kedro/extras/datasets
you must update the type of all datasets in <project>/conf/base/catalog.yml
file.
Here how it should be changed: type: <SomeDataSet>
-> type: <subfolder of kedro/extras/datasets>.<SomeDataSet>
(e.g. type: CSVDataSet
-> type: pandas.CSVDataSet
).
In addition, all the specific datasets like CSVLocalDataSet
, CSVS3DataSet
etc. were deprecated. Instead, you must use generalized datasets like CSVDataSet
.
E.g. type: CSVS3DataSet
-> type: pandas.CSVDataSet
.
Note: No changes required if you are using your custom dataset.
Pipeline.transform()
has been dropped in favour of the pipeline()
constructor. The following changes apply:
- Remember to import
from kedro.pipeline import pipeline
- The
prefix
argument has been renamed tonamespace
- And
datasets
has been broken down into more granular arguments:inputs
: Independent inputs to the pipelineoutputs
: Any output created in the pipeline, whether an intermediary dataset or a leaf outputparameters
:params:...
orparameters
As an example, code that used to look like this with the Pipeline.transform()
constructor:
result = my_pipeline.transform(
datasets={"input": "new_input", "output": "new_output", "params:x": "params:y"},
prefix="pre",
)
When used with the new pipeline()
constructor, becomes:
from kedro.pipeline import pipeline
result = pipeline(
my_pipeline,
inputs={"input": "new_input"},
outputs={"output": "new_output"},
parameters={"params:x": "params:y"},
namespace="pre",
)
Since some modules were moved to other locations you need to update import paths appropriately.
You can find the list of moved files in the 0.15.6
release notes under the section titled Files with a new location
.
Note: If you haven't made significant changes to your
kedro_cli.py
, it may be easier to simply copy the updatedkedro_cli.py
.ipython/profile_default/startup/00-kedro-init.py
and from GitHub or a newly generated project into your old project.
- We've removed
KEDRO_ENV_VAR
fromkedro.context
. To get your existing project template working, you'll need to remove all instances ofKEDRO_ENV_VAR
from your project template:- From the imports in
kedro_cli.py
and.ipython/profile_default/startup/00-kedro-init.py
:from kedro.context import KEDRO_ENV_VAR, load_context
->from kedro.framework.context import load_context
- Remove the
envvar=KEDRO_ENV_VAR
line from the click options inrun
,jupyter_notebook
andjupyter_lab
inkedro_cli.py
- Replace
KEDRO_ENV_VAR
with"KEDRO_ENV"
in_build_jupyter_env
- Replace
context = load_context(path, env=os.getenv(KEDRO_ENV_VAR))
withcontext = load_context(path)
in.ipython/profile_default/startup/00-kedro-init.py
- From the imports in
We have upgraded pip-tools
which is used by kedro build-reqs
to 5.x. This pip-tools
version requires pip>=20.0
. To upgrade pip
, please refer to their documentation.
@foolsgold, Mani Sarkar, Priyanka Shanbhag, Luis Blanche, Deepyaman Datta, Antony Milne, Panos Psimatikas, Tam-Sanh Nguyen, Tomasz Kaczmarczyk, Kody Fischer, Waylon Walker
- Pinned
fsspec>=0.5.1, <0.7.0
ands3fs>=0.3.0, <0.4.1
to fix incompatibility issues with their latest release.
- Added the additional libraries to our
requirements.txt
sopandas.CSVDataSet
class works out of box withpip install kedro
. - Added
pandas
to ourextra_requires
insetup.py
. - Improved the error message when dependencies of a
DataSet
class are missing.
- Added in documentation on how to contribute a custom
AbstractDataSet
implementation.
- Fixed the link to the Kedro banner image in the documentation.
TL;DR We're launching
kedro.extras
, the new home for our revamped series of datasets, decorators and dataset transformers. The datasets inkedro.extras.datasets
usefsspec
to access a variety of data stores including local file systems, network file systems, cloud object stores (including S3 and GCP), and Hadoop, read more about this here. The change will allow #178 to happen in the next major release of Kedro.
An example of this new system can be seen below, loading the CSV SparkDataSet
from S3:
weather:
type: spark.SparkDataSet # Observe the specified type, this affects all datasets
filepath: s3a://your_bucket/data/01_raw/weather* # filepath uses fsspec to indicate the file storage system
credentials: dev_s3
file_format: csv
You can also load data incrementally whenever it is dumped into a directory with the extension to PartionedDataSet
, a feature that allows you to load a directory of files. The IncrementalDataSet
stores the information about the last processed partition in a checkpoint
, read more about this feature here.
- Added
layer
attribute for datasets inkedro.extras.datasets
to specify the name of a layer according to data engineering convention, this feature will be passed tokedro-viz
in future releases. - Enabled loading a particular version of a dataset in Jupyter Notebooks and iPython, using
catalog.load("dataset_name", version="<2019-12-13T15.08.09.255Z>")
. - Added property
run_id
onProjectContext
, used for versioning using theJournal
. To customise your journalrun_id
you can override the private method_get_run_id()
. - Added the ability to install all optional kedro dependencies via
pip install "kedro[all]"
. - Modified the
DataCatalog
's load order for datasets, loading order is the following:kedro.io
kedro.extras.datasets
- Import path, specified in
type
- Added an optional
copy_mode
flag toCachedDataSet
andMemoryDataSet
to specify (deepcopy
,copy
orassign
) the copy mode to use when loading and saving.
Type | Description | Location |
---|---|---|
dask.ParquetDataSet |
Handles parquet datasets using Dask | kedro.extras.datasets.dask |
pickle.PickleDataSet |
Work with Pickle files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pickle |
pandas.CSVDataSet |
Work with CSV files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
pandas.TextDataSet |
Work with text files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
pandas.ExcelDataSet |
Work with Excel files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
pandas.HDFDataSet |
Work with HDF using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
yaml.YAMLDataSet |
Work with YAML files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.yaml |
matplotlib.MatplotlibWriter |
Save with Matplotlib images using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.matplotlib |
networkx.NetworkXDataSet |
Work with NetworkX files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.networkx |
biosequence.BioSequenceDataSet |
Work with bio-sequence objects using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.biosequence |
pandas.GBQTableDataSet |
Work with Google BigQuery | kedro.extras.datasets.pandas |
pandas.FeatherDataSet |
Work with feather files using fsspec to communicate with the underlying filesystem |
kedro.extras.datasets.pandas |
IncrementalDataSet |
Inherit from PartitionedDataSet and remembers the last processed partition |
kedro.io |
Type | New Location |
---|---|
JSONDataSet |
kedro.extras.datasets.pandas |
CSVBlobDataSet |
kedro.extras.datasets.pandas |
JSONBlobDataSet |
kedro.extras.datasets.pandas |
SQLTableDataSet |
kedro.extras.datasets.pandas |
SQLQueryDataSet |
kedro.extras.datasets.pandas |
SparkDataSet |
kedro.extras.datasets.spark |
SparkHiveDataSet |
kedro.extras.datasets.spark |
SparkJDBCDataSet |
kedro.extras.datasets.spark |
kedro/contrib/decorators/retry.py |
kedro/extras/decorators/retry_node.py |
kedro/contrib/decorators/memory_profiler.py |
kedro/extras/decorators/memory_profiler.py |
kedro/contrib/io/transformers/transformers.py |
kedro/extras/transformers/time_profiler.py |
kedro/contrib/colors/logging/color_logger.py |
kedro/extras/logging/color_logger.py |
extras/ipython_loader.py |
tools/ipython/ipython_loader.py |
kedro/contrib/io/cached/cached_dataset.py |
kedro/io/cached_dataset.py |
kedro/contrib/io/catalog_with_default/data_catalog_with_default.py |
kedro/io/data_catalog_with_default.py |
kedro/contrib/config/templated_config.py |
kedro/config/templated_config.py |
Category | Type |
---|---|
Datasets | BioSequenceLocalDataSet |
CSVGCSDataSet |
|
CSVHTTPDataSet |
|
CSVLocalDataSet |
|
CSVS3DataSet |
|
ExcelLocalDataSet |
|
FeatherLocalDataSet |
|
JSONGCSDataSet |
|
JSONLocalDataSet |
|
HDFLocalDataSet |
|
HDFS3DataSet |
|
kedro.contrib.io.cached.CachedDataSet |
|
kedro.contrib.io.catalog_with_default.DataCatalogWithDefault |
|
MatplotlibLocalWriter |
|
MatplotlibS3Writer |
|
NetworkXLocalDataSet |
|
ParquetGCSDataSet |
|
ParquetLocalDataSet |
|
ParquetS3DataSet |
|
PickleLocalDataSet |
|
PickleS3DataSet |
|
TextLocalDataSet |
|
YAMLLocalDataSet |
|
Decorators | kedro.contrib.decorators.memory_profiler |
kedro.contrib.decorators.retry |
|
kedro.contrib.decorators.pyspark.spark_to_pandas |
|
kedro.contrib.decorators.pyspark.pandas_to_spark |
|
Transformers | kedro.contrib.io.transformers.transformers |
Configuration Loaders | kedro.contrib.config.TemplatedConfigLoader |
- Added the option to set/overwrite params in
config.yaml
using YAML dict style instead of string CLI formatting only. - Kedro CLI arguments
--node
and--tag
support comma-separated values, alternative methods will be deprecated in future releases. - Fixed a bug in the
invalidate_cache
method ofParquetGCSDataSet
andCSVGCSDataSet
. --load-version
now won't break if version value contains a colon.- Enabled running
node
s with duplicate inputs. - Improved error message when empty credentials are passed into
SparkJDBCDataSet
. - Fixed bug that caused an empty project to fail unexpectedly with ImportError in
template/.../pipeline.py
. - Fixed bug related to saving dataframe with categorical variables in table mode using
HDFS3DataSet
. - Fixed bug that caused unexpected behavior when using
from_nodes
andto_nodes
in pipelines using transcoding. - Credentials nested in the dataset config are now also resolved correctly.
- Bumped minimum required pandas version to 0.24.0 to make use of
pandas.DataFrame.to_numpy
(recommended alternative topandas.DataFrame.values
). - Docs improvements.
Pipeline.transform
skips modifying node inputs/outputs containingparams:
orparameters
keywords.- Support for
dataset_credentials
key in the credentials forPartitionedDataSet
is now deprecated. The dataset credentials should be specified explicitly inside the dataset config. - Datasets can have a new
confirm
function which is called after a successful node function execution if the node containsconfirms
argument with such dataset name. - Make the resume prompt on pipeline run failure use
--from-nodes
instead of--from-inputs
to avoid unnecessarily re-running nodes that had already executed. - When closed, Jupyter notebook kernels are automatically terminated after 30 seconds of inactivity by default. Use
--idle-timeout
option to update it. - Added
kedro-viz
to the Kedro project templaterequirements.txt
file. - Removed the
results
andreferences
folder from the project template. - Updated contribution process in
CONTRIBUTING.md
.
- Existing
MatplotlibWriter
dataset incontrib
was renamed toMatplotlibLocalWriter
. kedro/contrib/io/matplotlib/matplotlib_writer.py
was renamed tokedro/contrib/io/matplotlib/matplotlib_local_writer.py
.kedro.contrib.io.bioinformatics.sequence_dataset.py
was renamed tokedro.contrib.io.bioinformatics.biosequence_local_dataset.py
.
Andrii Ivaniuk, Jonas Kemper, Yuhao Zhu, Balazs Konig, Pedro Abreu, Tam-Sanh Nguyen, Peter Zhao, Deepyaman Datta, Florian Roessler, Miguel Rodriguez Gutierrez
- New CLI commands and command flags:
- Load multiple
kedro run
CLI flags from a configuration file with the--config
flag (e.g.kedro run --config run_config.yml
) - Run parametrised pipeline runs with the
--params
flag (e.g.kedro run --params param1:value1,param2:value2
). - Lint your project code using the
kedro lint
command, your project is linted withblack
(Python 3.6+),flake8
andisort
.
- Load multiple
- Load specific environments with Jupyter notebooks using
KEDRO_ENV
which will globally setrun
,jupyter notebook
andjupyter lab
commands using environment variables. - Added the following datasets:
CSVGCSDataSet
dataset incontrib
for working with CSV files in Google Cloud Storage.ParquetGCSDataSet
dataset incontrib
for working with Parquet files in Google Cloud Storage.JSONGCSDataSet
dataset incontrib
for working with JSON files in Google Cloud Storage.MatplotlibS3Writer
dataset incontrib
for saving Matplotlib images to S3.PartitionedDataSet
for working with datasets split across multiple files.JSONDataSet
dataset for working with JSON files that usesfsspec
to communicate with the underlying filesystem. It doesn't supporthttp(s)
protocol for now.
- Added
s3fs_args
to all S3 datasets. - Pipelines can be deducted with
pipeline1 - pipeline2
.
ParallelRunner
now works withSparkDataSet
.- Allowed the use of nulls in
parameters.yml
. - Fixed an issue where
%reload_kedro
wasn't reloading all user modules. - Fixed
pandas_to_spark
andspark_to_pandas
decorators to work with functions with kwargs. - Fixed a bug where
kedro jupyter notebook
andkedro jupyter lab
would run a different Jupyter installation to the one in the local environment. - Implemented Databricks-compatible dataset versioning for
SparkDataSet
. - Fixed a bug where
kedro package
would fail in certain situations wherekedro build-reqs
was used to generaterequirements.txt
. - Made
bucket_name
argument optional for the following datasets:CSVS3DataSet
,HDFS3DataSet
,PickleS3DataSet
,contrib.io.parquet.ParquetS3DataSet
,contrib.io.gcs.JSONGCSDataSet
- bucket name can now be included into the filepath along with the filesystem protocol (e.g.s3://bucket-name/path/to/key.csv
). - Documentation improvements and fixes.
- Renamed entry point for running pip-installed projects to
run_package()
instead ofmain()
insrc/<package>/run.py
. bucket_name
key has been removed from the string representation of the following datasets:CSVS3DataSet
,HDFS3DataSet
,PickleS3DataSet
,contrib.io.parquet.ParquetS3DataSet
,contrib.io.gcs.JSONGCSDataSet
.- Moved the
mem_profiler
decorator tocontrib
and separated thecontrib
decorators so that dependencies are modular. You may need to update your import paths, for example the pyspark decorators should be imported asfrom kedro.contrib.decorators.pyspark import <pyspark_decorator>
instead offrom kedro.contrib.decorators import <pyspark_decorator>
.
Sheldon Tsen, @roumail, Karlson Lee, Waylon Walker, Deepyaman Datta, Giovanni, Zain Patel
kedro jupyter
now gives the default kernel a sensible name.Pipeline.name
has been deprecated in favour ofPipeline.tags
.- Reuse pipelines within a Kedro project using
Pipeline.transform
, it simplifies dataset and node renaming. - Added Jupyter Notebook line magic (
%run_viz
) to runkedro viz
in a Notebook cell (requireskedro-viz
version 3.0.0 or later). - Added the following datasets:
NetworkXLocalDataSet
inkedro.contrib.io.networkx
to load and save local graphs (JSON format) via NetworkX. (by @josephhaaga)SparkHiveDataSet
inkedro.contrib.io.pyspark.SparkHiveDataSet
allowing usage of Spark and insert/upsert on non-transactional Hive tables.
kedro.contrib.config.TemplatedConfigLoader
now supports name/dict key templating and default values.
get_last_load_version()
method for versioned datasets now returns exact last load version if the dataset has been loaded at least once andNone
otherwise.- Fixed a bug in
_exists
method for versionedSparkDataSet
. - Enabled the customisation of the ExcelWriter in
ExcelLocalDataSet
by specifying options underwriter
key insave_args
. - Fixed a bug in IPython startup script, attempting to load context from the incorrect location.
- Removed capping the length of a dataset's string representation.
- Fixed
kedro install
command failing on Windows ifsrc/requirements.txt
contains a different version of Kedro. - Enabled passing a single tag into a node or a pipeline without having to wrap it in a list (i.e.
tags="my_tag"
).
- Removed
_check_paths_consistency()
method fromAbstractVersionedDataSet
. Version consistency check is now done inAbstractVersionedDataSet.save()
. Custom versioned datasets should modifysave()
method implementation accordingly.
Joseph Haaga, Deepyaman Datta, Joost Duisters, Zain Patel, Tom Vigrass
- Narrowed the requirements for
PyTables
so that we maintain support for Python 3.5.
- Added
--load-version
, akedro run
argument that allows you run the pipeline with a particular load version of a dataset. - Support for modular pipelines in
src/
, break the pipeline into isolated parts with reusability in mind. - Support for multiple pipelines, an ability to have multiple entry point pipelines and choose one with
kedro run --pipeline NAME
. - Added a
MatplotlibWriter
dataset incontrib
for saving Matplotlib images. - An ability to template/parameterize configuration files with
kedro.contrib.config.TemplatedConfigLoader
. - Parameters are exposed as a context property for ease of access in iPython / Jupyter Notebooks with
context.params
. - Added
max_workers
parameter forParallelRunner
.
- Users will override the
_get_pipeline
abstract method inProjectContext(KedroContext)
inrun.py
rather than thepipeline
abstract property. Thepipeline
property is not abstract anymore. - Improved an error message when versioned local dataset is saved and unversioned path already exists.
- Added
catalog
global variable to00-kedro-init.py
, allowing you to load datasets withcatalog.load()
. - Enabled tuples to be returned from a node.
- Disallowed the
ConfigLoader
loading the same file more than once, and deduplicated theconf_paths
passed in. - Added a
--open
flag tokedro build-docs
that opens the documentation on build. - Updated the
Pipeline
representation to include name of the pipeline, also making it readable as a context property. kedro.contrib.io.pyspark.SparkDataSet
andkedro.contrib.io.azure.CSVBlobDataSet
now support versioning.
KedroContext.run()
no longer acceptscatalog
andpipeline
arguments.node.inputs
now returns the node's inputs in the order required to bind them properly to the node's function.
Deepyaman Datta, Luciano Issoe, Joost Duisters, Zain Patel, William Ashford, Karlson Lee
- Extended
versioning
support to cover the tracking of environment setup, code and datasets. - Added the following datasets:
FeatherLocalDataSet
incontrib
for usage with pandas. (by @mdomarsaleem)
- Added
get_last_load_version
andget_last_save_version
toAbstractVersionedDataSet
. - Implemented
__call__
method onNode
to allow for users to executemy_node(input1=1, input2=2)
as an alternative tomy_node.run(dict(input1=1, input2=2))
. - Added new
--from-inputs
run argument.
- Fixed a bug in
load_context()
not loading context in non-Kedro Jupyter Notebooks. - Fixed a bug in
ConfigLoader.get()
not listing nested files for**
-ending glob patterns. - Fixed a logging config error in Jupyter Notebook.
- Updated documentation in
03_configuration
regarding how to modify the configuration path. - Documented the architecture of Kedro showing how we think about library, project and framework components.
extras/kedro_project_loader.py
renamed toextras/ipython_loader.py
and now runs any IPython startup scripts without relying on the Kedro project structure.- Fixed TypeError when validating partial function's signature.
- After a node failure during a pipeline run, a resume command will be suggested in the logs. This command will not work if the required inputs are MemoryDataSets.
Omar Saleem, Mariana Silva, Anil Choudhary, Craig
- Added
KedroContext
base class which holds the configuration and Kedro's main functionality (catalog, pipeline, config, runner). - Added a new CLI command
kedro jupyter convert
to facilitate converting Jupyter Notebook cells into Kedro nodes. - Added support for
pip-compile
and new Kedro commandkedro build-reqs
that generatesrequirements.txt
based onrequirements.in
. - Running
kedro install
will install packages to conda environment ifsrc/environment.yml
exists in your project. - Added a new
--node
flag tokedro run
, allowing users to run only the nodes with the specified names. - Added new
--from-nodes
and--to-nodes
run arguments, allowing users to run a range of nodes from the pipeline. - Added prefix
params:
to the parameters specified inparameters.yml
which allows users to differentiate between their different parameter node inputs and outputs. - Jupyter Lab/Notebook now starts with only one kernel by default.
- Added the following datasets:
CSVHTTPDataSet
to load CSV using HTTP(s) links.JSONBlobDataSet
to load json (-delimited) files from Azure Blob Storage.ParquetS3DataSet
incontrib
for usage with pandas. (by @mmchougule)CachedDataSet
incontrib
which will cache data in memory to avoid io/network operations. It will clear the cache once a dataset is no longer needed by a pipeline. (by @tsanikgr)YAMLLocalDataSet
incontrib
to load and save local YAML files. (by @Minyus)
- Documentation improvements including instructions on how to initialise a Spark session using YAML configuration.
anyconfig
default log level changed fromINFO
toWARNING
.- Added information on installed plugins to
kedro info
. - Added style sheets for project documentation, so the output of
kedro build-docs
will resemble the style ofkedro docs
.
- Simplified the Kedro template in
run.py
with the introduction ofKedroContext
class. - Merged
FilepathVersionMixIn
andS3VersionMixIn
under one abstract classAbstractVersionedDataSet
which extendsAbstractDataSet
. name
changed to be a keyword-only argument forPipeline
.CSVLocalDataSet
no longer supports URLs.CSVHTTPDataSet
supports URLs.
This guide assumes that:
- The framework specific code has not been altered significantly
- Your project specific code is stored in the dedicated python package under
src/
.
The breaking changes were introduced in the following project template files:
<project-name>/.ipython/profile_default/startup/00-kedro-init.py
<project-name>/kedro_cli.py
<project-name>/src/tests/test_run.py
<project-name>/src/<python_package>/run.py
<project-name>/.kedro.yml
(new file)
The easiest way to migrate your project from Kedro 0.14.* to Kedro 0.15.0 is to create a new project (by using kedro new
) and move code and files bit by bit as suggested in the detailed guide below:
-
Create a new project with the same name by running
kedro new
-
Copy the following folders to the new project:
results/
references/
notebooks/
logs/
data/
conf/
- If you customised your
src/<package>/run.py
, make sure you apply the same customisations tosrc/<package>/run.py
- If you customised
get_config()
, you can overrideconfig_loader
property inProjectContext
derived class - If you customised
create_catalog()
, you can overridecatalog()
property inProjectContext
derived class - If you customised
run()
, you can overriderun()
method inProjectContext
derived class - If you customised default
env
, you can override it inProjectContext
derived class or pass it at construction. By default,env
islocal
. - If you customised default
root_conf
, you can overrideCONF_ROOT
attribute inProjectContext
derived class. By default,KedroContext
base class hasCONF_ROOT
attribute set toconf
.
- The following syntax changes are introduced in ipython or Jupyter notebook/labs:
proj_dir
->context.project_path
proj_name
->context.project_name
conf
->context.config_loader
.io
->context.catalog
(e.g.,io.load()
->context.catalog.load()
)
-
If you customised your
kedro_cli.py
, you need to apply the same customisations to yourkedro_cli.py
in the new project. -
Copy the contents of the old project's
src/requirements.txt
into the new project'ssrc/requirements.in
and, from the project root directory, run thekedro build-reqs
command in your terminal window.
If you defined any custom dataset classes which support versioning in your project, you need to apply the following changes:
- Make sure your dataset inherits from
AbstractVersionedDataSet
only. - Call
super().__init__()
with the appropriate arguments in the dataset's__init__
. If storing on local filesystem, providing the filepath and the version is enough. Otherwise, you should also pass in anexists_function
and aglob_function
that emulateexists
andglob
in a different filesystem (seeCSVS3DataSet
as an example). - Remove setting of the
_filepath
and_version
attributes in the dataset's__init__
, as this is taken care of in the base abstract class. - Any calls to
_get_load_path
and_get_save_path
methods should take no arguments. - Ensure you convert the output of
_get_load_path
and_get_save_path
appropriately, as these now returnPurePath
s instead of strings. - Make sure
_check_paths_consistency
is called withPurePath
s as input arguments, instead of strings.
These steps should have brought your project to Kedro 0.15.0. There might be some more minor tweaks needed as every project is unique, but now you have a pretty solid base to work with. If you run into any problems, please consult the Kedro documentation.
Dmitry Vukolov, Jo Stichbury, Angus Williams, Deepyaman Datta, Mayur Chougule, Marat Kopytjuk, Evan Miller, Yusuke Minami
- Tab completion for catalog datasets in
ipython
orjupyter
sessions. (Thank you @datajoely and @WaylonWalker) - Added support for transcoding, an ability to decouple loading/saving mechanisms of a dataset from its storage location, denoted by adding '@' to the dataset name.
- Datasets have a new
release
function that instructs them to free any cached data. The runners will call this when the dataset is no longer needed downstream.
- Add support for pipeline nodes made up from partial functions.
- Expand user home directory
~
for TextLocalDataSet (see issue #19). - Add a
short_name
property toNode
s for a display-friendly (but not necessarily unique) name. - Add Kedro project loader for IPython:
extras/kedro_project_loader.py
. - Fix source file encoding issues with Python 3.5 on Windows.
- Fix local project source not having priority over the same source installed as a package, leading to local updates not being recognised.
- Remove the max_loads argument from the
MemoryDataSet
constructor and from theAbstractRunner.create_default_data_set
method.
Joel Schwarzmann, Alex Kalmikov
- Added Data Set transformer support in the form of AbstractTransformer and DataCatalog.add_transformer.
- Merged the
ExistsMixin
intoAbstractDataSet
. Pipeline.node_dependencies
returns a dictionary keyed by node, with sets of parent nodes as values;Pipeline
andParallelRunner
were refactored to make use of this for topological sort for node dependency resolution and running pipelines respectively.Pipeline.grouped_nodes
returns a list of sets, rather than a list of lists.
- New I/O module
HDFS3DataSet
.
- Improved API docs.
- Template
run.py
will throw a warning instead of error ifcredentials.yml
is not present.
None
The initial release of Kedro.
Jo Stichbury, Aris Valtazanos, Fabian Peters, Guilherme Braccialli, Joel Schwarzmann, Miguel Beltre, Mohammed ElNabawy, Deepyaman Datta, Shubham Agrawal, Oleg Andreyev, Mayur Chougule, William Ashford, Ed Cannon, Nikhilesh Nukala, Sean Bailey, Vikram Tegginamath, Thomas Huijskens, Musa Bilal
We are also grateful to everyone who advised and supported us, filed issues or helped resolve them, asked and answered questions and were part of inspiring discussions.