Skip to content

Commit

Permalink
feat: Add ManagedTableDataset for managed Delta Lake tables in Databr…
Browse files Browse the repository at this point in the history
…icks (#206)

* committing first version of UnityTableCatalog with unit tests. This datasets allows users to interface with Unity catalog tables in Databricks to both read and write.

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* renaming dataset

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* adding mlflow connectors

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* fixing mlflow imports

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* cleaned up mlflow for initial release

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* cleaned up mlflow references from setup.py for initial release

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* fixed deps in setup.py

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* adding comments before intiial PR

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* moved validation to dataclass

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* bug fix in type of partition column and cleanup

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* updated docstring for ManagedTableDataSet

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* added backticks to catalog

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* fixing regex to allow hyphens

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/test_requirements.txt

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* adding backticks to catalog

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Require pandas < 2.0 for compatibility with spark < 3.4

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Replace use of walrus operator

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add test coverage for validation methods

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove unused versioning functions

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix exception catching for invalid schema, add test for invalid schema

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add pylint ignore

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add tests/databricks to ignore for no-spark tests

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Nok Lam Chan <mediumnok@gmail.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Nok Lam Chan <mediumnok@gmail.com>

* Remove spurious mlflow test dependency

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add explicit check for database existence

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove character limit for table names

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Refactor validation steps in ManagedTable

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious checks for table and schema name existence

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

---------

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Co-authored-by: Danny Farah <danny.farah@quantumblack.com>
Co-authored-by: Danny Farah <danny_farah@mckinsey.com>
Co-authored-by: Nok Lam Chan <mediumnok@gmail.com>
  • Loading branch information
4 people authored May 22, 2023
1 parent de8b833 commit 1354fc9
Show file tree
Hide file tree
Showing 9 changed files with 962 additions and 4 deletions.
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,10 @@ sign-off:

# kedro-datasets related only
test-no-spark:
cd kedro-datasets && pytest tests --no-cov --ignore tests/spark --numprocesses 4 --dist loadfile
cd kedro-datasets && pytest tests --no-cov --ignore tests/spark --ignore tests/databricks --numprocesses 4 --dist loadfile

test-no-spark-sequential:
cd kedro-datasets && pytest tests --no-cov --ignore tests/spark
cd kedro-datasets && pytest tests --no-cov --ignore tests/spark --ignore tests/databricks

# kedro-datasets/snowflake tests skipped from default scope
test-snowflake-only:
Expand Down
3 changes: 3 additions & 0 deletions kedro-datasets/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -145,3 +145,6 @@ kedro.db
kedro/html
docs/tmp-build-artifacts
docs/build
spark-warehouse
metastore_db/
derby.log
8 changes: 8 additions & 0 deletions kedro-datasets/kedro_datasets/databricks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
"""Provides interface to Unity Catalog Tables."""

__all__ = ["ManagedTableDataSet"]

from contextlib import suppress

with suppress(ImportError):
from .managed_table_dataset import ManagedTableDataSet
Loading

0 comments on commit 1354fc9

Please sign in to comment.