Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Add ManagedTableDataset for managed Delta Lake tables in Databr…
…icks (#206) * committing first version of UnityTableCatalog with unit tests. This datasets allows users to interface with Unity catalog tables in Databricks to both read and write. Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * renaming dataset Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * adding mlflow connectors Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * fixing mlflow imports Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * cleaned up mlflow for initial release Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * cleaned up mlflow references from setup.py for initial release Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * fixed deps in setup.py Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * adding comments before intiial PR Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * moved validation to dataclass Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * bug fix in type of partition column and cleanup Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * updated docstring for ManagedTableDataSet Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * added backticks to catalog Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * fixing regex to allow hyphens Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update kedro-datasets/test_requirements.txt Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com> Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * adding backticks to catalog Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Require pandas < 2.0 for compatibility with spark < 3.4 Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Replace use of walrus operator Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add test coverage for validation methods Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove unused versioning functions Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix exception catching for invalid schema, add test for invalid schema Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add pylint ignore Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add tests/databricks to ignore for no-spark tests Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Nok Lam Chan <mediumnok@gmail.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Nok Lam Chan <mediumnok@gmail.com> * Remove spurious mlflow test dependency Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add explicit check for database existence Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove character limit for table names Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Refactor validation steps in ManagedTable Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious checks for table and schema name existence Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> --------- Signed-off-by: Danny Farah <danny_farah@mckinsey.com> Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Co-authored-by: Danny Farah <danny.farah@quantumblack.com> Co-authored-by: Danny Farah <danny_farah@mckinsey.com> Co-authored-by: Nok Lam Chan <mediumnok@gmail.com>
- Loading branch information