safer use of "/dbfs" #1931

mle-els · 2022-10-13T16:55:24Z

avoid error in a file system that happen to have "/dbfs" paths

Description

I use MLFlow to submit a Kedro project to a DataBricks cluster. This is because I need access to both the GPU that's available on the cluster and the data locations that are mounted to the cluster. The "kedro run" command fails because an IPython environment is not available (AttributeError: 'NoneType' object has no attribute 'user_ns') which I traced to this part of the code which assumes that you can initialize a DBUtils object.

For my project, the program is running on DataBricks but not in a managed way. Also, for systems that, for whatever reason, use a /dbfs path, it's not reasonable that one can initiate a DBUtils object. Ideally the initialization of the DBUtils object shouldn't throw an exception but in reality, it does. So, to be safe, it's better to catch the exception.

deepyaman · 2022-10-13T18:56:28Z

The "kedro run" command fails because an IPython environment is not available (AttributeError: 'NoneType' object has no attribute 'user_ns') which I traced to this part of the code which assumes that you can initialize a DBUtils object.

I'm not sure how this is happening; can you share a full stack trace?

https://github.com/kedro-org/kedro/blob/0.18.3/kedro/extras/datasets/spark/spark_dataset.py#L93

If ipython is None above, it shouldn't try to access the user_ns attribute. I'm not really sure where _get_dbutils is raising an error from, based on a read-through.

deepyaman · 2022-10-13T18:58:20Z

kedro/extras/datasets/spark/spark_dataset.py

+                except:
+                    pass


I wouldn't generally call swallowing the exception safe. 😅 But, as posted just a second ago, I'm not sure why an error (especially that which you shared) is occurring here to begin with.

Hi there, this is the full stacktrace. The error occurs when Kedro tries to read a dataset that starts with /dbfs. I wouldn't normally skip exceptions either but this is warranted because the assumption that /dbfs has to do with DataBricks is a risky assumption in the first place.

╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/kedro/io/core.py:155 in from_config │ │ │ │ 152 │ │ │ ) from exc │ │ 153 │ │ │ │ 154 │ │ try: │ │ ❱ 155 │ │ │ data_set = class_obj(**config) # type: ignore │ │ 156 │ │ except TypeError as err: │ │ 157 │ │ │ raise DataSetError( │ │ 158 │ │ │ │ f"\n{err}.\nDataSet '{name}' must only contain argumen │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ class_obj = <class │ │ │ │ 'kedro.extras.datasets.spark.spark_dataset.SparkDataSet'> │ │ │ │ cls = <class 'kedro.io.core.AbstractDataSet'> │ │ │ │ config = { │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/t… │ │ │ │ │ 'file_format': 'parquet', │ │ │ │ │ 'version': Version( │ │ │ │ │ │ load=None, │ │ │ │ │ │ save='2022-10-13T16.18.00.115Z' │ │ │ │ │ ) │ │ │ │ } │ │ │ │ load_version = None │ │ │ │ name = 'ft_labels' │ │ │ │ save_version = '2022-10-13T16.18.00.115Z' │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/kedro/extras/datasets/spark/spark_dataset.py:308 in │ │ __init__ │ │ │ │ 305 │ │ │ path = PurePosixPath(filepath) │ │ 306 │ │ │ │ │ 307 │ │ │ if filepath.startswith("/dbfs"): │ │ ❱ 308 │ │ │ │ dbutils = _get_dbutils(self._get_spark()) │ │ 309 │ │ │ │ if dbutils: │ │ 310 │ │ │ │ │ glob_function = partial(_dbfs_glob, dbutils=dbutil │ │ 311 │ │ │ │ │ exists_function = partial(_dbfs_exists, dbutils=db │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ __class__ = <class │ │ │ │ 'kedro.extras.datasets.spark.spark_dataset.SparkDataS… │ │ │ │ credentials = {} │ │ │ │ exists_function = None │ │ │ │ file_format = 'parquet' │ │ │ │ filepath = '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifie… │ │ │ │ fs_prefix = '' │ │ │ │ glob_function = None │ │ │ │ load_args = None │ │ │ │ path = PurePosixPath('/dbfs/mnt/els-nlp-experts1/data/OmniSc… │ │ │ │ save_args = None │ │ │ │ self = <kedro.extras.datasets.spark.spark_dataset.SparkDataS… │ │ │ │ object at 0x7f4a7ca602e0> │ │ │ │ version = Version(load=None, save='2022-10-13T16.18.00.115Z') │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/kedro/extras/datasets/spark/spark_dataset.py:85 in │ │ _get_dbutils │ │ │ │ 82 │ try: │ │ 83 │ │ from pyspark.dbutils import DBUtils # pylint: disable=import- │ │ 84 │ │ │ │ ❱ 85 │ │ dbutils = DBUtils(spark) │ │ 86 │ except ImportError: │ │ 87 │ │ try: │ │ 88 │ │ │ import IPython # pylint: disable=import-outside-toplevel │ │ │ │ ╭─────────────────────────────── locals ────────────────────────────────╮ │ │ │ dbutils = None │ │ │ │ DBUtils = <class 'pyspark.dbutils.DBUtils'> │ │ │ │ spark = <pyspark.sql.session.SparkSession object at 0x7f4afb216400> │ │ │ ╰───────────────────────────────────────────────────────────────────────╯ │ │ │ │ /databricks/spark/python/pyspark/dbutils.py:33 in __init__ │ │ │ │ 30 │ def __init__(self, spark=None): │ │ 31 │ │ if spark is None: │ │ 32 │ │ │ spark = SparkSession.builder.getOrCreate() │ │ ❱ 33 │ │ dbutils_obj = self.get_dbutils(spark) │ │ 34 │ │ self.fs = dbutils_obj.fs │ │ 35 │ │ self.secrets = dbutils_obj.secrets │ │ 36 │ │ if spark.conf.get("spark.databricks.service.client.enabled") = │ │ │ │ ╭────────────────────────────── locals ───────────────────────────────╮ │ │ │ self = <pyspark.dbutils.DBUtils object at 0x7f4a7ca60400> │ │ │ │ spark = <pyspark.sql.session.SparkSession object at 0x7f4afb216400> │ │ │ ╰─────────────────────────────────────────────────────────────────────╯ │ │ │ │ /databricks/spark/python/pyspark/dbutils.py:50 in get_dbutils │ │ │ │ 47 │ │ │ return SparkServiceClientDBUtils(spark.sparkContext) │ │ 48 │ │ else: │ │ 49 │ │ │ import IPython │ │ ❱ 50 │ │ │ return IPython.get_ipython().user_ns["dbutils"] │ │ 51 │ │ 52 │ │ 53 class SparkServiceClientDBUtils(object): │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ IPython = <module 'IPython' from │ │ │ │ '/databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e9259… │ │ │ │ self = <pyspark.dbutils.DBUtils object at 0x7f4a7ca60400> │ │ │ │ spark = <pyspark.sql.session.SparkSession object at 0x7f4afb216400> │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ ╰──────────────────────────────────────────────────────────────────────────────╯ AttributeError: 'NoneType' object has no attribute 'user_ns' The above exception was the direct cause of the following exception: ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/bin/k │ │ edro:8 in <module> │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/kedro/framework/cli/cli.py:211 in main │ │ │ │ 208 │ """ │ │ 209 │ _init_plugins() │ │ 210 │ cli_collection = KedroCLI(project_path=Path.cwd()) │ │ ❱ 211 │ cli_collection() │ │ 212 │ │ │ │ ╭───────────── locals ─────────────╮ │ │ │ cli_collection = <KedroCLI None> │ │ │ ╰──────────────────────────────────╯ │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/click/core.py:1130 in __call__ │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/kedro/framework/cli/cli.py:139 in main │ │ │ │ 136 │ │ ) │ │ 137 │ │ │ │ 138 │ │ try: │ │ ❱ 139 │ │ │ super().main( │ │ 140 │ │ │ │ args=args, │ │ 141 │ │ │ │ prog_name=prog_name, │ │ 142 │ │ │ │ complete_var=complete_var, │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ __class__ = <class 'kedro.framework.cli.cli.KedroCLI'> │ │ │ │ args = [ │ │ │ │ │ 'run', │ │ │ │ │ '--env', │ │ │ │ │ 'dbfs', │ │ │ │ │ '--node', │ │ │ │ │ 'train_contrastive' │ │ │ │ ] │ │ │ │ complete_var = None │ │ │ │ extra = { │ │ │ │ │ 'obj': ProjectMetadata( │ │ │ │ │ │ │ │ │ │ config_file=PosixPath('/databricks/mlflow/projects/1b… │ │ │ │ │ │ package_name='omnieval', │ │ │ │ │ │ project_name='OmniScience classification │ │ │ │ evaluation framework', │ │ │ │ │ │ │ │ │ │ project_path=PosixPath('/databricks/mlflow/projects/1… │ │ │ │ │ │ project_version='0.18.2', │ │ │ │ │ │ │ │ │ │ source_dir=PosixPath('/databricks/mlflow/projects/1b1… │ │ │ │ │ ) │ │ │ │ } │ │ │ │ prog_name = None │ │ │ │ self = <KedroCLI None> │ │ │ │ standalone_mode = True │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/click/core.py:1055 in main │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/click/core.py:1657 in invoke │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/click/core.py:1404 in invoke │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/click/core.py:760 in invoke │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/kedro/framework/cli/project.py:352 in run │ │ │ │ 349 │ node_names = _get_values_as_tuple(node_names) if node_names else n │ │ 350 │ │ │ 351 │ with KedroSession.create(env=env, extra_params=params) as session: │ │ ❱ 352 │ │ session.run( │ │ 353 │ │ │ tags=tag, │ │ 354 │ │ │ runner=runner(is_async=is_async), │ │ 355 │ │ │ node_names=node_names, │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ config = None │ │ │ │ env = 'dbfs' │ │ │ │ from_inputs = [] │ │ │ │ from_nodes = [] │ │ │ │ is_async = False │ │ │ │ load_version = {} │ │ │ │ node_names = ('train_contrastive',) │ │ │ │ params = {} │ │ │ │ pipeline = None │ │ │ │ runner = <class 'kedro.runner.sequential_runner.SequentialRunner'> │ │ │ │ session = <kedro.framework.session.session.KedroSession object at │ │ │ │ 0x7f4ad70048e0> │ │ │ │ tag = () │ │ │ │ to_nodes = [] │ │ │ │ to_outputs = [] │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/kedro/framework/session/session.py:389 in run │ │ │ │ 386 │ │ │ "runner": getattr(runner, "__name__", str(runner)), │ │ 387 │ │ } │ │ 388 │ │ │ │ ❱ 389 │ │ catalog = context._get_catalog( │ │ 390 │ │ │ save_version=save_version, │ │ 391 │ │ │ load_versions=load_versions, │ │ 392 │ │ ) │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ context = <kedro.framework.context.context.KedroContext object │ │ │ │ at 0x7f4ad7118c10> │ │ │ │ extra_params = {} │ │ │ │ filtered_pipeline = Pipeline([ │ │ │ │ Node(train_contrastive, │ │ │ │ ['embeddings_on_train_specter_sampled', │ │ │ │ 'label_encoder', │ │ │ │ 'params:specter_contrastive_hparams'], │ │ │ │ 'specter_contrastive_model', 'train_contrastive') │ │ │ │ ]) │ │ │ │ from_inputs = [] │ │ │ │ from_nodes = [] │ │ │ │ load_versions = {} │ │ │ │ name = '__default__' │ │ │ │ node_names = ('train_contrastive',) │ │ │ │ pipeline = Pipeline([ │ │ │ │ Node(count_ds, ['train_test_set'], None, │ │ │ │ 'count_train_test'), │ │ │ │ Node(download_file_s3, ['params:ft_model_path_s3', │ │ │ │ 'params:ft_model_path_local'], None, 'download_ft'), │ │ │ │ Node(get_ft_labels, ['params:ft_model_path_local'], │ │ │ │ 'ft_labels', 'get_ft_labels'), │ │ │ │ Node(get_labels, 'train_test_set', 'all_labels', │ │ │ │ 'get_labels_all'), │ │ │ │ Node(split_train_test, ['train_test_set'], │ │ │ │ ['ds_train', 'ds_test'], 'split_train_test'), │ │ │ │ Node(clean_up_dataset, ['ds_test'], │ │ │ │ 'ds_test_cleaned', 'clean_up_test_set'), │ │ │ │ Node(gen_embeddings_specter, ['ds_train', │ │ │ │ 'specter_tokenizer', 'specter_model', │ │ │ │ 'params:specter_hparams'], │ │ │ │ 'embeddings_on_train_specter', │ │ │ │ 'embed_train_specter'), │ │ │ │ Node(fit_label_encoder, 'all_labels', │ │ │ │ 'label_encoder', 'fit_label_encoder'), │ │ │ │ Node(get_labels, 'ds_train', 'train_labels', │ │ │ │ 'get_labels'), │ │ │ │ Node(sample_training_data, ['ds_train', │ │ │ │ 'params:training_data_sample_approx_size', │ │ │ │ 'params:training_data_sample_min_examples_per_class… │ │ │ │ 'params:training_data_sample_upsampling'], │ │ │ │ 'ds_train_sample_index', 'sample_train'), │ │ │ │ ... │ │ │ │ ]) │ │ │ │ pipeline_name = None │ │ │ │ record_data = { │ │ │ │ │ 'session_id': '2022-10-13T16.18.00.115Z', │ │ │ │ │ 'project_path': │ │ │ │ '/databricks/mlflow/projects/1b104a49cbf61560a7e5fe… │ │ │ │ │ 'env': 'dbfs', │ │ │ │ │ 'kedro_version': '0.18.2', │ │ │ │ │ 'tags': (), │ │ │ │ │ 'from_nodes': [], │ │ │ │ │ 'to_nodes': [], │ │ │ │ │ 'node_names': ('train_contrastive',), │ │ │ │ │ 'from_inputs': [], │ │ │ │ │ 'to_outputs': [], │ │ │ │ │ ... +4 │ │ │ │ } │ │ │ │ runner = <kedro.runner.sequential_runner.SequentialRunner │ │ │ │ object at 0x7f4b00a77520> │ │ │ │ save_version = '2022-10-13T16.18.00.115Z' │ │ │ │ self = <kedro.framework.session.session.KedroSession object │ │ │ │ at 0x7f4ad70048e0> │ │ │ │ session_id = '2022-10-13T16.18.00.115Z' │ │ │ │ tags = () │ │ │ │ to_nodes = [] │ │ │ │ to_outputs = [] │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/kedro/framework/context/context.py:286 in │ │ _get_catalog │ │ │ │ 283 │ │ ) │ │ 284 │ │ conf_creds = self._get_config_credentials() │ │ 285 │ │ │ │ ❱ 286 │ │ catalog = settings.DATA_CATALOG_CLASS.from_config( │ │ 287 │ │ │ catalog=conf_catalog, │ │ 288 │ │ │ credentials=conf_creds, │ │ 289 │ │ │ load_versions=load_versions, │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ conf_catalog = { │ │ │ │ │ 'ft_labels': { │ │ │ │ │ │ 'type': 'spark.SparkDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │ │ │ │ │ 'file_format': 'parquet', │ │ │ │ │ │ 'versioned': True │ │ │ │ │ }, │ │ │ │ │ 'specter_tokenizer': { │ │ │ │ │ │ 'type': 'pickle.PickleDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ 's3://els-nlp-experts1/data/OmniScienceClassifier/spect… │ │ │ │ │ │ 'backend': 'joblib' │ │ │ │ │ }, │ │ │ │ │ 'specter_model': { │ │ │ │ │ │ 'type': 'pickle.PickleDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ 's3://els-nlp-experts1/data/OmniScienceClassifier/spect… │ │ │ │ │ │ 'backend': 'joblib' │ │ │ │ │ }, │ │ │ │ │ 'metrics_on_test_fasttext': { │ │ │ │ │ │ 'type': │ │ │ │ 'omnieval.kedro_utils.MlflowMetricsDataSet', │ │ │ │ │ │ 'prefix': 'test', │ │ │ │ │ │ 'params': {'model_type': 'fasttext'} │ │ │ │ │ }, │ │ │ │ │ 'metrics_on_test_specter': { │ │ │ │ │ │ 'type': │ │ │ │ 'omnieval.kedro_utils.MlflowMetricsDataSet', │ │ │ │ │ │ 'prefix': 'test', │ │ │ │ │ │ 'params': {'model_type': 'specter_svm'} │ │ │ │ │ }, │ │ │ │ │ 'train_test_set': { │ │ │ │ │ │ 'type': 'spark.SparkDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │ │ │ │ │ 'file_format': 'parquet' │ │ │ │ │ }, │ │ │ │ │ 'ds_train': { │ │ │ │ │ │ 'type': 'spark.SparkDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │ │ │ │ │ 'file_format': 'parquet', │ │ │ │ │ │ 'versioned': True │ │ │ │ │ }, │ │ │ │ │ 'ds_train_sample_index': { │ │ │ │ │ │ 'type': 'spark.SparkDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │ │ │ │ │ 'file_format': 'parquet', │ │ │ │ │ │ 'versioned': True │ │ │ │ │ }, │ │ │ │ │ 'ds_test': { │ │ │ │ │ │ 'type': 'spark.SparkDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │ │ │ │ │ 'file_format': 'parquet', │ │ │ │ │ │ 'versioned': True │ │ │ │ │ }, │ │ │ │ │ 'ds_test_cleaned': { │ │ │ │ │ │ 'type': 'spark.SparkDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │ │ │ │ │ 'file_format': 'parquet', │ │ │ │ │ │ 'versioned': True │ │ │ │ │ }, │ │ │ │ │ ... +9 │ │ │ │ } │ │ │ │ conf_creds = {} │ │ │ │ load_versions = {} │ │ │ │ save_version = '2022-10-13T16.18.00.115Z' │ │ │ │ self = <kedro.framework.context.context.KedroContext object at │ │ │ │ 0x7f4ad7118c10> │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/kedro/io/data_catalog.py:277 in from_config │ │ │ │ 274 │ │ │ │ layers[ds_layer].add(ds_name) │ │ 275 │ │ │ │ │ 276 │ │ │ ds_config = _resolve_credentials(ds_config, credentials) │ │ ❱ 277 │ │ │ data_sets[ds_name] = AbstractDataSet.from_config( │ │ 278 │ │ │ │ ds_name, ds_config, load_versions.get(ds_name), save_v │ │ 279 │ │ │ ) │ │ 280 │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ catalog = { │ │ │ │ │ 'ft_labels': { │ │ │ │ │ │ 'type': 'spark.SparkDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │ │ │ │ │ 'file_format': 'parquet', │ │ │ │ │ │ 'versioned': True │ │ │ │ │ }, │ │ │ │ │ 'specter_tokenizer': { │ │ │ │ │ │ 'type': 'pickle.PickleDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ 's3://els-nlp-experts1/data/OmniScienceClassifier/spect… │ │ │ │ │ │ 'backend': 'joblib' │ │ │ │ │ }, │ │ │ │ │ 'specter_model': { │ │ │ │ │ │ 'type': 'pickle.PickleDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ 's3://els-nlp-experts1/data/OmniScienceClassifier/spect… │ │ │ │ │ │ 'backend': 'joblib' │ │ │ │ │ }, │ │ │ │ │ 'metrics_on_test_fasttext': { │ │ │ │ │ │ 'type': │ │ │ │ 'omnieval.kedro_utils.MlflowMetricsDataSet', │ │ │ │ │ │ 'prefix': 'test', │ │ │ │ │ │ 'params': {'model_type': 'fasttext'} │ │ │ │ │ }, │ │ │ │ │ 'metrics_on_test_specter': { │ │ │ │ │ │ 'type': │ │ │ │ 'omnieval.kedro_utils.MlflowMetricsDataSet', │ │ │ │ │ │ 'prefix': 'test', │ │ │ │ │ │ 'params': {'model_type': 'specter_svm'} │ │ │ │ │ }, │ │ │ │ │ 'train_test_set': { │ │ │ │ │ │ 'type': 'spark.SparkDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │ │ │ │ │ 'file_format': 'parquet' │ │ │ │ │ }, │ │ │ │ │ 'ds_train': { │ │ │ │ │ │ 'type': 'spark.SparkDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │ │ │ │ │ 'file_format': 'parquet', │ │ │ │ │ │ 'versioned': True │ │ │ │ │ }, │ │ │ │ │ 'ds_train_sample_index': { │ │ │ │ │ │ 'type': 'spark.SparkDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │ │ │ │ │ 'file_format': 'parquet', │ │ │ │ │ │ 'versioned': True │ │ │ │ │ }, │ │ │ │ │ 'ds_test': { │ │ │ │ │ │ 'type': 'spark.SparkDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │ │ │ │ │ 'file_format': 'parquet', │ │ │ │ │ │ 'versioned': True │ │ │ │ │ }, │ │ │ │ │ 'ds_test_cleaned': { │ │ │ │ │ │ 'type': 'spark.SparkDataSet', │ │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │ │ │ │ │ 'file_format': 'parquet', │ │ │ │ │ │ 'versioned': True │ │ │ │ │ }, │ │ │ │ │ ... +9 │ │ │ │ } │ │ │ │ cls = <class 'kedro.io.data_catalog.DataCatalog'> │ │ │ │ credentials = {} │ │ │ │ data_sets = {} │ │ │ │ ds_config = { │ │ │ │ │ 'type': 'spark.SparkDataSet', │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │ │ │ │ 'file_format': 'parquet', │ │ │ │ │ 'versioned': True │ │ │ │ } │ │ │ │ ds_layer = None │ │ │ │ ds_name = 'ft_labels' │ │ │ │ layers = defaultdict(<class 'set'>, {}) │ │ │ │ load_versions = {} │ │ │ │ missing_keys = set() │ │ │ │ save_version = '2022-10-13T16.18.00.115Z' │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │ │ ython3.8/site-packages/kedro/io/core.py:162 in from_config │ │ │ │ 159 │ │ │ │ f"constructor of '{class_obj.__module__}.{class_obj.__ │ │ 160 │ │ │ ) from err │ │ 161 │ │ except Exception as err: │ │ ❱ 162 │ │ │ raise DataSetError( │ │ 163 │ │ │ │ f"\n{err}.\nFailed to instantiate DataSet '{name}' " │ │ 164 │ │ │ │ f"of type '{class_obj.__module__}.{class_obj.__qualnam │ │ 165 │ │ │ ) from err │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ class_obj = <class │ │ │ │ 'kedro.extras.datasets.spark.spark_dataset.SparkDataSet'> │ │ │ │ cls = <class 'kedro.io.core.AbstractDataSet'> │ │ │ │ config = { │ │ │ │ │ 'filepath': │ │ │ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/t… │ │ │ │ │ 'file_format': 'parquet', │ │ │ │ │ 'version': Version( │ │ │ │ │ │ load=None, │ │ │ │ │ │ save='2022-10-13T16.18.00.115Z' │ │ │ │ │ ) │ │ │ │ } │ │ │ │ load_version = None │ │ │ │ name = 'ft_labels' │ │ │ │ save_version = '2022-10-13T16.18.00.115Z' │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ ╰──────────────────────────────────────────────────────────────────────────────╯ DataSetError: 'NoneType' object has no attribute 'user_ns'. Failed to instantiate DataSet 'ft_labels' of type 'kedro.extras.datasets.spark.spark_dataset.SparkDataSet'. 2022/10/13 16:18:13 ERROR mlflow.cli: === Run (ID 'dd72997a9cfc4044836dedee6aeef61d') failed ===

Interesting...

I got confused because the stack trace you share has:

│ 48 │ │ else: │ │ 49 │ │ │ import IPython │ │ ❱ 50 │ │ │ return IPython.get_ipython().user_ns["dbutils"]

Based on the error message, I previously thought it was choking on this very similar Kedro code.

I'm inclined to think Databricks should add a safeguard here (or figure out why this is happening). In the interim, I suppose we could catch AttributeError with a comment? I think that resolves your issue, while not potentially opening the door to swallowing other exceptions unknowingly.

@yetudada @idanov Is there anybody at Databricks we can verify this behavior with? :)

I made the try-catch narrower.

merelcht

Thanks for your contribution @mle-els. This looks like a reasonable solution to me 🙂 Would you mind adding a note to the release notes about this change?

I also noticed the DCO check is failing. You can click on the check and follow the instructions to make it pass. For more info about this check see: https://kedro.readthedocs.io/en/stable/contribution/developer_contributor_guidelines.html#developer-certificate-of-origin

avoid error in a file system that happen to have "/dbfs" paths Signed-off-by: Minh Le <m.le@elsevier.com>

Signed-off-by: Minh Le <m.le@elsevier.com>

* Update dependabot.yml Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com> * pin jupyterlab_services to requirments Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com> * lint Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com> Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com> Signed-off-by: Minh Le <m.le@elsevier.com>

Signed-off-by: Minh Le <m.le@elsevier.com>

…o-org#1957) Updates the requirements on [pip-tools](https://github.com/jazzband/pip-tools) to permit the latest version. - [Release notes](https://github.com/jazzband/pip-tools/releases) - [Changelog](https://github.com/jazzband/pip-tools/blob/master/CHANGELOG.md) - [Commits](jazzband/pip-tools@6.5.0...6.9.0) --- updated-dependencies: - dependency-name: pip-tools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Minh Le <m.le@elsevier.com>

…-org#1956) Updates the requirements on [toposort]() to permit the latest version. --- updated-dependencies: - dependency-name: toposort dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com> Signed-off-by: Minh Le <m.le@elsevier.com>

…edro-org#1953) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Minh Le <m.le@elsevier.com>

* remove a redundant function call Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com> * Remove redundant resolove_load_version & fix test Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com> * Fix HoloviewWriter tests with more specific error message pattern & Lint Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com> * Rename tests Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com> Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com> Signed-off-by: Minh Le <m.le@elsevier.com>

Signed-off-by: Minh Le <m.le@elsevier.com>

Signed-off-by: Jo Stichbury <jo.stichbury@quantumblack.com> Signed-off-by: Jo Stichbury <jo.stichbury@quantumblack.com> Signed-off-by: Minh Le <m.le@elsevier.com>

…support for `tf.device` (kedro-org#1915) * Fix issue with save operation. Add gpu option Signed-off-by: William Caicedo <williamc@movio.co> * Add tests Signed-off-by: William Caicedo <williamc@movio.co> * Update RELEASE.md Signed-off-by: William Caicedo <williamc@movio.co> * Update test description Signed-off-by: William Caicedo <williamc@movio.co> * Remove double slash and overwrite flag in fsspec.put method invocation Signed-off-by: William Caicedo <williamc@movio.co> * Allow to explicitly set device name Signed-off-by: William Caicedo <williamc@movio.co> * Update RELEASE.md Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: William Caicedo <williamc@movio.co> * Update docs Signed-off-by: William Caicedo <williamc@movio.co> Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Minh Le <m.le@elsevier.com>

Signed-off-by: Minh Le <m.le@elsevier.com>

avoid error in a file system that happen to have "/dbfs" paths Signed-off-by: Minh Le <m.le@elsevier.com>

Signed-off-by: Minh Le <m.le@elsevier.com>

mle-els · 2022-11-07T16:04:16Z

Thanks for your contribution @mle-els. This looks like a reasonable solution to me 🙂 Would you mind adding a note to the release notes about this change?

I also noticed the DCO check is failing. You can click on the check and follow the instructions to make it pass. For more info about this check see: https://kedro.readthedocs.io/en/stable/contribution/developer_contributor_guidelines.html#developer-certificate-of-origin

@merelcht: Thanks for replying, after signing off, I'm getting this error: Author: Minh Le, Committer: Minh Le; Expected "Minh Le [57996662+mle-els@users.noreply.github.com](mailto:57996662+mle-els@users.noreply.github.com)", but got "Minh Le [m.le@elsevier.com](mailto:m.le@elsevier.com)".

I had written the changes on the web interface, probably that's why the email doesn't match. Is there any way to fix this other than closing this PR and creating a new one?

merelcht · 2022-11-07T16:11:34Z

I had written the changes on the web interface, probably that's why the email doesn't match. Is there any way to fix this other than closing this PR and creating a new one?

Let me have a look at this!

avoid error in a file system that happen to have "/dbfs" paths Signed-off-by: Minh Le <m.le@elsevier.com> Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Signed-off-by: Minh Le <m.le@elsevier.com> Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

merelcht · 2022-11-08T14:05:02Z

@mle-els The remaining failing checks are because of coverage not reaching 100%. Can you add a test for your code?

Signed-off-by: Minh Le <m.le@elsevier.com>

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

merelcht

Thanks for adding a test and updating the release notes! 👍 ⭐

noklam · 2022-11-10T13:47:56Z

kedro/extras/datasets/spark/spark_dataset.py

+                dbutils = None
+                try:
+                    dbutils = _get_dbutils(self._get_spark())
+                except AttributeError:
+                    # Databricks is known to raise AttributeError when called
+                    # on an unsupported environment
+                    pass


I am not sure I understand the root cause entirely. Is this a bug from Databricks pyspark.dbutils module or is it because we check the filepath too eagerly in kedro?

The _get_dbutils function is suppose to try getting the dbutils aggressively, if not it just return None. This solution is adding yet another try-catch layer outside is a bit hacky but maybe necessary in this case? I want to make sure I understand the problem before I come to the conclusion.

Is it better to have this try-except block inside the _get_dbutils func if necessary?

I believe it's a bug in DataBricks code. It assumes that IPython.get_ipython() returns an object. When it happens to return None, we get an AttributeError.

│ /databricks/spark/python/pyspark/dbutils.py:50 in get_dbutils │ │ │ │ 47 │ │ │ return SparkServiceClientDBUtils(spark.sparkContext) │ │ 48 │ │ else: │ │ 49 │ │ │ import IPython │ │ ❱ 50 │ │ │ return IPython.get_ipython().user_ns["dbutils"] │ │ 51 │ │ 52 │ │ 53 class SparkServiceClientDBUtils(object): │

I think having the try-except block inside _get_dbutils is a better solution indeed. Thanks for pointing that out!

Is this pyspark.dbutils module only available on Databricks runtime? If so I think that's why it assumes you have IPython. Also, you mentioned you are running on Databricks but not in a managed way, I am not aware that there is an on-premise option, what kind of environment are you running on?

I was trying to run on a normal DataBricks cluster, just with MLFlow instead of a notebook. I managed to run pipelines via a notebook too but it would have been better to do it through command line. So, when I run mlflow run, MLFlow packages my project into a zip file, sends to a new DataBricks cluster, and runs it on there. Apparently, because it's not on a notebook, there's no IPython.

If you think that this use case is worth it to support, I can make the change that you proposed.

@mle-els I prefer moving the try-except block to _get_dbutils. For IPython I am unsure, even running with .py file it will have IPython normally, but Databricks doesn't document.

@mle-els I won't be able to test it myself since I don't have the environment configured. My guess will be you have a relatively old Databricks runtime.

We test it recently with dbx, which package up a project and runs as a Databricks Job, and IPython would be available in that case.

This suggest Databricks runtime >11 always run on IPython, although it mentioned notebook only but as we tested a couple months ago, it's the same with .py file.
https://docs.databricks.com/notebooks/ipython-kernel.html#how-to-use-the-ipython-kernel-with-databricks

I'll try running the code on a newer runtime when I find some free time.

Hi, @mle-els We'd like to get all PRs related to datasets to be merged soon now we're moving our datasets code to a different package (see our medium blog post for more details).

Do you think you can find time this week? Otherwise, we'll close this PR and ask you to re-open it on the new repo when it's ready.

This week I'm swamped, unfortunately :( Please feel free to close it.

@mle-els No worries! Feel free to re-open the PR in the kedro-plugins repository when you are free to work on it again. :)

noklam · 2022-11-15T18:11:24Z

tests/extras/datasets/spark/test_spark_dataset.py

+    def test_ds_init_get_dbutils_raises_exception(self, mocker):
+        get_dbutils_mock = mocker.Mock()
+        get_dbutils_mock.side_effect = AttributeError
+        get_dbutils_mock = mocker.patch(
+            "kedro.extras.datasets.spark.spark_dataset._get_dbutils", get_dbutils_mock
+        )
+
+        data_set = SparkDataSet(filepath="/dbfs/tmp/data")
+        assert data_set._glob_function.__name__ == "iglob"
+


The test and the assertion don't seem to match here. This would be obsoleted if the try-except is moved to _get_dbutils too, so it would need some modification.

mle-els requested a review from idanov as a code owner October 13, 2022 16:55

deepyaman reviewed Oct 13, 2022

View reviewed changes

merelcht mentioned this pull request Nov 7, 2022

Release Kedro 0.18.4 #2004

Closed

10 tasks

merelcht reviewed Nov 7, 2022

View reviewed changes

mle-els and others added 12 commits November 7, 2022 16:56

safer use of "/dbfs"

0941892

avoid error in a file system that happen to have "/dbfs" paths Signed-off-by: Minh Le <m.le@elsevier.com>

fix broken link (kedro-org#1950)

350d4ef

Signed-off-by: Minh Le <m.le@elsevier.com>

Update setup.py Jinja2 dependencies (kedro-org#1954)

2b10303

Signed-off-by: Minh Le <m.le@elsevier.com>

Add deprecation warning to package_name argument in session create() (k…

59895ea

…edro-org#1953) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Minh Le <m.le@elsevier.com>

Make docstring in test starter match real starters (kedro-org#1916)

ba546f9

Signed-off-by: Minh Le <m.le@elsevier.com>

Add show-docs command to Makefile (kedro-org#1959)

9fc83cd

Signed-off-by: Jo Stichbury <jo.stichbury@quantumblack.com> Signed-off-by: Jo Stichbury <jo.stichbury@quantumblack.com> Signed-off-by: Minh Le <m.le@elsevier.com>

make catching narrower

723cb2d

Signed-off-by: Minh Le <m.le@elsevier.com>

mle-els force-pushed the patch-1 branch from ba2fb7b to 723cb2d Compare November 7, 2022 15:56

mle-els requested a review from yetudada as a code owner November 7, 2022 15:56

mle-els added 2 commits November 7, 2022 16:58

safer use of "/dbfs"

411b145

avoid error in a file system that happen to have "/dbfs" paths Signed-off-by: Minh Le <m.le@elsevier.com>

make catching narrower

f4e6710

Signed-off-by: Minh Le <m.le@elsevier.com>

mle-els force-pushed the patch-1 branch from 723cb2d to f4e6710 Compare November 7, 2022 15:58

mle-els and others added 4 commits November 7, 2022 16:12

safer use of "/dbfs"

00e90bb

avoid error in a file system that happen to have "/dbfs" paths Signed-off-by: Minh Le <m.le@elsevier.com> Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

make catching narrower

5bad159

Signed-off-by: Minh Le <m.le@elsevier.com> Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Merge branch 'patch-1' of github.com:mle-els/kedro into patch-1

6c744e7

Merge branch 'main' into patch-1

9d2f8c1

add test, release note

53182a5

Signed-off-by: Minh Le <m.le@elsevier.com>

mle-els and others added 3 commits November 9, 2022 09:52

Merge branch 'patch-1' of github.com:mle-els/kedro into patch-1

f4088d6

Signed-off-by: Minh Le <m.le@elsevier.com>

Fix lint

6e6da4b

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Merge branch 'main' into patch-1

c610485

merelcht approved these changes Nov 9, 2022

View reviewed changes

merelcht requested a review from deepyaman November 9, 2022 10:48

noklam self-requested a review November 10, 2022 13:44

noklam reviewed Nov 10, 2022

View reviewed changes

noklam self-assigned this Nov 15, 2022

noklam reviewed Nov 15, 2022

View reviewed changes

noklam closed this Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

safer use of "/dbfs" #1931

safer use of "/dbfs" #1931

mle-els commented Oct 13, 2022 •

edited by gitpod-io bot

Loading

deepyaman commented Oct 13, 2022

deepyaman Oct 13, 2022

mle-els Oct 13, 2022

deepyaman Oct 20, 2022

mle-els Oct 22, 2022 •

edited

Loading

merelcht left a comment

mle-els commented Nov 7, 2022

merelcht commented Nov 7, 2022

merelcht commented Nov 8, 2022

merelcht left a comment

noklam Nov 10, 2022

mle-els Nov 10, 2022

noklam Nov 10, 2022

mle-els Nov 14, 2022

noklam Nov 14, 2022

noklam Nov 15, 2022

mle-els Nov 17, 2022

noklam Nov 17, 2022

mle-els Nov 17, 2022

noklam Nov 17, 2022

noklam Nov 15, 2022

safer use of "/dbfs" #1931

safer use of "/dbfs" #1931

Conversation

mle-els commented Oct 13, 2022 • edited by gitpod-io bot Loading

Description

deepyaman commented Oct 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mle-els Oct 22, 2022 • edited Loading

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

mle-els commented Nov 7, 2022

merelcht commented Nov 7, 2022

merelcht commented Nov 8, 2022

merelcht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mle-els commented Oct 13, 2022 •

edited by gitpod-io bot

Loading

mle-els Oct 22, 2022 •

edited

Loading