-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
safer use of "/dbfs" #1931
safer use of "/dbfs" #1931
Conversation
I'm not sure how this is happening; can you share a full stack trace? If |
except: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't generally call swallowing the exception safe. 😅 But, as posted just a second ago, I'm not sure why an error (especially that which you shared) is occurring here to begin with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi there, this is the full stacktrace. The error occurs when Kedro tries to read a dataset that starts with /dbfs
. I wouldn't normally skip exceptions either but this is warranted because the assumption that /dbfs
has to do with DataBricks is a risky assumption in the first place.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/io/core.py:155 in from_config │
│ │
│ 152 │ │ │ ) from exc │
│ 153 │ │ │
│ 154 │ │ try: │
│ ❱ 155 │ │ │ data_set = class_obj(**config) # type: ignore │
│ 156 │ │ except TypeError as err: │
│ 157 │ │ │ raise DataSetError( │
│ 158 │ │ │ │ f"\n{err}.\nDataSet '{name}' must only contain argumen │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ class_obj = <class │ │
│ │ 'kedro.extras.datasets.spark.spark_dataset.SparkDataSet'> │ │
│ │ cls = <class 'kedro.io.core.AbstractDataSet'> │ │
│ │ config = { │ │
│ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/t… │ │
│ │ │ 'file_format': 'parquet', │ │
│ │ │ 'version': Version( │ │
│ │ │ │ load=None, │ │
│ │ │ │ save='2022-10-13T16.18.00.115Z' │ │
│ │ │ ) │ │
│ │ } │ │
│ │ load_version = None │ │
│ │ name = 'ft_labels' │ │
│ │ save_version = '2022-10-13T16.18.00.115Z' │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/extras/datasets/spark/spark_dataset.py:308 in │
│ __init__ │
│ │
│ 305 │ │ │ path = PurePosixPath(filepath) │
│ 306 │ │ │ │
│ 307 │ │ │ if filepath.startswith("/dbfs"): │
│ ❱ 308 │ │ │ │ dbutils = _get_dbutils(self._get_spark()) │
│ 309 │ │ │ │ if dbutils: │
│ 310 │ │ │ │ │ glob_function = partial(_dbfs_glob, dbutils=dbutil │
│ 311 │ │ │ │ │ exists_function = partial(_dbfs_exists, dbutils=db │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ __class__ = <class │ │
│ │ 'kedro.extras.datasets.spark.spark_dataset.SparkDataS… │ │
│ │ credentials = {} │ │
│ │ exists_function = None │ │
│ │ file_format = 'parquet' │ │
│ │ filepath = '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifie… │ │
│ │ fs_prefix = '' │ │
│ │ glob_function = None │ │
│ │ load_args = None │ │
│ │ path = PurePosixPath('/dbfs/mnt/els-nlp-experts1/data/OmniSc… │ │
│ │ save_args = None │ │
│ │ self = <kedro.extras.datasets.spark.spark_dataset.SparkDataS… │ │
│ │ object at 0x7f4a7ca602e0> │ │
│ │ version = Version(load=None, save='2022-10-13T16.18.00.115Z') │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/extras/datasets/spark/spark_dataset.py:85 in │
│ _get_dbutils │
│ │
│ 82 │ try: │
│ 83 │ │ from pyspark.dbutils import DBUtils # pylint: disable=import- │
│ 84 │ │ │
│ ❱ 85 │ │ dbutils = DBUtils(spark) │
│ 86 │ except ImportError: │
│ 87 │ │ try: │
│ 88 │ │ │ import IPython # pylint: disable=import-outside-toplevel │
│ │
│ ╭─────────────────────────────── locals ────────────────────────────────╮ │
│ │ dbutils = None │ │
│ │ DBUtils = <class 'pyspark.dbutils.DBUtils'> │ │
│ │ spark = <pyspark.sql.session.SparkSession object at 0x7f4afb216400> │ │
│ ╰───────────────────────────────────────────────────────────────────────╯ │
│ │
│ /databricks/spark/python/pyspark/dbutils.py:33 in __init__ │
│ │
│ 30 │ def __init__(self, spark=None): │
│ 31 │ │ if spark is None: │
│ 32 │ │ │ spark = SparkSession.builder.getOrCreate() │
│ ❱ 33 │ │ dbutils_obj = self.get_dbutils(spark) │
│ 34 │ │ self.fs = dbutils_obj.fs │
│ 35 │ │ self.secrets = dbutils_obj.secrets │
│ 36 │ │ if spark.conf.get("spark.databricks.service.client.enabled") = │
│ │
│ ╭────────────────────────────── locals ───────────────────────────────╮ │
│ │ self = <pyspark.dbutils.DBUtils object at 0x7f4a7ca60400> │ │
│ │ spark = <pyspark.sql.session.SparkSession object at 0x7f4afb216400> │ │
│ ╰─────────────────────────────────────────────────────────────────────╯ │
│ │
│ /databricks/spark/python/pyspark/dbutils.py:50 in get_dbutils │
│ │
│ 47 │ │ │ return SparkServiceClientDBUtils(spark.sparkContext) │
│ 48 │ │ else: │
│ 49 │ │ │ import IPython │
│ ❱ 50 │ │ │ return IPython.get_ipython().user_ns["dbutils"] │
│ 51 │
│ 52 │
│ 53 class SparkServiceClientDBUtils(object): │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ IPython = <module 'IPython' from │ │
│ │ '/databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e9259… │ │
│ │ self = <pyspark.dbutils.DBUtils object at 0x7f4a7ca60400> │ │
│ │ spark = <pyspark.sql.session.SparkSession object at 0x7f4afb216400> │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'NoneType' object has no attribute 'user_ns'
The above exception was the direct cause of the following exception:
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/bin/k │
│ edro:8 in <module> │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/framework/cli/cli.py:211 in main │
│ │
│ 208 │ """ │
│ 209 │ _init_plugins() │
│ 210 │ cli_collection = KedroCLI(project_path=Path.cwd()) │
│ ❱ 211 │ cli_collection() │
│ 212 │
│ │
│ ╭───────────── locals ─────────────╮ │
│ │ cli_collection = <KedroCLI None> │ │
│ ╰──────────────────────────────────╯ │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/click/core.py:1130 in __call__ │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/framework/cli/cli.py:139 in main │
│ │
│ 136 │ │ ) │
│ 137 │ │ │
│ 138 │ │ try: │
│ ❱ 139 │ │ │ super().main( │
│ 140 │ │ │ │ args=args, │
│ 141 │ │ │ │ prog_name=prog_name, │
│ 142 │ │ │ │ complete_var=complete_var, │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ __class__ = <class 'kedro.framework.cli.cli.KedroCLI'> │ │
│ │ args = [ │ │
│ │ │ 'run', │ │
│ │ │ '--env', │ │
│ │ │ 'dbfs', │ │
│ │ │ '--node', │ │
│ │ │ 'train_contrastive' │ │
│ │ ] │ │
│ │ complete_var = None │ │
│ │ extra = { │ │
│ │ │ 'obj': ProjectMetadata( │ │
│ │ │ │ │ │
│ │ config_file=PosixPath('/databricks/mlflow/projects/1b… │ │
│ │ │ │ package_name='omnieval', │ │
│ │ │ │ project_name='OmniScience classification │ │
│ │ evaluation framework', │ │
│ │ │ │ │ │
│ │ project_path=PosixPath('/databricks/mlflow/projects/1… │ │
│ │ │ │ project_version='0.18.2', │ │
│ │ │ │ │ │
│ │ source_dir=PosixPath('/databricks/mlflow/projects/1b1… │ │
│ │ │ ) │ │
│ │ } │ │
│ │ prog_name = None │ │
│ │ self = <KedroCLI None> │ │
│ │ standalone_mode = True │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/click/core.py:1055 in main │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/click/core.py:1657 in invoke │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/click/core.py:1404 in invoke │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/click/core.py:760 in invoke │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/framework/cli/project.py:352 in run │
│ │
│ 349 │ node_names = _get_values_as_tuple(node_names) if node_names else n │
│ 350 │ │
│ 351 │ with KedroSession.create(env=env, extra_params=params) as session: │
│ ❱ 352 │ │ session.run( │
│ 353 │ │ │ tags=tag, │
│ 354 │ │ │ runner=runner(is_async=is_async), │
│ 355 │ │ │ node_names=node_names, │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ config = None │ │
│ │ env = 'dbfs' │ │
│ │ from_inputs = [] │ │
│ │ from_nodes = [] │ │
│ │ is_async = False │ │
│ │ load_version = {} │ │
│ │ node_names = ('train_contrastive',) │ │
│ │ params = {} │ │
│ │ pipeline = None │ │
│ │ runner = <class 'kedro.runner.sequential_runner.SequentialRunner'> │ │
│ │ session = <kedro.framework.session.session.KedroSession object at │ │
│ │ 0x7f4ad70048e0> │ │
│ │ tag = () │ │
│ │ to_nodes = [] │ │
│ │ to_outputs = [] │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/framework/session/session.py:389 in run │
│ │
│ 386 │ │ │ "runner": getattr(runner, "__name__", str(runner)), │
│ 387 │ │ } │
│ 388 │ │ │
│ ❱ 389 │ │ catalog = context._get_catalog( │
│ 390 │ │ │ save_version=save_version, │
│ 391 │ │ │ load_versions=load_versions, │
│ 392 │ │ ) │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ context = <kedro.framework.context.context.KedroContext object │ │
│ │ at 0x7f4ad7118c10> │ │
│ │ extra_params = {} │ │
│ │ filtered_pipeline = Pipeline([ │ │
│ │ Node(train_contrastive, │ │
│ │ ['embeddings_on_train_specter_sampled', │ │
│ │ 'label_encoder', │ │
│ │ 'params:specter_contrastive_hparams'], │ │
│ │ 'specter_contrastive_model', 'train_contrastive') │ │
│ │ ]) │ │
│ │ from_inputs = [] │ │
│ │ from_nodes = [] │ │
│ │ load_versions = {} │ │
│ │ name = '__default__' │ │
│ │ node_names = ('train_contrastive',) │ │
│ │ pipeline = Pipeline([ │ │
│ │ Node(count_ds, ['train_test_set'], None, │ │
│ │ 'count_train_test'), │ │
│ │ Node(download_file_s3, ['params:ft_model_path_s3', │ │
│ │ 'params:ft_model_path_local'], None, 'download_ft'), │ │
│ │ Node(get_ft_labels, ['params:ft_model_path_local'], │ │
│ │ 'ft_labels', 'get_ft_labels'), │ │
│ │ Node(get_labels, 'train_test_set', 'all_labels', │ │
│ │ 'get_labels_all'), │ │
│ │ Node(split_train_test, ['train_test_set'], │ │
│ │ ['ds_train', 'ds_test'], 'split_train_test'), │ │
│ │ Node(clean_up_dataset, ['ds_test'], │ │
│ │ 'ds_test_cleaned', 'clean_up_test_set'), │ │
│ │ Node(gen_embeddings_specter, ['ds_train', │ │
│ │ 'specter_tokenizer', 'specter_model', │ │
│ │ 'params:specter_hparams'], │ │
│ │ 'embeddings_on_train_specter', │ │
│ │ 'embed_train_specter'), │ │
│ │ Node(fit_label_encoder, 'all_labels', │ │
│ │ 'label_encoder', 'fit_label_encoder'), │ │
│ │ Node(get_labels, 'ds_train', 'train_labels', │ │
│ │ 'get_labels'), │ │
│ │ Node(sample_training_data, ['ds_train', │ │
│ │ 'params:training_data_sample_approx_size', │ │
│ │ 'params:training_data_sample_min_examples_per_class… │ │
│ │ 'params:training_data_sample_upsampling'], │ │
│ │ 'ds_train_sample_index', 'sample_train'), │ │
│ │ ... │ │
│ │ ]) │ │
│ │ pipeline_name = None │ │
│ │ record_data = { │ │
│ │ │ 'session_id': '2022-10-13T16.18.00.115Z', │ │
│ │ │ 'project_path': │ │
│ │ '/databricks/mlflow/projects/1b104a49cbf61560a7e5fe… │ │
│ │ │ 'env': 'dbfs', │ │
│ │ │ 'kedro_version': '0.18.2', │ │
│ │ │ 'tags': (), │ │
│ │ │ 'from_nodes': [], │ │
│ │ │ 'to_nodes': [], │ │
│ │ │ 'node_names': ('train_contrastive',), │ │
│ │ │ 'from_inputs': [], │ │
│ │ │ 'to_outputs': [], │ │
│ │ │ ... +4 │ │
│ │ } │ │
│ │ runner = <kedro.runner.sequential_runner.SequentialRunner │ │
│ │ object at 0x7f4b00a77520> │ │
│ │ save_version = '2022-10-13T16.18.00.115Z' │ │
│ │ self = <kedro.framework.session.session.KedroSession object │ │
│ │ at 0x7f4ad70048e0> │ │
│ │ session_id = '2022-10-13T16.18.00.115Z' │ │
│ │ tags = () │ │
│ │ to_nodes = [] │ │
│ │ to_outputs = [] │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/framework/context/context.py:286 in │
│ _get_catalog │
│ │
│ 283 │ │ ) │
│ 284 │ │ conf_creds = self._get_config_credentials() │
│ 285 │ │ │
│ ❱ 286 │ │ catalog = settings.DATA_CATALOG_CLASS.from_config( │
│ 287 │ │ │ catalog=conf_catalog, │
│ 288 │ │ │ credentials=conf_creds, │
│ 289 │ │ │ load_versions=load_versions, │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ conf_catalog = { │ │
│ │ │ 'ft_labels': { │ │
│ │ │ │ 'type': 'spark.SparkDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │ │ │ 'file_format': 'parquet', │ │
│ │ │ │ 'versioned': True │ │
│ │ │ }, │ │
│ │ │ 'specter_tokenizer': { │ │
│ │ │ │ 'type': 'pickle.PickleDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ 's3://els-nlp-experts1/data/OmniScienceClassifier/spect… │ │
│ │ │ │ 'backend': 'joblib' │ │
│ │ │ }, │ │
│ │ │ 'specter_model': { │ │
│ │ │ │ 'type': 'pickle.PickleDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ 's3://els-nlp-experts1/data/OmniScienceClassifier/spect… │ │
│ │ │ │ 'backend': 'joblib' │ │
│ │ │ }, │ │
│ │ │ 'metrics_on_test_fasttext': { │ │
│ │ │ │ 'type': │ │
│ │ 'omnieval.kedro_utils.MlflowMetricsDataSet', │ │
│ │ │ │ 'prefix': 'test', │ │
│ │ │ │ 'params': {'model_type': 'fasttext'} │ │
│ │ │ }, │ │
│ │ │ 'metrics_on_test_specter': { │ │
│ │ │ │ 'type': │ │
│ │ 'omnieval.kedro_utils.MlflowMetricsDataSet', │ │
│ │ │ │ 'prefix': 'test', │ │
│ │ │ │ 'params': {'model_type': 'specter_svm'} │ │
│ │ │ }, │ │
│ │ │ 'train_test_set': { │ │
│ │ │ │ 'type': 'spark.SparkDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │ │ │ 'file_format': 'parquet' │ │
│ │ │ }, │ │
│ │ │ 'ds_train': { │ │
│ │ │ │ 'type': 'spark.SparkDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │ │ │ 'file_format': 'parquet', │ │
│ │ │ │ 'versioned': True │ │
│ │ │ }, │ │
│ │ │ 'ds_train_sample_index': { │ │
│ │ │ │ 'type': 'spark.SparkDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │ │ │ 'file_format': 'parquet', │ │
│ │ │ │ 'versioned': True │ │
│ │ │ }, │ │
│ │ │ 'ds_test': { │ │
│ │ │ │ 'type': 'spark.SparkDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │ │ │ 'file_format': 'parquet', │ │
│ │ │ │ 'versioned': True │ │
│ │ │ }, │ │
│ │ │ 'ds_test_cleaned': { │ │
│ │ │ │ 'type': 'spark.SparkDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │ │ │ 'file_format': 'parquet', │ │
│ │ │ │ 'versioned': True │ │
│ │ │ }, │ │
│ │ │ ... +9 │ │
│ │ } │ │
│ │ conf_creds = {} │ │
│ │ load_versions = {} │ │
│ │ save_version = '2022-10-13T16.18.00.115Z' │ │
│ │ self = <kedro.framework.context.context.KedroContext object at │ │
│ │ 0x7f4ad7118c10> │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/io/data_catalog.py:277 in from_config │
│ │
│ 274 │ │ │ │ layers[ds_layer].add(ds_name) │
│ 275 │ │ │ │
│ 276 │ │ │ ds_config = _resolve_credentials(ds_config, credentials) │
│ ❱ 277 │ │ │ data_sets[ds_name] = AbstractDataSet.from_config( │
│ 278 │ │ │ │ ds_name, ds_config, load_versions.get(ds_name), save_v │
│ 279 │ │ │ ) │
│ 280 │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ catalog = { │ │
│ │ │ 'ft_labels': { │ │
│ │ │ │ 'type': 'spark.SparkDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │ │ │ 'file_format': 'parquet', │ │
│ │ │ │ 'versioned': True │ │
│ │ │ }, │ │
│ │ │ 'specter_tokenizer': { │ │
│ │ │ │ 'type': 'pickle.PickleDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ 's3://els-nlp-experts1/data/OmniScienceClassifier/spect… │ │
│ │ │ │ 'backend': 'joblib' │ │
│ │ │ }, │ │
│ │ │ 'specter_model': { │ │
│ │ │ │ 'type': 'pickle.PickleDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ 's3://els-nlp-experts1/data/OmniScienceClassifier/spect… │ │
│ │ │ │ 'backend': 'joblib' │ │
│ │ │ }, │ │
│ │ │ 'metrics_on_test_fasttext': { │ │
│ │ │ │ 'type': │ │
│ │ 'omnieval.kedro_utils.MlflowMetricsDataSet', │ │
│ │ │ │ 'prefix': 'test', │ │
│ │ │ │ 'params': {'model_type': 'fasttext'} │ │
│ │ │ }, │ │
│ │ │ 'metrics_on_test_specter': { │ │
│ │ │ │ 'type': │ │
│ │ 'omnieval.kedro_utils.MlflowMetricsDataSet', │ │
│ │ │ │ 'prefix': 'test', │ │
│ │ │ │ 'params': {'model_type': 'specter_svm'} │ │
│ │ │ }, │ │
│ │ │ 'train_test_set': { │ │
│ │ │ │ 'type': 'spark.SparkDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │ │ │ 'file_format': 'parquet' │ │
│ │ │ }, │ │
│ │ │ 'ds_train': { │ │
│ │ │ │ 'type': 'spark.SparkDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │ │ │ 'file_format': 'parquet', │ │
│ │ │ │ 'versioned': True │ │
│ │ │ }, │ │
│ │ │ 'ds_train_sample_index': { │ │
│ │ │ │ 'type': 'spark.SparkDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │ │ │ 'file_format': 'parquet', │ │
│ │ │ │ 'versioned': True │ │
│ │ │ }, │ │
│ │ │ 'ds_test': { │ │
│ │ │ │ 'type': 'spark.SparkDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │ │ │ 'file_format': 'parquet', │ │
│ │ │ │ 'versioned': True │ │
│ │ │ }, │ │
│ │ │ 'ds_test_cleaned': { │ │
│ │ │ │ 'type': 'spark.SparkDataSet', │ │
│ │ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │ │ │ 'file_format': 'parquet', │ │
│ │ │ │ 'versioned': True │ │
│ │ │ }, │ │
│ │ │ ... +9 │ │
│ │ } │ │
│ │ cls = <class 'kedro.io.data_catalog.DataCatalog'> │ │
│ │ credentials = {} │ │
│ │ data_sets = {} │ │
│ │ ds_config = { │ │
│ │ │ 'type': 'spark.SparkDataSet', │ │
│ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │ │ 'file_format': 'parquet', │ │
│ │ │ 'versioned': True │ │
│ │ } │ │
│ │ ds_layer = None │ │
│ │ ds_name = 'ft_labels' │ │
│ │ layers = defaultdict(<class 'set'>, {}) │ │
│ │ load_versions = {} │ │
│ │ missing_keys = set() │ │
│ │ save_version = '2022-10-13T16.18.00.115Z' │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/io/core.py:162 in from_config │
│ │
│ 159 │ │ │ │ f"constructor of '{class_obj.__module__}.{class_obj.__ │
│ 160 │ │ │ ) from err │
│ 161 │ │ except Exception as err: │
│ ❱ 162 │ │ │ raise DataSetError( │
│ 163 │ │ │ │ f"\n{err}.\nFailed to instantiate DataSet '{name}' " │
│ 164 │ │ │ │ f"of type '{class_obj.__module__}.{class_obj.__qualnam │
│ 165 │ │ │ ) from err │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ class_obj = <class │ │
│ │ 'kedro.extras.datasets.spark.spark_dataset.SparkDataSet'> │ │
│ │ cls = <class 'kedro.io.core.AbstractDataSet'> │ │
│ │ config = { │ │
│ │ │ 'filepath': │ │
│ │ '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/t… │ │
│ │ │ 'file_format': 'parquet', │ │
│ │ │ 'version': Version( │ │
│ │ │ │ load=None, │ │
│ │ │ │ save='2022-10-13T16.18.00.115Z' │ │
│ │ │ ) │ │
│ │ } │ │
│ │ load_version = None │ │
│ │ name = 'ft_labels' │ │
│ │ save_version = '2022-10-13T16.18.00.115Z' │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
DataSetError:
'NoneType' object has no attribute 'user_ns'.
Failed to instantiate DataSet 'ft_labels' of type
'kedro.extras.datasets.spark.spark_dataset.SparkDataSet'.
2022/10/13 16:18:13 ERROR mlflow.cli: === Run (ID 'dd72997a9cfc4044836dedee6aeef61d') failed ===
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting...
I got confused because the stack trace you share has:
│ 48 │ │ else: │
│ 49 │ │ │ import IPython │
│ ❱ 50 │ │ │ return IPython.get_ipython().user_ns["dbutils"]
Based on the error message, I previously thought it was choking on this very similar Kedro code.
I'm inclined to think Databricks should add a safeguard here (or figure out why this is happening). In the interim, I suppose we could catch AttributeError
with a comment? I think that resolves your issue, while not potentially opening the door to swallowing other exceptions unknowingly.
@yetudada @idanov Is there anybody at Databricks we can verify this behavior with? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the try-catch narrower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @mle-els. This looks like a reasonable solution to me 🙂 Would you mind adding a note to the release notes about this change?
I also noticed the DCO
check is failing. You can click on the check and follow the instructions to make it pass. For more info about this check see: https://kedro.readthedocs.io/en/stable/contribution/developer_contributor_guidelines.html#developer-certificate-of-origin
avoid error in a file system that happen to have "/dbfs" paths Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
* Update dependabot.yml Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com> * pin jupyterlab_services to requirments Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com> * lint Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com> Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com> Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
…o-org#1957) Updates the requirements on [pip-tools](https://github.com/jazzband/pip-tools) to permit the latest version. - [Release notes](https://github.com/jazzband/pip-tools/releases) - [Changelog](https://github.com/jazzband/pip-tools/blob/master/CHANGELOG.md) - [Commits](jazzband/pip-tools@6.5.0...6.9.0) --- updated-dependencies: - dependency-name: pip-tools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Minh Le <m.le@elsevier.com>
…-org#1956) Updates the requirements on [toposort]() to permit the latest version. --- updated-dependencies: - dependency-name: toposort dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com> Signed-off-by: Minh Le <m.le@elsevier.com>
…edro-org#1953) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Minh Le <m.le@elsevier.com>
* remove a redundant function call Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com> * Remove redundant resolove_load_version & fix test Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com> * Fix HoloviewWriter tests with more specific error message pattern & Lint Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com> * Rename tests Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com> Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com> Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Jo Stichbury <jo.stichbury@quantumblack.com> Signed-off-by: Jo Stichbury <jo.stichbury@quantumblack.com> Signed-off-by: Minh Le <m.le@elsevier.com>
…support for `tf.device` (kedro-org#1915) * Fix issue with save operation. Add gpu option Signed-off-by: William Caicedo <williamc@movio.co> * Add tests Signed-off-by: William Caicedo <williamc@movio.co> * Update RELEASE.md Signed-off-by: William Caicedo <williamc@movio.co> * Update test description Signed-off-by: William Caicedo <williamc@movio.co> * Remove double slash and overwrite flag in fsspec.put method invocation Signed-off-by: William Caicedo <williamc@movio.co> * Allow to explicitly set device name Signed-off-by: William Caicedo <williamc@movio.co> * Update RELEASE.md Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: William Caicedo <williamc@movio.co> * Update docs Signed-off-by: William Caicedo <williamc@movio.co> Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
avoid error in a file system that happen to have "/dbfs" paths Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
@merelcht: Thanks for replying, after signing off, I'm getting this error: I had written the changes on the web interface, probably that's why the email doesn't match. Is there any way to fix this other than closing this PR and creating a new one? |
Let me have a look at this! |
avoid error in a file system that happen to have "/dbfs" paths Signed-off-by: Minh Le <m.le@elsevier.com> Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Minh Le <m.le@elsevier.com> Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
@mle-els The remaining failing checks are because of coverage not reaching 100%. Can you add a test for your code? |
Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a test and updating the release notes! 👍 ⭐
dbutils = None | ||
try: | ||
dbutils = _get_dbutils(self._get_spark()) | ||
except AttributeError: | ||
# Databricks is known to raise AttributeError when called | ||
# on an unsupported environment | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the root cause entirely. Is this a bug from Databricks pyspark.dbutils
module or is it because we check the filepath too eagerly in kedro?
The _get_dbutils
function is suppose to try getting the dbutils aggressively, if not it just return None
. This solution is adding yet another try-catch layer outside is a bit hacky but maybe necessary in this case? I want to make sure I understand the problem before I come to the conclusion.
Is it better to have this try-except block inside the _get_dbutils
func if necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's a bug in DataBricks code. It assumes that IPython.get_ipython()
returns an object. When it happens to return None, we get an AttributeError
.
│ /databricks/spark/python/pyspark/dbutils.py:50 in get_dbutils │
│ │
│ 47 │ │ │ return SparkServiceClientDBUtils(spark.sparkContext) │
│ 48 │ │ else: │
│ 49 │ │ │ import IPython │
│ ❱ 50 │ │ │ return IPython.get_ipython().user_ns["dbutils"] │
│ 51 │
│ 52 │
│ 53 class SparkServiceClientDBUtils(object): │
I think having the try-except block inside _get_dbutils
is a better solution indeed. Thanks for pointing that out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pyspark.dbutils
module only available on Databricks runtime? If so I think that's why it assumes you have IPython
. Also, you mentioned you are running on Databricks but not in a managed way, I am not aware that there is an on-premise option, what kind of environment are you running on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to run on a normal DataBricks cluster, just with MLFlow instead of a notebook. I managed to run pipelines via a notebook too but it would have been better to do it through command line. So, when I run mlflow run
, MLFlow packages my project into a zip file, sends to a new DataBricks cluster, and runs it on there. Apparently, because it's not on a notebook, there's no IPython.
If you think that this use case is worth it to support, I can make the change that you proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mle-els I prefer moving the try-except block to _get_dbutils
. For IPython
I am unsure, even running with .py
file it will have IPython
normally, but Databricks doesn't document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mle-els I won't be able to test it myself since I don't have the environment configured. My guess will be you have a relatively old Databricks runtime.
We test it recently with dbx
, which package up a project and runs as a Databricks Job, and IPython would be available in that case.
This suggest Databricks runtime >11 always run on IPython, although it mentioned notebook only but as we tested a couple months ago, it's the same with .py file.
https://docs.databricks.com/notebooks/ipython-kernel.html#how-to-use-the-ipython-kernel-with-databricks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try running the code on a newer runtime when I find some free time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @mle-els We'd like to get all PRs related to datasets to be merged soon now we're moving our datasets code to a different package (see our medium blog post for more details).
Do you think you can find time this week? Otherwise, we'll close this PR and ask you to re-open it on the new repo when it's ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This week I'm swamped, unfortunately :( Please feel free to close it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mle-els No worries! Feel free to re-open the PR in the kedro-plugins
repository when you are free to work on it again. :)
def test_ds_init_get_dbutils_raises_exception(self, mocker): | ||
get_dbutils_mock = mocker.Mock() | ||
get_dbutils_mock.side_effect = AttributeError | ||
get_dbutils_mock = mocker.patch( | ||
"kedro.extras.datasets.spark.spark_dataset._get_dbutils", get_dbutils_mock | ||
) | ||
|
||
data_set = SparkDataSet(filepath="/dbfs/tmp/data") | ||
assert data_set._glob_function.__name__ == "iglob" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test and the assertion don't seem to match here. This would be obsoleted if the try-except
is moved to _get_dbutils
too, so it would need some modification.
avoid error in a file system that happen to have "/dbfs" paths
Description
I use MLFlow to submit a Kedro project to a DataBricks cluster. This is because I need access to both the GPU that's available on the cluster and the data locations that are mounted to the cluster. The "kedro run" command fails because an IPython environment is not available (
AttributeError: 'NoneType' object has no attribute 'user_ns'
) which I traced to this part of the code which assumes that you can initialize a DBUtils object.For my project, the program is running on DataBricks but not in a managed way. Also, for systems that, for whatever reason, use a
/dbfs
path, it's not reasonable that one can initiate a DBUtils object. Ideally the initialization of the DBUtils object shouldn't throw an exception but in reality, it does. So, to be safe, it's better to catch the exception.