Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safer use of "/dbfs" #1931

Closed
wants to merge 22 commits into from
Closed

safer use of "/dbfs" #1931

wants to merge 22 commits into from

Conversation

mle-els
Copy link

@mle-els mle-els commented Oct 13, 2022

avoid error in a file system that happen to have "/dbfs" paths

Description

I use MLFlow to submit a Kedro project to a DataBricks cluster. This is because I need access to both the GPU that's available on the cluster and the data locations that are mounted to the cluster. The "kedro run" command fails because an IPython environment is not available (AttributeError: 'NoneType' object has no attribute 'user_ns') which I traced to this part of the code which assumes that you can initialize a DBUtils object.

For my project, the program is running on DataBricks but not in a managed way. Also, for systems that, for whatever reason, use a /dbfs path, it's not reasonable that one can initiate a DBUtils object. Ideally the initialization of the DBUtils object shouldn't throw an exception but in reality, it does. So, to be safe, it's better to catch the exception.

@mle-els mle-els requested a review from idanov as a code owner October 13, 2022 16:55
@deepyaman
Copy link
Member

The "kedro run" command fails because an IPython environment is not available (AttributeError: 'NoneType' object has no attribute 'user_ns') which I traced to this part of the code which assumes that you can initialize a DBUtils object.

I'm not sure how this is happening; can you share a full stack trace?

https://github.com/kedro-org/kedro/blob/0.18.3/kedro/extras/datasets/spark/spark_dataset.py#L93

If ipython is None above, it shouldn't try to access the user_ns attribute. I'm not really sure where _get_dbutils is raising an error from, based on a read-through.

Comment on lines 313 to 318
except:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't generally call swallowing the exception safe. 😅 But, as posted just a second ago, I'm not sure why an error (especially that which you shared) is occurring here to begin with.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there, this is the full stacktrace. The error occurs when Kedro tries to read a dataset that starts with /dbfs. I wouldn't normally skip exceptions either but this is warranted because the assumption that /dbfs has to do with DataBricks is a risky assumption in the first place.

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/io/core.py:155 in from_config                   │
│                                                                              │
│   152 │   │   │   ) from exc                                                 │
│   153 │   │                                                                  │
│   154 │   │   try:                                                           │
│ ❱ 155 │   │   │   data_set = class_obj(**config)  # type: ignore             │
│   156 │   │   except TypeError as err:                                       │
│   157 │   │   │   raise DataSetError(                                        │
│   158 │   │   │   │   f"\n{err}.\nDataSet '{name}' must only contain argumen │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │    class_obj = <class                                                    │ │
│ │                'kedro.extras.datasets.spark.spark_dataset.SparkDataSet'> │ │
│ │          cls = <class 'kedro.io.core.AbstractDataSet'>                   │ │
│ │       config = {                                                         │ │
│ │                │   'filepath':                                           │ │
│ │                '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/t… │ │
│ │                │   'file_format': 'parquet',                             │ │
│ │                │   'version': Version(                                   │ │
│ │                │   │   load=None,                                        │ │
│ │                │   │   save='2022-10-13T16.18.00.115Z'                   │ │
│ │                │   )                                                     │ │
│ │                }                                                         │ │
│ │ load_version = None                                                      │ │
│ │         name = 'ft_labels'                                               │ │
│ │ save_version = '2022-10-13T16.18.00.115Z'                                │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/extras/datasets/spark/spark_dataset.py:308 in   │
│ __init__                                                                     │
│                                                                              │
│   305 │   │   │   path = PurePosixPath(filepath)                             │
│   306 │   │   │                                                              │
│   307 │   │   │   if filepath.startswith("/dbfs"):                           │
│ ❱ 308 │   │   │   │   dbutils = _get_dbutils(self._get_spark())              │
│   309 │   │   │   │   if dbutils:                                            │
│   310 │   │   │   │   │   glob_function = partial(_dbfs_glob, dbutils=dbutil │
│   311 │   │   │   │   │   exists_function = partial(_dbfs_exists, dbutils=db │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │       __class__ = <class                                                 │ │
│ │                   'kedro.extras.datasets.spark.spark_dataset.SparkDataS… │ │
│ │     credentials = {}                                                     │ │
│ │ exists_function = None                                                   │ │
│ │     file_format = 'parquet'                                              │ │
│ │        filepath = '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifie… │ │
│ │       fs_prefix = ''                                                     │ │
│ │   glob_function = None                                                   │ │
│ │       load_args = None                                                   │ │
│ │            path = PurePosixPath('/dbfs/mnt/els-nlp-experts1/data/OmniSc… │ │
│ │       save_args = None                                                   │ │
│ │            self = <kedro.extras.datasets.spark.spark_dataset.SparkDataS… │ │
│ │                   object at 0x7f4a7ca602e0>                              │ │
│ │         version = Version(load=None, save='2022-10-13T16.18.00.115Z')    │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/extras/datasets/spark/spark_dataset.py:85 in    │
│ _get_dbutils                                                                 │
│                                                                              │
│    82 │   try:                                                               │
│    83 │   │   from pyspark.dbutils import DBUtils  # pylint: disable=import- │
│    84 │   │                                                                  │
│ ❱  85 │   │   dbutils = DBUtils(spark)                                       │
│    86 │   except ImportError:                                                │
│    87 │   │   try:                                                           │
│    88 │   │   │   import IPython  # pylint: disable=import-outside-toplevel  │
│                                                                              │
│ ╭─────────────────────────────── locals ────────────────────────────────╮    │
│ │ dbutils = None                                                        │    │
│ │ DBUtils = <class 'pyspark.dbutils.DBUtils'>                           │    │
│ │   spark = <pyspark.sql.session.SparkSession object at 0x7f4afb216400> │    │
│ ╰───────────────────────────────────────────────────────────────────────╯    │
│                                                                              │
│ /databricks/spark/python/pyspark/dbutils.py:33 in __init__                   │
│                                                                              │
│    30 │   def __init__(self, spark=None):                                    │
│    31 │   │   if spark is None:                                              │
│    32 │   │   │   spark = SparkSession.builder.getOrCreate()                 │
│ ❱  33 │   │   dbutils_obj = self.get_dbutils(spark)                          │
│    34 │   │   self.fs = dbutils_obj.fs                                       │
│    35 │   │   self.secrets = dbutils_obj.secrets                             │
│    36 │   │   if spark.conf.get("spark.databricks.service.client.enabled") = │
│                                                                              │
│ ╭────────────────────────────── locals ───────────────────────────────╮      │
│ │  self = <pyspark.dbutils.DBUtils object at 0x7f4a7ca60400>          │      │
│ │ spark = <pyspark.sql.session.SparkSession object at 0x7f4afb216400> │      │
│ ╰─────────────────────────────────────────────────────────────────────╯      │
│                                                                              │
│ /databricks/spark/python/pyspark/dbutils.py:50 in get_dbutils                │
│                                                                              │
│    47 │   │   │   return SparkServiceClientDBUtils(spark.sparkContext)       │
│    48 │   │   else:                                                          │
│    49 │   │   │   import IPython                                             │
│ ❱  50 │   │   │   return IPython.get_ipython().user_ns["dbutils"]            │
│    51                                                                        │
│    52                                                                        │
│    53 class SparkServiceClientDBUtils(object):                               │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ IPython = <module 'IPython' from                                         │ │
│ │           '/databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e9259… │ │
│ │    self = <pyspark.dbutils.DBUtils object at 0x7f4a7ca60400>             │ │
│ │   spark = <pyspark.sql.session.SparkSession object at 0x7f4afb216400>    │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'NoneType' object has no attribute 'user_ns'

The above exception was the direct cause of the following exception:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/bin/k │
│ edro:8 in <module>                                                           │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/framework/cli/cli.py:211 in main                │
│                                                                              │
│   208 │   """                                                                │
│   209 │   _init_plugins()                                                    │
│   210 │   cli_collection = KedroCLI(project_path=Path.cwd())                 │
│ ❱ 211 │   cli_collection()                                                   │
│   212                                                                        │
│                                                                              │
│ ╭───────────── locals ─────────────╮                                         │
│ │ cli_collection = <KedroCLI None> │                                         │
│ ╰──────────────────────────────────╯                                         │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/click/core.py:1130 in __call__                        │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/framework/cli/cli.py:139 in main                │
│                                                                              │
│   136 │   │   )                                                              │
│   137 │   │                                                                  │
│   138 │   │   try:                                                           │
│ ❱ 139 │   │   │   super().main(                                              │
│   140 │   │   │   │   args=args,                                             │
│   141 │   │   │   │   prog_name=prog_name,                                   │
│   142 │   │   │   │   complete_var=complete_var,                             │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │       __class__ = <class 'kedro.framework.cli.cli.KedroCLI'>             │ │
│ │            args = [                                                      │ │
│ │                   │   'run',                                             │ │
│ │                   │   '--env',                                           │ │
│ │                   │   'dbfs',                                            │ │
│ │                   │   '--node',                                          │ │
│ │                   │   'train_contrastive'                                │ │
│ │                   ]                                                      │ │
│ │    complete_var = None                                                   │ │
│ │           extra = {                                                      │ │
│ │                   │   'obj': ProjectMetadata(                            │ │
│ │                   │   │                                                  │ │
│ │                   config_file=PosixPath('/databricks/mlflow/projects/1b… │ │
│ │                   │   │   package_name='omnieval',                       │ │
│ │                   │   │   project_name='OmniScience classification       │ │
│ │                   evaluation framework',                                 │ │
│ │                   │   │                                                  │ │
│ │                   project_path=PosixPath('/databricks/mlflow/projects/1… │ │
│ │                   │   │   project_version='0.18.2',                      │ │
│ │                   │   │                                                  │ │
│ │                   source_dir=PosixPath('/databricks/mlflow/projects/1b1… │ │
│ │                   │   )                                                  │ │
│ │                   }                                                      │ │
│ │       prog_name = None                                                   │ │
│ │            self = <KedroCLI None>                                        │ │
│ │ standalone_mode = True                                                   │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/click/core.py:1055 in main                            │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/click/core.py:1657 in invoke                          │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/click/core.py:1404 in invoke                          │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/click/core.py:760 in invoke                           │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/framework/cli/project.py:352 in run             │
│                                                                              │
│   349 │   node_names = _get_values_as_tuple(node_names) if node_names else n │
│   350 │                                                                      │
│   351 │   with KedroSession.create(env=env, extra_params=params) as session: │
│ ❱ 352 │   │   session.run(                                                   │
│   353 │   │   │   tags=tag,                                                  │
│   354 │   │   │   runner=runner(is_async=is_async),                          │
│   355 │   │   │   node_names=node_names,                                     │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │       config = None                                                      │ │
│ │          env = 'dbfs'                                                    │ │
│ │  from_inputs = []                                                        │ │
│ │   from_nodes = []                                                        │ │
│ │     is_async = False                                                     │ │
│ │ load_version = {}                                                        │ │
│ │   node_names = ('train_contrastive',)                                    │ │
│ │       params = {}                                                        │ │
│ │     pipeline = None                                                      │ │
│ │       runner = <class 'kedro.runner.sequential_runner.SequentialRunner'> │ │
│ │      session = <kedro.framework.session.session.KedroSession object at   │ │
│ │                0x7f4ad70048e0>                                           │ │
│ │          tag = ()                                                        │ │
│ │     to_nodes = []                                                        │ │
│ │   to_outputs = []                                                        │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/framework/session/session.py:389 in run         │
│                                                                              │
│   386 │   │   │   "runner": getattr(runner, "__name__", str(runner)),        │
│   387 │   │   }                                                              │
│   388 │   │                                                                  │
│ ❱ 389 │   │   catalog = context._get_catalog(                                │
│   390 │   │   │   save_version=save_version,                                 │
│   391 │   │   │   load_versions=load_versions,                               │
│   392 │   │   )                                                              │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │           context = <kedro.framework.context.context.KedroContext object │ │
│ │                     at 0x7f4ad7118c10>                                   │ │
│ │      extra_params = {}                                                   │ │
│ │ filtered_pipeline = Pipeline([                                           │ │
│ │                     Node(train_contrastive,                              │ │
│ │                     ['embeddings_on_train_specter_sampled',              │ │
│ │                     'label_encoder',                                     │ │
│ │                     'params:specter_contrastive_hparams'],               │ │
│ │                     'specter_contrastive_model', 'train_contrastive')    │ │
│ │                     ])                                                   │ │
│ │       from_inputs = []                                                   │ │
│ │        from_nodes = []                                                   │ │
│ │     load_versions = {}                                                   │ │
│ │              name = '__default__'                                        │ │
│ │        node_names = ('train_contrastive',)                               │ │
│ │          pipeline = Pipeline([                                           │ │
│ │                     Node(count_ds, ['train_test_set'], None,             │ │
│ │                     'count_train_test'),                                 │ │
│ │                     Node(download_file_s3, ['params:ft_model_path_s3',   │ │
│ │                     'params:ft_model_path_local'], None, 'download_ft'), │ │
│ │                     Node(get_ft_labels, ['params:ft_model_path_local'],  │ │
│ │                     'ft_labels', 'get_ft_labels'),                       │ │
│ │                     Node(get_labels, 'train_test_set', 'all_labels',     │ │
│ │                     'get_labels_all'),                                   │ │
│ │                     Node(split_train_test, ['train_test_set'],           │ │
│ │                     ['ds_train', 'ds_test'], 'split_train_test'),        │ │
│ │                     Node(clean_up_dataset, ['ds_test'],                  │ │
│ │                     'ds_test_cleaned', 'clean_up_test_set'),             │ │
│ │                     Node(gen_embeddings_specter, ['ds_train',            │ │
│ │                     'specter_tokenizer', 'specter_model',                │ │
│ │                     'params:specter_hparams'],                           │ │
│ │                     'embeddings_on_train_specter',                       │ │
│ │                     'embed_train_specter'),                              │ │
│ │                     Node(fit_label_encoder, 'all_labels',                │ │
│ │                     'label_encoder', 'fit_label_encoder'),               │ │
│ │                     Node(get_labels, 'ds_train', 'train_labels',         │ │
│ │                     'get_labels'),                                       │ │
│ │                     Node(sample_training_data, ['ds_train',              │ │
│ │                     'params:training_data_sample_approx_size',           │ │
│ │                     'params:training_data_sample_min_examples_per_class… │ │
│ │                     'params:training_data_sample_upsampling'],           │ │
│ │                     'ds_train_sample_index', 'sample_train'),            │ │
│ │                     ...                                                  │ │
│ │                     ])                                                   │ │
│ │     pipeline_name = None                                                 │ │
│ │       record_data = {                                                    │ │
│ │                     │   'session_id': '2022-10-13T16.18.00.115Z',        │ │
│ │                     │   'project_path':                                  │ │
│ │                     '/databricks/mlflow/projects/1b104a49cbf61560a7e5fe… │ │
│ │                     │   'env': 'dbfs',                                   │ │
│ │                     │   'kedro_version': '0.18.2',                       │ │
│ │                     │   'tags': (),                                      │ │
│ │                     │   'from_nodes': [],                                │ │
│ │                     │   'to_nodes': [],                                  │ │
│ │                     │   'node_names': ('train_contrastive',),            │ │
│ │                     │   'from_inputs': [],                               │ │
│ │                     │   'to_outputs': [],                                │ │
│ │                     │   ... +4                                           │ │
│ │                     }                                                    │ │
│ │            runner = <kedro.runner.sequential_runner.SequentialRunner     │ │
│ │                     object at 0x7f4b00a77520>                            │ │
│ │      save_version = '2022-10-13T16.18.00.115Z'                           │ │
│ │              self = <kedro.framework.session.session.KedroSession object │ │
│ │                     at 0x7f4ad70048e0>                                   │ │
│ │        session_id = '2022-10-13T16.18.00.115Z'                           │ │
│ │              tags = ()                                                   │ │
│ │          to_nodes = []                                                   │ │
│ │        to_outputs = []                                                   │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/framework/context/context.py:286 in             │
│ _get_catalog                                                                 │
│                                                                              │
│   283 │   │   )                                                              │
│   284 │   │   conf_creds = self._get_config_credentials()                    │
│   285 │   │                                                                  │
│ ❱ 286 │   │   catalog = settings.DATA_CATALOG_CLASS.from_config(             │
│   287 │   │   │   catalog=conf_catalog,                                      │
│   288 │   │   │   credentials=conf_creds,                                    │
│   289 │   │   │   load_versions=load_versions,                               │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │  conf_catalog = {                                                        │ │
│ │                 │   'ft_labels': {                                       │ │
│ │                 │   │   'type': 'spark.SparkDataSet',                    │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │                 │   │   'file_format': 'parquet',                        │ │
│ │                 │   │   'versioned': True                                │ │
│ │                 │   },                                                   │ │
│ │                 │   'specter_tokenizer': {                               │ │
│ │                 │   │   'type': 'pickle.PickleDataSet',                  │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 's3://els-nlp-experts1/data/OmniScienceClassifier/spect… │ │
│ │                 │   │   'backend': 'joblib'                              │ │
│ │                 │   },                                                   │ │
│ │                 │   'specter_model': {                                   │ │
│ │                 │   │   'type': 'pickle.PickleDataSet',                  │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 's3://els-nlp-experts1/data/OmniScienceClassifier/spect… │ │
│ │                 │   │   'backend': 'joblib'                              │ │
│ │                 │   },                                                   │ │
│ │                 │   'metrics_on_test_fasttext': {                        │ │
│ │                 │   │   'type':                                          │ │
│ │                 'omnieval.kedro_utils.MlflowMetricsDataSet',             │ │
│ │                 │   │   'prefix': 'test',                                │ │
│ │                 │   │   'params': {'model_type': 'fasttext'}             │ │
│ │                 │   },                                                   │ │
│ │                 │   'metrics_on_test_specter': {                         │ │
│ │                 │   │   'type':                                          │ │
│ │                 'omnieval.kedro_utils.MlflowMetricsDataSet',             │ │
│ │                 │   │   'prefix': 'test',                                │ │
│ │                 │   │   'params': {'model_type': 'specter_svm'}          │ │
│ │                 │   },                                                   │ │
│ │                 │   'train_test_set': {                                  │ │
│ │                 │   │   'type': 'spark.SparkDataSet',                    │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │                 │   │   'file_format': 'parquet'                         │ │
│ │                 │   },                                                   │ │
│ │                 │   'ds_train': {                                        │ │
│ │                 │   │   'type': 'spark.SparkDataSet',                    │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │                 │   │   'file_format': 'parquet',                        │ │
│ │                 │   │   'versioned': True                                │ │
│ │                 │   },                                                   │ │
│ │                 │   'ds_train_sample_index': {                           │ │
│ │                 │   │   'type': 'spark.SparkDataSet',                    │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │                 │   │   'file_format': 'parquet',                        │ │
│ │                 │   │   'versioned': True                                │ │
│ │                 │   },                                                   │ │
│ │                 │   'ds_test': {                                         │ │
│ │                 │   │   'type': 'spark.SparkDataSet',                    │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │                 │   │   'file_format': 'parquet',                        │ │
│ │                 │   │   'versioned': True                                │ │
│ │                 │   },                                                   │ │
│ │                 │   'ds_test_cleaned': {                                 │ │
│ │                 │   │   'type': 'spark.SparkDataSet',                    │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │                 │   │   'file_format': 'parquet',                        │ │
│ │                 │   │   'versioned': True                                │ │
│ │                 │   },                                                   │ │
│ │                 │   ... +9                                               │ │
│ │                 }                                                        │ │
│ │    conf_creds = {}                                                       │ │
│ │ load_versions = {}                                                       │ │
│ │  save_version = '2022-10-13T16.18.00.115Z'                               │ │
│ │          self = <kedro.framework.context.context.KedroContext object at  │ │
│ │                 0x7f4ad7118c10>                                          │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/io/data_catalog.py:277 in from_config           │
│                                                                              │
│   274 │   │   │   │   layers[ds_layer].add(ds_name)                          │
│   275 │   │   │                                                              │
│   276 │   │   │   ds_config = _resolve_credentials(ds_config, credentials)   │
│ ❱ 277 │   │   │   data_sets[ds_name] = AbstractDataSet.from_config(          │
│   278 │   │   │   │   ds_name, ds_config, load_versions.get(ds_name), save_v │
│   279 │   │   │   )                                                          │
│   280                                                                        │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │       catalog = {                                                        │ │
│ │                 │   'ft_labels': {                                       │ │
│ │                 │   │   'type': 'spark.SparkDataSet',                    │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │                 │   │   'file_format': 'parquet',                        │ │
│ │                 │   │   'versioned': True                                │ │
│ │                 │   },                                                   │ │
│ │                 │   'specter_tokenizer': {                               │ │
│ │                 │   │   'type': 'pickle.PickleDataSet',                  │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 's3://els-nlp-experts1/data/OmniScienceClassifier/spect… │ │
│ │                 │   │   'backend': 'joblib'                              │ │
│ │                 │   },                                                   │ │
│ │                 │   'specter_model': {                                   │ │
│ │                 │   │   'type': 'pickle.PickleDataSet',                  │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 's3://els-nlp-experts1/data/OmniScienceClassifier/spect… │ │
│ │                 │   │   'backend': 'joblib'                              │ │
│ │                 │   },                                                   │ │
│ │                 │   'metrics_on_test_fasttext': {                        │ │
│ │                 │   │   'type':                                          │ │
│ │                 'omnieval.kedro_utils.MlflowMetricsDataSet',             │ │
│ │                 │   │   'prefix': 'test',                                │ │
│ │                 │   │   'params': {'model_type': 'fasttext'}             │ │
│ │                 │   },                                                   │ │
│ │                 │   'metrics_on_test_specter': {                         │ │
│ │                 │   │   'type':                                          │ │
│ │                 'omnieval.kedro_utils.MlflowMetricsDataSet',             │ │
│ │                 │   │   'prefix': 'test',                                │ │
│ │                 │   │   'params': {'model_type': 'specter_svm'}          │ │
│ │                 │   },                                                   │ │
│ │                 │   'train_test_set': {                                  │ │
│ │                 │   │   'type': 'spark.SparkDataSet',                    │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │                 │   │   'file_format': 'parquet'                         │ │
│ │                 │   },                                                   │ │
│ │                 │   'ds_train': {                                        │ │
│ │                 │   │   'type': 'spark.SparkDataSet',                    │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │                 │   │   'file_format': 'parquet',                        │ │
│ │                 │   │   'versioned': True                                │ │
│ │                 │   },                                                   │ │
│ │                 │   'ds_train_sample_index': {                           │ │
│ │                 │   │   'type': 'spark.SparkDataSet',                    │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │                 │   │   'file_format': 'parquet',                        │ │
│ │                 │   │   'versioned': True                                │ │
│ │                 │   },                                                   │ │
│ │                 │   'ds_test': {                                         │ │
│ │                 │   │   'type': 'spark.SparkDataSet',                    │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │                 │   │   'file_format': 'parquet',                        │ │
│ │                 │   │   'versioned': True                                │ │
│ │                 │   },                                                   │ │
│ │                 │   'ds_test_cleaned': {                                 │ │
│ │                 │   │   'type': 'spark.SparkDataSet',                    │ │
│ │                 │   │   'filepath':                                      │ │
│ │                 '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │                 │   │   'file_format': 'parquet',                        │ │
│ │                 │   │   'versioned': True                                │ │
│ │                 │   },                                                   │ │
│ │                 │   ... +9                                               │ │
│ │                 }                                                        │ │
│ │           cls = <class 'kedro.io.data_catalog.DataCatalog'>              │ │
│ │   credentials = {}                                                       │ │
│ │     data_sets = {}                                                       │ │
│ │     ds_config = {                                                        │ │
│ │                 │   'type': 'spark.SparkDataSet',                        │ │
│ │                 │   'filepath':                                          │ │
│ │                 '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/… │ │
│ │                 │   'file_format': 'parquet',                            │ │
│ │                 │   'versioned': True                                    │ │
│ │                 }                                                        │ │
│ │      ds_layer = None                                                     │ │
│ │       ds_name = 'ft_labels'                                              │ │
│ │        layers = defaultdict(<class 'set'>, {})                           │ │
│ │ load_versions = {}                                                       │ │
│ │  missing_keys = set()                                                    │ │
│ │  save_version = '2022-10-13T16.18.00.115Z'                               │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /databricks/conda/envs/mlflow-72c7f5ff04b648c18c79250e9e92591063bbb83d/lib/p │
│ ython3.8/site-packages/kedro/io/core.py:162 in from_config                   │
│                                                                              │
│   159 │   │   │   │   f"constructor of '{class_obj.__module__}.{class_obj.__ │
│   160 │   │   │   ) from err                                                 │
│   161 │   │   except Exception as err:                                       │
│ ❱ 162 │   │   │   raise DataSetError(                                        │
│   163 │   │   │   │   f"\n{err}.\nFailed to instantiate DataSet '{name}' "   │
│   164 │   │   │   │   f"of type '{class_obj.__module__}.{class_obj.__qualnam │
│   165 │   │   │   ) from err                                                 │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │    class_obj = <class                                                    │ │
│ │                'kedro.extras.datasets.spark.spark_dataset.SparkDataSet'> │ │
│ │          cls = <class 'kedro.io.core.AbstractDataSet'>                   │ │
│ │       config = {                                                         │ │
│ │                │   'filepath':                                           │ │
│ │                '/dbfs/mnt/els-nlp-experts1/data/OmniScienceClassifier/t… │ │
│ │                │   'file_format': 'parquet',                             │ │
│ │                │   'version': Version(                                   │ │
│ │                │   │   load=None,                                        │ │
│ │                │   │   save='2022-10-13T16.18.00.115Z'                   │ │
│ │                │   )                                                     │ │
│ │                }                                                         │ │
│ │ load_version = None                                                      │ │
│ │         name = 'ft_labels'                                               │ │
│ │ save_version = '2022-10-13T16.18.00.115Z'                                │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
DataSetError: 
'NoneType' object has no attribute 'user_ns'.
Failed to instantiate DataSet 'ft_labels' of type 
'kedro.extras.datasets.spark.spark_dataset.SparkDataSet'.
2022/10/13 16:18:13 ERROR mlflow.cli: === Run (ID 'dd72997a9cfc4044836dedee6aeef61d') failed ===

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting...

I got confused because the stack trace you share has:

48 │   │   else:                                                          │
│    49 │   │   │   import IPython                                             │
│ ❱  50 │   │   │   return IPython.get_ipython().user_ns["dbutils"]   

Based on the error message, I previously thought it was choking on this very similar Kedro code.

I'm inclined to think Databricks should add a safeguard here (or figure out why this is happening). In the interim, I suppose we could catch AttributeError with a comment? I think that resolves your issue, while not potentially opening the door to swallowing other exceptions unknowingly.

@yetudada @idanov Is there anybody at Databricks we can verify this behavior with? :)

Copy link
Author

@mle-els mle-els Oct 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the try-catch narrower.

@merelcht merelcht mentioned this pull request Nov 7, 2022
10 tasks
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @mle-els. This looks like a reasonable solution to me 🙂 Would you mind adding a note to the release notes about this change?

I also noticed the DCO check is failing. You can click on the check and follow the instructions to make it pass. For more info about this check see: https://kedro.readthedocs.io/en/stable/contribution/developer_contributor_guidelines.html#developer-certificate-of-origin

mle-els and others added 12 commits November 7, 2022 16:56
avoid error in a file system that happen to have "/dbfs" paths

Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
* Update dependabot.yml

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* pin jupyterlab_services to requirments

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

* lint

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>

Signed-off-by: SajidAlamQB <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
…o-org#1957)

Updates the requirements on [pip-tools](https://github.com/jazzband/pip-tools) to permit the latest version.
- [Release notes](https://github.com/jazzband/pip-tools/releases)
- [Changelog](https://github.com/jazzband/pip-tools/blob/master/CHANGELOG.md)
- [Commits](jazzband/pip-tools@6.5.0...6.9.0)

---
updated-dependencies:
- dependency-name: pip-tools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
…-org#1956)

Updates the requirements on [toposort]() to permit the latest version.

---
updated-dependencies:
- dependency-name: toposort
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
…edro-org#1953)

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
* remove a redundant function call

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Remove redundant resolove_load_version & fix test

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Fix HoloviewWriter tests with more specific error message pattern & Lint

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Rename tests

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Jo Stichbury <jo.stichbury@quantumblack.com>

Signed-off-by: Jo Stichbury <jo.stichbury@quantumblack.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
…support for `tf.device` (kedro-org#1915)

* Fix issue with save operation. Add gpu option

Signed-off-by: William Caicedo <williamc@movio.co>

* Add tests

Signed-off-by: William Caicedo <williamc@movio.co>

* Update RELEASE.md

Signed-off-by: William Caicedo <williamc@movio.co>

* Update test description

Signed-off-by: William Caicedo <williamc@movio.co>

* Remove double slash and overwrite flag in fsspec.put method invocation

Signed-off-by: William Caicedo <williamc@movio.co>

* Allow to explicitly set device name

Signed-off-by: William Caicedo <williamc@movio.co>

* Update RELEASE.md

Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: William Caicedo <williamc@movio.co>

* Update docs

Signed-off-by: William Caicedo <williamc@movio.co>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
avoid error in a file system that happen to have "/dbfs" paths

Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
@mle-els
Copy link
Author

mle-els commented Nov 7, 2022

Thanks for your contribution @mle-els. This looks like a reasonable solution to me 🙂 Would you mind adding a note to the release notes about this change?

I also noticed the DCO check is failing. You can click on the check and follow the instructions to make it pass. For more info about this check see: https://kedro.readthedocs.io/en/stable/contribution/developer_contributor_guidelines.html#developer-certificate-of-origin

@merelcht: Thanks for replying, after signing off, I'm getting this error: Author: Minh Le, Committer: Minh Le; Expected "Minh Le [57996662+mle-els@users.noreply.github.com](mailto:57996662+mle-els@users.noreply.github.com)", but got "Minh Le [m.le@elsevier.com](mailto:m.le@elsevier.com)".

I had written the changes on the web interface, probably that's why the email doesn't match. Is there any way to fix this other than closing this PR and creating a new one?

@merelcht
Copy link
Member

merelcht commented Nov 7, 2022

I had written the changes on the web interface, probably that's why the email doesn't match. Is there any way to fix this other than closing this PR and creating a new one?

Let me have a look at this!

mle-els and others added 4 commits November 7, 2022 16:12
avoid error in a file system that happen to have "/dbfs" paths

Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
@merelcht
Copy link
Member

merelcht commented Nov 8, 2022

@mle-els The remaining failing checks are because of coverage not reaching 100%. Can you add a test for your code?

Signed-off-by: Minh Le <m.le@elsevier.com>
mle-els and others added 3 commits November 9, 2022 09:52
Signed-off-by: Minh Le <m.le@elsevier.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a test and updating the release notes! 👍 ⭐

@merelcht merelcht requested a review from deepyaman November 9, 2022 10:48
@noklam noklam self-requested a review November 10, 2022 13:44
Comment on lines +312 to +318
dbutils = None
try:
dbutils = _get_dbutils(self._get_spark())
except AttributeError:
# Databricks is known to raise AttributeError when called
# on an unsupported environment
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the root cause entirely. Is this a bug from Databricks pyspark.dbutils module or is it because we check the filepath too eagerly in kedro?

The _get_dbutils function is suppose to try getting the dbutils aggressively, if not it just return None. This solution is adding yet another try-catch layer outside is a bit hacky but maybe necessary in this case? I want to make sure I understand the problem before I come to the conclusion.

Is it better to have this try-except block inside the _get_dbutils func if necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's a bug in DataBricks code. It assumes that IPython.get_ipython() returns an object. When it happens to return None, we get an AttributeError.

/databricks/spark/python/pyspark/dbutils.py:50 in get_dbutils                │
│                                                                              │
│    47 │   │   │   return SparkServiceClientDBUtils(spark.sparkContext)       │
│    48 │   │   else:                                                          │
│    49 │   │   │   import IPython                                             │
│ ❱  50 │   │   │   return IPython.get_ipython().user_ns["dbutils"]            │
│    51                                                                        │
│    52                                                                        │
│    53 class SparkServiceClientDBUtils(object):                               │

I think having the try-except block inside _get_dbutils is a better solution indeed. Thanks for pointing that out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pyspark.dbutils module only available on Databricks runtime? If so I think that's why it assumes you have IPython. Also, you mentioned you are running on Databricks but not in a managed way, I am not aware that there is an on-premise option, what kind of environment are you running on?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to run on a normal DataBricks cluster, just with MLFlow instead of a notebook. I managed to run pipelines via a notebook too but it would have been better to do it through command line. So, when I run mlflow run, MLFlow packages my project into a zip file, sends to a new DataBricks cluster, and runs it on there. Apparently, because it's not on a notebook, there's no IPython.

If you think that this use case is worth it to support, I can make the change that you proposed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mle-els I prefer moving the try-except block to _get_dbutils. For IPython I am unsure, even running with .py file it will have IPython normally, but Databricks doesn't document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mle-els I won't be able to test it myself since I don't have the environment configured. My guess will be you have a relatively old Databricks runtime.

We test it recently with dbx, which package up a project and runs as a Databricks Job, and IPython would be available in that case.

This suggest Databricks runtime >11 always run on IPython, although it mentioned notebook only but as we tested a couple months ago, it's the same with .py file.
https://docs.databricks.com/notebooks/ipython-kernel.html#how-to-use-the-ipython-kernel-with-databricks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try running the code on a newer runtime when I find some free time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @mle-els We'd like to get all PRs related to datasets to be merged soon now we're moving our datasets code to a different package (see our medium blog post for more details).

Do you think you can find time this week? Otherwise, we'll close this PR and ask you to re-open it on the new repo when it's ready.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This week I'm swamped, unfortunately :( Please feel free to close it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mle-els No worries! Feel free to re-open the PR in the kedro-plugins repository when you are free to work on it again. :)

@noklam noklam self-assigned this Nov 15, 2022
Comment on lines +617 to +626
def test_ds_init_get_dbutils_raises_exception(self, mocker):
get_dbutils_mock = mocker.Mock()
get_dbutils_mock.side_effect = AttributeError
get_dbutils_mock = mocker.patch(
"kedro.extras.datasets.spark.spark_dataset._get_dbutils", get_dbutils_mock
)

data_set = SparkDataSet(filepath="/dbfs/tmp/data")
assert data_set._glob_function.__name__ == "iglob"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test and the assertion don't seem to match here. This would be obsoleted if the try-except is moved to _get_dbutils too, so it would need some modification.

@noklam noklam closed this Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants