Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParallelRunner raises AttributeError: The following data sets cannot be used by multiprocessing... on datasets not involved in --pipeline being run #3804

Closed
yury-fedotov opened this issue Apr 11, 2024 · 7 comments

Comments

@yury-fedotov
Copy link
Contributor

Description

Using ParallelRunner puts some restrictions on datasets involved in the run, as the logs mention:

In order to utilize multiprocessing you need to make sure all data sets are serialisable, i.e. data sets should not make use of lambda functions, nested functions, closures etc.
If you are using custom decorators ensure they are correctly decorated using functools.wraps().

Having this constraint on datasets that are involved in the pipeline that's executed with ParallelRunner makes total sense.

However, I found out that if any dataset in the catalog doesn't adhere to this, usage of ParallelRunner becomes impossible even for pipelines that have nothing to do with those datasets.

In other words, the following raises AttributeError: The following data sets cannot be used by multiprocessing...:

kedro run --pipeline pipeline_that_doesnt_involve_problematic_datasets runner=ParallelRunner

Context

This error prevents leveraging amazing advantages that ParallelRunner can bring to large projects in cases where any dataset doesn't adhere to the runner's requirements.

Steps to Reproduce

  1. Create a pipeline that uses datasets not adhering to ParallelRunner requirements, but runs fine with SequentialRunner. Let this pipeline have 2 outputs: e.g. pandas dataframes.
  2. Create a second pipeline that does some profiling of those tables: like df.describe(). There should be a modular pipe and 2 namespaces pipelines created for 2 tables respectively.
  3. Run the first pipeline with SequentialRunner and produce those 2 outputs.
  4. Try running the second pipeline with ParallelRunner, since it should be able to process those 2 namespaces in parallel, and see error raised.

Expected Result

The second pipeline involves no datasets that do not adhere to ParallelRunner requirements, and should be executed without errors. It should not check requirements for datasets not involved in it.

Actual Result

ParallelRunner raises AttributeError: The following data sets cannot be used by multiprocessing... on datasets not involved in --pipeline being run

Your Environment

  • Kedro version used (pip show kedro or kedro -V): 0.19.3
  • Python version used (python -V): 3.10
  • Operating system and version: Windows 10, Spark 3.5
@noklam
Copy link
Contributor

noklam commented Apr 11, 2024

Haven't read the full thing. Was this working prior 0.19? In general we recommend ThreadRunner because multiprocess doesn't work with Spark. The computation doesn't happened locally anyway so it does not make sense to use multiprocess.

Would you be able to provide an demo repository that we can run on other side? Something modify from the existing starter would be good enough.

@yury-fedotov
Copy link
Contributor Author

yury-fedotov commented Apr 19, 2024

@noklam Hey! Sorry for late reply.

  1. I haven't tested in < 0.19 tbh.
  2. ThreadRunner has limitations too - e.g. matplotlib does not work with it, since this package is thread-unsafe. That's a limitation in my use case since the whole point of moving away from SequentialRunner is to parallelize nodes that generate big partitioned datasets of plt.Figures.
  3. ParallelRunner does not with spark - that I get. So the fact that it's not able to run pipelines involving SparkDataset or SparkHiveDataset is clear. But the problem I described is a bit different: if your catalog has any spark datasets, ParallelRunner cannot be used in even pipelines that have nothing to do with those catalog datasets.

On providing the repo - I'm not sure unfortunately I'll have time for that in the near future, but will post here if I manage to.

@noklam
Copy link
Contributor

noklam commented Apr 19, 2024

Got it, would this resolved if dataset is somehow lazy initialised?

@yury-fedotov
Copy link
Contributor Author

Got it, would this resolved if dataset is somehow lazy initialised?

Yeah lazy initialization would resolve this. That's my understanding since if I comment out those datasets, it works fine.

Does Kedro support lazy initialization somehow?

@noklam
Copy link
Contributor

noklam commented Apr 27, 2024

Kedro-datasets is lazily import but I think during the initialisatio Data Catalog would create instance for the entire catalog.

@merelcht
Copy link
Member

This seems to be related to #2829

@astrojuanlu
Copy link
Member

Indeed, closing this as duplicate of #2829, they are the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants