-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParallelRunner
raises AttributeError: The following data sets cannot be used by multiprocessing...
on datasets not involved in --pipeline
being run
#3804
Comments
Haven't read the full thing. Was this working prior 0.19? In general we recommend Would you be able to provide an demo repository that we can run on other side? Something modify from the existing starter would be good enough. |
@noklam Hey! Sorry for late reply.
On providing the repo - I'm not sure unfortunately I'll have time for that in the near future, but will post here if I manage to. |
Got it, would this resolved if dataset is somehow lazy initialised? |
Yeah lazy initialization would resolve this. That's my understanding since if I comment out those datasets, it works fine. Does Kedro support lazy initialization somehow? |
Kedro-datasets is lazily import but I think during the initialisatio Data Catalog would create instance for the entire catalog. |
This seems to be related to #2829 |
Indeed, closing this as duplicate of #2829, they are the same problem. |
Description
Using
ParallelRunner
puts some restrictions on datasets involved in the run, as the logs mention:Having this constraint on datasets that are involved in the pipeline that's executed with
ParallelRunner
makes total sense.However, I found out that if any dataset in the catalog doesn't adhere to this, usage of
ParallelRunner
becomes impossible even for pipelines that have nothing to do with those datasets.In other words, the following raises
AttributeError: The following data sets cannot be used by multiprocessing...
:Context
This error prevents leveraging amazing advantages that
ParallelRunner
can bring to large projects in cases where any dataset doesn't adhere to the runner's requirements.Steps to Reproduce
ParallelRunner
requirements, but runs fine withSequentialRunner
. Let this pipeline have 2 outputs: e.g.pandas
dataframes.profiling
of those tables: likedf.describe()
. There should be a modular pipe and 2 namespaces pipelines created for 2 tables respectively.SequentialRunner
and produce those 2 outputs.ParallelRunner
, since it should be able to process those 2 namespaces in parallel, and see error raised.Expected Result
The second pipeline involves no datasets that do not adhere to
ParallelRunner
requirements, and should be executed without errors. It should not check requirements for datasets not involved in it.Actual Result
ParallelRunner
raisesAttributeError: The following data sets cannot be used by multiprocessing...
on datasets not involved in--pipeline
being runYour Environment
pip show kedro
orkedro -V
): 0.19.3python -V
): 3.10The text was updated successfully, but these errors were encountered: