-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Investigate] ParallelRunner does not work with S3-linked data catalog #2162
Comments
Hey, thanks so much for this detailed report. This an important problem to investigate, I'll push for it to be added to our next sprint. |
We'll try to reproduce this error, should appear with any dataset and an S3 bucket when using the parallel runner. |
Closing this issue as it hasn't had any recent activity. |
Hi! I had the same problem. I've been using ParallelRunner happily with two datasets. Then added a dataset stored in s3 and exactly the same issue described by OP happened to me. The Pickle error is:
|
Thanks @BielStela, reopening. Can you confirm you were using the latest Kedro version? Also let us know Python version and operating system |
Sure, I'm using
|
for more context, a fix that worked is what OP did to the dataset. I'm using a custom dataset and adding just this to the if self._protocol == "s3":
glob_func = s3fs.core.S3FileSystem._glob
else:
glob_func = self._fs.glob
super().__init__(
glob_function=glob_func,
... |
Hey, class NoFsspecProblemCSVDataset(CSVDataset):
def __init__(self, ....) -> None:
super().__init__(...)
@property
def _glob_function(self):
return self._fs.glob
@_glob_function.setter
def _glob_function(self, value):
pass This solution prevents the problem where glob_function is non-serializable. |
Description
The
ParallelRunner
fails if data catalog entries point to Amazon S3.Context
We use the
ParallelRunner
to run a large and highly parallelized pipeline. When our data catalog is connected to the local disk filesystem, everything works. When we attempt to switch the file locations to a functionally identical S3 bucket (using the out-of-the-box location specifications as documented here), we see errors. Further details below, but I believe this is caused by some tricky imports and a pickling failure.Steps to Reproduce
The code is a bit too involved to wireframe directly here, but in general I believe any session that couples
ParallelRunner
with S3 catalog objects will throw errors.Expected Result
The pipeline should run to completion.
Actual Result
We see errors related to the serializability of the catalog objects. Namely:
This error is accompanied by the following message:
Further up the traceback we see that the error was tripped here, in
runner.parallel_runner.py
:Your Environment
Kedro version 0.18.2
Python version 3.9.1
Running on Windows 10 Pro 21H2 (also replicated on a Linux instance although I don't have the distro / version details at the moment).
Temporary fix
I have found a way to fix this problem, i.e. allow
ParallelRunner
to work with S3 datasets, by modifying the Kedro source code locally. I am not sure that this fix is the correct approach, but sharing in case helpful as a head start.I found that what was happening was the S3FS-enabled catalog objects were unable to be serialized by
ForkingPickler
. The specific problem seems to be in the creation ofglob_func
, which usess3fs.core.S3FileSystem._glob
in the case of S3 files, but (I think because of the sequence of imports, somehow), the inherited function's signature does not match what the pickler expects froms3fs.core.S3FileSystem._glob
. In general, my solution involves re-instantiating thatglob_func
at various places so that the signatures match and serialization is possible. (I think. I don't really fully understand what's going on here, and my knowledge / vocabulary of the underlying dynamics is not very good, but the following is what worked for me).Changes to individual datasets
First, I modified the individual datasets as follows. I did this for each dataset type that I used (e.g.
CSVDataSet
,ParquetDataSet
... etc.In
__init__()
, I added:Theoretically, I could have just defined my own runners without submitting an issue if the above were sufficient. But I found I also needed to make a small modification to
core
to get things to run:Changes to
io.core.py
In
__init__()
, in_fetch_latest_load_version(self)
, I changed the following line:to:
Again, I have no conviction that the above changes were the right way to do this, but it did get multiprocessing working with S3.
The text was updated successfully, but these errors were encountered: