-
Notifications
You must be signed in to change notification settings - Fork 722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFX with Dataflow Python Version Error #1216
Comments
I don't have access to your colab. I just requested it. |
Just saw this, try it again and there should be no restrictions! |
Hello, @luischinchillagarcia Thank you for reporting an issue. I tried the colab, but failed to run due to other issues. By the way, I found an issue in the beam mailing thread which looks very similar to this issue. Can you try after pinning cloudpickle version? |
I'll try it now! |
Unfortunately, the suggestions in the beam mailing thread did not resolve the issue. Perhaps a better avenue to ask is: what would be the suggested list of packages and versions to include under the install_requires parameter in the setup.py file required for Dataflow (including the python version since anything 3.7 fails as well)? Here is a new, cleaned up, Colab which I used to test these suggestions. Something to add: the pipeline works perfectly in a local environment.. |
You can find package dependencies in the Google Cloud documentation, but when I tried to run your colab, Dataflow workers used Beam SDK version 2.19 which is not listed in the doc. It looks like it uses python 3.6.10. |
I tried running it now with all the compatible versions as shown in the above documentation, and BigQueryExampleGen runs, but the rest doesn't. The logs still show an issue with pickling. |
Can you also paste the result of |
For tfx==0.15.0 and apache-beam==2.17.0, here's what I have
|
Can you please post complete output of pip freeze without filtering any dependencies? |
Local env:
Colab env:
|
As we can see, both 'dill' and 'cloudpickle==1.2.2' are installed, which may be causing your issue as per the discussion on Beam mailing list. |
I'll create a clean virtualenv and only have dill to see if that works! Also, when installing the necessary dependencies of tfx onto a clean virtualenv, I get this error. It's strange that it asks for pyarrow to have two incompatible ranges. Perhaps this might lead to something as well.:
|
I would try to use a more recent version of beam and/or tfx to find a combination that does not have a conflict with pyarrow. It general, it may be difficult to achieve complete compatibility for any two versions of two package, which both have a complex dependency chain. |
The issue is, I have tried with the package combinations listed for tfx 0.14.0, 0.15.0, 0.21.0rc and they all have the exact same error. They run locally, but not in Dataflow specifically, and the error always leads to a pickling issue. |
You are dealing with two separate issues here:
Can you also please paste the output of |
A quick update: The SDK dependencies for apache-beam=2.17.0 (the compatible version to tfx==0.21.0rc) contains a list of packages already installed (assuming Python 3.7.4) for Dataflow workers. However, the only way I got the Dataflow job to work is to experiment with different versions of python (and slightly changing pyarrow to 0.15.1 to be compatible). Python 3.7.5 seemed to be the only successful one.
Doing this same thing with tfx 0.15.0, I wasn't able to successfully run with any recommended version of python or these packages.
Possible reasons: Cloudpickle was one source of error since I had done this testing on different versions of python before and with no luck. Now absent on both versions with the complete compatible libraries, only the tfx==0.21.0rc works. I don't know why nothing has worked in terms of tfx==0.15.0, since the errors still point to dill being the issue. |
Here's the pipdeptree for the environment with tfx==0.15.0: |
We expect that the pickling issue will be fixed in cloudpickle==1.3.0. Per [1], this release may be available next week. Once available, you can install the fixed version of cloudpickle until the AI notebook image picks up the new version. In the meantime, you can uninstall cloudpickle as a workaround. |
Thanks for reporting this issue, @luischinchillagarcia |
Thanks for all the help! @tvalentyn |
One last question related to this issue that others may also share: When running a Dataflow job, does Dataflow have containers for different versions of Beam? If so, how do we choose one that is not the latest version? Thanks! |
You can ask this question on Beam user mailing list (user@beam.apache.org), Stackoverflow, or via Dataflow customer support channels. TFX bug tracker is not the best forum for this discussion. |
Thanks for the help and resources! |
I have run into this issue with TFX == 1.2.0, TF == 2.5.1, I also tried removing cloudpickle; any other suggestions as to why I would be seeing the same error? |
I am not seeing cloudpickle in the list of deps, but if you see it, it should be cloudpickle==1.3.0 or higher. If you have a minimally reproducible example, that could help debug it. |
I had uninstalled cloudpickle as a test and I wind up with the same error |
When running a pipeline with Dataflow, BQExampleGen job always works, however, the statisticsGen, schemaGen, ExampleValidator always fails, saying the python version in the 'setup.py' is the issue. Here is a Colab file reproducing the issue.
This Beam issue briefly addresses the issue, however, upon testing the recommended Python versions >=3.7.4, I get an error that Dataflow requires Python 3.6.9. Is there any way to circumvent.
The exact error on the Dataflow logs appears as follows:
The text was updated successfully, but these errors were encountered: