Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataflow workers not able to install tfx from requirements file due to no-binary option from beam stager #649

Closed
andrewsmartin opened this issue Sep 19, 2019 · 18 comments

Comments

@andrewsmartin
Copy link

When no Beam packaging arguments are provided by the user, TFX generates a requirements file with the tfx package inside.

This ends up failing on Dataflow, because the Beam stager uses pip's --no-binary flag: https://github.com/apache/beam/blob/v2.15.0/sdks/python/apache_beam/runners/portability/stager.py#L483.

Indeed, in a fresh virtualenv (Python 3.6.3):

pip download tfx==0.14.0 --no-binary :all:
Collecting tfx==0.14.0
  ERROR: Could not find a version that satisfies the requirement tfx==0.14.0 (from versions: none)
ERROR: No matching distribution found for tfx==0.14.0

Whereas if I remove the --no-binary flag, it works just fine.

I'm not all that knowledgable about Python packaging, but is this because TFX is built as a wheel? Is there some Beam option I can pass to make this work?

@charlesccychen
Copy link
Contributor

Thanks @andrewsmartin. I think we can consider this to mainly be a bug in Beam. We currently do not upload the source package to PyPI and it is currently not trivial to set up the correct environment to build the package from source.

@angoenka: is there a particular reason we use --no-binary at the line here (https://github.com/apache/beam/blob/v2.15.0/sdks/python/apache_beam/runners/portability/stager.py#L483)? Should we remove this?

As a workaround, you can try downloading the wheel file from PyPI (https://pypi.org/project/tfx/#files) and specify it as an --extra_package to Beam.

CC: @zhitaoli

@andrewsmartin
Copy link
Author

Hi @charlesccychen, thanks for the response! Agreed that this seems more of an issue in Beam itself. I raised here just because the default behaviour in TFX does not work.

I was able to work around this by providing a minimal setup file with install_requires=["tfx==0.14.0"], so this isn't a blocker or anything. It would just be nice to be able to use a requirements file, and let TFX take care of it.

@angoenka
Copy link

The corresponding issue similar to this is tracked at https://jira.apache.org/jira/browse/BEAM-4032
The reason is that binary packages are environment dependent. The packages are downloaded on the client machines and then shipped to the worker machines and hence might not be compatible with the worker machine.

We can look into it but at the moment its not prioritized.

@tejaslodaya
Copy link

tejaslodaya commented Sep 28, 2019

Hi @andrewsmartin and @charlesccychen

I was not able to use both of your workarounds. I am still trying to play around with Chicago Beam pipeline - on spark.

I tried using --extra_package argument and point it to wheel file like this-

additional_pipeline_args={
'beam_pipeline_args': [
    '--runner=PortableRunner',
     '--extra_package=/Users/tejas.lodaya/Downloads/tfx-0.14.0-py3-none-any.whl',
      ....
      ....

I got this error:
Output from execution of subprocess: b'Collecting tfx==0.14.0 ERROR: Could not find a version that satisfies the requirement tfx==0.14.0 (from versions: none)\nERROR: No matching distribution found for tfx==0.14.0

I feel that the wheel file was ignored.

I then used the second suggestion, with

additional_pipeline_args={
   'beam_pipeline_args': [
      '--runner=PortableRunner',
      '--setup_file=/Users/tejas.lodaya/setup.py',
       ....
       ....

and setup.py contains-

install_requires=["tfx==0.14.0"]

It thows below error
'File %s not found.' % os.path.join(temp_dir, '*.tar.gz'))

Please help me with the correct way of inserting tfx package on beam executors

@andrewsmartin
Copy link
Author

Hi @tejaslodaya, for the second case (trying with a provided setup.py file), do you have a more detailed stacktrace? Can you also share the full contents of your setup.py?

@tejaslodaya
Copy link

Hi @andrewsmartin and @charlesccychen

I managed to solve this issue by doing these steps:

  1. Go to site-packages inside your virtual environment and go to apache_beam/runners/portability/stager.py file.
  2. Go to _populate_requirements_cache function and remove these two lines
    '--no-binary',
    ':all:'
  3. Reload the package inside your jupyter notebook/ main call.

In my case, I had created conda environment and changed this file: ~/miniconda3/envs/tfx_test/lib/python3.7/site-packages/apache_beam/runners/portability/stager.py where my environment name is tfx_test.

This solves the issue.

@tejaslodaya
Copy link

@andrewsmartin please close this issue

@andrewsmartin
Copy link
Author

Hi @tejaslodaya - glad you found a workaround, but it is just that - a workaround. That said I'm going to keep this open.

@yantriks-edi-bice
Copy link

yantriks-edi-bice commented Mar 25, 2020

@andrewsmartin I believe I ran into this issue running a TFX pipeline on Kubeflow (based on the Taxi template). Could not tell exactly what was happening from what logs I could find - for one thing I could not find this worker-startup log.
"A setup error was detected in beamapp-root-0325160453-2-03250905-ygwy-harness-n7qk. Please refer to the worker-startup log for detailed information."

Yes, eventually found the worker-startup log in Stackdriver by filtering logs (there's a googleapis worker-startup choice). The error was "Failed to install packages: failed to install workflow".

Job still failed but I made it a lot further once I applied the change @tejaslodaya described above.

Ran into this error when running a similar pipeline this time from my mac. Not sure if commenting out the no-binary option is appropriate in this case given the differences between my laptop and dataflow workers.

@ucdmkt
Copy link
Contributor

ucdmkt commented Apr 1, 2020

I hit the same issue when trying to run DataflowRunner from locally running BeamDagRunner.

(error from BigQueryExampleGen on DataflowRunner)

  File "/usr/local/google/home/muchida/miniconda3/envs/tfx-kfp-2/lib/python3.7/site-packages/apache_beam/utils/processes.py", line 83, in check_output
    out = subprocess.check_output(*args, **kwargs)
  File "/usr/local/google/home/muchida/miniconda3/envs/tfx-kfp-2/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/usr/local/google/home/muchida/miniconda3/envs/tfx-kfp-2/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/usr/local/google/home/muchida/miniconda3/envs/tfx-kfp-2/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/tmp/tmp6s75wqpi/requirement.txt', '--exists-action', 'i', '--no-binary', ':all:']' returned non-zero exit status 1.
$ pip list | grep -P 'beam|tensorflow|tfx'
apache-beam                2.17.0             
tensorflow                 2.1.0              
tensorflow-data-validation 0.21.5             
tensorflow-estimator       2.1.0              
tensorflow-metadata        0.21.1             
tensorflow-model-analysis  0.21.6             
tensorflow-serving-api     2.1.0              
tensorflow-transform       0.21.2             
tfx                        0.21.2             
tfx-bsl                    0.21.4             

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 28, 2020

I'm hitting the same issues with tfx==0.21.2 and tfx==0.21.4.

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 28, 2020

Here is the log I'm getting:

  File "/tfx-src/tfx/components/example_gen/base_example_gen_executor.py", line 235, in Do
    artifact_utils.get_split_uri(output_dict['examples'], split_name)))
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/pipeline.py", line 426, in __exit__
    self.run().wait_until_finish()
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/pipeline.py", line 406, in run
    self._options).run(False)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/pipeline.py", line 419, in run
    return self.runner.run_pipeline(self, self._options)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 488, in run_pipeline
    self.dataflow_client.create_job(self.job), self)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/utils/retry.py", line 206, in wrapper
    return fun(*args, **kwargs)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 530, in create_job
    self.create_job_description(job)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 560, in create_job_description
    resources = self._stage_resources(job.options)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 490, in _stage_resources
    staging_location=google_cloud_options.staging_location)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/runners/portability/stager.py", line 168, in stage_job_resources
    requirements_cache_path)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/utils/retry.py", line 206, in wrapper
    return fun(*args, **kwargs)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/runners/portability/stager.py", line 487, in _populate_requirements_cache
    processes.check_output(cmd_args, stderr=processes.STDOUT)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/utils/processes.py", line 91, in check_output
    .format(traceback.format_exc(), args[0][6], error.output))
RuntimeError: Full traceback: Traceback (most recent call last):
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/utils/processes.py", line 83, in check_output
    out = subprocess.check_output(*args, **kwargs)
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/opt/venv/bin/python3', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/tmp/tmpogyhgwkv/requirement.txt', '--exists-action', 'i', '--no-binary', ':all:']' returned non-zero exit status 1.
 
 Pip install failed for package: -r         
 Output from execution of subprocess: b"ERROR: Could not find a version that satisfies the requirement tfx==0.21.4 (from -r /tmp/tmpogyhgwkv/requirement.txt (line 1)) (from versions: none)\nERROR: No matching distribution found for tfx==0.21.4 (from -r /tmp/tmpogyhgwkv/requirement.txt (line 1))\nWARNING: You are using pip version 20.0.2; however, version 20.1 is available.\nYou should consider upgrading via the '/opt/venv/bin/python3 -m pip install --upgrade pip' command.\n"

@gfournier
Copy link

gfournier commented Oct 25, 2020

I have the same issue here with onnxruntime==1.4.0 and tensorflow==2.3.1.
Is there a way to bypass dependency installation from binaries and taking wheels instead ?

Runner: Dataflow

Error message:

ERROR: Could not find a version that satisfies the requirement tensorflow==2.3.1 (from -r /app/requirements-dataflow.txt (line 12)) (from versions: none)
ERROR: No matching distribution found for tensorflow==2.3.1 (from -r /app/requirements-dataflow.txt (line 12))

@pindinagesh
Copy link
Contributor

@andrewsmartin

Could you please confirm is this still an issue, otherwise move this to closed status. Thanks

@andrewsmartin
Copy link
Author

andrewsmartin commented Oct 20, 2021

This is no longer an issue for us but only because we are using a different workaround. Unfortunately I cannot confirm whether it is in fact still an issue. That said, it my be less relevant going forward as there is now better support for custom containers on Dataflow workers. I'd be OK with this being closed personally but other users still may be hitting this.

@pindinagesh pindinagesh self-assigned this Oct 28, 2021
@pindinagesh
Copy link
Contributor

@andrewsmartin

Closing this issue, Please feel free to reopen if this still exist. Thanks

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@pindinagesh pindinagesh removed their assignment Nov 12, 2021
@davidcavazos
Copy link

This still happens with tensorflow 2.8.0. I have tensorflow as a requirement and I still get this error. Is there any reason to keep the --no-binary option?

I've had many issues with that, including that it takes a really long time to recompile every single direct and indirect requirement. I've had timeouts at startup on Flex Templates as well because most Google client libraries depend on pyarrow which takes a very long time to compile from source.

I think removing --no-binary and using the pre-compiled packages would be both faster, less wasteful on resources, and would get rid of these kinds of errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests