Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PyTorch support #18293

Open
chrisjrn opened this issue Feb 17, 2023 · 9 comments
Open

Improve PyTorch support #18293

chrisjrn opened this issue Feb 17, 2023 · 9 comments
Labels
backend: Python Python backend-related issues enhancement

Comments

@chrisjrn
Copy link
Contributor

Survey submitter said:

An easy to maintain solution for the pytorch with pants lockfiles and documentation how to solve this efficiently for multi platform dev machines (mac/linux)

@chrisjrn chrisjrn changed the title Improved PyTorch support Improve PyTorch support Feb 17, 2023
@tdyas tdyas added the backend: Python Python backend-related issues label Feb 18, 2023
@vrazdalovschi
Copy link

vrazdalovschi commented Feb 21, 2023

I have recently spent some time adding PyTorch, and what hack I have to apply:

  1. Some context: the lock file doesn't support multiplatform, it's why we cannot specify mac/linux and see all the wheels in the lock file. I see the open ticket in pipenv repo and I hope it can be solved in a reasonable time. It's why, when we generate lock files we see a version for mac/linux, but cannot generate a lock file for both envs. I was trying to convert the poetry lock file with --platform flag, but it was resolved to the pip lockfile and it didn't solve the issue of specifying different platforms :(
  2. I tried the last version (1.13.1) for mac and 1.13.1+cpu version for linux. On generating lock file I was getting two variants: I have wheels for mac, or I have wheels for linux. I wasn't able to run tests/do export on both environments. So, I downloaded all required wheels for linux -> torch-1.13.1%2Bcpu-cp39-cp39-linux_x86_64.whl and for mac -> torch-1.13.1-cp39-none-macosx_11_0_arm64.whl. I took the mac wheel and did steps: unzip, rename everything from 1.13.1 to 1.13.1+cpu, zip back to torch-1.13.1+cpu-cp39-none-macosx_11_0_arm64.whl. I have 2 wheels one for linux, another for mac. Upload them to my private PyPI repo. On the next lock file generation, I see that pants include these two wheels and I can do export.

The result is that we have a working version for Linux/Mac. (I hope it's a temporary solution)

PS: I think the real issue is not "pytorch", but adding support for multiplatform or explaining how to do it right, and then it could be applied for all libraries that require specific versions for different platforms.

@tgolsson
Copy link
Contributor

It goes beyond platform. For example, what if some devs want +cu113 tag for GPU but the CI doesn't have GPUs so it'd make more sense to run +cpu for faster builds/tests? What about tensorflow, where there's both tensorflow-cpu, and tensorflow, and tensorflow-macos?

And both of them have supporting libraries that are also API tagged, so if you use torch+cu113 you also need torchvision+cu113. I think for Tensorflow the same occurs with the distributions package, maybe others.

And... now for my CI I want to run tests with CPU but I actually want to publish docker images with GPU support. It's madness.

(Not to mention that all these local tags are screwy when it comes to PEP440, and forces users to use == exact pins.)

@JoostvDoorn
Copy link

The solution we are exploring now is to use multiple .toml files for pants depending on the operating system that is used. There will be a separate lockfile and separate toml file for each environment that we support for our monorepo. This way we can easily set things like python indexes and such separately per environment. It's just something we thought of today, and we just have to see if this works in practice.

@SimonBiggs
Copy link
Contributor

SimonBiggs commented Mar 27, 2023

I'm a little stumped at the moment. I've been adding torch to my requirements file, running generate lockfiles, but then when I run:

pants export --symlink-python-virtualenv --resolve=python-default

I get the following issue:

11:34:21.39 [INFO] Completed: Build pex for resolve `python-default`
11:34:21.40 [ERROR] 1 Exception encountered:

Engine traceback:
  in `export` goal

ProcessExecutionFailure: Process 'Build pex for resolve `python-default`' failed with exit code 1.
stdout:

stderr:
Failed to resolve requirements from PEX environment @ /home/simon/.cache/pants/named_caches/pex_root/unzipped_pexes/4f8efa9195abcc55754bc6bb5fd04d4bb535374e.
Needed cp310-cp310-manylinux_2_35_x86_64 compatible dependencies for:
 1: nvidia-cuda-nvrtc-cu11==11.7.99; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cuda-nvrtc-cu11', normalized='nvidia-cuda-nvrtc-cu11') distributions.
 2: nvidia-cuda-runtime-cu11==11.7.99; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cuda-runtime-cu11', normalized='nvidia-cuda-runtime-cu11') distributions.
 3: nvidia-cuda-cupti-cu11==11.7.101; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cuda-cupti-cu11', normalized='nvidia-cuda-cupti-cu11') distributions.
 4: nvidia-cudnn-cu11==8.5.0.96; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cudnn-cu11', normalized='nvidia-cudnn-cu11') distributions.
 5: nvidia-cublas-cu11==11.10.3.66; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cublas-cu11', normalized='nvidia-cublas-cu11') distributions.
 6: nvidia-cufft-cu11==10.9.0.58; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cufft-cu11', normalized='nvidia-cufft-cu11') distributions.
 7: nvidia-curand-cu11==10.2.10.91; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-curand-cu11', normalized='nvidia-curand-cu11') distributions.
 8: nvidia-cusolver-cu11==11.4.0.1; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cusolver-cu11', normalized='nvidia-cusolver-cu11') distributions.
 9: nvidia-cusparse-cu11==11.7.4.91; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cusparse-cu11', normalized='nvidia-cusparse-cu11') distributions.
 10: nvidia-nccl-cu11==2.14.3; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-nccl-cu11', normalized='nvidia-nccl-cu11') distributions.
 11: nvidia-nvtx-cu11==11.7.91; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-nvtx-cu11', normalized='nvidia-nvtx-cu11') distributions.
 12: triton==2.0.0; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='triton', normalized='triton') distributions.



Use `--keep-sandboxes=on_failure` to preserve the process chroot for inspection.

Is this the same issue as what is being reported here?


My computer is Linux, with a CUDA enabled GPU.

Running pip install torch or poetry add torch both have no issues and work as expected.

I've tried on pants versions 2.15.0, 2.16.0a0 and 2.16.0a1

@SimonBiggs
Copy link
Contributor

SimonBiggs commented Mar 27, 2023

My solution for now is to not use symlink-python-virtualenv, and then run export, and then manually install torch with pip after the export...

pants export --resolve=python-default
. dist/export/python/virtualenvs/python-default/3.10.6/bin/activate
pip install torch

@ryan-minato
Copy link

My solution for now is to not use symlink-python-virtualenv, and then run export, and then manually install torch with pip after the export...

pants export --resolve=python-default
. dist/export/python/virtualenvs/python-default/3.10.6/bin/activate
pip install torch

In my environment, I was able to resolve the problem by manually adding the dependencies that caused the error.
I added a third-party dependency called "fix_cuda" under the path /3rdparty/python/fix_cuda/ in the python-default. This directory contains a requirements.txt file that includes the necessary for pytorch.

nvidia-cuda-nvrtc-cu11
nvidia-cuda-runtime-cu11
nvidia-cuda-cupti-cu11
nvidia-cublas-cu11
nvidia-cufft-cu11
nvidia-cudnn-cu11
nvidia-curand-cu11
nvidia-cusolver-cu11
nvidia-cusparse-cu11
nvidia-nccl-cu11
nvidia-nvtx-cu11

triton==2.0.0

ugly, but it is worked.

@silverwestK
Copy link

@SimonBiggs @minato-ellie

It can be simply solved by adding this option in pants.toml file.

[python-repos]
indexes = ["https://pypi.org/simple/", "https://download.pytorch.org/whl/cu117"]

Then, Pants automatically resolves such Pytorch transitive dependencies. (this example for cuda version 11.7.)

well, this is my case :)

@ryan-minato
Copy link

@SimonBiggs @minato-ellie

It can be simply solved by adding this option in pants.toml file.

[python-repos]
indexes = ["https://pypi.org/simple/", "https://download.pytorch.org/whl/cu117"]

Then, Pants automatically resolves such Pytorch transitive dependencies. (this example for cuda version 11.7.)

well, this is my case :)

This issue seems to be specific to torch==2.0.1, and it appears to be a known problem.

You can refer to this Issue for more information: pytorch/pytorch#100974 (comment)

huonw added a commit that referenced this issue Jan 9, 2024
All changes:

- https://github.com/pantsbuild/pex/releases/tag/v2.1.153
- https://github.com/pantsbuild/pex/releases/tag/v2.1.154
- https://github.com/pantsbuild/pex/releases/tag/v2.1.155

Highlights:

- `--no-pre-install-wheels` (and `--max-install-jobs`) that likely helps
with:
  - #15062 
  - (the root cause of) #20227
  - _maybe_ arguably #18293, #18965, #19681 
- improved shebang selection, helping with
#19514, but probably not the
full solution (#19925)
- performance improvements
huonw added a commit that referenced this issue Jan 11, 2024
https://github.com/pantsbuild/pex/releases/tag/v2.1.156

Continuing from #20347, this brings additional performance
optimisations, particularly for large wheels like PyTorch, and so may
help with #18293,
#18965,
#19681
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: Python Python backend-related issues enhancement
Projects
Status: No status
Development

No branches or pull requests

9 participants