Improve PyTorch support #18293

chrisjrn · 2023-02-17T20:11:09Z

Survey submitter said:

An easy to maintain solution for the pytorch with pants lockfiles and documentation how to solve this efficiently for multi platform dev machines (mac/linux)

vrazdalovschi · 2023-02-21T23:17:45Z

I have recently spent some time adding PyTorch, and what hack I have to apply:

Some context: the lock file doesn't support multiplatform, it's why we cannot specify mac/linux and see all the wheels in the lock file. I see the open ticket in pipenv repo and I hope it can be solved in a reasonable time. It's why, when we generate lock files we see a version for mac/linux, but cannot generate a lock file for both envs. I was trying to convert the poetry lock file with --platform flag, but it was resolved to the pip lockfile and it didn't solve the issue of specifying different platforms :(
I tried the last version (1.13.1) for mac and 1.13.1+cpu version for linux. On generating lock file I was getting two variants: I have wheels for mac, or I have wheels for linux. I wasn't able to run tests/do export on both environments. So, I downloaded all required wheels for linux -> torch-1.13.1%2Bcpu-cp39-cp39-linux_x86_64.whl and for mac -> torch-1.13.1-cp39-none-macosx_11_0_arm64.whl. I took the mac wheel and did steps: unzip, rename everything from 1.13.1 to 1.13.1+cpu, zip back to torch-1.13.1+cpu-cp39-none-macosx_11_0_arm64.whl. I have 2 wheels one for linux, another for mac. Upload them to my private PyPI repo. On the next lock file generation, I see that pants include these two wheels and I can do export.

The result is that we have a working version for Linux/Mac. (I hope it's a temporary solution)

PS: I think the real issue is not "pytorch", but adding support for multiplatform or explaining how to do it right, and then it could be applied for all libraries that require specific versions for different platforms.

tgolsson · 2023-02-22T14:39:27Z

It goes beyond platform. For example, what if some devs want +cu113 tag for GPU but the CI doesn't have GPUs so it'd make more sense to run +cpu for faster builds/tests? What about tensorflow, where there's both tensorflow-cpu, and tensorflow, and tensorflow-macos?

And both of them have supporting libraries that are also API tagged, so if you use torch+cu113 you also need torchvision+cu113. I think for Tensorflow the same occurs with the distributions package, maybe others.

And... now for my CI I want to run tests with CPU but I actually want to publish docker images with GPU support. It's madness.

(Not to mention that all these local tags are screwy when it comes to PEP440, and forces users to use == exact pins.)

JoostvDoorn · 2023-02-23T12:49:06Z

The solution we are exploring now is to use multiple .toml files for pants depending on the operating system that is used. There will be a separate lockfile and separate toml file for each environment that we support for our monorepo. This way we can easily set things like python indexes and such separately per environment. It's just something we thought of today, and we just have to see if this works in practice.

SimonBiggs · 2023-03-27T00:35:02Z

I'm a little stumped at the moment. I've been adding torch to my requirements file, running generate lockfiles, but then when I run:

pants export --symlink-python-virtualenv --resolve=python-default

I get the following issue:

11:34:21.39 [INFO] Completed: Build pex for resolve `python-default`
11:34:21.40 [ERROR] 1 Exception encountered:

Engine traceback:
  in `export` goal

ProcessExecutionFailure: Process 'Build pex for resolve `python-default`' failed with exit code 1.
stdout:

stderr:
Failed to resolve requirements from PEX environment @ /home/simon/.cache/pants/named_caches/pex_root/unzipped_pexes/4f8efa9195abcc55754bc6bb5fd04d4bb535374e.
Needed cp310-cp310-manylinux_2_35_x86_64 compatible dependencies for:
 1: nvidia-cuda-nvrtc-cu11==11.7.99; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cuda-nvrtc-cu11', normalized='nvidia-cuda-nvrtc-cu11') distributions.
 2: nvidia-cuda-runtime-cu11==11.7.99; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cuda-runtime-cu11', normalized='nvidia-cuda-runtime-cu11') distributions.
 3: nvidia-cuda-cupti-cu11==11.7.101; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cuda-cupti-cu11', normalized='nvidia-cuda-cupti-cu11') distributions.
 4: nvidia-cudnn-cu11==8.5.0.96; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cudnn-cu11', normalized='nvidia-cudnn-cu11') distributions.
 5: nvidia-cublas-cu11==11.10.3.66; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cublas-cu11', normalized='nvidia-cublas-cu11') distributions.
 6: nvidia-cufft-cu11==10.9.0.58; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cufft-cu11', normalized='nvidia-cufft-cu11') distributions.
 7: nvidia-curand-cu11==10.2.10.91; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-curand-cu11', normalized='nvidia-curand-cu11') distributions.
 8: nvidia-cusolver-cu11==11.4.0.1; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cusolver-cu11', normalized='nvidia-cusolver-cu11') distributions.
 9: nvidia-cusparse-cu11==11.7.4.91; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-cusparse-cu11', normalized='nvidia-cusparse-cu11') distributions.
 10: nvidia-nccl-cu11==2.14.3; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-nccl-cu11', normalized='nvidia-nccl-cu11') distributions.
 11: nvidia-nvtx-cu11==11.7.91; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='nvidia-nvtx-cu11', normalized='nvidia-nvtx-cu11') distributions.
 12: triton==2.0.0; platform_system == "Linux" and platform_machine == "x86_64"
    Required by:
      torch 2.0.0
    But this pex had no ProjectName(raw='triton', normalized='triton') distributions.



Use `--keep-sandboxes=on_failure` to preserve the process chroot for inspection.

Is this the same issue as what is being reported here?

My computer is Linux, with a CUDA enabled GPU.

Running pip install torch or poetry add torch both have no issues and work as expected.

I've tried on pants versions 2.15.0, 2.16.0a0 and 2.16.0a1

SimonBiggs · 2023-03-27T00:57:09Z

My solution for now is to not use symlink-python-virtualenv, and then run export, and then manually install torch with pip after the export...

pants export --resolve=python-default
. dist/export/python/virtualenvs/python-default/3.10.6/bin/activate
pip install torch

ryan-minato · 2023-05-25T03:11:03Z

My solution for now is to not use symlink-python-virtualenv, and then run export, and then manually install torch with pip after the export...
pants export --resolve=python-default
. dist/export/python/virtualenvs/python-default/3.10.6/bin/activate
pip install torch

In my environment, I was able to resolve the problem by manually adding the dependencies that caused the error.
I added a third-party dependency called "fix_cuda" under the path /3rdparty/python/fix_cuda/ in the python-default. This directory contains a requirements.txt file that includes the necessary for pytorch.

nvidia-cuda-nvrtc-cu11
nvidia-cuda-runtime-cu11
nvidia-cuda-cupti-cu11
nvidia-cublas-cu11
nvidia-cufft-cu11
nvidia-cudnn-cu11
nvidia-curand-cu11
nvidia-cusolver-cu11
nvidia-cusparse-cu11
nvidia-nccl-cu11
nvidia-nvtx-cu11

triton==2.0.0

ugly, but it is worked.

huonw · 2023-05-25T04:34:11Z

A user on slack shared their script for handling PyTorch too:

silverwestK · 2023-08-01T09:22:31Z

@SimonBiggs @minato-ellie

It can be simply solved by adding this option in pants.toml file.

[python-repos]
indexes = ["https://pypi.org/simple/", "https://download.pytorch.org/whl/cu117"]

Then, Pants automatically resolves such Pytorch transitive dependencies. (this example for cuda version 11.7.)

well, this is my case :)

ryan-minato · 2023-08-17T07:13:10Z

@SimonBiggs @minato-ellie

It can be simply solved by adding this option in pants.toml file.
[python-repos]
indexes = ["https://pypi.org/simple/", "https://download.pytorch.org/whl/cu117"]
Then, Pants automatically resolves such Pytorch transitive dependencies. (this example for cuda version 11.7.)

well, this is my case :)

This issue seems to be specific to torch==2.0.1, and it appears to be a known problem.

You can refer to this Issue for more information: pytorch/pytorch#100974 (comment)

All changes: - https://github.com/pantsbuild/pex/releases/tag/v2.1.153 - https://github.com/pantsbuild/pex/releases/tag/v2.1.154 - https://github.com/pantsbuild/pex/releases/tag/v2.1.155 Highlights: - `--no-pre-install-wheels` (and `--max-install-jobs`) that likely helps with: - #15062 - (the root cause of) #20227 - _maybe_ arguably #18293, #18965, #19681 - improved shebang selection, helping with #19514, but probably not the full solution (#19925) - performance improvements

https://github.com/pantsbuild/pex/releases/tag/v2.1.156 Continuing from #20347, this brings additional performance optimisations, particularly for large wheels like PyTorch, and so may help with #18293, #18965, #19681

chrisjrn added the enhancement label Feb 17, 2023

chrisjrn changed the title ~~Improved PyTorch support~~ Improve PyTorch support Feb 17, 2023

tdyas added the backend: Python Python backend-related issues label Feb 18, 2023

chrisjrn added this to 2023 Priorities Mar 28, 2023

tgolsson mentioned this issue May 10, 2023

Improved workflows for Torch (and Tensorflow?) #18965

Open

huonw mentioned this issue Dec 30, 2023

Update to Pex 2.1.155 #20347

Merged

huonw mentioned this issue Jan 11, 2024

Upgrade to PEX 2.1.156 #20391

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PyTorch support #18293

Improve PyTorch support #18293

chrisjrn commented Feb 17, 2023

vrazdalovschi commented Feb 21, 2023 •

edited

Loading

tgolsson commented Feb 22, 2023

JoostvDoorn commented Feb 23, 2023

SimonBiggs commented Mar 27, 2023 •

edited

Loading

SimonBiggs commented Mar 27, 2023 •

edited

Loading

ryan-minato commented May 25, 2023

huonw commented May 25, 2023

silverwestK commented Aug 1, 2023

ryan-minato commented Aug 17, 2023

Improve PyTorch support #18293

Improve PyTorch support #18293

Comments

chrisjrn commented Feb 17, 2023

vrazdalovschi commented Feb 21, 2023 • edited Loading

tgolsson commented Feb 22, 2023

JoostvDoorn commented Feb 23, 2023

SimonBiggs commented Mar 27, 2023 • edited Loading

SimonBiggs commented Mar 27, 2023 • edited Loading

ryan-minato commented May 25, 2023

huonw commented May 25, 2023

silverwestK commented Aug 1, 2023

ryan-minato commented Aug 17, 2023

vrazdalovschi commented Feb 21, 2023 •

edited

Loading

SimonBiggs commented Mar 27, 2023 •

edited

Loading

SimonBiggs commented Mar 27, 2023 •

edited

Loading