Please document that PJRT_DEVICE=CPU is required #326

artemisart · 2024-10-29T15:23:56Z

Description of the bug:

All examples crash with a log looking like that (I did this one below with the code in the README to convert the resnet18):

RuntimeError: torch_xla/csrc/runtime/pjrt_registry.cc:214 : Check failed: client 
*** Begin stack trace ***
        tsl::CurrentStackTrace()
        torch_xla::runtime::InitializePjRt(std::string const&)
        torch_xla::runtime::PjRtComputationClient::PjRtComputationClient()

        torch_xla::runtime::GetComputationClient()
        torch_xla::bridge::GetDefaultDevice()
        torch_xla::bridge::GetCurrentDevice()
        torch_xla::bridge::GetCurrentAtenDevice()

        _PyObject_MakeTpCall
        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyObject_Vectorcall
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault

        PyObject_Vectorcall
        _PyEval_EvalFrameDefault

        PyObject_Call
        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        _PyRun_SimpleFileObject
        _PyRun_AnyFileObject
        Py_RunMain
        Py_BytesMain
        __libc_start_main

*** End stack trace ***
Unknown PJRT_DEVICE 'CUDA'

After digging in other issues I discovered that os.environ["PJRT_DEVICE"] = "CUDA" was absolutely needed before doing the conversion, I could not find any mention of that in the documentation.

Actual vs expected behavior:

Actual: Any model conversion including official examples crash with no explanation.
Expected: they should work.

Any other information you'd like to share?

Tried with "stable" and nightly, I think I had the same issue (but not sure as I also had many ipykernel crash and segfaults, still don't really understand why).

uv pip tree --package ai-edge-torch

ai-edge-torch v0.2.0
├── numpy v1.26.4
├── scipy v1.13.1
│   └── numpy v1.26.4
├── safetensors v0.4.3
├── tabulate v0.9.0
├── torch v2.5.0+cu121
│   ├── filelock v3.15.1
│   ├── typing-extensions v4.12.2
│   ├── networkx v3.2.1
│   ├── jinja2 v3.1.4
│   │   └── markupsafe v2.1.5
│   ├── fsspec v2024.9.0
│   ├── nvidia-cuda-nvrtc-cu12 v12.1.105
│   ├── nvidia-cuda-runtime-cu12 v12.1.105
│   ├── nvidia-cuda-cupti-cu12 v12.1.105
│   ├── nvidia-cudnn-cu12 v9.1.0.70
│   │   └── nvidia-cublas-cu12 v12.1.3.1
│   ├── nvidia-cublas-cu12 v12.1.3.1
│   ├── nvidia-cufft-cu12 v11.0.2.54
│   ├── nvidia-curand-cu12 v10.3.2.106
│   ├── nvidia-cusolver-cu12 v11.4.5.107
│   │   ├── nvidia-cublas-cu12 v12.1.3.1
│   │   ├── nvidia-nvjitlink-cu12 v12.5.82
│   │   └── nvidia-cusparse-cu12 v12.1.0.106
│   │       └── nvidia-nvjitlink-cu12 v12.5.82
│   ├── nvidia-cusparse-cu12 v12.1.0.106 (*)
│   ├── nvidia-nccl-cu12 v2.21.5
│   ├── nvidia-nvtx-cu12 v12.1.105
│   ├── triton v3.1.0
│   │   └── filelock v3.15.1
│   └── sympy v1.13.1
│       └── mpmath v1.3.0
├── torch-xla v2.5.0
│   ├── absl-py v2.1.0
│   ├── numpy v1.26.4
│   ├── pyyaml v6.0.1
│   └── requests v2.32.3
│       ├── charset-normalizer v3.3.2
│       ├── idna v3.7
│       ├── urllib3 v2.2.2
│       └── certifi v2024.6.2
├── tf-nightly v2.19.0.dev20241026
│   ├── absl-py v2.1.0
│   ├── astunparse v1.6.3
│   │   ├── wheel v0.44.0
│   │   └── six v1.16.0
│   ├── flatbuffers v24.3.25
│   ├── gast v0.6.0
│   ├── google-pasta v0.2.0
│   │   └── six v1.16.0
│   ├── libclang v18.1.1
│   ├── opt-einsum v3.4.0
│   ├── packaging v24.1
│   ├── protobuf v4.25.3
│   ├── requests v2.32.3 (*)
│   ├── setuptools v66.1.1
│   ├── six v1.16.0
│   ├── termcolor v2.4.0
│   ├── typing-extensions v4.12.2
│   ├── wrapt v1.16.0
│   ├── grpcio v1.64.1
│   ├── tb-nightly v2.19.0a20241029
│   │   ├── absl-py v2.1.0
│   │   ├── grpcio v1.64.1
│   │   ├── markdown v3.7
│   │   ├── numpy v1.26.4
│   │   ├── packaging v24.1
│   │   ├── protobuf v4.25.3
│   │   ├── setuptools v66.1.1
│   │   ├── six v1.16.0
│   │   ├── tensorboard-data-server v0.7.2
│   │   └── werkzeug v3.0.3
│   │       └── markupsafe v2.1.5
│   ├── keras-nightly v3.7.0.dev2024102903
│   │   ├── absl-py v2.1.0
│   │   ├── numpy v1.26.4
│   │   ├── rich v13.7.1
│   │   │   ├── markdown-it-py v3.0.0
│   │   │   │   └── mdurl v0.1.2
│   │   │   └── pygments v2.18.0
│   │   ├── namex v0.0.8
│   │   ├── h5py v3.12.1
│   │   │   └── numpy v1.26.4
│   │   ├── optree v0.13.0
│   │   │   └── typing-extensions v4.12.2
│   │   ├── ml-dtypes v0.5.0
│   │   │   ├── numpy v1.26.4
│   │   │   ├── numpy v1.26.4
│   │   │   └── numpy v1.26.4
│   │   └── packaging v24.1
│   ├── numpy v1.26.4
│   ├── h5py v3.12.1 (*)
│   ├── ml-dtypes v0.5.0 (*)
│   └── tensorflow-io-gcs-filesystem v0.37.1
└── ai-edge-quantizer-nightly v0.0.1.dev20240718
    ├── immutabledict v4.2.0
    ├── numpy v1.26.4
    └── tf-nightly v2.19.0.dev20241026 (*)
(*) Package tree already displayed

The text was updated successfully, but these errors were encountered:

pkgoogle · 2024-10-29T20:50:19Z

Hi @artemisart, I am able to do the conversion example w/o explicitly setting this variable on the latest code. I do see this warning:

WARNING:root:Defaulting to PJRT_DEVICE=CPU

What version of AI-Edge-Torch and torch-xla are you using? Also please describe your CPU/GPU/TPU environment. Thanks.

artemisart · 2024-10-30T11:10:48Z

The package versions are in the <details> tag of my first message. I'm on a GCP n1-standard-8 VM with a T4, NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2.
Other people seem to have the same bug from the code I see in other issues: https://github.com/search?q=repo%3Agoogle-ai-edge%2Fai-edge-torch+PJRT_DEVICE&type=issues

pkgoogle · 2024-10-30T18:26:29Z

Hi @artemisart, it seems like you are using the latest stable version (0.2.0)... can you try with the nightly versions? If it's fixed there then it is already fixed and will be fixed on the next release.

Mitigation for #326 PiperOrigin-RevId: 692227947

Mitigation for #326 PiperOrigin-RevId: 692234642

artemisart added the type:bug Bug label Oct 29, 2024

artemisart changed the title ~~Please document that JPRT_DEVICE=CPU is required~~ Please document that PJRT_DEVICE=CPU is required Oct 29, 2024

pkgoogle self-assigned this Oct 29, 2024

pkgoogle added status:awaiting user response When awaiting user response status:more data needed This label needs to be added to stale issues and PRs. type:support For use-related issues and removed type:bug Bug labels Oct 29, 2024

pkgoogle removed the status:more data needed This label needs to be added to stale issues and PRs. label Oct 30, 2024

copybara-service bot pushed a commit that referenced this issue Nov 1, 2024

set PJRT_DEVICE before loading torch_xla

e76678c

Mitigation for #326 PiperOrigin-RevId: 692227947

copybara-service bot mentioned this issue Nov 1, 2024

set PJRT_DEVICE before loading torch_xla #334

Merged

copybara-service bot pushed a commit that referenced this issue Nov 1, 2024

set PJRT_DEVICE before loading torch_xla

a5b89c1

Mitigation for #326 PiperOrigin-RevId: 692227947

copybara-service bot pushed a commit that referenced this issue Nov 1, 2024

set PJRT_DEVICE before loading torch_xla

bd09407

Mitigation for #326 PiperOrigin-RevId: 692234642

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please document that PJRT_DEVICE=CPU is required #326

Please document that PJRT_DEVICE=CPU is required #326

artemisart commented Oct 29, 2024 •

edited

Loading

pkgoogle commented Oct 29, 2024

artemisart commented Oct 30, 2024 •

edited

Loading

pkgoogle commented Oct 30, 2024

Please document that PJRT_DEVICE=CPU is required #326

Please document that PJRT_DEVICE=CPU is required #326

Comments

artemisart commented Oct 29, 2024 • edited Loading

Description of the bug:

Actual vs expected behavior:

Any other information you'd like to share?

pkgoogle commented Oct 29, 2024

artemisart commented Oct 30, 2024 • edited Loading

pkgoogle commented Oct 30, 2024

artemisart commented Oct 29, 2024 •

edited

Loading

artemisart commented Oct 30, 2024 •

edited

Loading