Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please document that PJRT_DEVICE=CPU is required #326

Open
artemisart opened this issue Oct 29, 2024 · 3 comments
Open

Please document that PJRT_DEVICE=CPU is required #326

artemisart opened this issue Oct 29, 2024 · 3 comments
Assignees
Labels
status:awaiting user response When awaiting user response type:support For use-related issues

Comments

@artemisart
Copy link

artemisart commented Oct 29, 2024

Description of the bug:

All examples crash with a log looking like that (I did this one below with the code in the README to convert the resnet18):

RuntimeError: torch_xla/csrc/runtime/pjrt_registry.cc:214 : Check failed: client 
*** Begin stack trace ***
        tsl::CurrentStackTrace()
        torch_xla::runtime::InitializePjRt(std::string const&)
        torch_xla::runtime::PjRtComputationClient::PjRtComputationClient()

        torch_xla::runtime::GetComputationClient()
        torch_xla::bridge::GetDefaultDevice()
        torch_xla::bridge::GetCurrentDevice()
        torch_xla::bridge::GetCurrentAtenDevice()

        _PyObject_MakeTpCall
        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        PyObject_Vectorcall
        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault

        _PyEval_EvalFrameDefault

        PyObject_Vectorcall
        _PyEval_EvalFrameDefault

        PyObject_Call
        _PyEval_EvalFrameDefault

        PyEval_EvalCode

        _PyRun_SimpleFileObject
        _PyRun_AnyFileObject
        Py_RunMain
        Py_BytesMain
        __libc_start_main

*** End stack trace ***
Unknown PJRT_DEVICE 'CUDA'

After digging in other issues I discovered that os.environ["PJRT_DEVICE"] = "CUDA" was absolutely needed before doing the conversion, I could not find any mention of that in the documentation.

Actual vs expected behavior:

Actual: Any model conversion including official examples crash with no explanation.
Expected: they should work.

Any other information you'd like to share?

Tried with "stable" and nightly, I think I had the same issue (but not sure as I also had many ipykernel crash and segfaults, still don't really understand why).

uv pip tree --package ai-edge-torch
ai-edge-torch v0.2.0
├── numpy v1.26.4
├── scipy v1.13.1
│   └── numpy v1.26.4
├── safetensors v0.4.3
├── tabulate v0.9.0
├── torch v2.5.0+cu121
│   ├── filelock v3.15.1
│   ├── typing-extensions v4.12.2
│   ├── networkx v3.2.1
│   ├── jinja2 v3.1.4
│   │   └── markupsafe v2.1.5
│   ├── fsspec v2024.9.0
│   ├── nvidia-cuda-nvrtc-cu12 v12.1.105
│   ├── nvidia-cuda-runtime-cu12 v12.1.105
│   ├── nvidia-cuda-cupti-cu12 v12.1.105
│   ├── nvidia-cudnn-cu12 v9.1.0.70
│   │   └── nvidia-cublas-cu12 v12.1.3.1
│   ├── nvidia-cublas-cu12 v12.1.3.1
│   ├── nvidia-cufft-cu12 v11.0.2.54
│   ├── nvidia-curand-cu12 v10.3.2.106
│   ├── nvidia-cusolver-cu12 v11.4.5.107
│   │   ├── nvidia-cublas-cu12 v12.1.3.1
│   │   ├── nvidia-nvjitlink-cu12 v12.5.82
│   │   └── nvidia-cusparse-cu12 v12.1.0.106
│   │       └── nvidia-nvjitlink-cu12 v12.5.82
│   ├── nvidia-cusparse-cu12 v12.1.0.106 (*)
│   ├── nvidia-nccl-cu12 v2.21.5
│   ├── nvidia-nvtx-cu12 v12.1.105
│   ├── triton v3.1.0
│   │   └── filelock v3.15.1
│   └── sympy v1.13.1
│       └── mpmath v1.3.0
├── torch-xla v2.5.0
│   ├── absl-py v2.1.0
│   ├── numpy v1.26.4
│   ├── pyyaml v6.0.1
│   └── requests v2.32.3
│       ├── charset-normalizer v3.3.2
│       ├── idna v3.7
│       ├── urllib3 v2.2.2
│       └── certifi v2024.6.2
├── tf-nightly v2.19.0.dev20241026
│   ├── absl-py v2.1.0
│   ├── astunparse v1.6.3
│   │   ├── wheel v0.44.0
│   │   └── six v1.16.0
│   ├── flatbuffers v24.3.25
│   ├── gast v0.6.0
│   ├── google-pasta v0.2.0
│   │   └── six v1.16.0
│   ├── libclang v18.1.1
│   ├── opt-einsum v3.4.0
│   ├── packaging v24.1
│   ├── protobuf v4.25.3
│   ├── requests v2.32.3 (*)
│   ├── setuptools v66.1.1
│   ├── six v1.16.0
│   ├── termcolor v2.4.0
│   ├── typing-extensions v4.12.2
│   ├── wrapt v1.16.0
│   ├── grpcio v1.64.1
│   ├── tb-nightly v2.19.0a20241029
│   │   ├── absl-py v2.1.0
│   │   ├── grpcio v1.64.1
│   │   ├── markdown v3.7
│   │   ├── numpy v1.26.4
│   │   ├── packaging v24.1
│   │   ├── protobuf v4.25.3
│   │   ├── setuptools v66.1.1
│   │   ├── six v1.16.0
│   │   ├── tensorboard-data-server v0.7.2
│   │   └── werkzeug v3.0.3
│   │       └── markupsafe v2.1.5
│   ├── keras-nightly v3.7.0.dev2024102903
│   │   ├── absl-py v2.1.0
│   │   ├── numpy v1.26.4
│   │   ├── rich v13.7.1
│   │   │   ├── markdown-it-py v3.0.0
│   │   │   │   └── mdurl v0.1.2
│   │   │   └── pygments v2.18.0
│   │   ├── namex v0.0.8
│   │   ├── h5py v3.12.1
│   │   │   └── numpy v1.26.4
│   │   ├── optree v0.13.0
│   │   │   └── typing-extensions v4.12.2
│   │   ├── ml-dtypes v0.5.0
│   │   │   ├── numpy v1.26.4
│   │   │   ├── numpy v1.26.4
│   │   │   └── numpy v1.26.4
│   │   └── packaging v24.1
│   ├── numpy v1.26.4
│   ├── h5py v3.12.1 (*)
│   ├── ml-dtypes v0.5.0 (*)
│   └── tensorflow-io-gcs-filesystem v0.37.1
└── ai-edge-quantizer-nightly v0.0.1.dev20240718
    ├── immutabledict v4.2.0
    ├── numpy v1.26.4
    └── tf-nightly v2.19.0.dev20241026 (*)
(*) Package tree already displayed
@artemisart artemisart added the type:bug Bug label Oct 29, 2024
@artemisart artemisart changed the title Please document that JPRT_DEVICE=CPU is required Please document that PJRT_DEVICE=CPU is required Oct 29, 2024
@pkgoogle
Copy link
Contributor

Hi @artemisart, I am able to do the conversion example w/o explicitly setting this variable on the latest code. I do see this warning:

WARNING:root:Defaulting to PJRT_DEVICE=CPU

What version of AI-Edge-Torch and torch-xla are you using? Also please describe your CPU/GPU/TPU environment. Thanks.

@pkgoogle pkgoogle self-assigned this Oct 29, 2024
@pkgoogle pkgoogle added status:awaiting user response When awaiting user response status:more data needed This label needs to be added to stale issues and PRs. type:support For use-related issues and removed type:bug Bug labels Oct 29, 2024
@artemisart
Copy link
Author

artemisart commented Oct 30, 2024

The package versions are in the <details> tag of my first message. I'm on a GCP n1-standard-8 VM with a T4, NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2.
Other people seem to have the same bug from the code I see in other issues: https://github.com/search?q=repo%3Agoogle-ai-edge%2Fai-edge-torch+PJRT_DEVICE&type=issues

@pkgoogle
Copy link
Contributor

Hi @artemisart, it seems like you are using the latest stable version (0.2.0)... can you try with the nightly versions? If it's fixed there then it is already fixed and will be fixed on the next release.

@pkgoogle pkgoogle removed the status:more data needed This label needs to be added to stale issues and PRs. label Oct 30, 2024
copybara-service bot pushed a commit that referenced this issue Nov 1, 2024
Mitigation for #326

PiperOrigin-RevId: 692227947
copybara-service bot pushed a commit that referenced this issue Nov 1, 2024
Mitigation for #326

PiperOrigin-RevId: 692227947
copybara-service bot pushed a commit that referenced this issue Nov 1, 2024
Mitigation for #326

PiperOrigin-RevId: 692234642
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:awaiting user response When awaiting user response type:support For use-related issues
Projects
None yet
Development

No branches or pull requests

2 participants