Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running Benchmark for SD2 #9

Open
radna0 opened this issue Nov 19, 2024 · 0 comments
Open

Error running Benchmark for SD2 #9

radna0 opened this issue Nov 19, 2024 · 0 comments

Comments

@radna0
Copy link

radna0 commented Nov 19, 2024

Coming from pytorch/xla#8388, I tried running THE SD2 perf optimized implementation and ran into the following error

[worker:] + sudo groupadd docker
[worker:] groupadd: group 'docker' already exists
[worker:] + sudo usermod -aG docker kojoe
[worker:] + newgrp docker
[worker:] WARNING: Your config file at [/home/kojoe/.docker/config.json] contains these credential helper entries:
[worker:]
[worker:] {
[worker:]   "credHelpers": {
[worker:]     "us-central1-docker.pkg.dev": "gcloud"
[worker:]   }
[worker:] }
[worker:] Adding credentials for: us-central1-docker.pkg.dev
[worker:] gcloud credential helpers already registered correctly.
[worker:] Error response from daemon: Cannot kill container: kojoe-test: No such container: kojoe-test
[worker:] v4: Pulling from deeplearning-images/reproducibility/pytorch-tpu-diffusers
[worker:] Digest: sha256:38a7ced3c82f8288fa6c8a33f8c8c7ef9dfa403e4a76bbcab5ea453c0eced862
[worker:] Status: Image is up to date for us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-tpu-diffusers:v4
[worker:] us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-tpu-diffusers:v4
[worker:] WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
[worker:] Repo card metadata block was not found. Setting CardData to empty.
[worker:] WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.
Generating train split: 100%|██████████| 1221/1221 [00:02<00:00, 447.38 examples/s]
[worker:] ***** Running training *****
[worker:] Instantaneous batch size per device = 16
[worker:] Total train batch size (w. parallel, distributed & accumulation) = 128
[worker:]   Total optimization steps = 50
[worker:] Traceback (most recent call last):
[worker:]   File "/workspace/diffusers/examples/text_to_image/train_text_to_image_xla.py", line 579, in <module>
[worker:] Bad StatusOr access: INTERNAL: Mosaic failed to compile TPU kernel: failed to legalize operation 'tpu.mask_cast'
[worker:]
[worker:] at location: loc("/dot_general"(callsite("body"("/usr/local/lib/python3.10/site-packages/jax/experimental/pallas/ops/tpu/flash_attention.py":1211:0) at callsite("run"("/usr/local/lib/python3.10/site-packages/jax/experimental/pallas/ops/tpu/flash_attention.py":1302:0) at callsite("_flash_attention_dq_kernel"("/usr/local/lib/python3.10/site-packages/jax/experimental/pallas/ops/tpu/flash_attention.py":1301:0) at "_flash_attention_bwd_dq"("/usr/local/lib/python3.10/site-packages/jax/experimental/pallas/ops/tpu/flash_attention.py":1484:0))))))
[worker:]
[worker:] The MLIR operation involved:
[worker:]   %8069 = "tpu.mask_cast"(%8068) : (vector<8x128xi1>) -> vector<8x128x2xi1>
[worker:]
[worker:] Please report a bug at: https://github.com/google/jax/issues/new?assignees=apaszke
[worker:]
[worker:]     main(args)
[worker:]   File "/workspace/diffusers/examples/text_to_image/train_text_to_image_xla.py", line 561, in main
[worker:]     trainer.start_training()
[worker:]   File "/workspace/diffusers/examples/text_to_image/train_text_to_image_xla.py", line 86, in start_training
[worker:]     xm.mark_step()
[worker:]   File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1046, in mark_step
[worker:]     torch_xla._XLAC._xla_step_marker(
[worker:] RuntimeError: torch_xla/csrc/xla_graph_executor.cpp:689 : Check failed: tensor_data
[worker:] *** Begin stack trace ***
[worker:]       tsl::CurrentStackTrace[abi:cxx11]()
[worker:]       torch_xla::XLAGraphExecutor::CollectSyncTensors(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > > const&, torch::lazy::LazyGraphExecutor::SyncTensorsConfig const&)
[worker:]       torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, torch::lazy::LazyGraphExecutor::SyncTensorsConfig const&, bool)
[worker:]       torch_xla::XLAGraphExecutor::SyncTensorsGraph(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, bool, bool, bool)
[worker:]       torch_xla::XLAGraphExecutor::SyncLiveTensorsGraph(torch::lazy::BackendDevice const*, c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, bool)
[worker:]
[worker:]
[worker:]
[worker:]
[worker:]       _PyObject_MakeTpCall
[worker:]       _PyEval_EvalFrameDefault
[worker:]
[worker:]       _PyEval_EvalFrameDefault
[worker:]
[worker:]       _PyEval_EvalFrameDefault
[worker:]
[worker:]       _PyEval_EvalFrameDefault
[worker:]
[worker:]       PyEval_EvalCode
[worker:]
[worker:]
[worker:]
[worker:]       _PyRun_SimpleFileObject
[worker:]       _PyRun_AnyFileObject
[worker:]       Py_RunMain
[worker:]       Py_BytesMain
[worker:]       __libc_start_main
[worker:]       _start
[worker:] *** End stack trace ***
[worker:]
@radna0 radna0 changed the title Error running Error running Benchmark for SD2 Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant