Stable_diffusion_unet out of memory on XPU #505

mengfei25 · 2024-06-28T06:17:27Z

🐛 Describe the bug

torchbench_float32_training
xpu train stable_diffusion_unet
Traceback (most recent call last):
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2294, in validate_model
self.model_iter_fn(model, example_inputs)
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 456, in forward_and_backward_pass
pred = mod(*cloned_inputs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 985, in forward
sample = upsample_block(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 2187, in forward
hidden_states = attn(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/models/transformer_2d.py", line 309, in forward
hidden_states = block(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/models/attention.py", line 194, in forward
attn_output = self.attn1(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 322, in forward
return self.processor(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 1117, in call
hidden_states = F.scaled_dot_product_attention(
RuntimeError: XPU out of memory, please use empty_cache to release all unoccupied cached memory.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4177, in run
) = runner.load_model(
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 380, in load_model
self.validate_model(model, example_inputs)
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2296, in validate_model
raise RuntimeError("Eager run failed") from e
RuntimeError: Eager run failed

eager_fail_to_run

loading model: 0it [00:00, ?it/s]
loading model: 0it [01:21, ?it/s]

Versions

torch-xpu-ops: 31c4001
pytorch: 0f81473d7b4a1bf09246410712df22541be7caf3 + PRs: 127277,129120
device: PVC 1100, 803.61, 0.5.1

The text was updated successfully, but these errors were encountered:

retonym · 2024-07-15T11:05:54Z

This model is "pass_due_to_skip" in cuda due to too large size. We need to make sure it has the same behaviour on xpu with fp32.

chuanqi129 · 2024-10-14T06:13:31Z

@retonym @mengfei25 may I know the latest status of this model on XPU？

retonym · 2024-11-19T01:39:21Z

In the latest weekly test, Stable_diffusion_unet model is not loaded during env issue.

Traceback (most recent call last):
  File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4813, in run
    ) = runner.load_model(
  File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 243, in load_model
    module = importlib.import_module(c)
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/sdp/actions-runner/_work/torch-xpu-ops/benchmark/torchbenchmark/models/stable_diffusion_unet/__init__.py", line 11, in <module>
    from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/__init__.py", line 3, in <module>
    from .configuration_utils import ConfigMixin
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 34, in <module>
    from .utils import (
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/utils/__init__.py", line 37, in <module>
    from .dynamic_modules_utils import get_class_from_dynamic_module
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/utils/dynamic_modules_utils.py", line 28, in <module>
    from huggingface_hub import HfFolder, cached_download, hf_hub_download, model_info
ImportError: cannot import name 'cached_download' from 'huggingface_hub' (/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/huggingface_hub/__init__.py)

mengfei25 · 2024-12-26T07:03:07Z

fixed import name 'cached_download' from 'huggingface_hub' in #1218

mengfei25 added Accuracy E2E float32 torchbench training labels Jun 28, 2024

chuanqi129 assigned retonym Jul 11, 2024

chuanqi129 added this to the PT2.5 milestone Jul 11, 2024

retonym added the triaged label Jul 15, 2024

chuanqi129 self-assigned this Jul 18, 2024

chuanqi129 modified the milestones: PT2.5, PT2.6 Oct 14, 2024

riverliuintel assigned mengfei25 and unassigned chuanqi129 and retonym Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable_diffusion_unet out of memory on XPU #505

Stable_diffusion_unet out of memory on XPU #505

mengfei25 commented Jun 28, 2024

retonym commented Jul 15, 2024

chuanqi129 commented Oct 14, 2024

retonym commented Nov 19, 2024

mengfei25 commented Dec 26, 2024

Stable_diffusion_unet out of memory on XPU #505

Stable_diffusion_unet out of memory on XPU #505

Comments

mengfei25 commented Jun 28, 2024

🐛 Describe the bug

Versions

retonym commented Jul 15, 2024

chuanqi129 commented Oct 14, 2024

retonym commented Nov 19, 2024

mengfei25 commented Dec 26, 2024