Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable_diffusion_unet out of memory on XPU #505

Open
mengfei25 opened this issue Jun 28, 2024 · 4 comments
Open

Stable_diffusion_unet out of memory on XPU #505

mengfei25 opened this issue Jun 28, 2024 · 4 comments

Comments

@mengfei25
Copy link
Contributor

🐛 Describe the bug

torchbench_float32_training
xpu train stable_diffusion_unet
Traceback (most recent call last):
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2294, in validate_model
self.model_iter_fn(model, example_inputs)
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 456, in forward_and_backward_pass
pred = mod(*cloned_inputs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 985, in forward
sample = upsample_block(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 2187, in forward
hidden_states = attn(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/models/transformer_2d.py", line 309, in forward
hidden_states = block(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/models/attention.py", line 194, in forward
attn_output = self.attn1(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 322, in forward
return self.processor(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 1117, in call
hidden_states = F.scaled_dot_product_attention(
RuntimeError: XPU out of memory, please use empty_cache to release all unoccupied cached memory.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4177, in run
) = runner.load_model(
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 380, in load_model
self.validate_model(model, example_inputs)
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2296, in validate_model
raise RuntimeError("Eager run failed") from e
RuntimeError: Eager run failed

eager_fail_to_run

loading model: 0it [00:00, ?it/s]
loading model: 0it [01:21, ?it/s]

Versions

torch-xpu-ops: 31c4001
pytorch: 0f81473d7b4a1bf09246410712df22541be7caf3 + PRs: 127277,129120
device: PVC 1100, 803.61, 0.5.1

@retonym
Copy link
Contributor

retonym commented Jul 15, 2024

This model is "pass_due_to_skip" in cuda due to too large size. We need to make sure it has the same behaviour on xpu with fp32.

@chuanqi129 chuanqi129 self-assigned this Jul 18, 2024
@chuanqi129 chuanqi129 modified the milestones: PT2.5, PT2.6 Oct 14, 2024
@chuanqi129
Copy link
Contributor

@retonym @mengfei25 may I know the latest status of this model on XPU?

@retonym
Copy link
Contributor

retonym commented Nov 19, 2024

In the latest weekly test, Stable_diffusion_unet model is not loaded during env issue.

Traceback (most recent call last):
  File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4813, in run
    ) = runner.load_model(
  File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 243, in load_model
    module = importlib.import_module(c)
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/sdp/actions-runner/_work/torch-xpu-ops/benchmark/torchbenchmark/models/stable_diffusion_unet/__init__.py", line 11, in <module>
    from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/__init__.py", line 3, in <module>
    from .configuration_utils import ConfigMixin
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 34, in <module>
    from .utils import (
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/utils/__init__.py", line 37, in <module>
    from .dynamic_modules_utils import get_class_from_dynamic_module
  File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/diffusers/utils/dynamic_modules_utils.py", line 28, in <module>
    from huggingface_hub import HfFolder, cached_download, hf_hub_download, model_info
ImportError: cannot import name 'cached_download' from 'huggingface_hub' (/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/huggingface_hub/__init__.py)

@riverliuintel riverliuintel assigned mengfei25 and unassigned chuanqi129 and retonym Nov 21, 2024
@mengfei25
Copy link
Contributor Author

fixed import name 'cached_download' from 'huggingface_hub' in #1218

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants