Flux.1 Dev, memory issue #4271

iKurama · 2024-08-08T15:28:40Z

Expected Behavior

I expect no issues. I had installed comfyui anew a couple days ago, no issues, 4.6 seconds per iteration~

Actual Behavior

After updating, I'm now experiencing 20 seconds per iteration.

Steps to Reproduce

Install the newest version + update the python deps.

Debug Logs

E:\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build
Total VRAM 12281 MB, total RAM 16296 MB
pytorch version: 2.4.0+cu118
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 Ti : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: E:\ComfyUI_windows_portable\ComfyUI\web
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)

Import times for custom nodes:
   0.0 seconds: E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
E:\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
  out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load Flux
Loading 1 new model
loading in lowvram mode 9763.075
 55%|█████████████████████████████████████████████                                     | 11/20 [03:36<02:56, 19.58s/it]

Other

I have no idea why it has f'd up.

BigBanje · 2024-08-08T17:06:32Z

Having this issue as well. I updated yesterday, and now everything runs at half speed compared to before, and I was using fp16 before, and even upon downgrading to fp8 it's still half speed.

to be clear, my generation speed itself is still the same it/s. But the time spent loading the models is dramatically longer. My terminal has the same feedback as OP's

JorgeR81 · 2024-08-08T17:46:21Z

I noticed about 50% performance degradation, with a lower end GPU, but more RAM
I have an GTX 1070 ( 8 GB VRAM ) and 32 GB RAM

On a 1024 x 1024 img, I was at 30 s/it and now I'm at 45 s/it
On a 512 x 768 img, I was at 20 s/it and now I'm at 30 s/it

iKurama · 2024-08-08T19:08:32Z

After some testing, I re-downloaded Portable v.0.0.2 and went to this commit f123328 & I went down to 6.2s/it~
Then I tried to install both xformers and update comfy + python deps, back to 20+s/it.

So, pretty safe to say it's a python issue.

comfyanonymous · 2024-08-08T19:12:23Z

Can you update and test the latest commit to see if things are better?

JorgeR81 · 2024-08-08T19:35:11Z

For me, there was no improvement.
I still have about the same generation images.

BigBanje · 2024-08-08T19:39:10Z

I was able to resolve my issue by uninstalling two old versions of python (didnt realize I had them) and reinstalling python & comfyui from scratch via zip extraction.

It immediately was able to do fp8 at the previous speed, but still had serious bottlenecking on fp16. I then enabled --highvram flag and it works as fast as before (however, I did not need this flag before)

San4itos · 2024-08-08T19:53:32Z

I use ComfyUI with ROCm.

Python version: 3.10.14
Total VRAM 16368 MB, total RAM 31713 MB
pytorch version: 2.4.0+rocm6.1
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7800 XT : native

With latest version I have

loading in lowvram mode 9011.532500000001
20/20 [01:41<00:00,  5.07s/it]

on mentioned here commit f123328 I have

loading in lowvram mode 9493.052499771118
 20/20 [01:34<00:00,  4.73s/it]

I tried earlier commit eca962c which was faster for me and got

loading in lowvram mode 10977.61249923706
20/20 [01:12<00:00,  3.64s/it]

I see degradation of speed each time I update ComfyUI.

davoodice · 2024-08-08T19:56:23Z

Same, I realized CPU and GPU both are working in ping-pong form.

64 Gb RAM
RTX 4070ti

Before this, It took some time to be load into ram and then be loaded to vram, but now it immediately is loaded into vram.

iKurama · 2024-08-08T19:57:26Z

E:\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build
Total VRAM 12281 MB, total RAM 16296 MB
pytorch version: 2.4.0+cu121
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 Ti : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: E:\ComfyUI_windows_portable\ComfyUI\web
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)

Import times for custom nodes:
0.0 seconds: E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
E:\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load Flux
Loading 1 new model
loading in lowvram mode 9763.075
10%|████████▎ | 2/20 [00:42<06:20, 21.12s/it]

This was a fresh reinstall, a fresh pull, including the one after the "fix" push

Gonna try bigbanje's solution next

btibor91 · 2024-08-08T20:09:13Z

I have experienced the same problems (extreme slowdown and very frequent OOM crashes) after the recent update. It returned to normal after rolling back to commit 1aa9cf3 (20G VRAM).

JorgeR81 · 2024-08-08T20:34:11Z

I've been updating every day, but I'm not sure when my performance degradation started.
I know I was fine on August 3, because I posted some data here about performance.
I was probably fine 2 or 3 days after that, but then I stopped paying attention to times, until I saw this issue.
I was trying different settings and resolutions so times were always different ...

iKurama · 2024-08-08T20:34:23Z

Bigbanje's solution did not work

E:\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --lowvram
Total VRAM 12281 MB, total RAM 16296 MB
pytorch version: 2.4.0+cu121
Set vram state to: LOW_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 Ti : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: E:\ComfyUI_windows_portable\ComfyUI\web
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)

Import times for custom nodes:
0.0 seconds: E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
loading in lowvram mode 9823.2
0%| | 0/20 [00:00<?, ?it/s]E:\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
5%|████▏ | 1/20 [00:20<06:25, 20.27s/it]

///////////

Loading in on commit 1aa9cf3
gives me this error

When loading the graph, the following node types were not found:

FluxGuidance
ModelSamplingFlux

Nodes that have failed to load will show as red on the graph.

///

I do not have any issue what so ever with my CPU, like others do.
I went back to f123328, but I essentially go OOM, but it just freezes instead.

E:\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --lowvram
Total VRAM 12281 MB, total RAM 16296 MB
pytorch version: 2.4.0+cu121
Set vram state to: LOW_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 Ti : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: E:\ComfyUI_windows_portable\ComfyUI\web
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)

Import times for custom nodes:
0.0 seconds: E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model_type FLUX
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
loading in lowvram mode 9823.199999809265
0%| | 0/20 [00:00<?, ?it/s]E:\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
5%|████▏ | 1/20 [00:52<16:39, 52.59s/it]

iKurama · 2024-08-08T20:36:13Z

gonna try to load up v.0.0.2 again, in f123328 & see if I can replicate my own solution - because I suspect it being python being as issue, as it's on v2.4

comfyanonymous · 2024-08-08T20:45:11Z

Can you all post your full console output if you have not already?

iKurama · 2024-08-08T20:45:44Z

It's the full console, sadly

iKurama · 2024-08-08T20:46:09Z

I'd even screenshot it, due to how little it is

BigBanje · 2024-08-08T20:48:10Z

I will disable --highvram and post my f16 output, one moment

davoodice · 2024-08-08T20:54:57Z

got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLOW
Using pytorch attention in VAE
Using pytorch attention in VAE
Model doesn't have a device attribute.
Model doesn't have a device attribute.
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
loading in lowvram mode 9725.074980926514
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:22<00:00, 5.63s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 38.51 seconds

gpu CANT STAY at 100%

BigBanje · 2024-08-08T21:00:24Z

fp16 without --highvram

C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build
Total VRAM 24575 MB, total RAM 16268 MB
pytorch version: 2.4.0+cu121
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3090 : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\ComfyUI\web
C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)

Import times for custom nodes:
0.0 seconds: C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float8_e5m2, manual cast: torch.bfloat16
model_type FLUX
C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load Flux
Loading 1 new model
100%|████████| 20/20 [00:27<00:00, 1.36s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Model doesn't have a device attribute.
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 199.66 seconds

fp16 with --highvram (only pasting from "Starting server" and below as the rest was identical)

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float8_e5m2, manual cast: torch.bfloat16
model_type FLUX
C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load Flux
Loading 1 new model
100%|███████████| 20/20 [00:26<00:00, 1.32s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Model doesn't have a device attribute.
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 128.00 seconds

I know I'm on a Low-RAM, High-VRAM machine, so --highvram fixing my issue isn't too surprising... But I didn't need to do that before. Not sure what the error about 1Torch is about, I seem to get that no matter what I do after an update a few days ago.

JorgeR81 · 2024-08-08T21:00:26Z

Here is my full console output, after opening Comfy UI, and doing a few flux generations.

I'm at 512 x 768 resolution, in fp8 versions, with ~30 s/it.
But, on August 3, I was at was at ~20 s/it

console output

C:\Cui\cu_121_2\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --force-fp16 --windows-standalone-build
[START] Security scan
[DONE] Security scan
## ComfyUI-Manager: installing dependencies done.
** ComfyUI startup time: 2024-08-08 20:23:12.203838
** Platform: Windows
** Python version: 3.11.6 (tags/v3.11.6:8b6ee5b, Oct  2 2023, 14:57:12) [MSC v.1935 64 bit (AMD64)]
** Python executable: C:\Cui\cu_121_2\ComfyUI_windows_portable\python_embeded\python.exe
** ComfyUI Path: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI
** Log path: C:\Cui\cu_121_2\ComfyUI_windows_portable\comfyui.log

Prestartup times for custom nodes:
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\rgthree-comfy
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Marigold
   2.1 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Manager

Total VRAM 8192 MB, total RAM 32727 MB
pytorch version: 2.1.0+cu121
Forcing FP16.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce GTX 1070 : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\web
Adding extra search path checkpoints d:/ComfyUI/models/checkpoints/
Adding extra search path clip d:/ComfyUI/models/clip/
Adding extra search path clip_vision d:/ComfyUI/models/clip_vision/
Adding extra search path configs d:/ComfyUI/models/configs/
Adding extra search path controlnet d:/ComfyUI/models/controlnet/
Adding extra search path embeddings d:/ComfyUI/models/embeddings/
Adding extra search path loras d:/ComfyUI/models/loras/
Adding extra search path unet d:/ComfyUI/models/unet/
Adding extra search path upscale_models d:/ComfyUI/models/upscale_models/
Adding extra search path vae d:/ComfyUI/models/vae/
[ComfyUI- ] Loaded all nodes and apis.
### Loading: ComfyUI-Impact-Pack (V6.1)
### Loading: ComfyUI-Impact-Pack (Subpack: V0.6)
[Impact Pack] Wildcards loading done.
### Loading: ComfyUI-Inspire-Pack (V0.83)
### Loading: ComfyUI-Manager (V2.48.6)
### ComfyUI Revision: 2492 [66d42332] | Released on '2024-08-08'
C:\Cui\cu_121_2\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\models\transformers\transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
  deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/alter-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/github-stats.json
Total VRAM 8192 MB, total RAM 32727 MB
pytorch version: 2.1.0+cu121
Forcing FP16.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce GTX 1070 : cudaMallocAsync
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/extension-node-map.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/model-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/custom-node-list.json
[ReActor] - STATUS - Running v0.5.1-a6 in ComfyUI
C:\Cui\cu_121_2\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torchvision\transforms\functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
  warnings.warn(
Torch version: 2.1.0+cu121
[comfyui_controlnet_aux] | INFO -> Using ckpts path: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\comfyui_controlnet_aux\ckpts
[comfyui_controlnet_aux] | INFO -> Using symlinks: False
[comfyui_controlnet_aux] | INFO -> Using ort providers: ['CUDAExecutionProvider', 'DirectMLExecutionProvider', 'OpenVINOExecutionProvider', 'ROCMExecutionProvider', 'CPUExecutionProvider', 'CoreMLExecutionProvider']
DWPose: Onnxruntime with acceleration providers detected
Please 'pip install xformers'
Nvidia APEX normalization not installed, using PyTorch LayerNorm

[rgthree] Loaded 39 extraordinary nodes.

WAS Node Suite: BlenderNeko's Advanced CLIP Text Encode found, attempting to enable `CLIPTextEncode` support.
WAS Node Suite: `CLIPTextEncode (BlenderNeko Advanced + NSP)` node enabled under `WAS Suite/Conditioning` menu.
WAS Node Suite: OpenCV Python FFMPEG support is enabled
WAS Node Suite Warning: `ffmpeg_bin_path` is not set in `C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\was-node-suite-comfyui\was_suite_config.json` config file. Will attempt to use system ffmpeg binaries if available.
WAS Node Suite: Finished. Loaded 219 nodes successfully.

        "Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work." - Steve Jobs


Import times for custom nodes:
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_Noise
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\cg-use-everywhere
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_Cutoff
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_ADV_CLIP_emb
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_TiledKSampler
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_InstantID
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-AutomaticCFG
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_IPAdapter_plus
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Custom-Scripts
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_UltimateSDUpscale
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\rgthree-comfy
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-0246
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_essentials
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\aegisflow_utility_nodes
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\comfyui_controlnet_aux
   0.1 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Inspire-Pack
   0.2 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_smZNodes
   0.2 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Marigold
   0.3 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\PuLID_ComfyUI
   0.3 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Manager
   0.3 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_FaceAnalysis
   0.6 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\comfyui-reactor-node
   0.6 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Impact-Pack
   2.1 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\was-node-suite-comfyui

Starting server

To see the GUI go to: http://127.0.0.1:8188
FETCH DATA from: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Manager\extension-node-map.json [DONE]
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type FLUX
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
loading in lowvram mode 5938.199950408935
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [06:05<00:00, 30.50s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Model doesn't have a device attribute.
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 492.76 seconds
got prompt
loading in lowvram mode 5928.199950408935
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [06:00<00:00, 30.01s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 363.30 seconds
got prompt
loading in lowvram mode 5928.199950408935
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [06:01<00:00, 30.11s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 364.53 seconds
got prompt
loading in lowvram mode 5928.199935150146
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [08:00<00:00, 30.00s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 483.37 seconds

iKurama · 2024-08-08T21:04:28Z

gonna try to load up v.0.0.2 again, in f123328 & see if I can replicate my own solution - because I suspect it being python being as issue, as it's on v2.4

And naturally, a screenshot as promised

Bortus-AI · 2024-08-08T21:09:46Z

Same issue here. Was working great but something updated and now it takes 10x longer and goes into lowvram mode. Still trying to figure out what changed

comfyanonymous · 2024-08-08T21:29:50Z

Can you test if the latest commit improves things?

davoodice · 2024-08-08T21:38:19Z

[SKIP] Downgrading pip package isn't allowed: transformers (cur=4.38.2)
[SKIP] Downgrading pip package isn't allowed: tokenizers (cur=0.15.2)
[SKIP] Downgrading pip package isn't allowed: safetensors (cur=0.4.3)
[SKIP] Downgrading pip package isn't allowed: kornia (cur=0.7.2)

anyway I think things got worse

see gpu usage

JorgeR81 · 2024-08-08T21:42:30Z

No improvement for me.
I still have the same times.

got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type FLUX
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
loaded in lowvram mode 5838.19995803833
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [05:10<00:00, 31.01s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Model doesn't have a device attribute.
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 434.71 seconds

davoodice · 2024-08-08T21:45:51Z

No improvement for me. I still have the same times.

got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type FLUX
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
loaded in lowvram mode 5838.19995803833
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [05:10<00:00, 31.01s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Model doesn't have a device attribute.
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 434.71 seconds

can you share your GPU graph while rendering?

Bortus-AI · 2024-08-08T21:50:34Z

Can you test if the latest commit improves things?

Massive improvement. I also nuked and redid ComfyUI from source instead of portable with the latest commits and it took 15 seconds vs 2.5 minutes for 10 steps.

It still goes into lowvram mode when it shouldn't

JorgeR81 · 2024-08-08T21:51:07Z

can you share your GPU graph while rendering?

davoodice · 2024-08-08T22:02:43Z

do you sure you can use flux with 8 Gb of ram in full speed? I think its impossible

Kinglord · 2024-08-12T17:31:50Z

I don't want to open a new issue for this but maybe I should - I just want @comfyanonymous to know this is far from an issue only affecting Flux models. My comfyui was working super fine but I decided to update today to try out some flux models (had been about a week since I upgraded) and now ComfyUI is all but unusable on my normal standard SDXL workflows.

My iterations per second have dropped to about 1/5th what they were before the update, and whenever I run batches I now OOM and crash completely on any batch >10. I am used to also being able to do things that are not graphic intensive when using ComfyUI - like web browse, discord, etc. but now ComfyUI murders even my system FPS down to around 5. It doesn't appear to really be using my GPU but it's slowly eating up all my system ram until it OOMs (it's happy to try and eat 60GBs of system ram!).

I don't have to use Flux for anything, thankfully, so I'm going to just downgrade back to a version that worked OK. I'm just a bit shocked there isn't more noise about this outside of Flux but I guess everyone is just trying Flux these days. 😀

FWIW I don't know where things went sideways for me but I tested and can confirm the tagged version in the 0.0.4 standalone version works without any issues (just the code I'm not messing with actual python libs, etc.)

Edit: I don't think it matters at all but just in case

Standalone install
64GB system ram
4080 16GB Vram
v0.0.2-162-gce37c11

comfyanonymous · 2024-08-12T18:02:43Z

If you still have issues on the latest master I need:

Your system specs.

Your exact workflow.

Kinglord · 2024-08-12T18:33:11Z

If you still have issues on the latest master I need:

Your system specs.
Attached as dxdiag
Your exact workflow.
Attached as json

I'm running the python & libs from the 0.0.3 standalone install but if there's any particular version info you need for any packages or anything else I haven't included just let me know. The workflow is one i use to test prompts against a collection of checkpoints I have, which seems to cause a good chunk of the issues (obviously since it doesn't OOM until around 10 in).

DxDiag.txt
PonyChkpntTest.json

comfyanonymous · 2024-08-12T18:44:05Z

Can you try without custom nodes so see if it's a core problem: --disable-all-custom-nodes

Kinglord · 2024-08-12T23:17:45Z

Alright, sorry for the late reply but I tried to test things as extensively as I could here. tl;dr is that it appears to be a core problem, with no custom nodes running (using the flag you provided) I still run into this issue with massive slowdowns, OOMs, crashes, etc.

I was able to troubleshoot and find what appears to be the smoking gun on the issue, which are the Lora Loaders. The normal Lora Loader using CLIP seems to be worse than the Model Only Lora Loader, but they both eventually lead to the same OOM errors, GPU crashing, etc.

If there's any more data I can provide to help here just let me know. The console window usually dies with these crashes but if there's a log I can provide happy to do so if it would help at all.

These tests were done on the v0.0.2-163-gb8ffb29 tag. (release 0.0.7) This issue isn't present on release 0.0.4. I had no intention of hijacking this thread, if you'd like me to open up a new ticket since this appears related to Loras and not the base checkpoint flows I'm happy to do so.

MDMAchine · 2024-08-12T23:44:16Z

Updated to b8ffb29 and disabled all custom nodes. I would normally expect this to run in about 19-25 seconds. On the bright side, it appears the VRAM releases the memory upon completion.

Made a basic workflow, with no loras or anything extra.
Checkpoint loader > speedFP8_e5m2.safetensors
DualCLIPLoader> t5xxl_fp3_e4m3fn.safetensors & clip_l.safetensors
Sampler: euler
Scheduler: beta
Steps: 4

System info:
Total VRAM 8192 MB, total RAM 48735 MB
pytorch version: 2.2.2+cu121
xformers version: 0.0.25.post1
Set vram state to: NORMAL_VRAM
Disabling smart memory management
Device: cuda:0 NVIDIA GeForce RTX 3060 Ti : cudaMallocAsync
Using xformers cross attention

Run info:
Requested to load Flux
Loading 1 new model
loaded partially 5831.075 0
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:53<00:00, 13.40s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 59.79 seconds

DxDiag.txt
workflow.json

hypervoxel · 2024-08-13T00:25:57Z

RTX4090, 128gb RAM, Windows 10. Brand new install of ComfyUI portable, Latest Flux, it took 4 minutes to generate the demo image (the anime girl with bunny ears) at 11.13s/it using dev model. Feels very slow. It seems to always load in low vram mode (that is not correct right?).

bryancpe · 2024-08-13T00:35:51Z

Use the package linked on here: https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.0.4 it has pytorch 2.3.1 . everything returned to normal

Thanks! You are my hero! I updated everything and it was almost a 10x slowdown. I uninstalled pytorch and then reinstalled 2.3.1.
Here is my command sequence:

pip3 uninstall torch
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118

Before:
100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [06:56<00:00, 104.02s/it]
After:
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:10<00:00, 17.73s/it]

hypervoxel · 2024-08-13T00:56:24Z

Use the package linked on here: https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.0.4 it has pytorch 2.3.1 . everything returned to normal

Thanks! You are my hero! I updated everything and it was almost a 10x slowdown. I uninstalled pytorch and then reinstalled 2.3.1. Here is my command sequence:

pip3 uninstall torch pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118

Before: 100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [06:56<00:00, 104.02s/it] After: 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:10<00:00, 17.73s/it]

Thank you. I downloaded the new portable version. It does not seem any faster. Still runs in low vram mode
I did get this error " UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)"

bryancpe · 2024-08-13T01:00:47Z

Use the package linked on here: https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.0.4 it has pytorch 2.3.1 . everything returned to normal

Thanks! You are my hero! I updated everything and it was almost a 10x slowdown. I uninstalled pytorch and then reinstalled 2.3.1. Here is my command sequence:
pip3 uninstall torch pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
Before: 100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [06:56<00:00, 104.02s/it] After: 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:10<00:00, 17.73s/it]

Thank you. I downloaded the new portable version. It does not seem any faster. Still runs in low vram mode I did get this error " UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.) out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)"

I didn't use that portable installer, but that note clued me into torch being an issue.
Did you first uninstall torch?
pip3 uninstall torch

then:
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118

comfyanonymous · 2024-08-13T01:07:27Z

Can you print the full log and tell me exactly which portable version you downloaded (there are 3 of them).

hypervoxel · 2024-08-13T01:10:39Z

Use the package linked on here: https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.0.4 it has pytorch 2.3.1 . everything returned to normal

Thanks! You are my hero! I updated everything and it was almost a 10x slowdown. I uninstalled pytorch and then reinstalled 2.3.1. Here is my command sequence:
pip3 uninstall torch pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
Before: 100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [06:56<00:00, 104.02s/it] After: 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:10<00:00, 17.73s/it]

Thank you. I downloaded the new portable version. It does not seem any faster. Still runs in low vram mode I did get this error " UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.) out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)"

I didn't use that portable installer, but that note clued me into torch being an issue. Did you first uninstall torch? pip3 uninstall torch

then: pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118

I think pytorch is included with the portable version? log states pytorch version: 2.3.1+cu121

hypervoxel · 2024-08-13T01:14:46Z

Can you print the full log and tell me exactly which portable version you downloaded (there are 3 of them).

I've tried two portables so far, though they might actually have been the same

https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.0.4

I just set weight type to fp8 and it sooooo much faster (like x50, though now it's not fp16 right)

Will continue to test. Next time I restart Comfy I will upload log

Creepybits · 2024-08-13T02:14:06Z

I willingly admit that I don't know much about these things. But Forge has released an update so it's possible to run the nf4 model there, and I tried for a bit

I seem to get the same or similar issues on Forge that many seem to get on Comfy. It acted as if it just continued to eat memory nonstop and never cleaned up.

I offloaded to my system page, so together with my GPU I hade 60-70GB memory. I could generate maybe 3-4 images, and then I got an error saying that cuda ran out of memory and had to restart my system.

Then I could generate another 3-4 images, and then had to restart.

That's how it went on.

My thought is that maybe the issue is with BitsAndBytes? Forge pretty much copied that part from Comfy. And since the issues seems similar, it kind of makes sense. To me at least.

comfyanonymous · 2024-08-13T03:04:33Z

If anyone has issues make sure you run the: update/update_comfyui.bat to update ComfyUI first.

MDMAchine · 2024-08-13T03:20:55Z

Updated Observations with ComfyUI (2527[b8ffb2]):

After further testing, I've found that loading the flux model (either the 16-bit flux1_schnell UNET or speedFP8_e5m2 checkpoint) using the Load Diffusion Model node, and keeping the weight type at its default, resolves the issue. Speeds are similar for both models.

However, switching the weight type for either of these models to any of the FP8 options results in a significant slowdown—processing time increases from 19 seconds to 54 seconds per 4 steps.

Additionally, if I load the FP8 model (speedFP8_e5m2) in the Checkpoint Loader node, it also becomes 3x slower. I haven't yet tested a non-FP8 flux model since the only versions I've downloaded are FP8 and NF4, besides the UNETs.

VRAM releases after generation.

The latest CheckpointLoaderNF4 node also releases VRAM, but:

Using NF4 node:
Res 1024x1024 - Any batch size over 1 results in an OOM error.
Generating a larger image (1080x1920) also results in an OOM error.

Load Diffusion Model (default weights):
Res 1024x1024 - Batch of 4 @ 6 steps: 1:29
Res 1024x1024 - Single image @ 4 steps: 0:25
Res 1080x1920 @ 4 steps: 0:32
Batch of 2 Res 1080x1920 @ 4 steps: 0:54

Basic SDXL workflow seems to be functioning well. I’ll continue testing with a562c1 to see how it performs.

MDMAchine · 2024-08-13T04:27:07Z

No change on a562c1 for the results of the NF4 node or using FP8 weights.

iKurama · 2024-08-13T14:04:29Z

Did some misc stuff on and off, running back ups and such. Still on f123328.

I went with

pip install torch torchvision torchaudio xformers --extra-index-url https://download.pytorch.org/whl/cu121

And now I'm back to around 5.3s/it, in both lowvram & normalvram mode - it no longer forces me to use lowvram. I'll take the loss of 0.6s/it and be happy with it as is.

Thireus · 2024-08-13T20:07:27Z

If anyone has issues make sure you run the: update/update_comfyui.bat to update ComfyUI first.

Isn't it the same as running "Update ComfyUI" or "Update All" through the manager and restarting?

ltdrdata · 2024-08-14T14:25:11Z

FYI, the ComfyUI update in ComfyUI-Manager does not perform a torch update for safety reasons.

benzstation · 2024-08-15T00:50:42Z

I have the same issue (~20s/it), when using an all-in-one flux model (flux1.dev fp8 model with fp8 e4m3fn weights and t5xxl fp8 e4m3fn and vae baked in) on regular checkpoint loaders.

Although, it works flawlessly when using the original flux model (flux1.dev fp8 with default weights, clip_l and t5xxl fp8 e4m3fn clips loaded separately, and vae loaded separately) on the unet loader node (~2s/it):

When using the regular loader, i see this in the console:
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16

When using the unet loader, i see this instead:
model weight dtype torch.bfloat16, manual cast: None

Why is the regular checkpoint loader forcing manual cast to use torch.bfloat16? The model is built with fp8 e4m3fn weights.

Happy to share more details.

Thireus · 2024-08-15T07:55:47Z

Thanks @ltdrdata!

--

FYI, with the recent update that disables cuda malloc by default, I've add to add --cuda-malloc back because perfs with Flux were just terrible without.

johnr14 · 2024-08-15T12:43:34Z

Threadripper 1900
32gb ram
Vega64 8gb
Fedora 40 latest

Was running flux and all last week. Was slow but running.
Now, I can't run anything, not even SDXL.

I always get :
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16

I tried rolling back ComfyUI git commits, but to no avail.
b8ffb2 -> same dtype issue

tried flags:
python main.py --listen --preview-method auto --cuda-device 1 --verbose --fp8_e4m3fn-text-enc --fp8_e4m3fn-unet --lowvram --disable-cuda-malloc --disable-smart-memory

Could it be related to python updates (torch or other), rocm updates (I rolled back to 2.3) ?
Will try a few more rollback again.

+1 for adding a testing and release git branch with version revision of pip packages to install for a given version and github actions to TEST releases on a cloud before release and monitor performance. This would help spot general performance regression.

`summary of logs` :

## ComfyUI-Manager: installing dependencies done.
** ComfyUI startup time: 2024-08-15 08:22:49.352379
** Platform: Linux
** Python version: 3.12.4 (main, Jun  7 2024, 00:00:00) [GCC 14.1.1 20240607 (Red Hat 14.1.1-5)]
** Python executable: /usr/bin/python
** ComfyUI Path: ~/github/ComfyUI
** Log path: ~/github/ComfyUI/comfyui.log

Prestartup times for custom nodes:
   2.8 seconds: ~/github/ComfyUI/custom_nodes/ComfyUI-Manager

Set cuda device to: 1
Total VRAM 8176 MB, total RAM 31960 MB
pytorch version: 2.3.0+rocm6.0
Set vram state to: LOW_VRAM
Device: cuda:0 AMD Radeon RX Vega : native
Using sub quadratic optimization for cross attention, if you have memory or speed issues try using: --use-split-cross-attention
Using selector: EpollSelector
### ComfyUI Revision: 2542 [0f9c2a78] | Released on '2024-08-14'
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
adm 0
~/.local/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/4
  warnings.warn(
Model doesn't have a device attribute.
CLIP model load device: cpu, offload device: cpu
clip unexpected: ['encoder.embed_tokens.weight']
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model

`pip freeze` :

aiohappyeyeballs==2.3.5
aiohttp==3.10.3
aiosignal==1.3.1
albucore==0.0.13
albumentations==1.4.13
annotated-types==0.7.0
ansible==9.8.0
ansible-core==2.16.9
antlr4-python3-runtime==4.9.3
anyio==4.4.0
appdirs==1.4.4
argcomplete==3.3.0
arrow==1.3.0
attrs==23.2.0
beautifulsoup4==4.12.3
binaryornot==0.4.4
bitsandbytes==0.43.3
black==24.8.0
borgbackup==1.2.8
borgmatic==1.8.13
boto3==1.34.153
botocore==1.34.153
Brlapi==0.8.5
Brotli==1.1.0
certifi==2023.5.7
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
cockpit @ file:///builddir/build/BUILD/cockpit-322/tmp/wheel/cockpit-322-py3-none-any.whl#sha256=5587d8c988d8b9ebe77b0cca42347ff1cf40338aac6721b9251be1479e6ca19c
colorama==0.4.6
coloredlogs==15.0.1
colour-science==0.4.4
configobj==5.0.8
contourpy==1.2.0
cookiecutter==2.6.0
cryptography==41.0.7
cupshelpers==1.0
cycler==0.11.0
Cython==3.0.11
dasbus==1.7
dbus-python==1.3.2
deepdiff==7.0.1
Deprecated==1.2.14
discover-overlay==0.7.2
distro==1.9.0
dnspython==2.6.1
easydict==1.13
einops==0.8.0
email_validator==2.2.0
eval_type_backport==0.2.0
evdev==1.6.1
fastapi==0.112.0
fedora-third-party==0.10
fido2==1.1.2
file-magic==0.4.0
filelock==3.13.1
flatbuffers==24.3.25
flet==0.23.2
flet-core==0.23.2
flet-runtime==0.23.2
fonttools==4.50.0
fros==1.1
frozenlist==1.4.1
fs==2.4.16
fsspec==2024.6.1
gbinder-python==1.1.2
gbulb==0.6.4
gdown==5.2.0
gitdb==4.0.11
GitPython==3.1.43
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.24.5
humanfriendly==10.0
icoextract==0.1.4
idna==3.7
imageio==2.35.0
input-remapper==2.0.1
insightface==0.7.3
jaraco.classes==3.3.0
jeepney==0.8.0
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
joystickwake==0.4.2
jsonschema==4.19.1
jsonschema-specifications==2023.11.2
keyring==24.3.1
kiwisolver==1.4.5
kornia==0.7.3
kornia_rs==0.1.5
langtable==0.0.68
lazy_loader==0.4
libdnf5==5.1.17
libvirt-python==10.1.0
llfuse==1.5.0
llvmlite==0.43.0
louis==3.28.0
lutris==0.5.17
lxml==5.1.0
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matplotlib==3.8.0
matrix-client==0.4.0
mdurl==0.1.2
moddb==0.11.0
more-itertools==10.1.0
mpmath==1.3.0
msgpack==1.0.6
multidict==6.0.5
mutagen==1.47.0
mypy-extensions==1.0.0
networkx==3.3
nftables==0.1
numba==0.60.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
olefile==0.47
omegaconf==2.3.0
onnx==1.16.2
onnxruntime==1.18.1
open-fprintd==0.6
opencv-python==4.10.0.84
opencv-python-headless==4.10.0.84
ordered-set==4.1.0
packaging==23.2
pandas==2.2.2
pathspec==0.12.1
pefile==2023.2.7
pexpect==4.9.0
piexif==1.1.3
pillow==10.3.0
pixeloe==0.0.10
platformdirs==3.11.0
ply==3.11
podman-compose==1.2.0
pooch==1.8.2
prettytable==3.11.0
protobuf==5.27.3
protontricks==1.11.1
psutil==5.9.8
ptyprocess==0.7.0
py-cpuinfo==9.0.0
pycairo==1.25.1
pyclip==0.7.0
pycparser==2.20
pycryptodomex==3.20.0
pycups==2.0.4
pydantic==2.8.2
pydantic_core==2.20.1
pydbus==0.6.0
pyenchant==3.2.2
pygdbmi==0.11.0.0
PyGithub==2.3.0
Pygments==2.17.2
PyGObject==3.48.2
PyJWT==2.9.0
PyMatting==1.1.12
PyNaCl==1.5.0
pynvml==11.5.3
pyparsing==3.1.2
pypng==0.20220715.0
pypresence==4.3.0
pyscard==2.0.5
PySocks==1.7.1
python-dateutil==2.8.2
python-dotenv==1.0.1
python-pidfile==3.0.0
python-slugify==8.0.4
python-validity==0.14
python-xlib==0.33
pytorch-triton-rocm==3.0.0+21eae954ef
pytz==2024.1
pyudev==0.24.1
pyusb==1.2.1
pyxdg==0.27
PyYAML==6.0.1
qrcode==7.4.2
ranger-fm==1.9.3
referencing==0.31.1
regex==2024.4.16
rembg==2.0.58
repath==0.9.0
requests==2.31.0
requests-file==2.0.0
resolvelib==1.0.1
rich==13.7.0
rpds-py==0.18.1
rpm==4.19.1.1
ruamel.yaml==0.18.5
ruamel.yaml.clib==0.2.7
s3transfer==0.10.2
safetensors==0.4.4
scikit-image==0.24.0
scikit-learn==1.5.1
scipy==1.14.0
seaborn==0.13.2
SecretStorage==3.3.3
segment-anything==1.0
selinux @ file:///builddir/build/BUILD/libselinux-3.6/src
sentencepiece==0.2.0
sentry-sdk==2.11.0
sepolicy @ file:///builddir/build/BUILD/selinux-3.6/python/sepolicy
setools==4.5.1
setproctitle==1.2.3
setroubleshoot @ file:///builddir/build/BUILD/setroubleshoot-3.3.33/src
setuptools==69.0.3
shtab==1.6.1
simpleaudio==1.0.4
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
sos==4.7.2
soundfile==0.12.1
soupsieve==2.5
spandrel==0.3.4
starlette==0.37.2
sympy==1.13.1
systemd-python==235
termcolor==2.3.0
text-unidecode==1.3
threadpoolctl==3.5.0
tifffile==2024.8.10
timm==1.0.8
tldextract==3.5.0
tldr==3.3.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.3.0+rocm6.0
torchaudio==2.3.0+rocm6.0
torchsde==0.2.6
torchvision==0.18.0+rocm6.0
tqdm==4.66.5
trampoline==0.1.2
transformers==4.44.0
transparent-background==1.3.1
trash-cli==0.22.10.20
triton==3.0.0
typer==0.9.0
types-python-dateutil==2.9.0.20240316
typing_extensions==4.12.2
tzdata==2024.1
ublue_update==1.0.0
udica==0.2.8
ultralytics==8.2.77
ultralytics-thop==2.0.0
umu-launcher==0.0.1
urllib3==1.26.19
uvicorn==0.30.6
uvloop==0.19.0
vdf==3.4
watchdog==4.0.2
watchfiles==0.23.0
wcwidth==0.2.13
websocket-client==1.3.3
websockets==12.0
wget==3.2
wrapt==1.16.0
yafti==0.9.0
yarl==1.9.4
yt-dlp==2024.8.1
ytmusicapi==1.3.0
yubikey-manager==5.5.0

hartmark · 2024-08-15T13:09:58Z

Threadripper 1900 32gb ram Vega64 8gb Fedora 40 latest

Was running flux and all last week. Was slow but running. Now, I can't run anything, not even SDXL.

I always get : model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16

I tried rolling back ComfyUI git commits, but to no avail. b8ffb2 -> same dtype issue

tried flags: python main.py --listen --preview-method auto --cuda-device 1 --verbose --fp8_e4m3fn-text-enc --fp8_e4m3fn-unet --lowvram --disable-cuda-malloc --disable-smart-memory

Could it be related to python updates (torch or other), rocm updates (I rolled back to 2.3) ? Will try a few more rollback again.

+1 for adding a testing and release git branch with version revision of pip packages to install for a given version and github actions to TEST releases on a cloud before release and monitor performance. This would help spot general performance regression.
summary of logs :

pip freeze :

You can try my docker compose container and see how it works
I've been exploring Stable diffusion and it's quite fun what you can do locally.

I have posted a docker-comppse recipe for getting ComfyUI easily up and running
https://github.com/hartmark/sd-rocm

johnr14 · 2024-08-15T23:57:02Z

@hartmark Thanks.
reinstalled OS, from Fedora -> Arch

Installed docker and got it up and running. It works now that way. Too lazy to go back to Fedora to troubleshoot or convert it to podman ...

hartmark · 2024-08-16T06:02:18Z

Glad your got it working.

TheJoeSparks · 2024-08-20T22:27:05Z

I'm on this page because my Flux is taking 10+ minutes to render, suddenly. If this matters: Flux was working so fast in Pinokio-managed Comfy yesterday morning, faster than ever on my Windows 4080 rtx 16g VRam, all models! Was using Primarily Dev.1, then I noticed my Nvidia driver needed the update from early August. Ran that update, and now my best Purz-inspired workflow will not run, error in terminal of low VRAM, and Flux take minutes instead of seconds to produce an image.

kairin · 2024-09-15T04:43:01Z

are you able to share what it says on the terminal where it is running... sometimes i see errors there and tried to backtrack the changes i made...

iKurama added the Potential Bug User is reporting a bug. This should be tested. label Aug 8, 2024

Kinglord mentioned this issue Aug 13, 2024

Why can't I use my Lora after the update? #4308

Closed

JorgeR81 mentioned this issue Aug 13, 2024

ComfyUI/Flux memory utilization when loading model ? #4318

Open

Flux.1 Dev, memory issue #4271

Flux.1 Dev, memory issue #4271

Comments

iKurama commented Aug 8, 2024

Expected Behavior

Actual Behavior

Steps to Reproduce

Debug Logs

Other

BigBanje commented Aug 8, 2024 • edited Loading

JorgeR81 commented Aug 8, 2024 • edited Loading

iKurama commented Aug 8, 2024

comfyanonymous commented Aug 8, 2024

JorgeR81 commented Aug 8, 2024

BigBanje commented Aug 8, 2024

San4itos commented Aug 8, 2024

davoodice commented Aug 8, 2024 • edited Loading

iKurama commented Aug 8, 2024

btibor91 commented Aug 8, 2024

JorgeR81 commented Aug 8, 2024

iKurama commented Aug 8, 2024

iKurama commented Aug 8, 2024

comfyanonymous commented Aug 8, 2024

iKurama commented Aug 8, 2024

iKurama commented Aug 8, 2024

BigBanje commented Aug 8, 2024

davoodice commented Aug 8, 2024 • edited Loading

BigBanje commented Aug 8, 2024

JorgeR81 commented Aug 8, 2024 • edited Loading

iKurama commented Aug 8, 2024

Bortus-AI commented Aug 8, 2024 • edited Loading

comfyanonymous commented Aug 8, 2024

davoodice commented Aug 8, 2024 • edited Loading

JorgeR81 commented Aug 8, 2024

davoodice commented Aug 8, 2024

Bortus-AI commented Aug 8, 2024 • edited Loading

JorgeR81 commented Aug 8, 2024

davoodice commented Aug 8, 2024

Kinglord commented Aug 12, 2024 • edited Loading

comfyanonymous commented Aug 12, 2024

Kinglord commented Aug 12, 2024 • edited Loading

comfyanonymous commented Aug 12, 2024

Kinglord commented Aug 12, 2024 • edited Loading

MDMAchine commented Aug 12, 2024 • edited Loading

hypervoxel commented Aug 13, 2024 • edited Loading

bryancpe commented Aug 13, 2024

hypervoxel commented Aug 13, 2024

bryancpe commented Aug 13, 2024 • edited Loading

comfyanonymous commented Aug 13, 2024

hypervoxel commented Aug 13, 2024

hypervoxel commented Aug 13, 2024 • edited Loading

Creepybits commented Aug 13, 2024

comfyanonymous commented Aug 13, 2024

MDMAchine commented Aug 13, 2024 • edited Loading

MDMAchine commented Aug 13, 2024

iKurama commented Aug 13, 2024 • edited Loading

Thireus commented Aug 13, 2024

ltdrdata commented Aug 14, 2024

benzstation commented Aug 15, 2024

Thireus commented Aug 15, 2024 • edited Loading

johnr14 commented Aug 15, 2024

hartmark commented Aug 15, 2024

johnr14 commented Aug 15, 2024

hartmark commented Aug 16, 2024

TheJoeSparks commented Aug 20, 2024

kairin commented Sep 15, 2024

BigBanje commented Aug 8, 2024 •

edited

Loading

JorgeR81 commented Aug 8, 2024 •

edited

Loading

davoodice commented Aug 8, 2024 •

edited

Loading

davoodice commented Aug 8, 2024 •

edited

Loading

JorgeR81 commented Aug 8, 2024 •

edited

Loading

Bortus-AI commented Aug 8, 2024 •

edited

Loading

davoodice commented Aug 8, 2024 •

edited

Loading

Bortus-AI commented Aug 8, 2024 •

edited

Loading

Kinglord commented Aug 12, 2024 •

edited

Loading

Kinglord commented Aug 12, 2024 •

edited

Loading

Kinglord commented Aug 12, 2024 •

edited

Loading

MDMAchine commented Aug 12, 2024 •

edited

Loading

hypervoxel commented Aug 13, 2024 •

edited

Loading

bryancpe commented Aug 13, 2024 •

edited

Loading

hypervoxel commented Aug 13, 2024 •

edited

Loading

MDMAchine commented Aug 13, 2024 •

edited

Loading

iKurama commented Aug 13, 2024 •

edited

Loading

Thireus commented Aug 15, 2024 •

edited

Loading