Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux.1 Dev, memory issue #4271

Open
iKurama opened this issue Aug 8, 2024 · 113 comments
Open

Flux.1 Dev, memory issue #4271

iKurama opened this issue Aug 8, 2024 · 113 comments
Labels
Potential Bug User is reporting a bug. This should be tested.

Comments

@iKurama
Copy link

iKurama commented Aug 8, 2024

Expected Behavior

I expect no issues. I had installed comfyui anew a couple days ago, no issues, 4.6 seconds per iteration~

Actual Behavior

After updating, I'm now experiencing 20 seconds per iteration.

Steps to Reproduce

Install the newest version + update the python deps.

Debug Logs

E:\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build
Total VRAM 12281 MB, total RAM 16296 MB
pytorch version: 2.4.0+cu118
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 Ti : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: E:\ComfyUI_windows_portable\ComfyUI\web
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)

Import times for custom nodes:
   0.0 seconds: E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
E:\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
  out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load Flux
Loading 1 new model
loading in lowvram mode 9763.075
 55%|█████████████████████████████████████████████                                     | 11/20 [03:36<02:56, 19.58s/it]

Other

I have no idea why it has f'd up.

@iKurama iKurama added the Potential Bug User is reporting a bug. This should be tested. label Aug 8, 2024
@BigBanje
Copy link

BigBanje commented Aug 8, 2024

Having this issue as well. I updated yesterday, and now everything runs at half speed compared to before, and I was using fp16 before, and even upon downgrading to fp8 it's still half speed.

to be clear, my generation speed itself is still the same it/s. But the time spent loading the models is dramatically longer. My terminal has the same feedback as OP's

@JorgeR81
Copy link

JorgeR81 commented Aug 8, 2024

I noticed about 50% performance degradation, with a lower end GPU, but more RAM
I have an GTX 1070 ( 8 GB VRAM ) and 32 GB RAM

On a 1024 x 1024 img, I was at 30 s/it and now I'm at 45 s/it
On a 512 x 768 img, I was at 20 s/it and now I'm at 30 s/it

@iKurama
Copy link
Author

iKurama commented Aug 8, 2024

After some testing, I re-downloaded Portable v.0.0.2 and went to this commit f123328 & I went down to 6.2s/it~
Then I tried to install both xformers and update comfy + python deps, back to 20+s/it.

So, pretty safe to say it's a python issue.

@comfyanonymous
Copy link
Owner

Can you update and test the latest commit to see if things are better?

@JorgeR81
Copy link

JorgeR81 commented Aug 8, 2024

For me, there was no improvement. 
I still have about the same generation images. 

@BigBanje
Copy link

BigBanje commented Aug 8, 2024

I was able to resolve my issue by uninstalling two old versions of python (didnt realize I had them) and reinstalling python & comfyui from scratch via zip extraction.

It immediately was able to do fp8 at the previous speed, but still had serious bottlenecking on fp16. I then enabled --highvram flag and it works as fast as before (however, I did not need this flag before)

@San4itos
Copy link

San4itos commented Aug 8, 2024

I use ComfyUI with ROCm.

Python version: 3.10.14
Total VRAM 16368 MB, total RAM 31713 MB
pytorch version: 2.4.0+rocm6.1
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7800 XT : native

With latest version I have

loading in lowvram mode 9011.532500000001
20/20 [01:41<00:00,  5.07s/it]

on mentioned here commit f123328 I have

loading in lowvram mode 9493.052499771118
 20/20 [01:34<00:00,  4.73s/it]

I tried earlier commit eca962c which was faster for me and got

loading in lowvram mode 10977.61249923706
20/20 [01:12<00:00,  3.64s/it]

I see degradation of speed each time I update ComfyUI.

@davoodice
Copy link

davoodice commented Aug 8, 2024

Same, I realized CPU and GPU both are working in ping-pong form.
image

image

image

64 Gb RAM
RTX 4070ti

Before this, It took some time to be load into ram and then be loaded to vram, but now it immediately is loaded into vram.

@iKurama
Copy link
Author

iKurama commented Aug 8, 2024

E:\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build
Total VRAM 12281 MB, total RAM 16296 MB
pytorch version: 2.4.0+cu121
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 Ti : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: E:\ComfyUI_windows_portable\ComfyUI\web
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)

Import times for custom nodes:
0.0 seconds: E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
E:\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load Flux
Loading 1 new model
loading in lowvram mode 9763.075
10%|████████▎ | 2/20 [00:42<06:20, 21.12s/it]

This was a fresh reinstall, a fresh pull, including the one after the "fix" push

Gonna try bigbanje's solution next

@btibor91
Copy link

btibor91 commented Aug 8, 2024

I have experienced the same problems (extreme slowdown and very frequent OOM crashes) after the recent update. It returned to normal after rolling back to commit 1aa9cf3 (20G VRAM).

@JorgeR81
Copy link

JorgeR81 commented Aug 8, 2024

I've been updating every day, but I'm not sure when my performance degradation started.
I know I was fine on August 3, because I posted some data here about performance.
I was probably fine 2 or 3 days after that, but then I stopped paying attention to times, until I saw this issue.
I was trying different settings and resolutions so times were always different ...

@iKurama
Copy link
Author

iKurama commented Aug 8, 2024

Bigbanje's solution did not work

E:\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --lowvram
Total VRAM 12281 MB, total RAM 16296 MB
pytorch version: 2.4.0+cu121
Set vram state to: LOW_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 Ti : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: E:\ComfyUI_windows_portable\ComfyUI\web
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)

Import times for custom nodes:
0.0 seconds: E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
loading in lowvram mode 9823.2
0%| | 0/20 [00:00<?, ?it/s]E:\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
5%|████▏ | 1/20 [00:20<06:25, 20.27s/it]

///////////

Loading in on commit 1aa9cf3
gives me this error

When loading the graph, the following node types were not found:

FluxGuidance
ModelSamplingFlux

Nodes that have failed to load will show as red on the graph.

///

I do not have any issue what so ever with my CPU, like others do.
I went back to f123328, but I essentially go OOM, but it just freezes instead.

E:\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --lowvram
Total VRAM 12281 MB, total RAM 16296 MB
pytorch version: 2.4.0+cu121
Set vram state to: LOW_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 Ti : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: E:\ComfyUI_windows_portable\ComfyUI\web
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)

Import times for custom nodes:
0.0 seconds: E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model_type FLUX
E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
loading in lowvram mode 9823.199999809265
0%| | 0/20 [00:00<?, ?it/s]E:\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
5%|████▏ | 1/20 [00:52<16:39, 52.59s/it]

image

@iKurama
Copy link
Author

iKurama commented Aug 8, 2024

gonna try to load up v.0.0.2 again, in f123328 & see if I can replicate my own solution - because I suspect it being python being as issue, as it's on v2.4

@comfyanonymous
Copy link
Owner

Can you all post your full console output if you have not already?

@iKurama
Copy link
Author

iKurama commented Aug 8, 2024

It's the full console, sadly

@iKurama
Copy link
Author

iKurama commented Aug 8, 2024

I'd even screenshot it, due to how little it is

@BigBanje
Copy link

BigBanje commented Aug 8, 2024

I will disable --highvram and post my f16 output, one moment

@davoodice
Copy link

davoodice commented Aug 8, 2024

got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLOW
Using pytorch attention in VAE
Using pytorch attention in VAE
Model doesn't have a device attribute.
Model doesn't have a device attribute.
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
loading in lowvram mode 9725.074980926514
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:22<00:00, 5.63s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 38.51 seconds
image

gpu CANT STAY at 100%

@BigBanje
Copy link

BigBanje commented Aug 8, 2024

fp16 without --highvram

C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build
Total VRAM 24575 MB, total RAM 16268 MB
pytorch version: 2.4.0+cu121
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3090 : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\ComfyUI\web
C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)

Import times for custom nodes:
0.0 seconds: C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float8_e5m2, manual cast: torch.bfloat16
model_type FLUX
C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load Flux
Loading 1 new model
100%|████████| 20/20 [00:27<00:00, 1.36s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Model doesn't have a device attribute.
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 199.66 seconds

fp16 with --highvram (only pasting from "Starting server" and below as the rest was identical)

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float8_e5m2, manual cast: torch.bfloat16
model_type FLUX
C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
C:\Users\Grayscale\Documents\ComfyUI\ComfyUI_windows_portable\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load Flux
Loading 1 new model
100%|███████████| 20/20 [00:26<00:00, 1.32s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Model doesn't have a device attribute.
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 128.00 seconds

I know I'm on a Low-RAM, High-VRAM machine, so --highvram fixing my issue isn't too surprising... But I didn't need to do that before. Not sure what the error about 1Torch is about, I seem to get that no matter what I do after an update a few days ago.

@JorgeR81
Copy link

JorgeR81 commented Aug 8, 2024

Here is my full console output, after opening Comfy UI, and doing a few flux generations. 

I'm at 512 x 768 resolution, in fp8 versions, with ~30 s/it.
But, on August 3, I was at was at ~20 s/it

console output
C:\Cui\cu_121_2\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --force-fp16 --windows-standalone-build
[START] Security scan
[DONE] Security scan
## ComfyUI-Manager: installing dependencies done.
** ComfyUI startup time: 2024-08-08 20:23:12.203838
** Platform: Windows
** Python version: 3.11.6 (tags/v3.11.6:8b6ee5b, Oct  2 2023, 14:57:12) [MSC v.1935 64 bit (AMD64)]
** Python executable: C:\Cui\cu_121_2\ComfyUI_windows_portable\python_embeded\python.exe
** ComfyUI Path: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI
** Log path: C:\Cui\cu_121_2\ComfyUI_windows_portable\comfyui.log

Prestartup times for custom nodes:
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\rgthree-comfy
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Marigold
   2.1 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Manager

Total VRAM 8192 MB, total RAM 32727 MB
pytorch version: 2.1.0+cu121
Forcing FP16.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce GTX 1070 : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\web
Adding extra search path checkpoints d:/ComfyUI/models/checkpoints/
Adding extra search path clip d:/ComfyUI/models/clip/
Adding extra search path clip_vision d:/ComfyUI/models/clip_vision/
Adding extra search path configs d:/ComfyUI/models/configs/
Adding extra search path controlnet d:/ComfyUI/models/controlnet/
Adding extra search path embeddings d:/ComfyUI/models/embeddings/
Adding extra search path loras d:/ComfyUI/models/loras/
Adding extra search path unet d:/ComfyUI/models/unet/
Adding extra search path upscale_models d:/ComfyUI/models/upscale_models/
Adding extra search path vae d:/ComfyUI/models/vae/
[ComfyUI- ] Loaded all nodes and apis.
### Loading: ComfyUI-Impact-Pack (V6.1)
### Loading: ComfyUI-Impact-Pack (Subpack: V0.6)
[Impact Pack] Wildcards loading done.
### Loading: ComfyUI-Inspire-Pack (V0.83)
### Loading: ComfyUI-Manager (V2.48.6)
### ComfyUI Revision: 2492 [66d42332] | Released on '2024-08-08'
C:\Cui\cu_121_2\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\models\transformers\transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
  deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/alter-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/github-stats.json
Total VRAM 8192 MB, total RAM 32727 MB
pytorch version: 2.1.0+cu121
Forcing FP16.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce GTX 1070 : cudaMallocAsync
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/extension-node-map.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/model-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/custom-node-list.json
[ReActor] - STATUS - Running v0.5.1-a6 in ComfyUI
C:\Cui\cu_121_2\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torchvision\transforms\functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
  warnings.warn(
Torch version: 2.1.0+cu121
[comfyui_controlnet_aux] | INFO -> Using ckpts path: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\comfyui_controlnet_aux\ckpts
[comfyui_controlnet_aux] | INFO -> Using symlinks: False
[comfyui_controlnet_aux] | INFO -> Using ort providers: ['CUDAExecutionProvider', 'DirectMLExecutionProvider', 'OpenVINOExecutionProvider', 'ROCMExecutionProvider', 'CPUExecutionProvider', 'CoreMLExecutionProvider']
DWPose: Onnxruntime with acceleration providers detected
Please 'pip install xformers'
Nvidia APEX normalization not installed, using PyTorch LayerNorm

[rgthree] Loaded 39 extraordinary nodes.

WAS Node Suite: BlenderNeko's Advanced CLIP Text Encode found, attempting to enable `CLIPTextEncode` support.
WAS Node Suite: `CLIPTextEncode (BlenderNeko Advanced + NSP)` node enabled under `WAS Suite/Conditioning` menu.
WAS Node Suite: OpenCV Python FFMPEG support is enabled
WAS Node Suite Warning: `ffmpeg_bin_path` is not set in `C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\was-node-suite-comfyui\was_suite_config.json` config file. Will attempt to use system ffmpeg binaries if available.
WAS Node Suite: Finished. Loaded 219 nodes successfully.

        "Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work." - Steve Jobs


Import times for custom nodes:
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_Noise
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\cg-use-everywhere
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_Cutoff
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_ADV_CLIP_emb
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_TiledKSampler
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_InstantID
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-AutomaticCFG
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_IPAdapter_plus
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Custom-Scripts
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_UltimateSDUpscale
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\rgthree-comfy
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-0246
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_essentials
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\aegisflow_utility_nodes
   0.0 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\comfyui_controlnet_aux
   0.1 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Inspire-Pack
   0.2 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_smZNodes
   0.2 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Marigold
   0.3 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\PuLID_ComfyUI
   0.3 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Manager
   0.3 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI_FaceAnalysis
   0.6 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\comfyui-reactor-node
   0.6 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Impact-Pack
   2.1 seconds: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\was-node-suite-comfyui

Starting server

To see the GUI go to: http://127.0.0.1:8188
FETCH DATA from: C:\Cui\cu_121_2\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-Manager\extension-node-map.json [DONE]
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type FLUX
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
loading in lowvram mode 5938.199950408935
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [06:05<00:00, 30.50s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Model doesn't have a device attribute.
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 492.76 seconds
got prompt
loading in lowvram mode 5928.199950408935
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [06:00<00:00, 30.01s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 363.30 seconds
got prompt
loading in lowvram mode 5928.199950408935
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [06:01<00:00, 30.11s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 364.53 seconds
got prompt
loading in lowvram mode 5928.199935150146
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [08:00<00:00, 30.00s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 483.37 seconds

@iKurama
Copy link
Author

iKurama commented Aug 8, 2024

gonna try to load up v.0.0.2 again, in f123328 & see if I can replicate my own solution - because I suspect it being python being as issue, as it's on v2.4

And naturally, a screenshot as promised
image

@Bortus-AI
Copy link

Bortus-AI commented Aug 8, 2024

Same issue here. Was working great but something updated and now it takes 10x longer and goes into lowvram mode. Still trying to figure out what changed

@comfyanonymous
Copy link
Owner

Can you test if the latest commit improves things?

@davoodice
Copy link

davoodice commented Aug 8, 2024

[SKIP] Downgrading pip package isn't allowed: transformers (cur=4.38.2)
[SKIP] Downgrading pip package isn't allowed: tokenizers (cur=0.15.2)
[SKIP] Downgrading pip package isn't allowed: safetensors (cur=0.4.3)
[SKIP] Downgrading pip package isn't allowed: kornia (cur=0.7.2)

anyway I think things got worse

image

see gpu usage

@JorgeR81
Copy link

JorgeR81 commented Aug 8, 2024

No improvement for me.
I still have the same times.

got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type FLUX
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
loaded in lowvram mode 5838.19995803833
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [05:10<00:00, 31.01s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Model doesn't have a device attribute.
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 434.71 seconds

@davoodice
Copy link

No improvement for me. I still have the same times.

got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type FLUX
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
loaded in lowvram mode 5838.19995803833
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [05:10<00:00, 31.01s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Model doesn't have a device attribute.
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 434.71 seconds

can you share your GPU graph while rendering?

@Bortus-AI
Copy link

Bortus-AI commented Aug 8, 2024

Can you test if the latest commit improves things?

Massive improvement. I also nuked and redid ComfyUI from source instead of portable with the latest commits and it took 15 seconds vs 2.5 minutes for 10 steps.

It still goes into lowvram mode when it shouldn't

@JorgeR81
Copy link

JorgeR81 commented Aug 8, 2024

can you share your GPU graph while rendering?

tm

@davoodice
Copy link

do you sure you can use flux with 8 Gb of ram in full speed? I think its impossible

@Kinglord
Copy link

Kinglord commented Aug 12, 2024

I don't want to open a new issue for this but maybe I should - I just want @comfyanonymous to know this is far from an issue only affecting Flux models. My comfyui was working super fine but I decided to update today to try out some flux models (had been about a week since I upgraded) and now ComfyUI is all but unusable on my normal standard SDXL workflows.

My iterations per second have dropped to about 1/5th what they were before the update, and whenever I run batches I now OOM and crash completely on any batch >10. I am used to also being able to do things that are not graphic intensive when using ComfyUI - like web browse, discord, etc. but now ComfyUI murders even my system FPS down to around 5. It doesn't appear to really be using my GPU but it's slowly eating up all my system ram until it OOMs (it's happy to try and eat 60GBs of system ram!).

I don't have to use Flux for anything, thankfully, so I'm going to just downgrade back to a version that worked OK. I'm just a bit shocked there isn't more noise about this outside of Flux but I guess everyone is just trying Flux these days. 😀

FWIW I don't know where things went sideways for me but I tested and can confirm the tagged version in the 0.0.4 standalone version works without any issues (just the code I'm not messing with actual python libs, etc.)

Edit: I don't think it matters at all but just in case

  • Standalone install
  • 64GB system ram
  • 4080 16GB Vram
  • v0.0.2-162-gce37c11

@comfyanonymous
Copy link
Owner

If you still have issues on the latest master I need:

Your system specs.

Your exact workflow.

@Kinglord
Copy link

Kinglord commented Aug 12, 2024

If you still have issues on the latest master I need:

Your system specs.
Attached as dxdiag
Your exact workflow.
Attached as json

I'm running the python & libs from the 0.0.3 standalone install but if there's any particular version info you need for any packages or anything else I haven't included just let me know. The workflow is one i use to test prompts against a collection of checkpoints I have, which seems to cause a good chunk of the issues (obviously since it doesn't OOM until around 10 in).

DxDiag.txt
PonyChkpntTest.json

@comfyanonymous
Copy link
Owner

Can you try without custom nodes so see if it's a core problem: --disable-all-custom-nodes

@Kinglord
Copy link

Kinglord commented Aug 12, 2024

Alright, sorry for the late reply but I tried to test things as extensively as I could here. tl;dr is that it appears to be a core problem, with no custom nodes running (using the flag you provided) I still run into this issue with massive slowdowns, OOMs, crashes, etc.

I was able to troubleshoot and find what appears to be the smoking gun on the issue, which are the Lora Loaders. The normal Lora Loader using CLIP seems to be worse than the Model Only Lora Loader, but they both eventually lead to the same OOM errors, GPU crashing, etc.

If there's any more data I can provide to help here just let me know. The console window usually dies with these crashes but if there's a log I can provide happy to do so if it would help at all.

These tests were done on the v0.0.2-163-gb8ffb29 tag. (release 0.0.7) This issue isn't present on release 0.0.4. I had no intention of hijacking this thread, if you'd like me to open up a new ticket since this appears related to Loras and not the base checkpoint flows I'm happy to do so.

@MDMAchine
Copy link

MDMAchine commented Aug 12, 2024

Updated to b8ffb29 and disabled all custom nodes. I would normally expect this to run in about 19-25 seconds. On the bright side, it appears the VRAM releases the memory upon completion.

Made a basic workflow, with no loras or anything extra.
Checkpoint loader > speedFP8_e5m2.safetensors
DualCLIPLoader> t5xxl_fp3_e4m3fn.safetensors & clip_l.safetensors
Sampler: euler
Scheduler: beta
Steps: 4

System info:
Total VRAM 8192 MB, total RAM 48735 MB
pytorch version: 2.2.2+cu121
xformers version: 0.0.25.post1
Set vram state to: NORMAL_VRAM
Disabling smart memory management
Device: cuda:0 NVIDIA GeForce RTX 3060 Ti : cudaMallocAsync
Using xformers cross attention

Run info:
Requested to load Flux
Loading 1 new model
loaded partially 5831.075 0
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:53<00:00, 13.40s/it]
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 59.79 seconds

DxDiag.txt
workflow.json

@hypervoxel
Copy link

hypervoxel commented Aug 13, 2024

RTX4090, 128gb RAM, Windows 10. Brand new install of ComfyUI portable, Latest Flux, it took 4 minutes to generate the demo image (the anime girl with bunny ears) at 11.13s/it using dev model. Feels very slow. It seems to always load in low vram mode (that is not correct right?).

@bryancpe
Copy link

Use the package linked on here: https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.0.4 it has pytorch 2.3.1 . everything returned to normal

Thanks! You are my hero! I updated everything and it was almost a 10x slowdown. I uninstalled pytorch and then reinstalled 2.3.1.
Here is my command sequence:

pip3 uninstall torch
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118

Before:
100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [06:56<00:00, 104.02s/it]
After:
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:10<00:00, 17.73s/it]

@hypervoxel
Copy link

Use the package linked on here: https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.0.4 it has pytorch 2.3.1 . everything returned to normal

Thanks! You are my hero! I updated everything and it was almost a 10x slowdown. I uninstalled pytorch and then reinstalled 2.3.1. Here is my command sequence:

pip3 uninstall torch pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118

Before: 100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [06:56<00:00, 104.02s/it] After: 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:10<00:00, 17.73s/it]

Thank you. I downloaded the new portable version. It does not seem any faster. Still runs in low vram mode
I did get this error " UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)"

@bryancpe
Copy link

bryancpe commented Aug 13, 2024

Use the package linked on here: https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.0.4 it has pytorch 2.3.1 . everything returned to normal

Thanks! You are my hero! I updated everything and it was almost a 10x slowdown. I uninstalled pytorch and then reinstalled 2.3.1. Here is my command sequence:
pip3 uninstall torch pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
Before: 100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [06:56<00:00, 104.02s/it] After: 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:10<00:00, 17.73s/it]

Thank you. I downloaded the new portable version. It does not seem any faster. Still runs in low vram mode I did get this error " UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.) out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)"

I didn't use that portable installer, but that note clued me into torch being an issue.
Did you first uninstall torch?
pip3 uninstall torch

then:
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118

@comfyanonymous
Copy link
Owner

Can you print the full log and tell me exactly which portable version you downloaded (there are 3 of them).

@hypervoxel
Copy link

Use the package linked on here: https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.0.4 it has pytorch 2.3.1 . everything returned to normal

Thanks! You are my hero! I updated everything and it was almost a 10x slowdown. I uninstalled pytorch and then reinstalled 2.3.1. Here is my command sequence:
pip3 uninstall torch pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
Before: 100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [06:56<00:00, 104.02s/it] After: 100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:10<00:00, 17.73s/it]

Thank you. I downloaded the new portable version. It does not seem any faster. Still runs in low vram mode I did get this error " UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.) out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)"

I didn't use that portable installer, but that note clued me into torch being an issue. Did you first uninstall torch? pip3 uninstall torch

then: pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118

I think pytorch is included with the portable version? log states pytorch version: 2.3.1+cu121

@hypervoxel
Copy link

hypervoxel commented Aug 13, 2024

Can you print the full log and tell me exactly which portable version you downloaded (there are 3 of them).

I've tried two portables so far, though they might actually have been the same

https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.0.4

I just set weight type to fp8 and it sooooo much faster (like x50, though now it's not fp16 right)

Will continue to test. Next time I restart Comfy I will upload log

@Creepybits
Copy link

I willingly admit that I don't know much about these things. But Forge has released an update so it's possible to run the nf4 model there, and I tried for a bit

I seem to get the same or similar issues on Forge that many seem to get on Comfy. It acted as if it just continued to eat memory nonstop and never cleaned up.

I offloaded to my system page, so together with my GPU I hade 60-70GB memory. I could generate maybe 3-4 images, and then I got an error saying that cuda ran out of memory and had to restart my system.

Then I could generate another 3-4 images, and then had to restart.

That's how it went on.

My thought is that maybe the issue is with BitsAndBytes? Forge pretty much copied that part from Comfy. And since the issues seems similar, it kind of makes sense. To me at least.

@comfyanonymous
Copy link
Owner

If anyone has issues make sure you run the: update/update_comfyui.bat to update ComfyUI first.

@MDMAchine
Copy link

MDMAchine commented Aug 13, 2024

Updated Observations with ComfyUI (2527[b8ffb2]):

After further testing, I've found that loading the flux model (either the 16-bit flux1_schnell UNET or speedFP8_e5m2 checkpoint) using the Load Diffusion Model node, and keeping the weight type at its default, resolves the issue. Speeds are similar for both models.

However, switching the weight type for either of these models to any of the FP8 options results in a significant slowdown—processing time increases from 19 seconds to 54 seconds per 4 steps.

Additionally, if I load the FP8 model (speedFP8_e5m2) in the Checkpoint Loader node, it also becomes 3x slower. I haven't yet tested a non-FP8 flux model since the only versions I've downloaded are FP8 and NF4, besides the UNETs.

VRAM releases after generation.

The latest CheckpointLoaderNF4 node also releases VRAM, but:

Using NF4 node:
Res 1024x1024 - Any batch size over 1 results in an OOM error.
Generating a larger image (1080x1920) also results in an OOM error.

Load Diffusion Model (default weights):
Res 1024x1024 - Batch of 4 @ 6 steps: 1:29
Res 1024x1024 - Single image @ 4 steps: 0:25
Res 1080x1920 @ 4 steps: 0:32
Batch of 2 Res 1080x1920 @ 4 steps: 0:54

Basic SDXL workflow seems to be functioning well. I’ll continue testing with a562c1 to see how it performs.

@MDMAchine
Copy link

No change on a562c1 for the results of the NF4 node or using FP8 weights.

@iKurama
Copy link
Author

iKurama commented Aug 13, 2024

Did some misc stuff on and off, running back ups and such. Still on f123328.

I went with

pip install torch torchvision torchaudio xformers --extra-index-url https://download.pytorch.org/whl/cu121

And now I'm back to around 5.3s/it, in both lowvram & normalvram mode - it no longer forces me to use lowvram. I'll take the loss of 0.6s/it and be happy with it as is.

@Thireus
Copy link

Thireus commented Aug 13, 2024

If anyone has issues make sure you run the: update/update_comfyui.bat to update ComfyUI first.

Isn't it the same as running "Update ComfyUI" or "Update All" through the manager and restarting?

@ltdrdata
Copy link
Collaborator

FYI, the ComfyUI update in ComfyUI-Manager does not perform a torch update for safety reasons.

@benzstation
Copy link

I have the same issue (~20s/it), when using an all-in-one flux model (flux1.dev fp8 model with fp8 e4m3fn weights and t5xxl fp8 e4m3fn and vae baked in) on regular checkpoint loaders.

Although, it works flawlessly when using the original flux model (flux1.dev fp8 with default weights, clip_l and t5xxl fp8 e4m3fn clips loaded separately, and vae loaded separately) on the unet loader node (~2s/it):
image

When using the regular loader, i see this in the console:
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16

When using the unet loader, i see this instead:
model weight dtype torch.bfloat16, manual cast: None

Why is the regular checkpoint loader forcing manual cast to use torch.bfloat16? The model is built with fp8 e4m3fn weights.

Happy to share more details.

@Thireus
Copy link

Thireus commented Aug 15, 2024

Thanks @ltdrdata!

--

FYI, with the recent update that disables cuda malloc by default, I've add to add --cuda-malloc back because perfs with Flux were just terrible without.

@johnr14
Copy link

johnr14 commented Aug 15, 2024

Threadripper 1900
32gb ram
Vega64 8gb
Fedora 40 latest

Was running flux and all last week. Was slow but running.
Now, I can't run anything, not even SDXL.

I always get :
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16

I tried rolling back ComfyUI git commits, but to no avail.
b8ffb2 -> same dtype issue

tried flags:
python main.py --listen --preview-method auto --cuda-device 1 --verbose --fp8_e4m3fn-text-enc --fp8_e4m3fn-unet --lowvram --disable-cuda-malloc --disable-smart-memory

Could it be related to python updates (torch or other), rocm updates (I rolled back to 2.3) ?
Will try a few more rollback again.

+1 for adding a testing and release git branch with version revision of pip packages to install for a given version and github actions to TEST releases on a cloud before release and monitor performance. This would help spot general performance regression.

`summary of logs` :
## ComfyUI-Manager: installing dependencies done.
** ComfyUI startup time: 2024-08-15 08:22:49.352379
** Platform: Linux
** Python version: 3.12.4 (main, Jun  7 2024, 00:00:00) [GCC 14.1.1 20240607 (Red Hat 14.1.1-5)]
** Python executable: /usr/bin/python
** ComfyUI Path: ~/github/ComfyUI
** Log path: ~/github/ComfyUI/comfyui.log

Prestartup times for custom nodes:
   2.8 seconds: ~/github/ComfyUI/custom_nodes/ComfyUI-Manager

Set cuda device to: 1
Total VRAM 8176 MB, total RAM 31960 MB
pytorch version: 2.3.0+rocm6.0
Set vram state to: LOW_VRAM
Device: cuda:0 AMD Radeon RX Vega : native
Using sub quadratic optimization for cross attention, if you have memory or speed issues try using: --use-split-cross-attention
Using selector: EpollSelector
### ComfyUI Revision: 2542 [0f9c2a78] | Released on '2024-08-14'
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
adm 0
~/.local/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/4
  warnings.warn(
Model doesn't have a device attribute.
CLIP model load device: cpu, offload device: cpu
clip unexpected: ['encoder.embed_tokens.weight']
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
`pip freeze` :
aiohappyeyeballs==2.3.5
aiohttp==3.10.3
aiosignal==1.3.1
albucore==0.0.13
albumentations==1.4.13
annotated-types==0.7.0
ansible==9.8.0
ansible-core==2.16.9
antlr4-python3-runtime==4.9.3
anyio==4.4.0
appdirs==1.4.4
argcomplete==3.3.0
arrow==1.3.0
attrs==23.2.0
beautifulsoup4==4.12.3
binaryornot==0.4.4
bitsandbytes==0.43.3
black==24.8.0
borgbackup==1.2.8
borgmatic==1.8.13
boto3==1.34.153
botocore==1.34.153
Brlapi==0.8.5
Brotli==1.1.0
certifi==2023.5.7
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
cockpit @ file:///builddir/build/BUILD/cockpit-322/tmp/wheel/cockpit-322-py3-none-any.whl#sha256=5587d8c988d8b9ebe77b0cca42347ff1cf40338aac6721b9251be1479e6ca19c
colorama==0.4.6
coloredlogs==15.0.1
colour-science==0.4.4
configobj==5.0.8
contourpy==1.2.0
cookiecutter==2.6.0
cryptography==41.0.7
cupshelpers==1.0
cycler==0.11.0
Cython==3.0.11
dasbus==1.7
dbus-python==1.3.2
deepdiff==7.0.1
Deprecated==1.2.14
discover-overlay==0.7.2
distro==1.9.0
dnspython==2.6.1
easydict==1.13
einops==0.8.0
email_validator==2.2.0
eval_type_backport==0.2.0
evdev==1.6.1
fastapi==0.112.0
fedora-third-party==0.10
fido2==1.1.2
file-magic==0.4.0
filelock==3.13.1
flatbuffers==24.3.25
flet==0.23.2
flet-core==0.23.2
flet-runtime==0.23.2
fonttools==4.50.0
fros==1.1
frozenlist==1.4.1
fs==2.4.16
fsspec==2024.6.1
gbinder-python==1.1.2
gbulb==0.6.4
gdown==5.2.0
gitdb==4.0.11
GitPython==3.1.43
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.24.5
humanfriendly==10.0
icoextract==0.1.4
idna==3.7
imageio==2.35.0
input-remapper==2.0.1
insightface==0.7.3
jaraco.classes==3.3.0
jeepney==0.8.0
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
joystickwake==0.4.2
jsonschema==4.19.1
jsonschema-specifications==2023.11.2
keyring==24.3.1
kiwisolver==1.4.5
kornia==0.7.3
kornia_rs==0.1.5
langtable==0.0.68
lazy_loader==0.4
libdnf5==5.1.17
libvirt-python==10.1.0
llfuse==1.5.0
llvmlite==0.43.0
louis==3.28.0
lutris==0.5.17
lxml==5.1.0
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matplotlib==3.8.0
matrix-client==0.4.0
mdurl==0.1.2
moddb==0.11.0
more-itertools==10.1.0
mpmath==1.3.0
msgpack==1.0.6
multidict==6.0.5
mutagen==1.47.0
mypy-extensions==1.0.0
networkx==3.3
nftables==0.1
numba==0.60.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
olefile==0.47
omegaconf==2.3.0
onnx==1.16.2
onnxruntime==1.18.1
open-fprintd==0.6
opencv-python==4.10.0.84
opencv-python-headless==4.10.0.84
ordered-set==4.1.0
packaging==23.2
pandas==2.2.2
pathspec==0.12.1
pefile==2023.2.7
pexpect==4.9.0
piexif==1.1.3
pillow==10.3.0
pixeloe==0.0.10
platformdirs==3.11.0
ply==3.11
podman-compose==1.2.0
pooch==1.8.2
prettytable==3.11.0
protobuf==5.27.3
protontricks==1.11.1
psutil==5.9.8
ptyprocess==0.7.0
py-cpuinfo==9.0.0
pycairo==1.25.1
pyclip==0.7.0
pycparser==2.20
pycryptodomex==3.20.0
pycups==2.0.4
pydantic==2.8.2
pydantic_core==2.20.1
pydbus==0.6.0
pyenchant==3.2.2
pygdbmi==0.11.0.0
PyGithub==2.3.0
Pygments==2.17.2
PyGObject==3.48.2
PyJWT==2.9.0
PyMatting==1.1.12
PyNaCl==1.5.0
pynvml==11.5.3
pyparsing==3.1.2
pypng==0.20220715.0
pypresence==4.3.0
pyscard==2.0.5
PySocks==1.7.1
python-dateutil==2.8.2
python-dotenv==1.0.1
python-pidfile==3.0.0
python-slugify==8.0.4
python-validity==0.14
python-xlib==0.33
pytorch-triton-rocm==3.0.0+21eae954ef
pytz==2024.1
pyudev==0.24.1
pyusb==1.2.1
pyxdg==0.27
PyYAML==6.0.1
qrcode==7.4.2
ranger-fm==1.9.3
referencing==0.31.1
regex==2024.4.16
rembg==2.0.58
repath==0.9.0
requests==2.31.0
requests-file==2.0.0
resolvelib==1.0.1
rich==13.7.0
rpds-py==0.18.1
rpm==4.19.1.1
ruamel.yaml==0.18.5
ruamel.yaml.clib==0.2.7
s3transfer==0.10.2
safetensors==0.4.4
scikit-image==0.24.0
scikit-learn==1.5.1
scipy==1.14.0
seaborn==0.13.2
SecretStorage==3.3.3
segment-anything==1.0
selinux @ file:///builddir/build/BUILD/libselinux-3.6/src
sentencepiece==0.2.0
sentry-sdk==2.11.0
sepolicy @ file:///builddir/build/BUILD/selinux-3.6/python/sepolicy
setools==4.5.1
setproctitle==1.2.3
setroubleshoot @ file:///builddir/build/BUILD/setroubleshoot-3.3.33/src
setuptools==69.0.3
shtab==1.6.1
simpleaudio==1.0.4
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
sos==4.7.2
soundfile==0.12.1
soupsieve==2.5
spandrel==0.3.4
starlette==0.37.2
sympy==1.13.1
systemd-python==235
termcolor==2.3.0
text-unidecode==1.3
threadpoolctl==3.5.0
tifffile==2024.8.10
timm==1.0.8
tldextract==3.5.0
tldr==3.3.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.3.0+rocm6.0
torchaudio==2.3.0+rocm6.0
torchsde==0.2.6
torchvision==0.18.0+rocm6.0
tqdm==4.66.5
trampoline==0.1.2
transformers==4.44.0
transparent-background==1.3.1
trash-cli==0.22.10.20
triton==3.0.0
typer==0.9.0
types-python-dateutil==2.9.0.20240316
typing_extensions==4.12.2
tzdata==2024.1
ublue_update==1.0.0
udica==0.2.8
ultralytics==8.2.77
ultralytics-thop==2.0.0
umu-launcher==0.0.1
urllib3==1.26.19
uvicorn==0.30.6
uvloop==0.19.0
vdf==3.4
watchdog==4.0.2
watchfiles==0.23.0
wcwidth==0.2.13
websocket-client==1.3.3
websockets==12.0
wget==3.2
wrapt==1.16.0
yafti==0.9.0
yarl==1.9.4
yt-dlp==2024.8.1
ytmusicapi==1.3.0
yubikey-manager==5.5.0

@hartmark
Copy link

Threadripper 1900 32gb ram Vega64 8gb Fedora 40 latest

Was running flux and all last week. Was slow but running. Now, I can't run anything, not even SDXL.

I always get : model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16

I tried rolling back ComfyUI git commits, but to no avail. b8ffb2 -> same dtype issue

tried flags: python main.py --listen --preview-method auto --cuda-device 1 --verbose --fp8_e4m3fn-text-enc --fp8_e4m3fn-unet --lowvram --disable-cuda-malloc --disable-smart-memory

Could it be related to python updates (torch or other), rocm updates (I rolled back to 2.3) ? Will try a few more rollback again.

+1 for adding a testing and release git branch with version revision of pip packages to install for a given version and github actions to TEST releases on a cloud before release and monitor performance. This would help spot general performance regression.
summary of logs :

pip freeze :

You can try my docker compose container and see how it works
I've been exploring Stable diffusion and it's quite fun what you can do locally.

I have posted a docker-comppse recipe for getting ComfyUI easily up and running
https://github.com/hartmark/sd-rocm

@johnr14
Copy link

johnr14 commented Aug 15, 2024

@hartmark Thanks.
reinstalled OS, from Fedora -> Arch

Installed docker and got it up and running. It works now that way. Too lazy to go back to Fedora to troubleshoot or convert it to podman ...

@hartmark
Copy link

Glad your got it working.

@TheJoeSparks
Copy link

I'm on this page because my Flux is taking 10+ minutes to render, suddenly. If this matters: Flux was working so fast in Pinokio-managed Comfy yesterday morning, faster than ever on my Windows 4080 rtx 16g VRam, all models! Was using Primarily Dev.1, then I noticed my Nvidia driver needed the update from early August. Ran that update, and now my best Purz-inspired workflow will not run, error in terminal of low VRAM, and Flux take minutes instead of seconds to produce an image.

@kairin
Copy link

kairin commented Sep 15, 2024

are you able to share what it says on the terminal where it is running... sometimes i see errors there and tried to backtrack the changes i made...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Potential Bug User is reporting a bug. This should be tested.
Projects
None yet
Development

No branches or pull requests