Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FLUX memory management improvements #6791

Merged
merged 11 commits into from
Aug 29, 2024
Merged

Conversation

RyanJDick
Copy link
Collaborator

@RyanJDick RyanJDick commented Aug 28, 2024

Summary

This PR contains several improvements to memory management for FLUX workflows.

It is now possible to achieve better FLUX model caching performance, but this still requires users to manually configure their ram/vram settings. E.g. a vram setting of 16.0 should allow for all quantized FLUX models to be kept in memory on the GPU.

Changes:

  • Check the size of a model on disk and free the requisite space in the model cache before loading it. (This behaviour existed previously, but was removed in https://github.com/invoke-ai/InvokeAI/pull/6072/files. The removal did not seem to be intentional).
  • Removed the hack to free 24GB of space in the cache before loading the FLUX model.
  • Split the T5 embedding and CLIP embedding steps into separate functions so that the two models don't both have to be held in RAM at the same time.
  • Fix a bug in InvokeLinear8bitLt that was causing some tensors to be left on the GPU when the model was offloaded to the CPU. (This class is getting very messy due to the non-standard state_dict handling in bnb.nn.Linear8bitLt. )
  • Tidy up some dtype handling in FluxTextToImageInvocation to avoid situations where we hold references to two copies of the same tensor unnecessarily.
  • (minor) Misc cleanup of ModelCache: improve docs and remove unused vars.

Future:
We should revisit our default ram/vram configs. The current defaults are very conservative, and users could see major performance improvements from tuning these values.

QA Instructions

I tested the FLUX workflow with the following configurations and verified that the cache hit rates and memory usage matched the expected behaviour:

  • ram = 16 and vram = 16
  • ram = 16 and vram = 1
  • ram = 1 and vram = 1

Note that the changes in this PR are not isolated to FLUX. Since we now check the size of models on disk, we may see slight changes in model cache offload patterns for other models as well.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)

@github-actions github-actions bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files labels Aug 28, 2024
Copy link
Collaborator

@brandonrising brandonrising left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@RyanJDick RyanJDick force-pushed the ryan/flux-model-cache-improvements branch from 42992cc to 4e4b6c6 Compare August 29, 2024 19:08
@RyanJDick RyanJDick merged commit 87261bd into main Aug 29, 2024
14 checks passed
@RyanJDick RyanJDick deleted the ryan/flux-model-cache-improvements branch August 29, 2024 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend PRs that change backend files invocations PRs that change invocations python PRs that change python files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants