Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelPatcher Overhaul and Hook Support #5583

Merged
merged 102 commits into from
Dec 2, 2024

Conversation

Kosinkadink
Copy link
Collaborator

@Kosinkadink Kosinkadink commented Nov 11, 2024

This PR merges all changes from improved_memory branch, and expands ModelPatcher + transformer_options to allow for different weights and properties to be applied for selected conditioning. This is done by introducing a hook design pattern, where conditioning and CLIP can have hooks attached to change their behavior at sample time; before, this was hardcoded for specific things like controlnet and gligen.

I did not find any memory or performance regression in my testing, but more testing would be good; I will try to get some folks to test out this branch alongside the corresponding rework_modelpatcher branches in AnimateDiff-Evolved and Advanced-ControlNet that make use of the new functionality.

Related PRs in those repos that will be merged when this does:
Kosinkadink/ComfyUI-AnimateDiff-Evolved#498
Kosinkadink/ComfyUI-Advanced-ControlNet#198

Remaining TODO:

  • Fix VRAM usage exceeding expectations when applying weight hooks
    • SOLVED: weights needed to be copied in place, and the weight backups should be copied to avoid being overriden
  • Figure out why flux LoRAs do not produce the same results when applied via Hook compared to with Load LoRA node (issue does not exist for other model loras as far as I can tell)
    • SOLVED: I accidentally left an extra .to call on the calculate_weight results from way back before I added the stochastic_rounding call, which screws up results when fp8 weights are used. Removing the .to fixed it.
  • Make sure weights loaded in lowvram mode get modified appropriately by WeightHooks
    • I haven't found any glaring issues through testing; any problems here can be resolved as they are reported. Main thing to be revisited here is RAM usage when the weights are backed up for big models (flux) with very little VRAM.

Breaking Changes:

  • ControlNet's get_control function now takes transformer_options as a required parameter; if a custom node wrote its own function to overwrite the built-in calc_cond_batch function, it will result in an error when executing. It will be an easy fix for any affected nodes; only one I can think of on the top of my head is TiledDiffusion.

Features:

  • Hooks
    • In the long term, this will be a way for individual conds to have different weights, wrappers, callbacks, patches, attachments, etc. that are currently impossible for custom nodes to do in a way that doesn't require excessive hacks that break compatibility between nodes. Currently, two types of hooks are fully implemented, but I will be expanding this system in a near-future PR:
      • Weight Hooks: Different lora and model-as-lora weights can be attached and scheduled on CLIP/conditioning.
      • Wrapper Hooks: Different wrapper functions can be applied.
  • ModelPatcher additions
    • Wrappers
      • Instead of requiring nodes to overwrite or hack into important functions to change their functionality, ModelPatcher and model_options support wrappers functions that will be automatically handle passing an executor into wrapper functions to facilitate wrapping in a predictable manner. Since there is no limitation on names of wrapper functions, some custom nodes could decide to expose extending their own functionality with other nodes through the wrapper system. Cost of wrapping is imperceptibly low, so more wrapper support can be added upon need/request.
    • Callbacks
      • Similar to wrappers, but callbacks instead can be used to extend ModelPatcher functions to avoid the need for hacky ModelPatcher inheritance, for cases where wrapping wouldn't make sense. Same as wrappers, more callbacks can be added upon need/request.
    • Additional Models
      • A dictionary stores models that should be loaded alongside the main model; they will be cloned when the main ModelPatcher is cloned.
    • Attachments
      • A dictionary stores objects that will not be deep copied, unlike model_options. Only clones objects stored inside it that have a on_model_patcher_clone() callable,
    • Injections
      • A dictionary storing a list of PatcherInjection objects, which allow for modifying ("injecting") anything on the ModelPatcher/model in a way that doesn't break the patching system. Biggest current user of this feature will be AnimateDiff, as it's implemented by injecting extra blocks into the base SD unets.
  • transformer_options as an execution context
    • This PR begins to take transformer_options to its natural conclusion - a way to track the context of execution that can be modified by both native components and third-party extensions. At sampling time, it collects all patches, wrappers, callbacks, etc., and all exposed dicts and lists are copied each time it is passed down a layer where it is modified to make sure there is no accidental poisoning of the ModelPatcher's model_options. For callbacks and wrappers in particular, helper functions were added in comfy.patcher_extension to allow for easy modification and classification via CallbacksMP and WrappersMP. In a future PR, patches should be exposed in a similar way.
    • ControlNets's get_control functions now take in transformer_options as an input, allowing them to add their own patches, wrappers, etc. as desired.

…d added call prepare current keyframe on hooks in calc_cond_batch
… for loras/model-as-loras, small renaming/refactoring
…n work on better Create Hook Model As LoRA node
…implement ModelPatcher callbacks, attachments, and additional_models
…ed additional_models support in ModelPatcher, conds, and hooks
…er properties, improved AutoPatcherEjector usage in partially_load
…delPatcher for emb_patch and forward_timestep_embed_patch, added helper functions for removing callbacks/wrappers/additional_models by key, added custom_should_register prop to hooks
…s due to hooks should be offloaded in hooks_backup
…ormer_options as additional parameter, made the model_options stored in extra_args in inner_sample be a clone of the original model_options instead of same ref
…se __future__ so that I can use the better type annotations
@Kosinkadink
Copy link
Collaborator Author

All of my manual testing is complete - the PR can be merged at any time if all looks fine with @comfyanonymous

@asagi4
Copy link
Contributor

asagi4 commented Nov 26, 2024

@Kosinkadink If I create a workflow that sets a LoRA hook with no scheduling, it seems that it reloads the hook on every sampling run even if just the sampler seed changes. It causes pretty significant latency. I think it's because the patches get unloaded immediately after sampling even though there's no need to do so.

Is there a way to use the hook mechanism in a way that avoids this at the moment? My quick testing shows a 13.8s to 16.5s increase when only changing the seed after warmup (sometimes even up to 18s). --highvram reduces the latency quite a bit but still doesn't remove it

@Kosinkadink
Copy link
Collaborator Author

A lot of my current optimization was focused on the worst case scenarios of different hooks needing to be applied to different conditioning, so to prevent any memory issues from cached weights not being cleared, I currently have the model purge newly registered (AKA, added at sample time) hook patches and clear cached hooked weight calculations, always.

In cases where there is only a single hook group to apply, I could make it not revert the model to its unhooked state at the end of sampling, so that if nothing gets changed with the hooks/ModelPatcher, it would not need to redo hooked weight application. However, that introduces some extra complexity that could introduce bugs I don't want to deal with currently - I've been working on this for 3 months, and in its current state it hasn't even been released to be tested by a wide variety of peeps. Once it gets merged and it appears to be working fine in general, I'd be down to add an optimization for that edge case.

@asagi4
Copy link
Contributor

asagi4 commented Nov 26, 2024

@Kosinkadink Fair. This PR is definitely big enough already.

nodes.py Outdated Show resolved Hide resolved
@comfyanonymous comfyanonymous merged commit 0ee322e into master Dec 2, 2024
6 checks passed
@comfyanonymous comfyanonymous deleted the patch_hooks_improved_memory branch December 3, 2024 05:22
@wbclark
Copy link

wbclark commented Dec 16, 2024

Thanks for the considerable effort it must have taken to make this so readable. That is tremendously helpful.

@andreszs
Copy link

andreszs commented Dec 20, 2024

@Kosinkadink I've tried your lorahookmasking workflow, using a single checkpoint and one lora file for each Set Clip Hooks node.

While it did manage to perfectly avoid lora bleeding, the generation time is 10x longer than usual for a 12-step 880x768px image:

got prompt
model weight dtype torch.float16, manual cast: None
model_type EPS
Using xformers attention in VAE
Using xformers attention in VAE
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
Requested to load SDXL
loaded completely 9.5367431640625e+25 4897.0483474731445 True
100%|█████████████████████████████████| 12/12 [01:43<00:00,  8.65s/it]
Requested to load AutoencoderKL
loaded completely 9.5367431640625e+25 159.55708122253418 True
Prompt executed in 132.59 seconds

Also, RAM usage skyrockets immediately as shown here.

ram usage

The model is PrefectPonyV2XL (6.4GB) and the loras have been shrinked to 70 MB each, but this didn't help at all. Is there some explanation on why this is so slow? The workflow is exactly your original one, but loading 2 loras instead of 1 lora and 1 checkpoint as lora.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants