save_pipe and load_pipe not work #717

forestlet · 2024-03-09T09:10:52Z

Describe the bug

I use OneDiffX (for HF diffusers) and Compile, save and load pipeline. after I run save_pipe example, there is nothing in cached_pipe

Your environment

Ubuntu LTS

OneDiff git commit id

500459f

OneFlow version info

libibverbs not available, ibv_fork_init skipped
path: ['/home/ubuntu/.local/lib/python3.10/site-packages/oneflow']
version: 0.9.1.dev20240307+cu121
git_commit: 88ece9e
cmake_build_type: Release
rdma: True
mlir: True
enterprise: False

How To Reproduce

Steps to reproduce the behavior(code or script):

from diffusers import StableDiffusionXLPipeline
from onediffx import compile_pipe, save_pipe
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
)
pipe.to("cuda")

pipe = compile_pipe(pipe)

save_pipe(pipe, dir="cached_pipe")

Additional context

Each time it takes too long to compile, however, the save_pipe func seems doesn't work.

forestlet · 2024-03-12T13:22:59Z

@strint

strint · 2024-03-13T11:00:30Z

@forestlet Got it. We will try to reproduce this and get back here.

forestlet · 2024-03-15T03:00:03Z

please set env variable ONEFLOW_ATTENTION_ALLOW_HALF_PRECISION_ACCUMULATION=0 or ONEFLOW_ATTENTION_ALLOW_HALF_PRECISION_SCORE_ACCUMULATION_MAX_M=0

didn't work for me neither :(

strint · 2024-03-15T09:30:35Z

@forestlet Please check this branch and have a try: #734

The pipe needs to run once to trigger the real compilation:

import torch
from diffusers import StableDiffusionXLPipeline
from onediffx import compile_pipe, save_pipe

pipe = StableDiffusionXLPipeline.from_pretrained(
    "/share_nfs/hf_models/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
)
pipe.to("cuda")

pipe = compile_pipe(pipe)

# run once to trigger compilation
image = pipe(
    prompt="street style, detailed, raw photo, woman, face, shot on CineStill 800T",
    height=512,
    width=512,
    num_inference_steps=30,
    output_type="pil",
).images

image[0].save(f"test_image.png")

# save the compiled pipe
save_pipe(pipe, dir="cached_pipe")

forestlet · 2024-03-15T13:41:31Z

pipe = compile_pipe(pipe)

save_pipe function works and it saves model to cached_pipe. 🎉
However when I use load_pipe, it fails 🥹 and outputs:

[ERROR](GRAPH:OneflowGraph_3:OneflowGraph) run got error: <class 'oneflow._oneflow_internal.exception.Exception'> InferDataType Failed. Expected kFloat, but got kFloat16
  File "oneflow/core/job/job_interpreter.cpp", line 312, in InterpretJob
    RunNormalOp(launch_context, launch_op, inputs)
  File "oneflow/core/job/job_interpreter.cpp", line 224, in RunNormalOp
    it.Apply(*op, inputs, &outputs, OpExprInterpContext(empty_attr_map, JUST(launch_op.device)))
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 84, in NaiveInterpret
    [&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... mut_local_tensor_infer_cache()->GetOrInfer(infer_args)); }()
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 87, in operator()
    user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 210, in GetOrInfer
    Infer(*user_op_expr, infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 178, in Infer
    user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); })
  File "oneflow/core/framework/op_expr.cpp", line 603, in InferPhysicalTensorDesc
    dtype_infer_fn_(&infer_ctx)
  File "oneflow/user/ops/group_norm_op.cpp", line 85, in InferDataType
    CHECK_EQ_OR_RETURN(gamma.data_type(), x.data_type())
Error Type: oneflow.ErrorProto.check_failed_error
Traceback (most recent call last):
  File "/home/ubuntu/filmacton/t2.py", line 15, in <module>
    load_pipe(pipe, dir="cached_pipe")
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediffx/compilers/diffusion_pipeline_compiler.py", line 100, in load_pipe
    obj.load_graph(os.path.join(dir, part))
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediff/infer_compiler/with_oneflow_compile.py", line 322, in load_graph
    self.get_graph().load_graph(file_path, device, run_warmup)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediff/infer_compiler/utils/cost_util.py", line 48, in clocked
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediff/infer_compiler/with_oneflow_compile.py", line 349, in load_graph
    self.load_runtime_state_dict(state_dict, warmup_with_run=run_warmup)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/graph.py", line 1188, in load_runtime_state_dict
    return self._dynamic_input_graph_cache.load_runtime_state_dict(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/cache.py", line 242, in load_runtime_state_dict
    graph.load_runtime_state_dict(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/graph.py", line 1348, in load_runtime_state_dict
    self.__run(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/graph.py", line 1865, in __run
    _eager_outputs = oneflow._oneflow_internal.nn.graph.RunLazyNNGraphByVM(
oneflow._oneflow_internal.exception.Exception: InferDataType Failed. Expected kFloat, but got kFloat16
  File "oneflow/core/job/job_interpreter.cpp", line 312, in InterpretJob
    RunNormalOp(launch_context, launch_op, inputs)
  File "oneflow/core/job/job_interpreter.cpp", line 224, in RunNormalOp
    it.Apply(*op, inputs, &outputs, OpExprInterpContext(empty_attr_map, JUST(launch_op.device)))
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 84, in NaiveInterpret
    [&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... mut_local_tensor_infer_cache()->GetOrInfer(infer_args)); }()
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 87, in operator()
    user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 210, in GetOrInfer
    Infer(*user_op_expr, infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 178, in Infer
    user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); })
  File "oneflow/core/framework/op_expr.cpp", line 603, in InferPhysicalTensorDesc
    dtype_infer_fn_(&infer_ctx)
  File "oneflow/user/ops/group_norm_op.cpp", line 85, in InferDataType
    CHECK_EQ_OR_RETURN(gamma.data_type(), x.data_type())
Error Type: oneflow.ErrorProto.check_failed_error

My save_pipe code is:

from diffusers import StableDiffusionXLPipeline
from onediffx import compile_pipe, save_pipe
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
)
pipe.to("cuda")

pipe = compile_pipe(pipe)

# run once to trigger compilation
image = pipe(
    prompt="street style, detailed, raw photo, woman, face, shot on CineStill 800T",
    height=512,
    width=512,
    num_inference_steps=30,
    output_type="pil",
).images

image[0].save(f"test_image.png")

# save the compiled pipe
save_pipe(pipe, dir="cached_pipe")

and my load_pipe code is:

from diffusers import StableDiffusionXLPipeline
from onediffx import compile_pipe, load_pipe
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
)
pipe.to("cuda")

pipe = compile_pipe(pipe)

# load the compiled pipe
load_pipe(pipe, dir="cached_pipe")

# no compilation now
image = pipe(
    prompt="street style, detailed, raw photo, woman, face, shot on CineStill 800T",
    height=512,
    width=512,
    num_inference_steps=30,
    output_type="pil",
).images

image[0].save(f"test_image.png")

clackhan · 2024-03-15T13:47:37Z

@forestlet This is because of the force_upcast of vae. You need execute the next code before load_pipe:

if pipe.vae.dtype == torch.float16 and pipe.vae.config.force_upcast:
   pipe.upcast_vae()

And we will integrate this behavior into the load_pipe function in PR-734

Fix: #717 --------- Co-authored-by: binbinHan <han_binbin@163.com>

forestlet · 2024-03-16T15:14:36Z

@forestlet This is because of the force_upcast of vae. You need execute the next code before load_pipe:
if pipe.vae.dtype == torch.float16 and pipe.vae.config.force_upcast:
   pipe.upcast_vae()
And we will integrate this behavior into the load_pipe function in PR-734

THANKs! However...
😢 I tried SVD and I found there is no upcast_vae() for SVD pipe.
So I checked the doc: onediff_diffusers_extensions/onediffx/deep_cache/pipeline_stable_video_diffusion.py and I tried

if pipe.vae.dtype == torch.float16 and pipe.vae.config.force_upcast:
    pipe.vae.to(dtype=torch.float32)

load_pipe(pipe, dir="cached_pipe")

And I got this:

/home/ubuntu/.local/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
libibverbs not available, ibv_fork_init skipped
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  9.09it/s]
[ERROR](GRAPH:OneflowGraph_3:OneflowGraph) run got error: <class 'oneflow._oneflow_internal.exception.Exception'> InferDataType Failed. Expected kFloat16, but got kFloat
  File "oneflow/core/job/job_interpreter.cpp", line 312, in InterpretJob
    RunNormalOp(launch_context, launch_op, inputs)
  File "oneflow/core/job/job_interpreter.cpp", line 224, in RunNormalOp
    it.Apply(*op, inputs, &outputs, OpExprInterpContext(empty_attr_map, JUST(launch_op.device)))
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 84, in NaiveInterpret
    [&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... mut_local_tensor_infer_cache()->GetOrInfer(infer_args)); }()
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 87, in operator()
    user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 210, in GetOrInfer
    Infer(*user_op_expr, infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 178, in Infer
    user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); })
  File "oneflow/core/framework/op_expr.cpp", line 603, in InferPhysicalTensorDesc
    dtype_infer_fn_(&infer_ctx)
  File "oneflow/user/ops/group_norm_op.cpp", line 85, in InferDataType
    CHECK_EQ_OR_RETURN(gamma.data_type(), x.data_type())
Error Type: oneflow.ErrorProto.check_failed_error
Traceback (most recent call last):
  File "/home/ubuntu/filmacton/video_gen/load_compiled_pipe.py", line 18, in <module>
    load_pipe(pipe, dir="cached_pipe")
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediffx/compilers/diffusion_pipeline_compiler.py", line 100, in load_pipe
    obj.load_graph(os.path.join(dir, part))
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediff/infer_compiler/with_oneflow_compile.py", line 322, in load_graph
    self.get_graph().load_graph(file_path, device, run_warmup)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediff/infer_compiler/utils/cost_util.py", line 48, in clocked
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediff/infer_compiler/with_oneflow_compile.py", line 349, in load_graph
    self.load_runtime_state_dict(state_dict, warmup_with_run=run_warmup)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/graph.py", line 1188, in load_runtime_state_dict
    return self._dynamic_input_graph_cache.load_runtime_state_dict(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/cache.py", line 242, in load_runtime_state_dict
    graph.load_runtime_state_dict(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/graph.py", line 1348, in load_runtime_state_dict
    self.__run(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/graph.py", line 1865, in __run
    _eager_outputs = oneflow._oneflow_internal.nn.graph.RunLazyNNGraphByVM(
oneflow._oneflow_internal.exception.Exception: InferDataType Failed. Expected kFloat16, but got kFloat
  File "oneflow/core/job/job_interpreter.cpp", line 312, in InterpretJob
    RunNormalOp(launch_context, launch_op, inputs)
  File "oneflow/core/job/job_interpreter.cpp", line 224, in RunNormalOp
    it.Apply(*op, inputs, &outputs, OpExprInterpContext(empty_attr_map, JUST(launch_op.device)))
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 84, in NaiveInterpret
    [&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... mut_local_tensor_infer_cache()->GetOrInfer(infer_args)); }()
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 87, in operator()
    user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 210, in GetOrInfer
    Infer(*user_op_expr, infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 178, in Infer
    user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); })
  File "oneflow/core/framework/op_expr.cpp", line 603, in InferPhysicalTensorDesc
    dtype_infer_fn_(&infer_ctx)
  File "oneflow/user/ops/group_norm_op.cpp", line 85, in InferDataType
    CHECK_EQ_OR_RETURN(gamma.data_type(), x.data_type())
Error Type: oneflow.ErrorProto.check_failed_error

strint · 2024-07-05T12:00:27Z

@forestlet Is there a full example for you error so we can have a try.

strint · 2024-07-12T03:25:57Z

too old to follow, please feel free to reopen it.

forestlet assigned strint Mar 9, 2024

strint added Request-bug Something isn't working Response-need_hours This issue need some hours to be solved labels Mar 13, 2024

strint added this to the v0.13.0 milestone Mar 13, 2024

This comment was marked as off-topic.

Sign in to view

strint added the Response-urgent_and_important label Mar 15, 2024

strint mentioned this issue Mar 15, 2024

fix pipeline save load #734

Merged

clackhan closed this as completed in #734 Mar 15, 2024

clackhan added a commit that referenced this issue Mar 15, 2024

fix pipeline save load (#734)

293d7cb

Fix: #717 --------- Co-authored-by: binbinHan <han_binbin@163.com>

strint reopened this Apr 21, 2024

strint modified the milestones: v1.0.0（0.13.0）, v1.1 Apr 21, 2024

strint added the sig-hfdiffusers label Apr 25, 2024

strint modified the milestones: v1.1, v1.2 Jun 9, 2024

strint mentioned this issue Jun 25, 2024

no upcast_vae() for SVD pipe #978

Closed

strint closed this as completed Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

save_pipe and load_pipe not work #717

save_pipe and load_pipe not work #717

forestlet commented Mar 9, 2024

forestlet commented Mar 12, 2024

strint commented Mar 13, 2024

This comment was marked as off-topic.

forestlet commented Mar 15, 2024

strint commented Mar 15, 2024 •

edited

Loading

forestlet commented Mar 15, 2024

clackhan commented Mar 15, 2024 •

edited

Loading

forestlet commented Mar 16, 2024

strint commented Jul 5, 2024

strint commented Jul 12, 2024

save_pipe and load_pipe not work #717

save_pipe and load_pipe not work #717

Comments

forestlet commented Mar 9, 2024

Describe the bug

Your environment

OneDiff git commit id

OneFlow version info

How To Reproduce

Additional context

forestlet commented Mar 12, 2024

strint commented Mar 13, 2024

This comment was marked as off-topic.

forestlet commented Mar 15, 2024

strint commented Mar 15, 2024 • edited Loading

forestlet commented Mar 15, 2024

clackhan commented Mar 15, 2024 • edited Loading

forestlet commented Mar 16, 2024

strint commented Jul 5, 2024

strint commented Jul 12, 2024

strint commented Mar 15, 2024 •

edited

Loading

clackhan commented Mar 15, 2024 •

edited

Loading