Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save_pipe and load_pipe not work #717

Closed
forestlet opened this issue Mar 9, 2024 · 10 comments · Fixed by #734
Closed

save_pipe and load_pipe not work #717

forestlet opened this issue Mar 9, 2024 · 10 comments · Fixed by #734
Assignees
Labels
Request-bug Something isn't working Response-need_hours This issue need some hours to be solved Response-urgent_and_important sig-hfdiffusers
Milestone

Comments

@forestlet
Copy link

Describe the bug

I use OneDiffX (for HF diffusers) and Compile, save and load pipeline. after I run save_pipe example, there is nothing in cached_pipe

Your environment

Ubuntu LTS

OneDiff git commit id

500459f

OneFlow version info

libibverbs not available, ibv_fork_init skipped
path: ['/home/ubuntu/.local/lib/python3.10/site-packages/oneflow']
version: 0.9.1.dev20240307+cu121
git_commit: 88ece9e
cmake_build_type: Release
rdma: True
mlir: True
enterprise: False

How To Reproduce

Steps to reproduce the behavior(code or script):

from diffusers import StableDiffusionXLPipeline
from onediffx import compile_pipe, save_pipe
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
)
pipe.to("cuda")

pipe = compile_pipe(pipe)

save_pipe(pipe, dir="cached_pipe")

Additional context

Each time it takes too long to compile, however, the save_pipe func seems doesn't work.

@forestlet
Copy link
Author

@strint

@strint
Copy link
Collaborator

strint commented Mar 13, 2024

@forestlet Got it. We will try to reproduce this and get back here.

@strint strint added Request-bug Something isn't working Response-need_hours This issue need some hours to be solved labels Mar 13, 2024
@strint strint added this to the v0.13.0 milestone Mar 13, 2024
@hjchen2

This comment was marked as off-topic.

@forestlet
Copy link
Author

please set env variable ONEFLOW_ATTENTION_ALLOW_HALF_PRECISION_ACCUMULATION=0 or ONEFLOW_ATTENTION_ALLOW_HALF_PRECISION_SCORE_ACCUMULATION_MAX_M=0

didn't work for me neither :(

@strint
Copy link
Collaborator

strint commented Mar 15, 2024

@forestlet Please check this branch and have a try: #734

The pipe needs to run once to trigger the real compilation:

import torch
from diffusers import StableDiffusionXLPipeline
from onediffx import compile_pipe, save_pipe

pipe = StableDiffusionXLPipeline.from_pretrained(
    "/share_nfs/hf_models/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
)
pipe.to("cuda")

pipe = compile_pipe(pipe)

# run once to trigger compilation
image = pipe(
    prompt="street style, detailed, raw photo, woman, face, shot on CineStill 800T",
    height=512,
    width=512,
    num_inference_steps=30,
    output_type="pil",
).images

image[0].save(f"test_image.png")

# save the compiled pipe
save_pipe(pipe, dir="cached_pipe")

@forestlet
Copy link
Author

pipe = compile_pipe(pipe)

save_pipe function works and it saves model to cached_pipe. 🎉
However when I use load_pipe, it fails 🥹 and outputs:

[ERROR](GRAPH:OneflowGraph_3:OneflowGraph) run got error: <class 'oneflow._oneflow_internal.exception.Exception'> InferDataType Failed. Expected kFloat, but got kFloat16
  File "oneflow/core/job/job_interpreter.cpp", line 312, in InterpretJob
    RunNormalOp(launch_context, launch_op, inputs)
  File "oneflow/core/job/job_interpreter.cpp", line 224, in RunNormalOp
    it.Apply(*op, inputs, &outputs, OpExprInterpContext(empty_attr_map, JUST(launch_op.device)))
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 84, in NaiveInterpret
    [&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... mut_local_tensor_infer_cache()->GetOrInfer(infer_args)); }()
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 87, in operator()
    user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 210, in GetOrInfer
    Infer(*user_op_expr, infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 178, in Infer
    user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); })
  File "oneflow/core/framework/op_expr.cpp", line 603, in InferPhysicalTensorDesc
    dtype_infer_fn_(&infer_ctx)
  File "oneflow/user/ops/group_norm_op.cpp", line 85, in InferDataType
    CHECK_EQ_OR_RETURN(gamma.data_type(), x.data_type())
Error Type: oneflow.ErrorProto.check_failed_error
Traceback (most recent call last):
  File "/home/ubuntu/filmacton/t2.py", line 15, in <module>
    load_pipe(pipe, dir="cached_pipe")
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediffx/compilers/diffusion_pipeline_compiler.py", line 100, in load_pipe
    obj.load_graph(os.path.join(dir, part))
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediff/infer_compiler/with_oneflow_compile.py", line 322, in load_graph
    self.get_graph().load_graph(file_path, device, run_warmup)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediff/infer_compiler/utils/cost_util.py", line 48, in clocked
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediff/infer_compiler/with_oneflow_compile.py", line 349, in load_graph
    self.load_runtime_state_dict(state_dict, warmup_with_run=run_warmup)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/graph.py", line 1188, in load_runtime_state_dict
    return self._dynamic_input_graph_cache.load_runtime_state_dict(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/cache.py", line 242, in load_runtime_state_dict
    graph.load_runtime_state_dict(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/graph.py", line 1348, in load_runtime_state_dict
    self.__run(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/graph.py", line 1865, in __run
    _eager_outputs = oneflow._oneflow_internal.nn.graph.RunLazyNNGraphByVM(
oneflow._oneflow_internal.exception.Exception: InferDataType Failed. Expected kFloat, but got kFloat16
  File "oneflow/core/job/job_interpreter.cpp", line 312, in InterpretJob
    RunNormalOp(launch_context, launch_op, inputs)
  File "oneflow/core/job/job_interpreter.cpp", line 224, in RunNormalOp
    it.Apply(*op, inputs, &outputs, OpExprInterpContext(empty_attr_map, JUST(launch_op.device)))
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 84, in NaiveInterpret
    [&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... mut_local_tensor_infer_cache()->GetOrInfer(infer_args)); }()
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 87, in operator()
    user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 210, in GetOrInfer
    Infer(*user_op_expr, infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 178, in Infer
    user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); })
  File "oneflow/core/framework/op_expr.cpp", line 603, in InferPhysicalTensorDesc
    dtype_infer_fn_(&infer_ctx)
  File "oneflow/user/ops/group_norm_op.cpp", line 85, in InferDataType
    CHECK_EQ_OR_RETURN(gamma.data_type(), x.data_type())
Error Type: oneflow.ErrorProto.check_failed_error

My save_pipe code is:

from diffusers import StableDiffusionXLPipeline
from onediffx import compile_pipe, save_pipe
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
)
pipe.to("cuda")

pipe = compile_pipe(pipe)

# run once to trigger compilation
image = pipe(
    prompt="street style, detailed, raw photo, woman, face, shot on CineStill 800T",
    height=512,
    width=512,
    num_inference_steps=30,
    output_type="pil",
).images

image[0].save(f"test_image.png")

# save the compiled pipe
save_pipe(pipe, dir="cached_pipe")

and my load_pipe code is:

from diffusers import StableDiffusionXLPipeline
from onediffx import compile_pipe, load_pipe
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
)
pipe.to("cuda")

pipe = compile_pipe(pipe)

# load the compiled pipe
load_pipe(pipe, dir="cached_pipe")

# no compilation now
image = pipe(
    prompt="street style, detailed, raw photo, woman, face, shot on CineStill 800T",
    height=512,
    width=512,
    num_inference_steps=30,
    output_type="pil",
).images

image[0].save(f"test_image.png")

@clackhan
Copy link
Contributor

clackhan commented Mar 15, 2024

@forestlet This is because of the force_upcast of vae. You need execute the next code before load_pipe:

if pipe.vae.dtype == torch.float16 and pipe.vae.config.force_upcast:
   pipe.upcast_vae()

And we will integrate this behavior into the load_pipe function in PR-734

clackhan added a commit that referenced this issue Mar 15, 2024
Fix: #717

---------

Co-authored-by: binbinHan <han_binbin@163.com>
@forestlet
Copy link
Author

@forestlet This is because of the force_upcast of vae. You need execute the next code before load_pipe:

if pipe.vae.dtype == torch.float16 and pipe.vae.config.force_upcast:
   pipe.upcast_vae()

And we will integrate this behavior into the load_pipe function in PR-734

THANKs! However...
😢 I tried SVD and I found there is no upcast_vae() for SVD pipe.
So I checked the doc: onediff_diffusers_extensions/onediffx/deep_cache/pipeline_stable_video_diffusion.py and I tried

if pipe.vae.dtype == torch.float16 and pipe.vae.config.force_upcast:
    pipe.vae.to(dtype=torch.float32)

load_pipe(pipe, dir="cached_pipe")

And I got this:

/home/ubuntu/.local/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
libibverbs not available, ibv_fork_init skipped
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  9.09it/s]
[ERROR](GRAPH:OneflowGraph_3:OneflowGraph) run got error: <class 'oneflow._oneflow_internal.exception.Exception'> InferDataType Failed. Expected kFloat16, but got kFloat
  File "oneflow/core/job/job_interpreter.cpp", line 312, in InterpretJob
    RunNormalOp(launch_context, launch_op, inputs)
  File "oneflow/core/job/job_interpreter.cpp", line 224, in RunNormalOp
    it.Apply(*op, inputs, &outputs, OpExprInterpContext(empty_attr_map, JUST(launch_op.device)))
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 84, in NaiveInterpret
    [&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... mut_local_tensor_infer_cache()->GetOrInfer(infer_args)); }()
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 87, in operator()
    user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 210, in GetOrInfer
    Infer(*user_op_expr, infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 178, in Infer
    user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); })
  File "oneflow/core/framework/op_expr.cpp", line 603, in InferPhysicalTensorDesc
    dtype_infer_fn_(&infer_ctx)
  File "oneflow/user/ops/group_norm_op.cpp", line 85, in InferDataType
    CHECK_EQ_OR_RETURN(gamma.data_type(), x.data_type())
Error Type: oneflow.ErrorProto.check_failed_error
Traceback (most recent call last):
  File "/home/ubuntu/filmacton/video_gen/load_compiled_pipe.py", line 18, in <module>
    load_pipe(pipe, dir="cached_pipe")
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediffx/compilers/diffusion_pipeline_compiler.py", line 100, in load_pipe
    obj.load_graph(os.path.join(dir, part))
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediff/infer_compiler/with_oneflow_compile.py", line 322, in load_graph
    self.get_graph().load_graph(file_path, device, run_warmup)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediff/infer_compiler/utils/cost_util.py", line 48, in clocked
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onediff/infer_compiler/with_oneflow_compile.py", line 349, in load_graph
    self.load_runtime_state_dict(state_dict, warmup_with_run=run_warmup)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/graph.py", line 1188, in load_runtime_state_dict
    return self._dynamic_input_graph_cache.load_runtime_state_dict(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/cache.py", line 242, in load_runtime_state_dict
    graph.load_runtime_state_dict(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/graph.py", line 1348, in load_runtime_state_dict
    self.__run(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/oneflow/nn/graph/graph.py", line 1865, in __run
    _eager_outputs = oneflow._oneflow_internal.nn.graph.RunLazyNNGraphByVM(
oneflow._oneflow_internal.exception.Exception: InferDataType Failed. Expected kFloat16, but got kFloat
  File "oneflow/core/job/job_interpreter.cpp", line 312, in InterpretJob
    RunNormalOp(launch_context, launch_op, inputs)
  File "oneflow/core/job/job_interpreter.cpp", line 224, in RunNormalOp
    it.Apply(*op, inputs, &outputs, OpExprInterpContext(empty_attr_map, JUST(launch_op.device)))
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 84, in NaiveInterpret
    [&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... mut_local_tensor_infer_cache()->GetOrInfer(infer_args)); }()
  File "oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 87, in operator()
    user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 210, in GetOrInfer
    Infer(*user_op_expr, infer_args)
  File "oneflow/core/framework/local_tensor_infer_cache.cpp", line 178, in Infer
    user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); })
  File "oneflow/core/framework/op_expr.cpp", line 603, in InferPhysicalTensorDesc
    dtype_infer_fn_(&infer_ctx)
  File "oneflow/user/ops/group_norm_op.cpp", line 85, in InferDataType
    CHECK_EQ_OR_RETURN(gamma.data_type(), x.data_type())
Error Type: oneflow.ErrorProto.check_failed_error

@strint strint reopened this Apr 21, 2024
@strint strint modified the milestones: v1.0.0(0.13.0), v1.1 Apr 21, 2024
@strint strint modified the milestones: v1.1, v1.2 Jun 9, 2024
@strint
Copy link
Collaborator

strint commented Jul 5, 2024

@forestlet Is there a full example for you error so we can have a try.

@strint
Copy link
Collaborator

strint commented Jul 12, 2024

too old to follow, please feel free to reopen it.

@strint strint closed this as completed Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Request-bug Something isn't working Response-need_hours This issue need some hours to be solved Response-urgent_and_important sig-hfdiffusers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants