[BUG] Zero-Inference usage error with .init_inference()

**Describe the bug**
Passing Zero level-3 cpu offload parameters to the `args` parameter of .init_inference() does not have any effect, still consuming all GPU memory for a large model and throwing an error. 

Following the [blog post](https://www.deepspeed.ai/2022/09/09/zero-inference.html) on Zero-Inference, I tried to load a GPT-J model with DeepSpeed Inference, and got a CUDA OOM error.

How should I pass in the deepspeed config parameters to the init_inference method? Or should  I just .initialize() even if using the model for inference?

**To Reproduce**
```

import os, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import deepspeed

model_name = "EleutherAI/gpt-j-6B"

model = AutoModelForCausalLM.from_pretrained(
    model_name, revision="float32", torch_dtype=torch.float32
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

deepspeed_config = {
        "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu"
        },
    }
}

model = deepspeed.init_inference(
    model,
    args=deepspeed_config,
    dtype=model.dtype,
    replace_method="auto",
    replace_with_kernel_inject=True,
)
```

Error :

`RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 14.56 GiB total capacity; 13.52 GiB already allocated; 52.44 MiB free; 13.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

**Expected behavior**
The first layer of the model being loaded , and then layers dynamically loaded as they are needed for inference.

**ds_report output**
```
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
which: no hipcc in (/opt/amazon/openmpi/bin:/opt/amazon/efa/bin:/home/ec2-user/anaconda3/condabin:/home/ec2-user/.dl_binaries/bin:/opt/aws/neuron/bin:/usr/libexec/gcc/x86_64-redhat-linux/7:/opt/aws/bin:/home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/bin:/home/ec2-user/anaconda3/condabin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin:/home/ec2-user/anaconda3/condabin:/home/ec2-user/.dl_binaries/bin:/opt/aws/neuron/bin:/usr/libexec/gcc/x86_64-redhat-linux/7:/opt/aws/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages/torch']
torch version .................... 1.5.1
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 11.0
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/amazonei_pytorch_latest_p37/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.7.3, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.5, cuda 10.2
```

**Screenshots**
If applicable, add screenshots to help explain your problem.

**System info (please complete the following information):**
 - OS: Amazon LInux 2
 - 1 machine (g4dn.8xlarge on Amazon Sagemaker with 1 Tesla T4 GPU)
 - Transformers version=4.22.2
 - Python version=3.8.12



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Zero-Inference usage error with .init_inference() #2372

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Zero-Inference usage error with .init_inference() #2372

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions