Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigscience/T0 multi-gpu inference exits with return code -9 #16616

Closed
2 of 4 tasks
gportill opened this issue Apr 5, 2022 · 20 comments
Closed
2 of 4 tasks

bigscience/T0 multi-gpu inference exits with return code -9 #16616

gportill opened this issue Apr 5, 2022 · 20 comments
Assignees

Comments

@gportill
Copy link

gportill commented Apr 5, 2022

Environment info

  • transformers version: 4.17.0.dev0
  • Platform: Linux-5.13.0-37-generic-x86_64-with-glibc2.10
  • Python version: 3.8.0
  • PyTorch version (GPU?): 1.10.1 (True)
  • Tensorflow version (GPU?): 2.8.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes (deepspeed)

Who can help

Library:

Information

Model I am using: T0

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

I want to load T0 across two 24GB GPUs with DeepSpeed in order to run inference. I followed the example code given here in issue #15399.

When running the code below, after the model says finished initializing model with 11.14B parameters, it quits without outputting a model response. It does not give an error or traceback, just a return code of -9:

[2022-04-05 16:18:09,845] [WARNING] [runner.py:155:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-04-05 16:18:09,912] [INFO] [runner.py:438:main] cmd = /home/aadelucia/miniconda3/envs/fda_cersi_tobacco/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 multi_gpu_T0.py
[2022-04-05 16:18:10,635] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2022-04-05 16:18:10,635] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=2, node_rank=0
[2022-04-05 16:18:10,635] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2022-04-05 16:18:10,635] [INFO] [launch.py:123:main] dist_world_size=2
[2022-04-05 16:18:10,635] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2022-04-05 16:18:11,702] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2022-04-05 16:18:56,295] [INFO] [partition_parameters.py:456:__exit__] finished initializing model with 11.14B parameters
[2022-04-05 16:19:40,754] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 406939
[2022-04-05 16:19:40,754] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 406940
[2022-04-05 16:19:40,754] [ERROR] [launch.py:184:sigkill_handler] ['/home/aadelucia/miniconda3/envs/fda_cersi_tobacco/bin/python', '-u', 'multi_gpu_T0.py', '--local_rank=1'] exits with return code = -9

Here is the code. Run with deepspeed --num_gpus 2 <script.py>

"""
Example code to load a PyTorch model across GPUs

Code from https://github.com/huggingface/transformers/issues/15399
"""
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import torch
import pdb
import os
from tqdm import tqdm
import re

seed = 42
torch.manual_seed(seed)

###
# Deepspeed setup
###
# To avoid warnings about parallelism in tokenizers
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# distributed setup
local_rank = int(os.getenv('LOCAL_RANK', '0'))  # TODO use this
world_size = int(os.getenv('WORLD_SIZE', '1'))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()

model_name = "bigscience/T0"
config = AutoConfig.from_pretrained(model_name)
model_hidden_size = config.d_model

ds_config = {
    "fp16": {
        "enabled": False,
    },
    "bf16": {
        "enabled": True,
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    # batch size has to be divisible by world_size, but can be bigger than world_size
    "train_batch_size": 1 * world_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}

# Initialize model
# must setup HfDeepSpeedConfig before instantiating the model
# ds_config is deepspeed config object or path to the file
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)  # should be 1024
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# we are ready to initialise deepspeed ZeRO now
ds_engine = deepspeed.initialize(model=model,
                                 config_params=ds_config,
                                 model_parameters=None,
                                 optimizer=None,
                                 lr_scheduler=None)[0]
ds_engine.module.eval()  # inference
rank = torch.distributed.get_rank()
text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"

inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)

# Generation options
# https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/model#transformers.generation_utils.GenerationMixin.generate
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, max_length=256)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

Expected behavior

T0 should load across 2 GPUs, generate an answer, and then quit.

@stas00
Copy link
Contributor

stas00 commented Apr 5, 2022

Please try again with the exact code from https://huggingface.co/docs/transformers/main/main_classes/deepspeed#custom-deepspeed-zero-inference as I cleaned it up a bit more - I have just re-tested with it - it works just fine on 2x rtx 3090 gpus.

But it's using the smaller bigscience/T0_3B. But still let's validate that it works for you as a baseline.

bigscience/T0 is ~4x bigger (11B) - I will try it next once 42GB get downloaded.

I'm using master/main version of transformers/deepspeed and pt-1.11.

@stas00 stas00 self-assigned this Apr 5, 2022
@stas00
Copy link
Contributor

stas00 commented Apr 6, 2022

OK, I managed to crash my system with the 11B version with 2 gpus.

Need to figure out cgroup v2 as I moved to Ubuntu 21.10 and my v1 setup no longer works.

Meanwhile I figured out how to run a shell that will not any processes started from it use more memory than I told it to and thus not kill the host:

systemd-run --user --scope -p MemoryHigh=100G -p MemoryMax=110G -p MemorySwapMax=60G bash

but since we have this huge checkpoint of 42GB I don't have enough RAM to load it twice in 2 processes. We have just added sharded checkpoints so need to switch T0 to it.

And meanwhile I'm trying to figure out how to get this to run with nvme offload.

I will update more once I have something running.

@gportill
Copy link
Author

gportill commented Apr 6, 2022

Please try again with the exact code from https://huggingface.co/docs/transformers/main/main_classes/deepspeed#custom-deepspeed-zero-inference as I cleaned it up a bit more - I have just re-tested with it - it works just fine on 2x rtx 3090 gpus.

Thanks for your help!

I've tried to run the example at the link, and now I get another error, related to Ninja--full traceback below. This is an error I have seen before when trying to run the script I provided in my initial post. The errors seemed to alternate between the return code -9 and this Ninja error, without changing anything in the code.

If the example works for you, I can't figure out what's going wrong on my end. Ninja is installed in my environment, and pip install ninja says that the requirement is already satisfied.

I am going to set up a new environment and see if that has better results.

Traceback (most recent call last):
  File "deepspeed_example.py", line 116, in <module>
    ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
  File "/home/username/miniconda3/envs/my_env/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/username/miniconda3/envs/my_env/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 296, in __init__
    self.optimizer = self._configure_zero_optimizer(optimizer=None)
  File "/home/username/miniconda3/envs/my_env/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1394, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/home/username/miniconda3/envs/my_env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 608, in __init__
    util_ops = UtilsBuilder().load()
  File "/home/username/miniconda3/envs/my_env/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 403, in load
    return self.jit_load(verbose)
  File "/home/username/miniconda3/envs/my_env/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 435, in jit_load
    op_module = load(
  File "/home/username/miniconda3/envs/my_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1124, in load
    return _jit_compile(
  File "/home/username/miniconda3/envs/my_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1337, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/username/miniconda3/envs/my_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1418, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "/home/username/miniconda3/envs/my_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1474, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions
Loading extension module utils...
Time to load utils op: 0.10305166244506836 seconds
[2022-04-05 20:50:51,338] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 419852
[2022-04-05 20:50:51,338] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 419853
[2022-04-05 20:50:51,339] [ERROR] [launch.py:184:sigkill_handler] ['/home/username/miniconda3/envs/my_env/bin/python', '-u', 'deepspeed_example.py', '--local_rank=1'] exits with return code = 1

@stas00
Copy link
Contributor

stas00 commented Apr 6, 2022

What's the output of this? I included the output on my conda env:

$ which ninja
/home/stas/anaconda3/envs/py38-pt111/bin/ninja

Perhaps your PATH env var is broken. check that it includes your conda's bin path, /home/stas/anaconda3/envs/py38-pt111/bin in my example.

try echo $PATH

@stas00
Copy link
Contributor

stas00 commented Apr 6, 2022

one of the deepspeed devs was able to reproduce your original error - seems to be related to the deepspeed launcher.

So until they figure it out the quick fix is not to use it ;) Instead use the torch launcher

python -m torch.distributed.run --nproc_per_node=2 <script.py>

cc: @jeffra

@gportill
Copy link
Author

gportill commented Apr 7, 2022

What's the output of this? I included the output on my conda env:

$ which ninja
/home/stas/anaconda3/envs/py38-pt111/bin/ninja

I don't see any output when I give this command, and echo $PATH does list the environment I'm working in, but it doesn't list a ninja directory. I'll look into adding it to the PATH.

Thanks for the python -m torch.distributed.run --nproc_per_node=2 <script.py>. I tried it out and got the same Ninja issue. Once I fix the PATH, I'll try it again.

I will say that I was able to launch T0 and get it working several times last week and early this week, so I'm not sure why the Ninja error is suddenly appearing.

@stas00
Copy link
Contributor

stas00 commented Apr 7, 2022

I don't see any output when I give this command, and echo $PATH does list the environment I'm working in, but it doesn't list a ninja directory

No output means it can't find it in $PATH.

There could be 2 issues:

  1. ninja is installed but your $PATH is incorrect
  2. ninja is not fully installed

Let's look at each case:

  1. What is your conda environment's path, you can get all the envs with:
conda info --envs

e.g. in my case:

$ conda info --envs | grep py38-pt111
py38-pt111            *  /home/stas/anaconda3/envs/py38-pt111

So the bin path that should be in $PATH is /home/stas/anaconda3/envs/py38-pt111/bin

Typically conda pushes that path into $PATH when you activate your environment.

  1. If your $PATH is correct, then you can also try forcing the reinstall:
pip install ninja --force

@gportill
Copy link
Author

gportill commented Apr 8, 2022

I was able to finally get past the Ninja problem by force installing (pip install ninja --force) it in my original environment. Thanks for that.

I also made a new environment and installed all the necessary packages. Here's the information for the new environment:

- `transformers` version: 4.19.0.dev0
- Platform: Linux-5.13.0-37-generic-x86_64-with-glibc2.17
- Python version: 3.8.13
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.11.0 (True)
- Tensorflow version (GPU?): 2.8.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes, DeepSpeed

For both my original and new environment, I can get T0_3B to work on the Custom DeepSpeed ZeRO Inference example.

However, the Custom DeepSpeed ZeRO Inference with the T0 model still finishes with exit code -9 and now mentions ChildFailedError. I'm running it with python -m torch.distributed.run --nproc_per_node=2 <script.py>:

[2022-04-07 20:19:38,188] [INFO] [distributed.py:48:init_distributed] Initializing torch distributed with backend: nccl
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 480668 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 480669) of binary: /home/gwightman/miniconda3/envs/confidence_estimation/bin/python
Traceback (most recent call last):
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
deepspeed_example.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-04-07_20:20:20
  host      : lambda1
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 480669)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 480669
=======================================================

Something else to note is that I was able to successfully run T0 and get output last week, around March 31st. In that case, I had two processes running at the same time, sending the same example to both processes, and output would be generated. When I sent different examples to each process, it appeared that rank=0 would finish before rank=1, and the input at rank=1 would be hanging.

@stas00
Copy link
Contributor

stas00 commented Apr 8, 2022

glad to hear you figured out ninja.

the traceback you pasted if from the launcher, not the actual program. there are 2 independent programs, the launcher starts your actual program and the traceback is that it detected that your program has failed, but do you have the traceback from your program?

In that case, I had two processes running at the same time, sending the same example to both processes, and output would be generated. When I sent different examples to each process, it appeared that rank=0 would finish before rank=1, and the input at rank=1 would be hanging.

I understand the symptom. It means that the gpus synchronisation code in generate didn't work and one gpu finished running while the other is waiting for its data shard from gpu that already finished. When you either send the same input or such input that leads to the output of the same length token-wise it'd work too.

So we need to figure out why the sync didn't kick in. The sync is enabled here:

"synced_gpus": True if is_deepspeed_zero3_enabled() else False,

which tells me that is_deepspeed_zero3_enabled() returned false, which tells me that either:

  1. you have used a config file which didn't have stage: 3 set
  2. or you haven't created or kept alive dschf = HfDeepSpeedConfig(ds_config) which tells transformers that deepspeed is used and its stage.

Could you insert:

from transformers.deepspeed import is_deepspeed_zero3_enabled
print(f"Deepspeed 3 is enabled: {is_deepspeed_zero3_enabled()}")

before: ds_engine.module.generate, so that your code looks like:

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
from transformers.deepspeed import is_deepspeed_zero3_enabled
print(f"Deepspeed 3 is enabled: {is_deepspeed_zero3_enabled()}")
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

and see if it reports: "Deepspeed 3 is enabled: True"

@stas00
Copy link
Contributor

stas00 commented Apr 8, 2022

Further, for now please switch to this branch of deepspeed deepspeedai/DeepSpeed#1884 as it has essential inference offloading bugs that have been fixed in this branch. As the Deepspeed team is on a Spring break it'll take a few weeks before it's merged into master:

You can install it directly like so:

pip install git+https://github.com/microsoft/DeepSpeed@olruwase/zero_inference_type_mismatch

Please install this branch and then try again.

Note: this branch is a bit slow at the moment as prefetch is currently not working, but it'll get fixed once the Deepspeed team is back to work. So it'll be faster once it's enabled again.

@stas00
Copy link
Contributor

stas00 commented Apr 8, 2022

Here is the nvme offload version that I tested with. Works great even with 1x or 2x tiny gpu - I didn't see more than 3GB used on each, but it's slow of course.

#!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py


from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers

# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()

model_name = "bigscience/T0"
#model_name = "bigscience/T0_3B"

config = AutoConfig.from_pretrained(model_name)
model_hidden_size = config.d_model

# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size

# ds_config notes
#
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
#
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
#
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
#
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
#
# For indepth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# XXX: modified this script to use nvme offload so need to explain the new configs, but the key is
# to change the path to `nvme_path`

# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "nvme",
            "nvme_path": "/mnt/nvme0/offload",
            "pin_memory": True,
            "buffer_count": 6,
            "buffer_size": 1e8,
            "max_in_cpu": 1e9
        },
        "aio": {
            "block_size": 262144,
            "queue_depth": 32,
            "thread_count": 1,
            "single_submit": False,
            "overlap_events": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.1 * model_hidden_size * model_hidden_size,
        "stage3_max_live_parameters": 1e8,
        "stage3_max_reuse_distance": 1e8,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
# fmt: on

# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

# now a model can be loaded.
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)#, low_cpu_mem_usage=True)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference

# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
    text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
#from transformers.deepspeed import is_deepspeed_zero3_enabled
#print(f"Deepspeed 3 is enabled: {is_deepspeed_zero3_enabled()}")
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

@gportill
Copy link
Author

the traceback you pasted if from the launcher, not the actual program. there are 2 independent programs, the launcher starts your actual program and the traceback is that it detected that your program has failed, but do you have the traceback from your program?

That is the full output when I run the program to use the T0 model. There are a few additional lines above what I posted, but there is no additional traceback info. I'll post the full output here (this is before I executed pip install git+https://github.com/microsoft/DeepSpeed@olruwase/zero_inference_type_mismatch):

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2022-04-12 09:16:43,200] [INFO] [distributed.py:48:init_distributed] Initializing torch distributed with backend: nccl
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 621696 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 621695) of binary: /home/gwightman/miniconda3/envs/confidence_estimation/bin/python
Traceback (most recent call last):
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
deepspeed_example.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-04-12_09:17:25
  host      : lambda1
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 621695)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 621695
=======================================================

Further, for now please switch to this branch of deepspeed microsoft/DeepSpeed#1884 as it has essential inference offloading bugs that have been fixed in this branch.

I've switched over to this branch.

Could you insert:

from transformers.deepspeed import is_deepspeed_zero3_enabled
print(f"Deepspeed 3 is enabled: {is_deepspeed_zero3_enabled()}")

before: ds_engine.module.generate
...
and see if it reports: "Deepspeed 3 is enabled: True"

When running the zero inference example with T0_3B, the program outputs "Deepspeed 3 is enabled: True" (twice) and successfully returns predictions for the two examples.

When I try to use the same zero inference example with T0, I get the same error as above (still without any extra traceback info). It does not output "Deepspeed 3 is enabled: True", so it must be exiting the program before it reaches that line.

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2022-04-12 09:27:00,049] [INFO] [distributed.py:48:init_distributed] Initializing torch distributed with backend: nccl
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 622851 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 622850) of binary: /home/gwightman/miniconda3/envs/confidence_estimation/bin/python
Traceback (most recent call last):
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
deepspeed_example.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-04-12_09:27:43
  host      : lambda1
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 622850)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 622850
=======================================================

@gportill
Copy link
Author

Here is the nvme offload version that I tested with. Works great even with 1x or 2x tiny gpu - I didn't see more than 3GB used on each, but it's slow of course.

I tried to run this example and got another error when running it as python -m torch.distributed.run --nproc_per_node=2 <script.py>:

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2022-04-12 09:35:17,339] [INFO] [distributed.py:48:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
  File "nvme_offload_example.py", line 130, in <module>
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)#, low_cpu_mem_usage=True)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 446, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1838, in from_pretrained
    with deepspeed.zero.Init(config_dict_or_path=deepspeed_config()):
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 702, in __init__
    self.param_swapper = AsyncPartitionedParameterSwapper(_ds_config, self.dtype)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 40, in __init__
    aio_op = AsyncIOBuilder().load(verbose=False)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 463, in load
    return self.jit_load(verbose)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 467, in jit_load
    raise RuntimeError(
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue.
Traceback (most recent call last):
  File "nvme_offload_example.py", line 130, in <module>
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)#, low_cpu_mem_usage=True)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 446, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1838, in from_pretrained
    with deepspeed.zero.Init(config_dict_or_path=deepspeed_config()):
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 702, in __init__
    self.param_swapper = AsyncPartitionedParameterSwapper(_ds_config, self.dtype)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 40, in __init__
    aio_op = AsyncIOBuilder().load(verbose=False)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 463, in load
    return self.jit_load(verbose)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 467, in jit_load
    raise RuntimeError(
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 623509) of binary: /home/gwightman/miniconda3/envs/confidence_estimation/bin/python
Traceback (most recent call last):
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/run.py", line 728, in <module>
    main()
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
nvme_offload_example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-04-12_09:35:49
  host      : lambda1
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 623510)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-04-12_09:35:49
  host      : lambda1
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 623509)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Now that I've switched to a new branch of deepspeed, can I once again use the deepspeed command to run the program?

If so, here's the output I get when running that example with deepspeed --num_gpus 2 <script.py>:

[2022-04-12 09:43:19,472] [WARNING] [runner.py:155:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-04-12 09:43:19,524] [INFO] [runner.py:453:main] cmd = /home/gwightman/miniconda3/envs/confidence_estimation/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 nvme_offload_example.py
[2022-04-12 09:43:19,983] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2022-04-12 09:43:19,983] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=2, node_rank=0
[2022-04-12 09:43:19,983] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2022-04-12 09:43:19,983] [INFO] [launch.py:123:main] dist_world_size=2
[2022-04-12 09:43:19,983] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2022-04-12 09:43:22,140] [INFO] [distributed.py:48:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
  File "nvme_offload_example.py", line 130, in <module>
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)#, low_cpu_mem_usage=True)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 446, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1838, in from_pretrained
    with deepspeed.zero.Init(config_dict_or_path=deepspeed_config()):
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 702, in __init__
    self.param_swapper = AsyncPartitionedParameterSwapper(_ds_config, self.dtype)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 40, in __init__
    aio_op = AsyncIOBuilder().load(verbose=False)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 463, in load
Traceback (most recent call last):
  File "nvme_offload_example.py", line 130, in <module>
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)#, low_cpu_mem_usage=True)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 446, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1838, in from_pretrained
    return self.jit_load(verbose)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 467, in jit_load
    raise RuntimeError(
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue.
    with deepspeed.zero.Init(config_dict_or_path=deepspeed_config()):
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 702, in __init__
    self.param_swapper = AsyncPartitionedParameterSwapper(_ds_config, self.dtype)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 40, in __init__
    aio_op = AsyncIOBuilder().load(verbose=False)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 463, in load
    return self.jit_load(verbose)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 467, in jit_load
    raise RuntimeError(
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue.
[2022-04-12 09:44:02,034] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 624172
[2022-04-12 09:44:02,034] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 624173
[2022-04-12 09:44:02,034] [ERROR] [launch.py:184:sigkill_handler] ['/home/gwightman/miniconda3/envs/confidence_estimation/bin/python', '-u', 'nvme_offload_example.py', '--local_rank=1'] exits with return code = 1

I saw your post DeepSpeed #1037 saying that I might need to do apt install libaio-dev, but I see this:

E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?

I'm going to check if this is just a permissions issue--hopefully that will fix it.

@stas00
Copy link
Contributor

stas00 commented Apr 12, 2022

I saw your post deepspeedai/DeepSpeed#1037 saying that I might need to do apt install libaio-dev, but I see this:

Yes, you need to sudo apt install libaio-dev

If for any reason you have an issue with installing libaio system-wide here is how to install it via conda if you use the latter: deepspeedai/DeepSpeed#1890

So let's try the nvme solution once you installed libaio-dev


wrt to failing to start with T0, I wonder if your kernel kills the program because it tries to use 4x cpu memory (over 3B that works) and on 2 gpus that's a huge amount of additional memory (64GB more). Perhaps something gets logged in /var/log/syslog?

How much cpu memory do you have on this host?

Perhaps, try the low_cpu_mem approach:

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, low_cpu_mem_usage=True)

but the main branch currently doesn't work for all models (it fails to use less memory silently), I have this PR that fixes the problem for all models:
#16657

Deepspeed should really have a parameter that defines how much CPU memory can be used.

@gportill
Copy link
Author

So let's try the nvme solution once you installed libaio-dev

I was able to install libaio-dev and tried to run the nvme offload solution again.

I'm getting a permission error related to nvme: PermissionError: [Errno 13] Permission denied: '/mnt/nvme0'

PermissionError: [Errno 13] Permission denied: '/mnt/nvme0'
Traceback (most recent call last):
  File "nvme_offload_example.py", line 130, in <module>
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)#, low_cpu_mem_usage=True)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 446, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1838, in from_pretrained
    with deepspeed.zero.Init(config_dict_or_path=deepspeed_config()):
  File "/home/gwightman/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 702, in __init__
    self.param_swapper = AsyncPartitionedParameterSwapper(_ds_config)
  File "/home/gwightman/DeepSpeed/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 43, in __init__
    self._configure_aio(ds_config)
  File "/home/gwightman/DeepSpeed/deepspeed/runtime/swap_tensor/partitioned_param_swapper.py", line 90, in _configure_aio
    os.makedirs(self.swap_folder, exist_ok=True)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  [Previous line repeated 1 more time]
  File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/mnt/nvme0'
[2022-04-18 17:18:00,169] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 776505
[2022-04-18 17:18:00,169] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 776506
[2022-04-18 17:18:00,169] [ERROR] [launch.py:184:sigkill_handler] ['/home/gwightman/miniconda3/envs/confidence_estimation/bin/python', '-u', 'nvme_offload_example.py', '--local_rank=1'] exits with return code = 1

How much cpu memory do you have on this host?

128663 M total memory
19816 M used memory
41032 M active memory
31464 M inactive memory
54663 M free memory
751 M buffer memory
53431 M swap cache
2047 M total swap
2033 M used swap
14 M free swap

Here's the output of /var/log/syslog:

Apr 18 17:25:16 lambda1 kernel: [1661615.493653] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1010.slice/session-744.scope,task=python,pid=777012,uid=1010
Apr 18 17:25:16 lambda1 kernel: [1661615.493753] Out of memory: Killed process 777012 (python) total-vm:73502680kB, anon-rss:46787484kB, file-rss:68712kB, shmem-rss:8063664kB, UID:1010 pgtables:112820kB oom_score_adj:0
Apr 18 17:25:16 lambda1 kernel: [1661615.496313] Cannot map memory with base addr 0x7f9adc000000 and size of 0x8000 pages
Apr 18 17:25:17 lambda1 kernel: [1661616.474204] oom_reaper: reaped process 777012 (python), now anon-rss:0kB, file-rss:68876kB, shmem-rss:8063796kB

Perhaps, try the low_cpu_mem approach:

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, low_cpu_mem_usage=True)

but the main branch currently doesn't work for all models (it fails to use less memory silently), I have this PR that fixes the problem for all models: #16657

Deepspeed should really have a parameter that defines how much CPU memory can be used.

I tried the low memory approach, and I got a message saying that low_cpu_mem_usage is not available with DeepSpeed 3, so I changed it to DeepSpeed 2. I got this error:

 File "/home/gwightman/miniconda3/envs/confidence_estimation/lib/python3.8/site-packages/torch/nn/modules/module.py", line 905, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError    : return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)CUDA out of memory. Tried to allocate 64.00 MiB (GPU 1; 23.70 GiB total capacity; 21.99 GiB already allocated; 52.81 MiB free; 21.99 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
RuntimeError
: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.69 GiB total capacity; 21.93 GiB already allocated; 48.44 MiB free; 21.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2022-04-18 17:36:36,198] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 777975
[2022-04-18 17:36:36,198] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 777976
[2022-04-18 17:36:36,198] [ERROR] [launch.py:184:sigkill_handler] ['/home/gwightman/miniconda3/envs/confidence_estimation/bin/python', '-u', 'deepspeed_example.py', '--local_rank=1'] exits with return code = 1

@stas00
Copy link
Contributor

stas00 commented Apr 19, 2022

So let's try the nvme solution once you installed libaio-dev

I was able to install libaio-dev and tried to run the nvme offload solution again.

I'm getting a permission error related to nvme: PermissionError: [Errno 13] Permission denied: '/mnt/nvme0'

Apologies if it wasn't obvious you were meant to edit the path to some path on your filesystem. It just happened to be /mnt/nvme0 on my setup.

How much cpu memory do you have on this host?

128663 M total memory

So 128MB of CPU RAM.

When dealing with huge models it always helps to have some swap memory, which extends your effective CPU memory.

Here's the output of /var/log/syslog:

Apr 18 17:25:16 lambda1 kernel: [1661615.493653] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1010.slice/session-744.scope,task=python,pid=777012,uid=1010
Apr 18 17:25:16 lambda1 kernel: [1661615.493753] Out of memory: Killed process 777012 (python) total-vm:73502680kB, anon-rss:46787484kB, file-rss:68712kB, shmem-rss:8063664kB, UID:1010 pgtables:112820kB oom_score_adj:0

So yes, as expected your system kills the process, as it consumes too much CPU memory.

I tried the low memory approach, and I got a message saying that low_cpu_mem_usage is not available with DeepSpeed 3, so I changed it to DeepSpeed 2. I got this error:

Ah, yes, sorry, that is still a work in progress. I will need to work on having low_cpu_mem_usage support Deepspeed stage-3 or we are also discussing other ways of loading directly on gpu and not require 2x model size on cpu memory, which gets further multiplied by the number of gpus. So here it tries to allocate 40 * 2 * 2 160GB of CPU memory and of course it fails.

Hmm, staggered loading should overcome this issue as well, basically having the 2nd instance of the script insert a delay before from_pretrained so that both don't try to load at the same time. It may or may not run into barriers. Have to have a look. But something like:

import time
[...]
if local_rank == 1: # stagger the loading
    time.sleep(120)    # should be long enough for rank 0 to finish `from_pretrained`
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Actually the staggering most likely won't work, since deepspeed's zero.Init will want all gpus to run in sync to shard the weights across GPUs. So don't waste time on trying this one.

@stas00
Copy link
Contributor

stas00 commented Apr 19, 2022

Here is a recipe to add swap memory, of course, edit the path and the desired amount of GBs

### Add a new swap file or extend one ###

# turn off all swap processes
sudo swapoff -a

# add 128GB file (or resize it if it already exists)
sudo dd if=/dev/zero of=/mnt/nvme0/swapfile bs=1G count=128

# prep as swap
sudo chmod 600 /mnt/nvme0/swapfile
sudo chown root.root /mnt/nvme0/swapfile
sudo mkswap /mnt/nvme0/swapfile

# activate the swap file
sudo swapon /mnt/nvme0/swapfile

# check the amount of swap available
grep SwapTotal /proc/meminfo

# to make permanent add to /etc/fstab if it isn’t already there
/mnt/nvme0/swapfile none swap sw 0 0

@stas00
Copy link
Contributor

stas00 commented Apr 20, 2022

  1. OK, so first we want to shard the T0 checkpoint which then allows us to have smaller chunks to keep in memory while loading the model.
# shard it to 10GB / shard
python -c "from transformers import AutoModelForSeq2SeqLM; model = AutoModelForSeq2SeqLM.from_pretrained('bigscience/T0'); model.save_pretrained('t0-sharded')"

now use "t0-sharded" as a model name

(at some point we will have a sharded version on the hub)

you can shard it into even smaller chunks, say of 5GB:

python -c 'from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0_3B"); \
model.save_pretrained("t0-sharded", max_shard_size="5GB")'

I'd say do the latter for this experiment.

and of course 't0-sharded' is where it gets saved - so you can play with the path.

  1. then we want this PR [modeling_utils] use less cpu memory with sharded checkpoint loading #16844 which I hope will get merged shortly which uses even less cpu memory.

With these 2 fixes we will still need 42+10GB (or 42+5 if you sharded to 5GB) per process - which may just be enough for what you need - i.e. at 47*2 = 94GB => should fit into 128GB.

Please let me know if this unblocks you.


another way:

with Deepspeed nvme + cpu offload 1 gpu should be enough! as you only need to be able to load a single largest layer and if you don't care for the parallel input processing you're not gaining anything from 2 gpus anyway when using nvme offload (I think, I haven't measured, so I can be wrong).


and I still want to try to work out low_cpu_mem_usage=True with deepspeed zero-3

@AADeLucia
Copy link

@stas00, thank you so much for your help! I'm answering for @gportill since we were working on this issue together.

Summary of what worked:

  1. Install transformers and DeepSpeed from GitHub

    pip install git+http://github.com/huggingface/transformers.git#egg=transformers
    pip install git+http://github.com/microsoft/DeepSpeed.git#egg=deepspeed
  2. If using NVME offload, set up Linux-native asynchronous I/O facility:

    sudo apt install libaio-dev
  3. If using CPU offload, increase swap memory with Stas' directions: bigscience/T0 multi-gpu inference exits with return code -9 #16616 (comment)

  4. Load sharded model (only some models are available sharded, T0 and T0pp included)

    model = AutoModelForSeq2SeqLM.from_pretrained(model_name, revision="sharded")

Full working example:

This example was modified from #15399 (comment) and assumes all of the "summary of what worked" steps were taken.

#!/usr/bin/env python

# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py

# Imports
from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM
from transformers.deepspeed import HfDeepSpeedConfig
import deepspeed
import os
# To avoid warnings about parallelism in tokenizers
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import torch
from argparse import ArgumentParser


#################
# DeepSpeed Config
#################
def generate_ds_config(args):
    """
    ds_config notes

    - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
    faster.

    - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
    all official t5 models are bf16-pretrained

    - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
    - want CPU offload

    - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
    - which params should remain on gpus - the larger the value the smaller the offload size

    For indepth info on Deepspeed config see
    https://huggingface.co/docs/transformers/main/main_classes/deepspeed
    keeping the same format as json for consistency, except it uses lower case for true/false
    fmt: off
    """

    config = AutoConfig.from_pretrained(args.model_name)
    world_size = int(os.getenv("WORLD_SIZE", "1"))
    model_hidden_size = config.d_model

    # batch size has to be divisible by world_size, but can be bigger than world_size
    train_batch_size = args.batch_size * world_size

    config = {
        "fp16": {
            "enabled": False
        },
        "bf16": {
            "enabled": False
        },
        "zero_optimization": {
            "stage": 3,
            "offload_param": {
                "device": args.offload,
                "nvme_path": args.nvme_offload_path,
                "pin_memory": True,
                "buffer_count": 6,
                "buffer_size": 1e8,
                "max_in_cpu": 1e9
            },
            "aio": {
                "block_size": 262144,
                "queue_depth": 32,
                "thread_count": 1,
                "single_submit": False,
                "overlap_events": True
            },
            "overlap_comm": True,
            "contiguous_gradients": True,
            "reduce_bucket_size": model_hidden_size * model_hidden_size,
            "stage3_prefetch_bucket_size": 0.1 * model_hidden_size * model_hidden_size,
            "stage3_max_live_parameters": 1e8,
            "stage3_max_reuse_distance": 1e8,
            "stage3_param_persistence_threshold": 10 * model_hidden_size
        },
        "steps_per_print": 2000,
        "train_batch_size": train_batch_size,
        "train_micro_batch_size_per_gpu": 1,
        "wall_clock_breakdown": False
    }
    return config


#################
# Helper Methods
#################
def parse_args():
    """Parse program options"""
    parser = ArgumentParser()
    parser.add_argument("--model-name", default="bigscience/T0", help="Name of model to load.")
    parser.add_argument("--offload", choices=["nvme", "cpu", "none"], default="none",
                        help="DeepSpeed optimization offload choices for ZeRO stage 3.")
    parser.add_argument("--nvme-offload-path", default="/tmp/nvme-offload",
                        help="Path for NVME offload. Ensure path exists with correct write permissions.")
    parser.add_argument("--batch-size", default=1, help="Effective batch size is batch-size * # GPUs")
    return parser.parse_args()


#################
# Main
#################
# Distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()
args = parse_args()
ds_config = generate_ds_config(args)

# fmt: on
# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
# Special version of T0
revision = None
if args.model_name in ["bigscience/T0", "bigscience/T0pp"]:
    revision = "sharded"
model = AutoModelForSeq2SeqLM.from_pretrained(args.model_name, revision=revision)

# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference

# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
    text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained(args.model_name)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)

# synced_gpus (bool, optional, defaults to False) —
# Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs —
# Additional model specific keyword arguments will be forwarded to the forward function of the model.
# If model is an encoder-decoder model the kwargs should include encoder_outputs.
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}\n")

And the following code to run:

export CUDA_LAUNCH_BLOCKING=0
export OMP_NUM_THREADS=1
python -m torch.distributed.run --nproc_per_node=2 T0_inference.py 

@stas00
Copy link
Contributor

stas00 commented May 6, 2022

That's a really neat summary and code parametrization, @AADeLucia - great work!

Just to add that with the sharded model it's now possible to infer T0 (42GB) and other similar models in fp32 using just 2x 24GB gpus, w/ deepspeed w/o any offload.

But if you have smaller GPUs, or just one GPU or larger models then the above script allows you to offload to cpu RAM if you have lots of it and if not so much to an NVMe device - each making the performance progressively slower.

And once:

  1. transformers>1.18.0
  2. deepspeed>0.6.3
    are available you can install the released versions instead of the git versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants