Error: BFloat16 Unsupported scalar when trying to execute across multiple GPUs with BFloat16 & 8-Bits #79

FTuma opened this issue Oct 18, 2022 · 2 comments


FTuma commented Oct 18, 2022

I tried to run BLOOM distributed across multiple A100 GPUs with 8-Bit and using BFloat16 but ran into this error while trying to execute a slightly adjusted version of the example script:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to:
For effortless bug reporting copy-paste your error into this form:
CUDA SETUP: CUDA runtime path found: /datadrive/miniconda3/envs/petals/lib/
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /datadrive/miniconda3/envs/petals/lib/python3.9/site-packages/bitsandbytes/
Oct 18 09:52:07.795 [WARN] [/datadrive/repos/petals/src/client/] RemoteSequential is in active development; expect adventures
Some weights of DistributedBloomForCausalLM were not initialized from the model checkpoint at bloom-testing/test-bloomd-560m-main and are newly initialized: ['lm_head.word_embeddings.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "/datadrive/repos/petals/", line 17, in <module>
    remote_outputs = model.generate(inputs, max_length=100)
  File "/datadrive/miniconda3/envs/petals/lib/python3.9/site-packages/torch/autograd/", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/datadrive/repos/petals/src/client/", line 113, in generate
    hidden_state = sess.step(embs, prompts=intermediate_prompts, hypo_ids=hypo_ids)[:, -1]
  File "/datadrive/repos/petals/src/client/", line 200, in step
    outputs = session.step(inputs, prompts[self.chosen_spans[0].start : self.chosen_spans[0].end], **kwargs)
  File "/datadrive/repos/petals/src/client/", line 109, in step
  File "/datadrive/repos/petals/src/client/", line 110, in <listcomp>
    serialize_torch_tensor(, proto.compression)
  File "/datadrive/miniconda3/envs/petals/lib/python3.9/site-packages/hivemind/compression/", line 41, in serialize_torch_tensor
    return compression.compress(tensor, info, allow_inplace)
  File "/datadrive/miniconda3/envs/petals/lib/python3.9/site-packages/hivemind/compression/", line 83, in compress
    array = tensor.detach().numpy()
TypeError: Got unsupported ScalarType BFloat16

The code of simple_example_script:

import torch
import torch.nn.functional as F
import transformers
from src import DistributedBloomForCausalLM

MODEL_NAME = "bloom-testing/test-bloomd-560m-main" #"bigscience/bloom-petals"
import os
initial_peer = os.getenv("initial_peer")
initial_peers = [initial_peer]  # e.g. ["/ip4/"]
tokenizer = transformers.BloomTokenizerFast.from_pretrained(MODEL_NAME)
model = DistributedBloomForCausalLM.from_pretrained(
  MODEL_NAME, initial_peers=initial_peers, low_cpu_mem_usage=True, torch_dtype=torch.float32
)  # this model has only embeddings / logits, all transformer blocks rely on remote servers

# model ='cuda')
inputs = tokenizer("a cat sat", return_tensors="pt")["input_ids"]
remote_outputs = model.generate(inputs, max_length=100)
print(tokenizer.decode(remote_outputs[0]))  # "a cat sat in the back of the car,"

# "train" input embeddings by backprop through distributed transformer blocks
model.transformer.word_embeddings.weight.requires_grad = True
outputs = model.forward(input_ids=inputs)
loss = F.cross_entropy(outputs.logits.flatten(0, 1), inputs.flatten())
print("Gradients (norm):", model.transformer.word_embeddings.weight.grad.norm())

Server launched via commands:

python -m cli.run_server bloom-testing/test-bloomd-560m-main --num_blocks 12 --torch_dtype bfloat16 --host_maddrs /ip4/ --load_in_8bit

python -m cli.run_server bloom-testing/test-bloomd-560m-main  --torch_dtype bfloat16 --host_maddrs /ip4/ --load_in_8bit --initial_peers /ip4/ --block_indices 12:24 --device cuda:1

Packages in the environment, have been installed via requirements.txt:

I just used the small version for debugging purposes, I need to distribute it across multiple GPUs since I intend to run the 176bn BLOOM version. I tried to naively just convert the tensor at that line to a supported DType but then another error occured somewhere else down the line.

Since I want to do Prompt Tuning on 8x 40GB A100s, I think I have to use BFloat16 & 8Bit or is there another solution/workaround with good performance?

Hi there! Will look into that later today (AOE) and try to reproduce. On the surface, we should not have bfloat16 at that stage, so it should be easy to fix. brb.

borzunov added a commit to learning-at-home/hivemind that referenced this issue Nov 28, 2022
This PR implements bfloat16 support for `CompressionType.NONE` and `CompressionType.BLOCKWISE_8BIT`.

This is important for the Petals client, see bigscience-workshop/petals#79
Hi @FTuma!

Sorry for taking the long time to look into this. The issue should be fixed now (don't forget to pull the latest main commit before trying it out).

For the reference, here are the two PRs where we did that:

I will close this issue for now, but feel free to reopen it or make a new one if you run into other issues.

mryab pushed a commit to learning-at-home/hivemind that referenced this issue Nov 29, 2022
This PR implements bfloat16 support for `CompressionType.NONE` and `CompressionType.BLOCKWISE_8BIT`.

This is important for the Petals client, see bigscience-workshop/petals#79

(cherry picked from commit 1e4af43)
