Memory usage is doubled when loading a fp16 model into bf16 #164

skyser2003 · 2023-09-06T06:22:11Z

Description

Model: Gpt-NeoX
GPU: A100
Tritonserver version: 22.12

Hello, I'm not sure whether this is FasterTransformer's issue or backend's issue, but still I'm reporting it here.

As the title says, I have my model trained originally with fp16 on huggingface, and I converted it to FasterTransformer weight format.

This is the command I used to convert, and the size of the result folder.

python huggingface_gptneox_convert.py -o {output_dir} -i {hf_model_dir} -infer_gpu_num 1 -model_name neox_model -weight_data_type fp16

$ du -h -d 1
25G     ./1-gpu
25G     .

As the command prints out, FasterTransformer converted output weight folder is 25GB, and original huggingface model's size is also 25GB.

Problem occurs when I load it using tritonserver and fastertransformer_backend.
When I load it using fp16, it just loads fine.

I0906 06:12:34.269131 83 libfastertransformer.cc:438] Before Loading Weights:
after allocation    : free: 78.56 GB, total: 79.15 GB, used:  0.60 GB
I0906 06:12:56.704958 83 libfastertransformer.cc:448] After Loading Weights:
after allocation    : free: 54.54 GB, total: 79.15 GB, used: 24.61 GB

But when I load it with bf16, it suddenly takes up twice the memory.

I0906 06:10:11.016121 83 libfastertransformer.cc:438] Before Loading Weights:
after allocation    : free: 78.56 GB, total: 79.15 GB, used:  0.60 GB
I0906 06:11:07.674020 83 libfastertransformer.cc:448] After Loading Weights:
after allocation    : free: 30.52 GB, total: 79.15 GB, used: 48.63 GB

I guess taking twice the memory means that is is loaded as fp32,
so does it mean then you can't load a model saved as fp16 into bf16,
or is it that just Gpt-NeoX model doesn't support bf16 format?

Reproduced Steps

In config.pbtxt

For fp16

parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}

For bf16

parameters {
  key: "data_type"
  value: {
    string_value: "bf16"
  }
}

devin12422 · 2024-03-18T05:35:09Z

Did you ever find a fix for this?

skyser2003 · 2024-03-18T08:45:44Z

@devin12422 No, since FasterTransformer is deprecated and TensorRT-LLM succeeded it,
just used tensorrtllm_backend and it seemed to work fine.

skyser2003 added the bug Something isn't working label Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage is doubled when loading a fp16 model into bf16 #164

Memory usage is doubled when loading a fp16 model into bf16 #164

skyser2003 commented Sep 6, 2023 •

edited

Loading

devin12422 commented Mar 18, 2024

skyser2003 commented Mar 18, 2024

Memory usage is doubled when loading a fp16 model into bf16 #164

Memory usage is doubled when loading a fp16 model into bf16 #164

Comments

skyser2003 commented Sep 6, 2023 • edited Loading

Description

Reproduced Steps

devin12422 commented Mar 18, 2024

skyser2003 commented Mar 18, 2024

skyser2003 commented Sep 6, 2023 •

edited

Loading