Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage is doubled when loading a fp16 model into bf16 #164

Open
skyser2003 opened this issue Sep 6, 2023 · 2 comments
Open

Memory usage is doubled when loading a fp16 model into bf16 #164

skyser2003 opened this issue Sep 6, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@skyser2003
Copy link

skyser2003 commented Sep 6, 2023

Description

Model: Gpt-NeoX
GPU: A100
Tritonserver version: 22.12

Hello, I'm not sure whether this is FasterTransformer's issue or backend's issue, but still I'm reporting it here.

As the title says, I have my model trained originally with fp16 on huggingface, and I converted it to FasterTransformer weight format.

This is the command I used to convert, and the size of the result folder.

python huggingface_gptneox_convert.py -o {output_dir} -i {hf_model_dir} -infer_gpu_num 1 -model_name neox_model -weight_data_type fp16
$ du -h -d 1
25G     ./1-gpu
25G     .

As the command prints out, FasterTransformer converted output weight folder is 25GB, and original huggingface model's size is also 25GB.

Problem occurs when I load it using tritonserver and fastertransformer_backend.
When I load it using fp16, it just loads fine.

I0906 06:12:34.269131 83 libfastertransformer.cc:438] Before Loading Weights:
after allocation    : free: 78.56 GB, total: 79.15 GB, used:  0.60 GB
I0906 06:12:56.704958 83 libfastertransformer.cc:448] After Loading Weights:
after allocation    : free: 54.54 GB, total: 79.15 GB, used: 24.61 GB

But when I load it with bf16, it suddenly takes up twice the memory.

I0906 06:10:11.016121 83 libfastertransformer.cc:438] Before Loading Weights:
after allocation    : free: 78.56 GB, total: 79.15 GB, used:  0.60 GB
I0906 06:11:07.674020 83 libfastertransformer.cc:448] After Loading Weights:
after allocation    : free: 30.52 GB, total: 79.15 GB, used: 48.63 GB

I guess taking twice the memory means that is is loaded as fp32,
so does it mean then you can't load a model saved as fp16 into bf16,
or is it that just Gpt-NeoX model doesn't support bf16 format?

Reproduced Steps

In config.pbtxt

For fp16

parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}

For bf16

parameters {
  key: "data_type"
  value: {
    string_value: "bf16"
  }
}
@skyser2003 skyser2003 added the bug Something isn't working label Sep 6, 2023
@devin12422
Copy link

Did you ever find a fix for this?

@skyser2003
Copy link
Author

@devin12422 No, since FasterTransformer is deprecated and TensorRT-LLM succeeded it,
just used tensorrtllm_backend and it seemed to work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants