Skip to content

Commit

Permalink
added chat model support for yi
Browse files Browse the repository at this point in the history
  • Loading branch information
guocuimi committed Nov 23, 2023
1 parent 83cf084 commit 68854da
Show file tree
Hide file tree
Showing 9 changed files with 116 additions and 47 deletions.
63 changes: 38 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,14 @@ In the coming weeks, we have exciting plans to focus on [**_speculative decoding
## Table of contents

- [Overview](#overview)
- [Supported Models](#supported-models)
- [Get Started](#get-started)
- [Docker Container](#docker-container)
- [ScaleLLM server](#scalellm-server)
- [Rest API Server](#rest-api-server)
- [Chatbot UI](#chatbot-ui)
- [Docker Compose](#docker-compose)
- [Usage Examples](#usage-examples)
- [Supported Models](#supported-models)
- [Quatization](#quatization)
- [Quantization](#quantization)
- [Limitations](#limitations)
- [Contributing](#Contributing)
- [Acknowledgements](#acknowledgements)
Expand All @@ -44,6 +46,33 @@ ScaleLLM is a cutting-edge inference system engineered for large language models
- [Customizable](): Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
- [Production Ready](): Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.


## Supported Models

Please note that in order to use Yi models, you need to add `--model_type=Yi` to the command line. For example:
```bash
docker run -it --gpus=all --net=host --shm-size=1g \
-v $HOME/.cache/huggingface/hub:/models \
-e HF_MODEL_ID=01-ai/Yi-34B-Chat-4bits \
-e DEVICE=auto \
docker.io/vectorchai/scalellm:latest --logtostderr --model_type=Yi
```

| Models | Tensor Parallel | Quantization | Chat API | HF models examples |
| :--------: | :-------------: | :----------: | :------: | :---------------------------:|
| Yi | Yes | Yes | Yes |[01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B), [01-ai/Yi-34B-Chat-4bits](https://huggingface.co/01-ai/Yi-34B-Chat-4bits), [01-ai/Yi-6B-200K](https://huggingface.co/01-ai/Yi-6B-200K), [casperhansen/yi-6b-awq](https://huggingface.co/casperhansen/yi-6b-awq), [TheBloke/Yi-34B-GPTQ](https://huggingface.co/TheBloke/Yi-34B-GPTQ) |
| Llama2 | Yes | Yes | Yes | [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b), [TheBloke/Llama-2-13B-chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ), [TheBloke/Llama-2-70B-AWQ](https://huggingface.co/TheBloke/Llama-2-70B-AWQ) |
| Aquila | Yes | Yes | Yes | [BAAI/Aquila-7B](https://huggingface.co/BAAI/Aquila-7B), [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B) |
| Bloom | Yes | Yes | No | [bigscience/bloom](https://huggingface.co/bigscience/bloom) |
| GPT_j | Yes | Yes | No | [EleutherAI/gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b) |
| GPT_NeoX | Yes | Yes | No | [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) |
| GPT2 | Yes | Yes | No | [gpt2](https://huggingface.co/gpt2)|
| InternLM | Yes | Yes | Yes | [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) |
| Mistral | Yes | Yes | Yes | [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) |
| MPT | Yes | Yes | No | [mosaicml/mpt-30b](https://huggingface.co/mosaicml/mpt-30b) |

If your model is not included in the supported list, we are more than willing to assist you. Please feel free to create a request for adding a new model on [GitHub Issues](https://github.com/vectorch-ai/ScaleLLM/issues).

## Getting Started

The easiest way to get started with our project is by using the official Docker images. If you don't have Docker installed, please follow the installation instructions for your platform.
Expand All @@ -55,9 +84,9 @@ You can download and install Docker from the official website: [Docker Installat
> **Note**<br />
> To use GPUs, you also need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
### Docker Container
### ScaleLLM server

Once you have Docker installed, you can run our project's Docker container using the following command:
Once you have Docker installed, you can run ScaleLLM Docker container using the following command:

```bash
docker run -it --gpus=all --net=host --shm-size=1g \
Expand All @@ -80,7 +109,7 @@ This command starts the Docker container with GPU support and various configurat
> **Note**<br />
> Although ScaleLLM supports both `CPU` and `GPU`, we recommend using GPU for better performance. CPU support is mainly for debugging and testing purposes, so the performance might be sub-optimal. If you want to use CPU, please set `DEVICE=cpu` in the command.
### Ports and Endpoints
#### Ports and Endpoints

After running the Docker container, two ports are exposed:

Expand Down Expand Up @@ -108,7 +137,7 @@ docker run -it --net=host \

The REST API Server is available on `localhost:8080`. You can use REST API requests to interact with the system. Check out the [Usage Examples](#usage-examples) section for more details.

### Local Chatbot UI
### Chatbot UI

A local Chatbot UI is also available on [localhost:3000](localhost:3000). You can start it with the following command:

Expand All @@ -119,7 +148,7 @@ docker run -it --net=host \
docker.io/vectorchai/chatbot-ui:latest
```

## Docker Compose
### Docker Compose

Using Docker Compose is the easiest way to run ScaleLLM with all the services together. If you don't have Docker Compose installed, please follow the [installation doc](https://docs.docker.com/compose/install/) for your platform.

Expand Down Expand Up @@ -231,23 +260,6 @@ for chunk in completion:
print(content, end="")
```

## Supported Models

| Models | Tensor Parallel | Quantization | Chat API | HF models examples |
| :--------: | :-------------: | :----------: | :------: | :---------------------------:|
| Yi | Yes | Yes | No |[01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B), [01-ai/Yi-6B-200K](https://huggingface.co/01-ai/Yi-6B-200K), [casperhansen/yi-6b-awq](https://huggingface.co/casperhansen/yi-6b-awq), [TheBloke/Yi-34B-GPTQ](https://huggingface.co/TheBloke/Yi-34B-GPTQ) |
| Llama2 | Yes | Yes | Yes | [meta-llama/Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b), [TheBloke/Llama-2-13B-chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ), [TheBloke/Llama-2-70B-AWQ](https://huggingface.co/TheBloke/Llama-2-70B-AWQ) |
| Aquila | Yes | Yes | Yes | [BAAI/Aquila-7B](https://huggingface.co/BAAI/Aquila-7B), [BAAI/AquilaChat-7B](https://huggingface.co/BAAI/AquilaChat-7B) |
| Bloom | Yes | Yes | No | [bigscience/bloom](https://huggingface.co/bigscience/bloom) |
| GPT_j | Yes | Yes | No | [EleutherAI/gpt-j-6b](https://huggingface.co/EleutherAI/gpt-j-6b) |
| GPT_NeoX | Yes | Yes | No | [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) |
| GPT2 | Yes | Yes | No | [gpt2](https://huggingface.co/gpt2)|
| InternLM | Yes | Yes | Yes | [internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) |
| Mistral | Yes | Yes | Yes | [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) |
| MPT | Yes | Yes | No | [mosaicml/mpt-30b](https://huggingface.co/mosaicml/mpt-30b) |

If your model is not included in the supported list, we are more than willing to assist you. Please feel free to create a request for adding a new model on [GitHub Issues](https://github.com/vectorch-ai/ScaleLLM/issues).

## Quantization
Quantization is a crucial process for reducing the memory footprint of models. ScaleLLM offers support for two quantization techniques: Accurate Post-Training Quantization ([APTQ](https://arxiv.org/abs/2210.17323)) and Activation-aware Weight Quantization ([AWQ](https://arxiv.org/abs/2306.00978)), with seamless integration into the following libraries: autogptq, exllama, exllamav2, and awq.

Expand All @@ -258,6 +270,7 @@ Quantization is a crucial process for reducing the memory footprint of models. S
There are several known limitations we are looking to address in the coming months, including:

- Only supports Hugging Face models with [fast tokenizers](https://github.com/huggingface/tokenizers).
- Only supports GPUs that newer than Turing architecture.

## Contributing

Expand Down
11 changes: 8 additions & 3 deletions src/model_loader/model_loader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -180,17 +180,22 @@ bool HFModelLoader::load_model_args(const std::string& model_weights_path) {
return false;
}

std::string model_type;
if (auto data = reader.value<std::string>("model_type")) {
args_.model_type() = data.value();
model_type = data.value();
} else {
GLOG(ERROR) << "Failed to find model_type in " << args_file_path;
return false;
}

auto args_loader = ModelRegistry::get_model_args_loader(args_.model_type());
// override model type from gflag if exists
if (!FLAGS_model_type.empty()) {
model_type = FLAGS_model_type;
}
auto args_loader = ModelRegistry::get_model_args_loader(model_type);
if (args_loader == nullptr) {
GLOG(ERROR) << "Failed to find model args loader for model type "
<< args_.model_type();
<< model_type;
return false;
}
args_loader(reader, &args_);
Expand Down
4 changes: 4 additions & 0 deletions src/models/args.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#pragma once

#include <optional>
#include <unordered_set>

#include "common/arg.h"
#include "common/process_group.h"
Expand Down Expand Up @@ -83,6 +84,9 @@ struct ModelArgs {

// whether to apply residual connection post layernorm
DEFINE_ARG(bool, residual_post_layernorm) = false;

// Stop token ids
DEFINE_ARG(std::unordered_set<int32_t>, stop_token_ids);
};

inline std::ostream& operator<<(std::ostream& os, const ModelArgs& args) {
Expand Down
1 change: 0 additions & 1 deletion src/models/huggingface/llama.h
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,6 @@ class LlamaAttentionImpl : public torch::nn::Module {
torch::Tensor positions,
KVCache& kv_cache,
const InputParameters& input_params) {
const auto num_tokens = x.size(0);
// (num_tokens, dim) x (dim, n_local_heads * head_dim)
// => (num_tokens, n_local_heads * head_dim)
auto qkv = qkv_proj_(x).split(/*split_size=*/qkv_sizes_, /*dim=*/-1);
Expand Down
62 changes: 50 additions & 12 deletions src/models/huggingface/yi.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

#include <torch/torch.h>

#include <unordered_set>

#include "layers/activation.h"
#include "layers/attention_rope.h"
#include "layers/embedding.h"
Expand Down Expand Up @@ -190,34 +192,39 @@ class YiDecoderLayerImpl : public torch::nn::Module {
YiAttention(args, quant_args, parallel_args, dtype, device));
mlp_ = register_module(
"mlp", YiMLP(args, quant_args, parallel_args, dtype, device));
ln1_ = register_module(
"ln1", RMSNorm(args.hidden_size(), args.rms_norm_eps(), dtype, device));
ln2_ = register_module(
"ln2", RMSNorm(args.hidden_size(), args.rms_norm_eps(), dtype, device));
input_layernorm_ = register_module(
"input_layernorm",
RMSNorm(args.hidden_size(), args.rms_norm_eps(), dtype, device));
post_attention_layernorm_ = register_module(
"post_attention_layernorm",
RMSNorm(args.hidden_size(), args.rms_norm_eps(), dtype, device));
}

torch::Tensor forward(torch::Tensor x,
torch::Tensor positions,
KVCache& kv_cache,
const InputParameters& input_params) {
auto h = x + self_attn_(ln1_(x), positions, kv_cache, input_params);
return h + mlp_(ln2_(h));
auto h =
x + self_attn_(input_layernorm_(x), positions, kv_cache, input_params);
return h + mlp_(post_attention_layernorm_(h));
}

// load the weight from the checkpoint
void load_state_dict(const StateDict& state_dict) {
// call each submodule's load_state_dict function
self_attn_->load_state_dict(state_dict.select("self_attn."));
mlp_->load_state_dict(state_dict.select("mlp."));
ln1_->load_state_dict(state_dict.select("ln1."));
ln2_->load_state_dict(state_dict.select("ln2."));
input_layernorm_->load_state_dict(state_dict.select("input_layernorm."));
post_attention_layernorm_->load_state_dict(
state_dict.select("post_attention_layernorm."));
}

void verify_loaded_weights(const std::string& prefix) const {
self_attn_->verify_loaded_weights(prefix + "self_attn.");
mlp_->verify_loaded_weights(prefix + "mlp.");
ln1_->verify_loaded_weights(prefix + "ln1.");
ln2_->verify_loaded_weights(prefix + "ln2.");
input_layernorm_->verify_loaded_weights(prefix + "input_layernorm.");
post_attention_layernorm_->verify_loaded_weights(
prefix + "post_attention_layernorm.");
}

private:
Expand All @@ -226,9 +233,9 @@ class YiDecoderLayerImpl : public torch::nn::Module {

YiMLP mlp_{nullptr};

RMSNorm ln1_{nullptr};
RMSNorm input_layernorm_{nullptr};

RMSNorm ln2_{nullptr};
RMSNorm post_attention_layernorm_{nullptr};
};
TORCH_MODULE(YiDecoderLayer);

Expand Down Expand Up @@ -357,8 +364,36 @@ class YiForCausalLMImpl : public torch::nn::Module {
};
TORCH_MODULE(YiForCausalLM);

class YiDialog final : public Dialog {
public:
// generate prompt from dialogs
// https://huggingface.co/01-ai/Yi-34B-Chat/blob/main/tokenizer_config.json#L60
// Prompt template:
// <|im_start|>user\n {message} <|im_end|>\n
// <|im_start|>assistant\n
std::optional<std::string> get_prompt() const override {
// at least one user message
if (messages_.size() % 2 == 0) {
return std::nullopt;
}

std::stringstream ss;
// Sounds Yi model doesn't support system message?

// then user and assistant message pairs (u/a/u/a/u...)
for (size_t i = 0; i < messages_.size(); ++i) {
const char* role = (i % 2) == 0 ? "user" : "assistant";
ss << "<|im_start|>" << role << "\n" << messages_[i] << "<|im_end|>\n";
}
// end with assistant message
ss << "<|im_start|>assistant\n";
return ss.str();
}
};

// register the causal model
REGISTER_CAUSAL_MODEL(Yi, YiForCausalLM);
REGISTER_DIALOG(Yi, YiDialog);
// register the model args
// example config:
// https://huggingface.co/01-ai/Yi-6B/blob/main/config.json
Expand All @@ -377,6 +412,9 @@ REGISTER_MODEL_ARGS(Yi, [&] {
LOAD_ARG_OR(eos_token_id, "eos_token_id", 2);
LOAD_ARG_OR(rope_theta, "rope_theta", 5000000.0f);
LOAD_ARG_OR(rope_scaling, "rope_scaling", 1.0f);

// stop token ids: "<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|im_sep|>"
LOAD_ARG_OR(stop_token_ids, "", std::unordered_set<int32_t>({2, 6, 7, 8}));
});

} // namespace llm::hf
9 changes: 7 additions & 2 deletions src/request/sequence.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,17 @@ bool Sequence::check_stopping_creteria() {
return is_finished_ = true;
}

const auto last_token_id = token_ids_.back();
if (!stopping_criteria_->ignore_eos_token &&
token_ids_.back() == stopping_criteria_->eos_token_id) {
last_token_id == stopping_criteria_->eos_token_id) {
finish_reason_ = FinishReason::STOP;
return is_finished_ = true;
}
// check against stop tokens ids
if (stopping_criteria_->stop_token_ids.count(last_token_id) > 0) {
finish_reason_ = FinishReason::STOP;
return is_finished_ = true;
}
// TODO: Add other stopping criterias

return false;
}
Expand Down
10 changes: 7 additions & 3 deletions src/request/stopping_criteria.h
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
#pragma once

#include <cstdint>
#include <vector>
#include <string>
#include <unordered_set>
#include <vector>

namespace llm {

// StoppingCriteria is used to specify stopping criterias for a request/sequence.
// StoppingCriteria is used to specify stopping criterias for a
// request/sequence.
struct StoppingCriteria {
// maximum number of generated tokens
size_t max_tokens = 0;
Expand All @@ -17,9 +19,11 @@ struct StoppingCriteria {
// whether to ignore eos token when checking stopping criterias
bool ignore_eos_token = false;

// stop token ids
std::unordered_set<int32_t> stop_token_ids;

// stop sequences
// std::vector<std::string> stop_sequences;

};

} // namespace llm
2 changes: 1 addition & 1 deletion src/sampling/logits_processor_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ TEST(LogitsProcessorTest, Temperature) {
TemperatureLogitsProcessor processor(temperatures, dtype, device);

int64_t batch_size = 2;
int64_t vocab_size = 5;
int64_t vocab_size = 32000;
auto logits = torch::randn({batch_size, vocab_size},
torch::dtype(dtype).device(device));
auto token_ids = torch::randint(/*high=*/vocab_size,
Expand Down
1 change: 1 addition & 0 deletions src/server/handlers/chat_handler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,7 @@ std::unique_ptr<Request> grpc_request_to_request(ChatCallData* call_data,
stopping_criteria.max_tokens = max_tokens;
// stopping_criteria.ignore_eos_token = false;
stopping_criteria.eos_token_id = model_args.eos_token_id();
stopping_criteria.stop_token_ids = model_args.stop_token_ids();

if (grpc_request.has_stream()) {
request->stream = grpc_request.stream();
Expand Down

0 comments on commit 68854da

Please sign in to comment.