-
Notifications
You must be signed in to change notification settings - Fork 1
Optimizing LLMs
- How to cache parts of the computation
Time to first token, throughput. Help select vendor for serving your LLMs.
There's already cloud hosted APIs, but if we want to use opensource LLMs finetuned on their own data. ( this will need to be hosted and served )
handle many requests from multiple users.
OS LLM on hugging face adn wrapped around flask, even though served with powerful GPUs. this handles 1 request at a time.
vectorization -> allows multiple set of user inputs in a single operation, multiple fine tuned operations as once.
Multiple fine tuned models at once, KV caching ( speed up inference) storing some results of attention calcution in memory of each token so these calcs don't have to be redone when generating later tokens.
improve latency and through put.
kv caching helps reduce latency in subsequent token.
-
How autoregressor language models generate texts one token at a time.
-
Implement KV caching technique.
-
Batch prompts in single tensor.
-
continuous batching, new request come in and new request compplete.
-
Quantization function, which transoform a models weights to lower precision representation.
-
parameter fine tuning techniques like laura or low rank adaption make it possible to dynamically load in specialized adapters at run time
-
combine multiple LoRAs using continuos batching simultenosy maintaing high throughput
Goal is to learn text generation using auto regressive models.
this process can be divided into 2, prefill and into code. use kv caching to speed up.
from transformers import AutoModelForCausalLM, AutoTokenizer.
loading LLM
model_name = "./models/gpt2" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name)
gpt2 is very practical for low latency use cases.
GPT2 is a transformer based model ( what does this mean )
early LLMs used to be an encoder and decoder architecture, where the input would go through an encode, that'll create embeddings and this will be fed to the decoder.
encoders take token and map them to an embedding space. BERT is an example of encoder models tht is useful to generating embedding space. useful for classification and similarity search.
GPT2 doesn't have an encoder, it's a decoder only model. we take the input pass it through an embedding, which gets passed through Attention and a multi level preceptron repeatedly to generate set of outputs. Essentially generates 1 token at a time this is called an autoregressive model. it's clled autoregressive because it depends on it own past values, it can be dentoed using AR(p) where p dentoes order, also can be read as how many lagged values does it depend on. Most LLM follow this structure. see (image)
what's logits ?
when the model returns the outputs, it also return logits (3D), which has a shape, the shape has 3 args, 1 is the size of the input, if only 1 sentce then 1, number of tokens in the input, and finally the number of possible tokens the output can have.
accessing the logits first row: logits[0, -1 , : ] -1 is the last element, of the first batch, : means, every single elements.
next_token_id = logits[0,-1.:].argmax()
argmax is to find out most likely. so we get somehing like 13990.
tokenizer.decode(next_toekn_id)
we can also do: top_k = torch.topk(last_logits, k=10) tokens = [tokenizer.decode(tk) for tk in top_k.indices]
does the model know how many tokens it's going to generate ? most likely not.
for KV caching you only need to pass the first/new generated token, the previous tokens are only needed during attention computation. this is sepration in the 2 phases, the prefill phase where we generate the first token and the decode phase where we generate the subsequent token using cache.
Batching improves throughput.
when multiple inputs are to be given, padding tokens are used to make the length of tokens the same in each input. this is needed for matrix computation
you'll have to configure the padding token needed to attach to the model
attention mask ? is usually an array of ones, i think it means which tokens to give weights/attention to.
for multi input, the next_token_id will be equal to n
latency is how fast the request is being processed, the throughput is how many requests get processed
we can choose to wait to batch requests, this will hurt latency, but improve throughput.
high throughput and low latency architecture.
since tokens are generated once at a time, we get to choose whehter to incorporate the new request into our existing batch.
normal batch takes 71 s for 32 inputs.
there's a filter steps which removes requests that either have completed or reached max tokens, it also remove the padding for subsequent/remaining requests to reduce memory overhead.
continuous batching only took 21s.
using quantization we can compress very large models to run on commodity hardware.
floating point representation, 32 bits: or FP32 -> 1 bit -> to represents sign 8 bits to represent the major parts 23 bits to represent the fractional part.
sometimes a lot of precesion isn't needed while running training or inference of the large model. so we consider using a smaller format to represent the data. so we consider FP16 which we can use 5 bits to represent the major part and 10 bits to represent precesion.
or we can use BF16 which is brain float 16, which can use 8 bits to represent the major part and only 7 bits to represent minor part. this cause the value to no be "correct" it rounds off data.
FP32: 3.1415927 FP16: 3.141 BF16: 4.140625
insstead of using smaller and smaller floating points, we can use quantization to compress the data and pass in some metadata, which can be used to recreate the data during forward pass. this causes a small overhead on compute but a massive benefit in terms of memory.
zero point quantization.
overview of quantization: compute the min and max value of the number then compute metadata which is the scale and zero point.
scale: (max - min) / 255 zero point: min
quantize: (tensor - zeroPoint) / scale
state = (scale , zeroPoint)
this essential converts all numbers to a range of 0 to 255 unsigned integer. this is a lossy compression.
to reduce the entire model a new state dictionary is used, to maintain the state of each namedParameter of the model.
dequantization does have effect of the quality of the model. there are different quantization models.
Serving fine-tuned LLMs trained using Low-Rank Adaptation (LoRA). This is needed to customize LLMs without chaning the existing weights of the model.
takes a certain set of parameters, like the layerss surroding the attention computation or all the layers in the model and introduces a new set of weights,
batching + LoRA.
there can be different LoRA example, a code completion assistant trained on different portfions of a code repository ( trains of different segments of data), chaining several related tasks, classicifcation, prioritization and routing, supporting multiple tenants. Serve many ifne tuned models simuntanelously.
theorotically how does batching improve throughput in LLMs ?
vectorization -> being able to process multiple inputs at the same time.
whats the theory behind multi LoRA architecture.
further reading
- support LoRA adapter of different ranks.
- Support mixed batches of req with and without LoRa adapters.
- improve efficieny of index select step where we find the index of which LoRA to use.
- organize by "segments" of same req LoRa adapter to reduce copies in memory.
- implement as CUDA kernel rater than pytorch for improved performance.
open source inference server,
Using quantization we can compress very large models to run on commodity hardware. Model size = size ( datatype ) * num weights. to train we need a larger size than model size.
floating point representation,
32 bits: or FP32 -> 1 bit -> to represents sign 8 bits to represent the major parts 23 bits to represent the fractional part.
sometimes a lot of precession isn't needed while running training or inference of the large model. so we consider using a smaller format to represent the data. so we consider (half precession) FP16 which we can use 5 bits to represent the major part and 10 bits to represent precession. smaller precession models take lesser time to train.
or we can use BF16 which is brain float 16, which can use 8 bits to represent the major part and only 7 bits to represent minor part. this cause the value to no be "correct" it rounds off data.
FP32: 3.1415927 FP16: 3.141 BF16: 4.140625
instead of using smaller and smaller floating points, we can use quantization to compress the data and pass in some metadata, which can be used to recreate the data during forward pass. this causes a small overhead on compute but a massive benefit in terms of memory.
zero point quantization.
overview of quantization: compute the min and max value of the number then compute metadata which is the scale and zero point.
scale: (max - min) / 255 zero point: min
quantize: (tensor - zeroPoint) / scale
state = (scale , zeroPoint)
this essential converts all numbers to a range of 0 to 255 unsigned integer. this is a lossy compression.
to reduce the entire model a new state dictionary is used, to maintain the state of each namedParameter of the model.
dequantization does have effect of the quality of the model. there are different quantization models.
Optimizations:
- Fine tune last layer
- Adapter layers ( leads to increased latency )
- prefix tuning (light weight ) ( prompting by prepending data )
- LoRA