Optimizing LLMs

How to cache parts of the computation

Time to first token, throughput. Help select vendor for serving your LLMs.

There's already cloud hosted APIs, but if we want to use opensource LLMs finetuned on their own data. ( this will need to be hosted and served )

handle many requests from multiple users.

OS LLM on hugging face adn wrapped around flask, even though served with powerful GPUs. this handles 1 request at a time.

vectorization -> allows multiple set of user inputs in a single operation, multiple fine tuned operations as once.

Multiple fine tuned models at once, KV caching ( speed up inference) storing some results of attention calcution in memory of each token so these calcs don't have to be redone when generating later tokens.

improve latency and through put.

kv caching helps reduce latency in subsequent token.

How autoregressor language models generate texts one token at a time.
Implement KV caching technique.
Batch prompts in single tensor.
continuous batching, new request come in and new request compplete.
Quantization function, which transoform a models weights to lower precision representation.
parameter fine tuning techniques like laura or low rank adaption make it possible to dynamically load in specialized adapters at run time
combine multiple LoRAs using continuos batching simultenosy maintaing high throughput

Text Generation

Goal is to learn text generation using auto regressive models.

this process can be divided into 2, prefill and into code. use kv caching to speed up.

from transformers import AutoModelForCausalLM, AutoTokenizer.

loading LLM

model_name = "./models/gpt2" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name)

gpt2 is very practical for low latency use cases.

GPT2 is a transformer based model ( what does this mean )

early LLMs used to be an encoder and decoder architecture, where the input would go through an encode, that'll create embeddings and this will be fed to the decoder.

encoders take token and map them to an embedding space. BERT is an example of encoder models tht is useful to generating embedding space. useful for classification and similarity search.

GPT2 doesn't have an encoder, it's a decoder only model. we take the input pass it through an embedding, which gets passed through Attention and a multi level preceptron repeatedly to generate set of outputs. Essentially generates 1 token at a time this is called an autoregressive model. it's clled autoregressive because it depends on it own past values, it can be dentoed using AR(p) where p dentoes order, also can be read as how many lagged values does it depend on. Most LLM follow this structure. see (image)

what's logits ?

when the model returns the outputs, it also return logits (3D), which has a shape, the shape has 3 args, 1 is the size of the input, if only 1 sentce then 1, number of tokens in the input, and finally the number of possible tokens the output can have.

accessing the logits first row: logits[0, -1 , : ] -1 is the last element, of the first batch, : means, every single elements.

next_token_id = logits[0,-1.:].argmax()

argmax is to find out most likely. so we get somehing like 13990.

tokenizer.decode(next_toekn_id)

we can also do: top_k = torch.topk(last_logits, k=10) tokens = [tokenizer.decode(tk) for tk in top_k.indices]

does the model know how many tokens it's going to generate ? most likely not.

for KV caching you only need to pass the first/new generated token, the previous tokens are only needed during attention computation. this is sepration in the 2 phases, the prefill phase where we generate the first token and the decode phase where we generate the subsequent token using cache.

Batching

Batching improves throughput.

when multiple inputs are to be given, padding tokens are used to make the length of tokens the same in each input. this is needed for matrix computation

you'll have to configure the padding token needed to attach to the model

attention mask ? is usually an array of ones, i think it means which tokens to give weights/attention to.

for multi input, the next_token_id will be equal to n

latency is how fast the request is being processed, the throughput is how many requests get processed

we can choose to wait to batch requests, this will hurt latency, but improve throughput.

Continuous Batching

high throughput and low latency architecture.

since tokens are generated once at a time, we get to choose whehter to incorporate the new request into our existing batch.

normal batch takes 71 s for 32 inputs.

there's a filter steps which removes requests that either have completed or reached max tokens, it also remove the padding for subsequent/remaining requests to reduce memory overhead.

continuous batching only took 21s.

Quantization

using quantization we can compress very large models to run on commodity hardware.

floating point representation, 32 bits: or FP32 -> 1 bit -> to represents sign 8 bits to represent the major parts 23 bits to represent the fractional part.

sometimes a lot of precesion isn't needed while running training or inference of the large model. so we consider using a smaller format to represent the data. so we consider FP16 which we can use 5 bits to represent the major part and 10 bits to represent precesion.

or we can use BF16 which is brain float 16, which can use 8 bits to represent the major part and only 7 bits to represent minor part. this cause the value to no be "correct" it rounds off data.

FP32: 3.1415927 FP16: 3.141 BF16: 4.140625

insstead of using smaller and smaller floating points, we can use quantization to compress the data and pass in some metadata, which can be used to recreate the data during forward pass. this causes a small overhead on compute but a massive benefit in terms of memory.

zero point quantization.

overview of quantization: compute the min and max value of the number then compute metadata which is the scale and zero point.

scale: (max - min) / 255 zero point: min

quantize: (tensor - zeroPoint) / scale

state = (scale , zeroPoint)

this essential converts all numbers to a range of 0 to 255 unsigned integer. this is a lossy compression.

to reduce the entire model a new state dictionary is used, to maintain the state of each namedParameter of the model.

dequantization does have effect of the quality of the model. there are different quantization models.

LoRA -> low rank adaption.

Serving fine-tuned LLMs trained using Low-Rank Adaptation (LoRA). This is needed to customize LLMs without chaning the existing weights of the model.

takes a certain set of parameters, like the layerss surroding the attention computation or all the layers in the model and introduces a new set of weights,

Multi LoRA

batching + LoRA.

there can be different LoRA example, a code completion assistant trained on different portfions of a code repository ( trains of different segments of data), chaining several related tasks, classicifcation, prioritization and routing, supporting multiple tenants. Serve many ifne tuned models simuntanelously.

theorotically how does batching improve throughput in LLMs ?

vectorization -> being able to process multiple inputs at the same time.

whats the theory behind multi LoRA architecture.

LoRAX (maintained by predibase)

open source inference server,

Parameter efficient fine tuning.

Quantization and lowering precession

Using quantization we can compress very large models to run on commodity hardware. Model size = size ( datatype ) * num weights. to train we need a larger size than model size.

floating point representation,

32 bits: or FP32 -> 1 bit -> to represents sign 8 bits to represent the major parts 23 bits to represent the fractional part.

sometimes a lot of precession isn't needed while running training or inference of the large model. so we consider using a smaller format to represent the data. so we consider (half precession) FP16 which we can use 5 bits to represent the major part and 10 bits to represent precession. smaller precession models take lesser time to train.

or we can use BF16 which is brain float 16, which can use 8 bits to represent the major part and only 7 bits to represent minor part. this cause the value to no be "correct" it rounds off data.

FP32: 3.1415927 FP16: 3.141 BF16: 4.140625

instead of using smaller and smaller floating points, we can use quantization to compress the data and pass in some metadata, which can be used to recreate the data during forward pass. this causes a small overhead on compute but a massive benefit in terms of memory.

zero point quantization.

overview of quantization: compute the min and max value of the number then compute metadata which is the scale and zero point.

scale: (max - min) / 255 zero point: min

quantize: (tensor - zeroPoint) / scale

state = (scale , zeroPoint)

this essential converts all numbers to a range of 0 to 255 unsigned integer. this is a lossy compression.

to reduce the entire model a new state dictionary is used, to maintain the state of each namedParameter of the model.

dequantization does have effect of the quality of the model. there are different quantization models.

Optimizations:

Fine tune last layer
Adapter layers ( leads to increased latency )
prefix tuning (light weight ) ( prompting by prepending data )
LoRA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizing LLMs

Text Generation

Batching

Continuous Batching

Quantization

LoRA -> low rank adaption.

Multi LoRA

LoRAX (maintained by predibase)

Parameter efficient fine tuning.

Quantization and lowering precession

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally