TinyLlama 1.1B 4-bit quantized 3 trillion token chat model and Qwen 2 beta 0.5B 8-bit quantized chat model running on Azure Functions consumption plan.
This project is not intended for production use. It is a technology demonstration to show that it is possible to run large language models on a cheap and scalable serverless platform.
A demo of the app is available at https://tiny-serverless-llms.azurewebsites.net. In the demo, you can enter a prompt and the model will generate a completion.
Any abuse of the service will result in the service being taken down.
To run the serverless function locally, please refer to Azure Functions documentation.
You will need following environment variables:
{
"IsEncrypted": false,
"Values": {
"AzureWebJobsStorage": "UseDevelopmentStorage=true",
"FUNCTIONS_WORKER_RUNTIME": "python",
"AzureSignalRBase": "https://{signal_r_service_name}.service.signalr.net",
"AzureSignalRAccessKey": "{signal_r_service_key}",
"MODEL_BASE": "{absolute_path_to_model_directory_including_trailing_slash}",
"LLAMA_BASE": "{absolute_path_to_llama_binary_directory_including_trailing_slash}"
}
}
Copy your llama.cpp binary to the LLAMA_BASE
directory and the models to the MODEL_BASE
directory.
You'll need the following resources in Azure:
- Azure Functions in Consumption Plan
- Azure SignalR Service in serverless mode
You'll need to set the following application setting in Azure Functions:
POST_BUILD_SCRIPT_PATH=post_build.sh
MODEL_BASE=/home/site/wwwroot/
LLAMA_BASE=/home/site/wwwroot/llama.cpp/
AzureSignalRBase=https://{signal_r_service_name}.service.signalr.net
AzureSignalRAccessKey={signal_r_service_key}
The post_build.sh
script will be executed during Oryx build. It is used to build the llama.cpp binary and add the models to the deployment package.
About Softlandia
Softlandia is a software consultancy based in Finland. We specialize in AI, especially in generative AI, cloud architecture and IoT and in building software in general, from web applications to embedded systems.