- Nvidia and CUDA driver
- Run
nvidia-smi
to verify the driver installation. - May need to disable secure boot in BIOS.
- Currently, Liquid cannot provide technical support for driver or CUDA installation.
- Run
- Docker
- If the current user has no permission to run Docker commands, run the following commands:
sudo usermod -aG docker $USER sudo systemctl restart docker # IMPORTANT: after the restart, log out and log back in # verify the permission ls -l /var/run/docker.sock
- Docker compose plugin
- Run
docker compose version
to verify installation.
- Run
- Nvidia container toolkit
- Run
nvidia-ctk --version
to verify installation.
- Run
# Authenticate to the Docker registry
# Paste the password when prompted
docker login -u <docker-username>
# Launch the stack
./launch.sh
# Wait for 2 min, and run API test
./test-api.sh
When running for the first time, the launch script will do the following:
- Create a
.env
file and populate all the environment variables used by the stack. - Create a Docker volume
postgres_data
for the Postgres database. - Run the
docker-compose.yaml
file and start the stack.
When running for subsequent times, the launch script will consume the environment variables from the .env
file and restart the stack.
Two environment variables are constructed from other variables: DATABASE_URL
and MODEL_NAME
. Please do not modify them directly in the .env
file.
File | Description |
---|---|
README.md |
This file |
docker-compose.yaml |
Docker compose file to launch the stack |
launch.sh |
Script to launch the stack |
.env |
Environment variables file created by the launch.sh script |
shutdown.sh |
Script to shut down the stack |
connect-db.sh |
Script to connect to the Postgres database |
test-api.sh |
Script to test the inference server API |
run-vllm.sh |
Script to launch any model from Hugging Face |
rm-vllm.sh |
Script to remove a model launched by run-vllm.sh |
run-checkpoint.sh |
Script to serve fine-tuned Liquid model checkpoints |
run-cf-tunnel.sh |
Script to run Cloudflare tunnel |
purge.sh |
Script to remove all containers, volumes, and networks |
To update stack or model to the latest version, pull the latest changes from this repository, and run the launch script with --upgrade-stack
and / or --upgrade-model
:
./shutdown.sh
./launch.sh [--upgrade-stack] [--upgrade-model]
To update the stack or modal manually to a specific version, change STACK_VERSION
and / or MODEL_IMAGE
in the .env
file and run:
./shutdown.sh
./launch.sh
- Install
pgcli
first. - Run
connect-db.sh
.
./shutdown.sh
To expose the web UI through Cloudflare tunnel, the default script given by Cloudflare does not work. Run the following command with --network
and --protocol h2mux
options instead.
# add --protocol h2mux
docker run -d --network liquid_labs_network cloudflare/cloudflared:latest tunnel --no-autoupdate run --protocol h2mux --token <tunnel-token>
Run the run-vllm.sh
script to launch models from Hugging Face. The script requires the --model-name
, --hf-model-path
, and --hf-token
parameters. For example, the following command will launch the llama-7b
model with the Hugging Face model meta-llama/Llama-2-7b-chat-hf
:
./run-vllm.sh --model-name llama-7b --hf-model-path "meta-llama/Llama-2-7b-chat-hf" --hf-token <hugging-face-token>
When accessing gated repository, please ensure:
- You have got the permission to access the repository.
- The access token has this permission scope:
Read access to contents of all public gated repos you can access
.
The launched vLLM container has no authentication. The container exposes port 9000 by default. Example API calls:
# show model ID:
curl http://0.0.0.0:9000/v1/models
# run chat completion:
curl http://0.0.0.0:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-7b",
"messages": [
{
"role": "user",
"content": "At which temperature does silver melt?"
}
],
"max_tokens": 128,
"temperature": 0
}'
(click to see the full list of parameters for the launch script)
Parameter | Required | Default | Description |
---|---|---|---|
--model-name |
Yes | Name for the docker container and model ID for API call | |
--hf-model-path |
Yes | Hugging Face model path (e.g. meta-llama/Llama-2-7b-chat-hf ) |
|
--hf-token |
Required for private or gated repository | Hugging Face API token | |
--port |
No | 9000 |
Port number for the inference server |
--gpu |
No | all |
GPU device to use (e.g. to use the first gpu: 0 , to use the second gpu: 1 ) |
--gpu-memory-utilization |
No | 0.6 |
GPU memory utilization for the inference server. |
--max-num-seqs |
No | 600 | Maximum number of sequences per iteration. Decrease this value when running into out-of-memory issue. |
--max-model-len |
No | 32768 | Model context length. Decrease this value when running into out-of-memory issue. |
(click to expand)
Missing chat template
When chatting with a model, if you see the following error:
As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
This means the model does not have a default chat_template
in the tokenizer_config.json
. It is possible that the model is not trained for chat input. The solution is to run a chat-compatible model instead. For example, meta-llama/Llama-3.2-3B
has no chat template, but meta-llama/Llama-3.2-3B-Instruct
does.
The run-vllm.sh
script does not support passing in a custom chat template. You can modify the script yourself if needed.
Unknown or invalid runtime name: nvidia
- Ensure NVIDIA Container Toolkit is installed:
sudo apt update
sudo apt install -y nvidia-container-toolkit nvidia-container-runtime
- Configure Docker to use NVIDIA runtime
sudo nano /etc/docker/daemon.json
Ensure the file contains the following:
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Then, restart Docker:
sudo systemctl restart docker
Install jq
, and run the run-checkpoint.sh
script.
For example, the following command will launch the checkpoint files in ~/finetuned-lfm-3b-output
on port 9000
:
./run-checkpoint.sh --model-checkpoint "~/finetuned-lfm-3b-output"
The model name is extracted from the model_metadata.json
file in the checkpoint directory. The launched vLLM container has no authentication. The container exposes port 9000 by default. Example API calls:
# show model ID:
curl http://0.0.0.0:9000/v1/models
# run chat completion:
curl http://0.0.0.0:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "lfm-3b-ft",
"messages": [
{
"role": "user",
"content": "At which temperature does silver melt?"
}
],
"max_tokens": 128,
"temperature": 0
}'
(click to see the full list of parameters for the launch script)
Parameter | Required | Default | Description |
---|---|---|---|
--model-checkpoint |
Yes | Local path to the fine-tuned Liquid model checkpoint | |
--port |
No | 9000 |
Port number for the inference server |
--gpu |
No | all |
GPU device to use (e.g. to use the first gpu: 0 , to use the second gpu: 1 ) |
--gpu-memory-utilization |
No | 0.6 |
GPU memory utilization for the inference server. Decrease this value when running into out-of-memory issue. |
--max-num-seqs |
No | Maximum number of sequences per iteration. Decrease this value when running into out-of-memory issue. |