Generative AI is an exciting and emerging space. Running large language models (LLMs) in the cloud can be both costly and expose proprietary data in unexpected ways. These issues can be avoided by deploying your AI workload in a private data centre on modern compute infrastructure. The purpose of this generative AI toolkit is to automate the full installation of some of the most popular open source software tools on Cisco UCS X-Series. The toolkit makes extensive use of the UCS X-fabric, PCIe node and GPU acceleration.
- Overview
- Installing the AI Toolkit
- Running the TextGen Server Software
- Performing Inference on Private Documents
- Performance Tuning
This solution guide will assist you with the full installation of:
- Ubuntu linux operating system including various common utilities
- GCC compiler required for development using the NVIDIA parallel computing and programming environment (CUDA)
- NVIDIA GPU drivers as well as CUDA
- Miniconda package, dependency and environment manager for programming languages (IE: python and C++). Miniconda is a minimal distribution of Anaconda that includes only conda, python, pip and some other useful packages. Very useful for data science as it includes a lot of dependencies in the package.
- AI Monitor for monitoring CPU, memory, GPU and VRAM utilization on your system
- WebUI simple user interface for testing and fine-tuning large language models
- OpenAI compatible API
- Various LLMs such as Vicuna and Meta Open Pre-Trained Transformer models. Utility to download additional models from Hugging Face is included. Many Llama 2 based models have been tested and work.
- Software to perform inferencing on locally hosted private documents using LangChain, Chroma on the most popular HuggingFace embedding models and LLMs
- Cisco UCS X-series w/ X440p PCIe node and NVIDIA A100 GPU
- Cisco Intersight account
In Intersight, derive and deploy a server-profile from a bare-metal linux template to a UCS X-Series X210c compute node. Basically all that is required is:
- Boot from M.2 RAID
- Single ethernet NIC with fabric failover (for redundancy)
From Intersight, select server and perform automated OS install. Use the custom OS install script from this repo called llm-bmaas.cfg You will want to modify the cloud-init settings for: password, address, gateway4 and nameservers.
The following combination has been tested:
- OS Image - ubuntu-22.04.2-live-server-amd64.iso as version Ubuntu Server 22.04 LTS
- SCU Image - ucs-scu-6.3.1a.iso as version 6.3.1a
- OS Configuration File - llm-bmaas.cfg as version Ubuntu Server 22.04 LTS
Other combinations may work, but please try these before asking for assistance.
SSH into the server for the first time as username ubuntu and run the following commands (one-time):
wget https://github.com/pl247/ai-install/raw/main/ai-install.sh
chmod a+x ai-install.sh
./ai-install.sh
Answer yes when asked if you want to proceed during the miniconda install.
YOU WILL NEED TO REBOOT to activate your NVIDIA GPU drivers.
sudo reboot
Now that the system is fully installed, you can run the server software using either CPU or GPU (if installed).
Activate the textgen environment in conda, move to the correct directory and start the text generation server:
conda activate textgen
cd text-generation-webui
python server.py --listen --auto-devices --chat --model-menu --cpu
To access the application, open a web browser to your server IP address on port 7860. http://10.0.0.10:7860
conda activate textgen
cd text-generation-webui
python server.py --listen --auto-devices --chat --model-menu --gpu-memory 76
If you have an NVIDIA GPU then you can also simultaneously monitor the system using the ai-monitor tool that was installed:
/ai/ai-monitor/ai-monitor
Check out the Hugging Face leader board: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard and then download any of the models you would like to try using the following commands:
cd text-generation-webui
python3 download-model.py TheBloke/Wizard-Vicuna-13B-Uncensored-HF
Substitute <TheBloke/Wizard-Vicuna-13B-Uncensored-HF> for any Hugging Face model you would like.
Note - the the OPT-350 LLM was included mostly to show how far things have progressed in less than 1 year.
This is often called Retrieval Augmented Generation (RAG). To perform inferencing on private localized data on your system, perform the following tasks:
- Move to the doc-inferencing directory
- Activate the docs environment in conda
- Place any documents (type pdf, doc, docx, txt, xls, xlsx, csv, md or py) you would like to query in the SOURCE_DOCUMENTS directory
- Ingest the documents using ingest.py
- Run the doc inferencing using run_localGPT.py
cd doc-inferencing
conda activate docs
# Ingest docs
python ingest.py
# Run inferencing
python run_localGPT.py
To place documents in the SOURCE_DOCUMENTS folder try using wget:
wget https://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-x-series-modular-system/x210cm7-specsheet.pdf
wget https://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-x-series-modular-system/x9508-specsheet.pdf
wget https://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-x-series-modular-system/cisco-ucs-6536-fabric-interconnect-spec-sheet.pdf
wget https://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-x-series-modular-system/x440p-specsheet.pdf
If wget
fails with the error message unsafe legacy renegotiation disabled
try the following workaround:
sudo vi /usr/lib/ssl/openssl.cnf
#Add the following option to openssl.cnf under the [system_default_sect] section
Options = UnsafeLegacyRenegotiation
To set the timezone on your system correctly:
# show current timezone with offset
date +"%Z %z"
# show timezone options for America
timedatectl list-timezones | grep America
# Set timezone
sudo timedatectl set-timezone America/Winnipeg
One of the nice things about Cisco UCS and Intersight is the ability to create specific policies for your desired configurations. For generative AI workloads you may wish to create a BIOS policy for your servers with changes from the defaults as per the following document: