AI Toolkit: Automated Installation of Generative AI Toolkit on UCS X-Series

Generative AI is an exciting and emerging space. Running large language models (LLMs) in the cloud can be both costly and expose proprietary data in unexpected ways. These issues can be avoided by deploying your AI workload in a private data centre on modern compute infrastructure. The purpose of this generative AI toolkit is to automate the full installation of some of the most popular open source software tools on Cisco UCS X-Series. The toolkit makes extensive use of the UCS X-fabric, PCIe node and GPU acceleration.

Overview

This solution guide will assist you with the full installation of:

Ubuntu linux operating system including various common utilities
GCC compiler required for development using the NVIDIA parallel computing and programming environment (CUDA)
NVIDIA GPU drivers as well as CUDA
Miniconda package, dependency and environment manager for programming languages (IE: python and C++). Miniconda is a minimal distribution of Anaconda that includes only conda, python, pip and some other useful packages. Very useful for data science as it includes a lot of dependencies in the package.
AI Monitor for monitoring CPU, memory, GPU and VRAM utilization on your system
WebUI simple user interface for testing and fine-tuning large language models
OpenAI compatible API
Various LLMs such as Vicuna and Meta Open Pre-Trained Transformer models. Utility to download additional models from Hugging Face is included. Many Llama 2 based models have been tested and work.
Software to perform inferencing on locally hosted private documents using LangChain, Chroma on the most popular HuggingFace embedding models and LLMs

Installing the AI Toolkit

Pre-requisites

Cisco UCS X-series w/ X440p PCIe node and NVIDIA A100 GPU
Cisco Intersight account

1. Create Server Profile

In Intersight, derive and deploy a server-profile from a bare-metal linux template to a UCS X-Series X210c compute node. Basically all that is required is:

Boot from M.2 RAID
Single ethernet NIC with fabric failover (for redundancy)

2. Install OS on Server

From Intersight, select server and perform automated OS install. Use the custom OS install script from this repo called llm-bmaas.cfg You will want to modify the cloud-init settings for: password, address, gateway4 and nameservers.

The following combination has been tested:

OS Image - ubuntu-22.04.2-live-server-amd64.iso as version Ubuntu Server 22.04 LTS
SCU Image - ucs-scu-6.3.1a.iso as version 6.3.1a
OS Configuration File - llm-bmaas.cfg as version Ubuntu Server 22.04 LTS

Other combinations may work, but please try these before asking for assistance.

3. Install Additional Software

SSH into the server for the first time as username ubuntu and run the following commands (one-time):

wget https://github.com/pl247/ai-install/raw/main/ai-install.sh
chmod a+x ai-install.sh
./ai-install.sh

Answer yes when asked if you want to proceed during the miniconda install.

YOU WILL NEED TO REBOOT to activate your NVIDIA GPU drivers.

sudo reboot

Running the TextGen Server Software

Now that the system is fully installed, you can run the server software using either CPU or GPU (if installed).

Using CPU Only

Activate the textgen environment in conda, move to the correct directory and start the text generation server:

conda activate textgen
cd text-generation-webui
python server.py --listen --auto-devices --chat --model-menu --cpu

To access the application, open a web browser to your server IP address on port 7860. http://10.0.0.10:7860

Using GPU

conda activate textgen
cd text-generation-webui
python server.py --listen --auto-devices --chat --model-menu --gpu-memory 76

If you have an NVIDIA GPU then you can also simultaneously monitor the system using the ai-monitor tool that was installed:

/ai/ai-monitor/ai-monitor

Downloading Additional Models for TextGen

Check out the Hugging Face leader board: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard and then download any of the models you would like to try using the following commands:

cd text-generation-webui
python3 download-model.py TheBloke/Wizard-Vicuna-13B-Uncensored-HF

Substitute <TheBloke/Wizard-Vicuna-13B-Uncensored-HF> for any Hugging Face model you would like.

Note - the the OPT-350 LLM was included mostly to show how far things have progressed in less than 1 year.

Performing Inference on Private Documents

This is often called Retrieval Augmented Generation (RAG). To perform inferencing on private localized data on your system, perform the following tasks:

Move to the doc-inferencing directory
Activate the docs environment in conda
Place any documents (type pdf, doc, docx, txt, xls, xlsx, csv, md or py) you would like to query in the SOURCE_DOCUMENTS directory
Ingest the documents using ingest.py
Run the doc inferencing using run_localGPT.py

cd doc-inferencing
conda activate docs
# Ingest docs
python ingest.py
# Run inferencing
python run_localGPT.py

To place documents in the SOURCE_DOCUMENTS folder try using wget:

wget https://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-x-series-modular-system/x210cm7-specsheet.pdf

wget https://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-x-series-modular-system/x9508-specsheet.pdf

wget https://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-x-series-modular-system/cisco-ucs-6536-fabric-interconnect-spec-sheet.pdf

wget https://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-x-series-modular-system/x440p-specsheet.pdf

Troubleshooting

If wget fails with the error message unsafe legacy renegotiation disabled try the following workaround:

sudo vi /usr/lib/ssl/openssl.cnf

#Add the following option to openssl.cnf under the [system_default_sect] section
Options = UnsafeLegacyRenegotiation

To set the timezone on your system correctly:

# show current timezone with offset
date +"%Z %z"

# show timezone options for America
timedatectl list-timezones | grep America

# Set timezone
sudo timedatectl set-timezone America/Winnipeg

Performance Tuning

One of the nice things about Cisco UCS and Intersight is the ability to create specific policies for your desired configurations. For generative AI workloads you may wish to create a BIOS policy for your servers with changes from the defaults as per the following document:

Performance Tuning Guide

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.gitignore		.gitignore
README.md		README.md
ai-install.sh		ai-install.sh
llm-bmaas.cfg		llm-bmaas.cfg
llm_stack.jpg		llm_stack.jpg
x-series_gpu.jpg		x-series_gpu.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Toolkit: Automated Installation of Generative AI Toolkit on UCS X-Series

Table of Contents

Overview

Installing the AI Toolkit

Pre-requisites

1. Create Server Profile

2. Install OS on Server

3. Install Additional Software

Running the TextGen Server Software

Using CPU Only

Using GPU

Downloading Additional Models for TextGen

Performing Inference on Private Documents

Troubleshooting

Performance Tuning

About

Releases

Packages

Languages

jasonatcisco/ai-install

Folders and files

Latest commit

History

Repository files navigation

AI Toolkit: Automated Installation of Generative AI Toolkit on UCS X-Series

Table of Contents

Overview

Installing the AI Toolkit

Pre-requisites

1. Create Server Profile

2. Install OS on Server

3. Install Additional Software

Running the TextGen Server Software

Using CPU Only

Using GPU

Downloading Additional Models for TextGen

Performing Inference on Private Documents

Troubleshooting

Performance Tuning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages