CPU Support

Aphrodite supports CPU-only inference at relatively fast speeds. Currently, only AVX512 CPUs are supported. You can verify this by running the following in a terminal:

cat /proc/cpuinfo | grep avx512

If your CPU does not support AVX512 instructions, the command will not output anything.

Building

Install system-wide dependencies

$ sudo apt-get update -y
$ sudo apt-get install -y gcc-12 g++-12  # you can skip this if you already have a gcc/g++>=12.3.0 installed
$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

Install the python dependencies

$ pip install -U pip
$ pip install wheel packaging ninja setuptools>=49.4.0 numpy
$ pip install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu

Build Aphrodite Engine

APHRODITE_TARGET_DEVICE=cpu python setup.py install

Usage

You can run the engine as normal. There are some points you will need to note:

Use the environment variable APHRODITE_CPU_KVCACHE_SPACE to specify the amount of memory (in GiBs) allocated for the KV cache. Higher numbers allow a higher degree of parallelism.
The CPU backend uses OpenMP for thread-parallel computation. If you want the best performance on CPU, it'll be critical to isolate CPU cores for OpenMP threads with other thread pools (like web-service even-loop) to avoid CPU oversubscription.
If running on bare-metal, you should probably disable hyper-threading.
If you're on a multi-socket machine with NUMA, make sure the process uses only a single socket to avoid remote memory access. You can use numactl to do this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Support

Building

Usage

Clone this wiki locally