Skip to content

Commit

Permalink
docs: add federation (#2929)
Browse files Browse the repository at this point in the history
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
  • Loading branch information
mudler authored Jul 20, 2024
1 parent f9f8379 commit 87bd831
Showing 1 changed file with 59 additions and 30 deletions.
89 changes: 59 additions & 30 deletions docs/content/docs/features/distributed_inferencing.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,66 +5,93 @@ weight = 15
url = "/features/distribute/"
+++

{{% alert note %}}
This feature is available exclusively with llama-cpp compatible models.

This feature was introduced in [LocalAI pull request #2324](https://github.com/mudler/LocalAI/pull/2324) and is based on the upstream work in [llama.cpp pull request #6829](https://github.com/ggerganov/llama.cpp/pull/6829).
{{% /alert %}}
This functionality enables LocalAI to distribute inference requests across multiple worker nodes, improving efficiency and performance. Nodes are automatically discovered and connect via p2p by using a shared token which makes sure the communication is secure and private between the nodes of the network.

LocalAI supports two modes of distributed inferencing via p2p:

This functionality enables LocalAI to distribute inference requests across multiple worker nodes, improving efficiency and performance.
- **Federated Mode**: Requests are shared between the cluster and routed to a single worker node in the network based on the load balancer's decision.
- **Worker Mode**: Requests are processed by all the workers which contributes to the final inference result (by sharing the model weights).

## Usage

### Starting Workers
Starting LocalAI with `--p2p` generates a shared token for connecting multiple instances: and that's all you need to create AI clusters, eliminating the need for intricate network setups.

To start workers for distributing the computational load, run:
Simply navigate to the "Swarm" section in the WebUI and follow the on-screen instructions.

For fully shared instances, initiate LocalAI with --p2p --federated and adhere to the Swarm section's guidance. This feature, while still experimental, offers a tech preview quality experience.

### Federated mode

Federated mode allows to launch multiple LocalAI instances and connect them together in a federated network. This mode is useful when you want to distribute the load of the inference across multiple nodes, but you want to have a single point of entry for the API. In the Swarm section of the WebUI, you can see the instructions to connect multiple instances together.

![346663124-1d2324fd-8b55-4fa2-9856-721a467969c2](https://github.com/user-attachments/assets/19ebd44a-20ff-412c-b92f-cfb8efbe4b21)

To start a LocalAI server in federated mode, run:

```bash
local-ai worker llama-cpp-rpc <listening_address> <listening_port>
local-ai run --p2p --federated
```

This will generate a token that you can use to connect other LocalAI instances to the network or others can use to join the network. If you already have a token, you can specify it using the `TOKEN` environment variable.

To start a load balanced server that routes the requests to the network, run with the `TOKEN`:

```bash
local-ai federated
```

Alternatively, you can build the RPC server following the llama.cpp [README](https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/README.md), which is compatible with LocalAI.
To see all the available options, run `local-ai federated --help`.

The instructions are displayed in the "Swarm" section of the WebUI, guiding you through the process of connecting multiple instances.

### Starting LocalAI
### Workers mode

To start the LocalAI server, which handles API requests, specify the worker addresses using the `LLAMACPP_GRPC_SERVERS` environment variable:
{{% alert note %}}
This feature is available exclusively with llama-cpp compatible models.

This feature was introduced in [LocalAI pull request #2324](https://github.com/mudler/LocalAI/pull/2324) and is based on the upstream work in [llama.cpp pull request #6829](https://github.com/ggerganov/llama.cpp/pull/6829).
{{% /alert %}}

To connect multiple workers to a single LocalAI instance, start first a server in p2p mode:

```bash
LLAMACPP_GRPC_SERVERS="address1:port,address2:port" local-ai run
local-ai run --p2p
```

The workload on the LocalAI server will then be distributed across the specified nodes.
And navigate the WebUI to the "Swarm" section to see the instructions to connect multiple workers to the network.

## Peer-to-Peer Networking
![346663124-1d2324fd-8b55-4fa2-9856-721a467969c2](https://github.com/user-attachments/assets/b8cadddf-a467-49cf-a1ed-8850de95366d)

![output](https://github.com/mudler/LocalAI/assets/2420543/8ca277cf-c208-4562-8929-808b2324b584)
### Without P2P

Workers can also connect to each other in a peer-to-peer network, distributing the workload in a decentralized manner.
To start workers for distributing the computational load, run:

A shared token between the server and the workers is required for communication within the peer-to-peer network. This feature supports both local network (using mDNS discovery) and DHT for communication across different networks.
```bash
local-ai worker llama-cpp-rpc <listening_address> <listening_port>
```

The token is automatically generated when starting the server with the `--p2p` flag. Workers can be started with the token using `local-ai worker p2p-llama-cpp-rpc` and specifying the token via the environment variable `TOKEN` or with the `--token` argument.
And you can specify the address of the workers when starting LocalAI with the `LLAMACPP_GRPC_SERVERS` environment variable:

A network is established between the server and workers using DHT and mDNS discovery protocols. The llama.cpp RPC server is automatically started and exposed to the peer-to-peer network, allowing the API server to connect.
```bash
LLAMACPP_GRPC_SERVERS="address1:port,address2:port" local-ai run
```
The workload on the LocalAI server will then be distributed across the specified nodes.

When the HTTP server starts, it discovers workers in the network and creates port forwards to the local service. Llama.cpp is configured to use these services. For more details on the implementation, refer to [LocalAI pull request #2343](https://github.com/mudler/LocalAI/pull/2343).
Alternatively, you can build the RPC workers/server following the llama.cpp [README](https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/README.md), which is compatible with LocalAI.

### Usage
## Manual example (worker)

Use the WebUI to guide you in the process of starting new workers. This example shows the manual steps to highlight the process.

1. Start the server with `--p2p`:

```bash
./local-ai run --p2p
# 1:02AM INF loading environment variables from file envFile=.env
# 1:02AM INF Setting logging to info
# 1:02AM INF P2P mode enabled
# 1:02AM INF No token provided, generating one
# 1:02AM INF Generated Token:
# XXXXXXXXXXX
# 1:02AM INF Press a button to proceed
# Get the token in the Swarm section of the WebUI
```

Copy the displayed token and press Enter.
Copy the token from the WebUI or via API call (e.g., `curl http://localhost:8000/p2p/token`) and save it for later use.

To reuse the same token later, restart the server with `--p2ptoken` or `P2P_TOKEN`.

Expand Down Expand Up @@ -93,12 +120,14 @@ The server logs should indicate that new workers are being discovered.

3. Start inference as usual on the server initiated in step 1.

![output](https://github.com/mudler/LocalAI/assets/2420543/8ca277cf-c208-4562-8929-808b2324b584)

## Notes

- If running in p2p mode with container images, make sure you start the container with `--net host` or `network_mode: host` in the docker-compose file.
- Only a single model is supported currently.
- Ensure the server detects new workers before starting inference. Currently, additional workers cannot be added once inference has begun.

- For more details on the implementation, refer to [LocalAI pull request #2343](https://github.com/mudler/LocalAI/pull/2343)

## Environment Variables

Expand Down

0 comments on commit 87bd831

Please sign in to comment.