Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Docs For SGLang Native Router #2308

Merged
merged 8 commits into from
Dec 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,13 @@ The core features include:
frontend/choices_methods.md


.. toctree::
:maxdepth: 1
:caption: SGLang Router

router/router.md


.. toctree::
:maxdepth: 1
:caption: References
Expand Down
110 changes: 110 additions & 0 deletions docs/router/router.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Router for Data Parallelism

Given multiple GPUs running multiple SGLang Runtimes, SGLang Router distributes the requests to different Runtimes with its unique cache-aware load-balancing algorithm.

The router is a independent Python package, and it can be used as a drop-in replacement for the OpenAI API.

## Installation

```bash
pip install sglang-router
```

Detailed usage of the router can be found in [launch_router](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang_router/launch_router.py) and [launch_server](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang/launch_server.py). Also, you can directly run the following command to see the usage of the router.

```bash
python -m sglang_router.launch_server --help
python -m sglang_router.launch_routher --help
```

The router supports two working modes:

1. Co-launch Router and Runtimes
2. Launch Runtimes and Router separately

## Co-launch Router and Runtimes

This will be a drop-in replacement for the existing `--dp-size` arguement of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers.

```bash
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 1
```

After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker.

```python
import requests

url = "http://localhost:30000/generate"
data = {"text": "What is the capital of France?"}

response = requests.post(url, json=data)
print(response.json())
```

## Launch Runtimes and Router Separately

This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers.

```bash
python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2
```

## Strategies

### Cache-Aware Load-Balancing Router

The native router combines two strategies to optimize both cache utilization and request distribution:

1. Cache-Aware Routing (Approximate Tree)
2. Load-Balancing Routing (Shortest Queue with Balance Thresholds)

The router dynamically switches between these strategies based on load conditions:

- Uses load balancing when the system is imbalanced
- Uses cache-aware routing when the system is balanced

A system is considered imbalanced if both conditions are met:

1. (max_load - min_load) > balance_abs_threshold
2. max_load > balance_rel_threshold * min_load

***Cache-Aware Routing (Approximate Tree)***

When the workers are considered to be balanced, the router maintains an approximate radix tree for each worker based on request history, eliminating the need for direct cache state queries on each worker. The tree stores raw text characters instead of token IDs to avoid tokenization overhead.

Process:

1. For each request, find the worker with the highest prefix match.

- If match rate > cache_threshold, route the request to the worker with highest match (likely has relevant data cached)
- If match rate ≤ cache_threshold, route the request to the worker with smallest tree size (most available cache capacity)

2. Background maintenance: Periodically evict least recently used leaf nodes on the approximate tree to prevent memory overflow.

***Load-Balancing (Shortest Queue)***

For unbalanced systems, this strategy tracks pending request counts per worker and routes new requests to the least busy worker. This helps maintain optimal load distribution across workers.

## Configuration Parameters

1. `cache_threshold`: (float, 0.0 to 1.0, default: 0.5)
- Minimum prefix match ratio to use highest-match routing.
- Below this threshold, the request will be routed to the worker with most available cache space.

2. `balance_abs_threshold`: (integer, default: 32)
- Absolute difference threshold for load imbalance detection.
- The system is potentially imbalanced if (max_load - min_load) > abs_threshold.

3. `balance_rel_threshold`: (float, default: 1.0001)
- Relative ratio threshold for load imbalance detection.
- The system is potentially imbalanced if max_load > min_load * rel_threshold.
- Used in conjunction with `balance_abs_threshold` to determine the final imbalance state.

4. `eviction_interval`: (integer, default: 60)
- Interval in seconds between LRU eviction cycles for the approximate trees.
- Background thread periodically evicts least recently used nodes to maintain tree size.

5. `max_tree_size`: (integer, default: 16777216)
- Maximum nodes on the approximate tree.
- When exceeded, LRU leaf nodes are evicted during the next eviction cycle.
Original file line number Diff line number Diff line change
Expand Up @@ -177,18 +177,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'The World Health Organization formally declared an end to the COVID-19 global health emergency'"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"outputs": [],
"source": [
"@trace\n",
"def rag_pipeline(question: str) -> str:\n",
Expand Down Expand Up @@ -307,18 +296,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'The World Health Organization formally declared an end to the COVID-19 global health emergency in May 2023.'"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"outputs": [],
"source": [
"@trace\n",
"def rag_pipeline(question: str) -> str:\n",
Expand Down Expand Up @@ -355,15 +333,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: nest-asyncio in /Users/joschkabraun/miniconda3/envs/sglang/lib/python3.10/site-packages (1.6.0)\r\n"
]
}
],
"outputs": [],
"source": [
"!pip install nest-asyncio\n",
"import nest_asyncio\n",
Expand All @@ -382,45 +352,7 @@
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Run name set to: sneak-weal, since a name was not provided.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 100/100 [00:27<00:00, 3.63it/s]\n",
"Waiting for evaluations to finish: 100%|██████████| 19/19 [00:10<00:00, 1.89it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Experiment RAG Run sneak-weal stats:\n",
"{\n",
" \"latency\": \"2.69\",\n",
" \"input_tokens\": \"61.26\",\n",
" \"output_tokens\": \"75.88\",\n",
" \"total_tokens\": \"137.14\",\n",
" \"cost\": \"0.00\",\n",
" \"answer_context_faithfulness_statement_level\": \"0.26\",\n",
" \"answer_matches_target_llm_grader\": \"0.22\",\n",
" \"context_query_relevancy\": \"0.27\",\n",
" \"percent_target_supported_by_context\": \"0.40\"\n",
"}\n",
"\n",
"\n",
"View experiment & traces at: https://app.parea.ai/experiments/RAG/30f0244a-d56c-44ff-bdfb-8f47626304b6\n",
"\n"
]
}
],
"outputs": [],
"source": [
"e = p.experiment(\n",
" \"RAG\",\n",
Expand Down
Loading