diff --git a/README.md b/README.md
index 48561cbe08..5515f0c17d 100644
--- a/README.md
+++ b/README.md
@@ -120,7 +120,7 @@ Dynamo provides a simple way to spin up a local set of inference components incl
```
# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router.
# Pass the TLS certificate and key paths to use HTTPS instead of HTTP.
-python -m dynamo.frontend --http-port 8080 [--tls-cert-path cert.pem] [--tls-key-path key.pem]
+python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key-path key.pem]
# Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
# both for the same model and for multiple models. The frontend node will discover them.
@@ -130,7 +130,7 @@ python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B
#### Send a Request
```bash
-curl localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [
{
diff --git a/components/backends/mocker/README.md b/components/backends/mocker/README.md
index 8b4ad23d6b..32594b26ab 100644
--- a/components/backends/mocker/README.md
+++ b/components/backends/mocker/README.md
@@ -37,7 +37,7 @@ python -m dynamo.mocker \
--enable-prefix-caching
# Start frontend server
-python -m dynamo.frontend --http-port 8080
+python -m dynamo.frontend --http-port 8000
```
### Legacy JSON file support:
diff --git a/components/backends/vllm/deepseek-r1.md b/components/backends/vllm/deepseek-r1.md
index dc5b0596a0..9170c4159c 100644
--- a/components/backends/vllm/deepseek-r1.md
+++ b/components/backends/vllm/deepseek-r1.md
@@ -26,7 +26,7 @@ node 1
On node 0 (where the frontend was started) send a test request to verify your deployment:
```bash
-curl localhost:8080/v1/chat/completions \
+curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
diff --git a/components/backends/vllm/deploy/README.md b/components/backends/vllm/deploy/README.md
index dec2dbaf69..52f3d9d0b3 100644
--- a/components/backends/vllm/deploy/README.md
+++ b/components/backends/vllm/deploy/README.md
@@ -197,7 +197,7 @@ See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/se
Send a test request to verify your deployment:
```bash
-curl localhost:8080/v1/chat/completions \
+curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
diff --git a/components/frontend/README.md b/components/frontend/README.md
index 193191e4a5..27d6a01d0c 100644
--- a/components/frontend/README.md
+++ b/components/frontend/README.md
@@ -1,6 +1,6 @@
# Dynamo frontend node.
-Usage: `python -m dynamo.frontend [--http-port 8080]`.
+Usage: `python -m dynamo.frontend [--http-port 8000]`.
This runs an OpenAI compliant HTTP server, a pre-processor, and a router in a single process. Engines / workers are auto-discovered when they call `register_llm`.
diff --git a/container/launch_message.txt b/container/launch_message.txt
index 73b4ac56db..87f62440c4 100644
--- a/container/launch_message.txt
+++ b/container/launch_message.txt
@@ -48,7 +48,7 @@ tools.
Try the following to begin interacting with a model:
> dynamo --help
-> python -m dynamo.frontend [--http-port 8080]
+> python -m dynamo.frontend [--http-port 8000]
> python -m dynamo.vllm Qwen/Qwen2.5-3B-Instruct
To run more complete deployment examples, instances of etcd and nats need to be
diff --git a/deploy/metrics/README.md b/deploy/metrics/README.md
index 054981aedd..16a6d502ca 100644
--- a/deploy/metrics/README.md
+++ b/deploy/metrics/README.md
@@ -23,7 +23,7 @@ graph TD
PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
- PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080]
+ PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
DYNAMOFE --> DYNAMOBACKEND
GRAFANA -->|:9090/query API| PROMETHEUS
diff --git a/docs/_includes/quick_start_local.rst b/docs/_includes/quick_start_local.rst
index 8d74d3d2ba..2f3ebcb010 100644
--- a/docs/_includes/quick_start_local.rst
+++ b/docs/_includes/quick_start_local.rst
@@ -24,7 +24,7 @@ Get started with Dynamo locally in just a few commands:
.. code-block:: bash
- # Start the OpenAI compatible frontend (default port is 8080)
+ # Start the OpenAI compatible frontend (default port is 8000)
python -m dynamo.frontend
# In another terminal, start an SGLang worker
@@ -34,7 +34,7 @@ Get started with Dynamo locally in just a few commands:
.. code-block:: bash
- curl localhost:8080/v1/chat/completions \
+ curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
diff --git a/docs/architecture/dynamo_flow.md b/docs/architecture/dynamo_flow.md
index 23240fab5b..865c98ab5c 100644
--- a/docs/architecture/dynamo_flow.md
+++ b/docs/architecture/dynamo_flow.md
@@ -23,7 +23,7 @@ This diagram shows the NVIDIA Dynamo disaggregated inference system as implement
The primary user journey through the system:
1. **Discovery (S1)**: Client discovers the service endpoint
-2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8080)
+2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing
4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker
@@ -84,7 +84,7 @@ graph TD
%% Top Layer - Client & Frontend
Client["HTTP Client"]
S1[["1 DISCOVERY"]]
- Frontend["Frontend
OpenAI Compatible Server
Port 8080"]
+ Frontend["Frontend
OpenAI Compatible Server
Port 8000"]
S2[["2 REQUEST"]]
%% Processing Layer
diff --git a/docs/components/router/README.md b/docs/components/router/README.md
index ef6656fda1..d309f5faa5 100644
--- a/docs/components/router/README.md
+++ b/docs/components/router/README.md
@@ -14,12 +14,12 @@ The Dynamo KV Router intelligently routes requests by evaluating their computati
To launch the Dynamo frontend with the KV Router:
```bash
-python -m dynamo.frontend --router-mode kv --http-port 8080
+python -m dynamo.frontend --router-mode kv --http-port 8000
```
This command:
- Launches the Dynamo frontend service with KV routing enabled
-- Exposes the service on port 8080 (configurable)
+- Exposes the service on port 8000 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
diff --git a/docs/guides/dynamo_deploy/create_deployment.md b/docs/guides/dynamo_deploy/create_deployment.md
index 50007a096a..a34865314c 100644
--- a/docs/guides/dynamo_deploy/create_deployment.md
+++ b/docs/guides/dynamo_deploy/create_deployment.md
@@ -88,7 +88,7 @@ Here's a template structure based on the examples:
Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
`extraPodSpec: -> mainContainer: -> args:`
-The front end is launched with "python3 -m dynamo.frontend [--http-port 8080] [--router-mode kv]"
+The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
If you are a Dynamo contributor the [dynamo run guide](../dynamo_run.md) for details on how to run this command.
diff --git a/docs/guides/metrics.md b/docs/guides/metrics.md
index 73699777d3..c0499b0bf6 100644
--- a/docs/guides/metrics.md
+++ b/docs/guides/metrics.md
@@ -79,7 +79,7 @@ graph TD
PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
- PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080]
+ PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
DYNAMOFE --> DYNAMOBACKEND
GRAFANA -->|:9090/query API| PROMETHEUS
diff --git a/docs/guides/planner_benchmark/README.md b/docs/guides/planner_benchmark/README.md
index 4332c3cdb5..9e74117f43 100644
--- a/docs/guides/planner_benchmark/README.md
+++ b/docs/guides/planner_benchmark/README.md
@@ -46,7 +46,7 @@ genai-perf profile \
--tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
-m deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--endpoint-type chat \
- --url http://localhost:8080 \
+ --url http://localhost:8000 \
--streaming \
--input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
```
@@ -76,7 +76,7 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n
# TODO
# in terminal 2
-genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8080 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
+genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
```
## Results
diff --git a/docs/support_matrix.md b/docs/support_matrix.md
index 340dd7a6eb..f6019c003a 100644
--- a/docs/support_matrix.md
+++ b/docs/support_matrix.md
@@ -85,7 +85,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
> [!Caution]
-> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8080 for frontend).
+> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
## Build Support
diff --git a/examples/multimodal/README.md b/examples/multimodal/README.md
index f2c0f96d2c..16a3c1cc54 100644
--- a/examples/multimodal/README.md
+++ b/examples/multimodal/README.md
@@ -73,7 +73,7 @@ bash launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct
In another terminal:
```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
@@ -146,7 +146,7 @@ bash launch/disagg.sh --model llava-hf/llava-1.5-7b-hf
In another terminal:
```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
@@ -223,7 +223,7 @@ bash launch/agg_llama.sh
In another terminal:
```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
@@ -295,7 +295,7 @@ bash launch/disagg_llama.sh
In another terminal:
```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
@@ -366,7 +366,7 @@ bash launch/video_agg.sh
In another terminal:
```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
@@ -455,7 +455,7 @@ bash launch/video_disagg.sh
In another terminal:
```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
diff --git a/lib/runtime/examples/system_metrics/README.md b/lib/runtime/examples/system_metrics/README.md
index 6ab654da41..dfbd4291d0 100644
--- a/lib/runtime/examples/system_metrics/README.md
+++ b/lib/runtime/examples/system_metrics/README.md
@@ -185,7 +185,7 @@ DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 cargo run --bin system_server
The server will start an system status server on the specified port (8081 in this example) that exposes the Prometheus metrics endpoint at `/metrics`.
-To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8080.
+To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8000.
```
python -m dynamo.frontend &
@@ -202,5 +202,5 @@ Once running, you can query the metrics:
curl http://localhost:8081/metrics | grep -E "dynamo_component"
# Get all frontend metrics
-curl http://localhost:8080/metrics | grep -E "dynamo_frontend"
+curl http://localhost:8000/metrics | grep -E "dynamo_frontend"
```
diff --git a/tests/lmcache/README.md b/tests/lmcache/README.md
index afb8f4545c..37ee3389cb 100644
--- a/tests/lmcache/README.md
+++ b/tests/lmcache/README.md
@@ -62,13 +62,13 @@ python3 summarize_scores_dynamo.py
### Baseline Architecture (deploy-baseline-dynamo.sh)
```
-HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → Direct Inference
+HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → Direct Inference
Environment: ENABLE_LMCACHE=0
```
### LMCache Architecture (deploy-lmcache_enabled-dynamo.sh)
```
-HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → LMCache-enabled Inference
+HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → LMCache-enabled Inference
Environment: ENABLE_LMCACHE=1
LMCACHE_CHUNK_SIZE=256
LMCACHE_LOCAL_CPU=True
@@ -80,7 +80,7 @@ Environment: ENABLE_LMCACHE=1
Test scripts use Dynamo's Chat Completions API:
```bash
-curl -X POST http://localhost:8080/v1/chat/completions \
+curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": Qwen/Qwen3-0.6B,
diff --git a/tests/lmcache/mmlu-baseline-dynamo.py b/tests/lmcache/mmlu-baseline-dynamo.py
index 943d411206..d8cfc22016 100644
--- a/tests/lmcache/mmlu-baseline-dynamo.py
+++ b/tests/lmcache/mmlu-baseline-dynamo.py
@@ -18,7 +18,7 @@
# Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/1-mmlu.py
# ASSUMPTIONS:
-# 1. dynamo is running (default: localhost:8080) without LMCache
+# 1. dynamo is running (default: localhost:8000) without LMCache
# 2. the mmlu dataset is in a "data" directory
# 3. all invocations of this script should be run in the same directory
# (for later consolidation)
diff --git a/tests/lmcache/mmlu-lmcache_enabled-dynamo.py b/tests/lmcache/mmlu-lmcache_enabled-dynamo.py
index 405ff6d5db..a07ef27750 100644
--- a/tests/lmcache/mmlu-lmcache_enabled-dynamo.py
+++ b/tests/lmcache/mmlu-lmcache_enabled-dynamo.py
@@ -17,7 +17,7 @@
# Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/2-mmlu.py
# ASSUMPTIONS:
-# 1. dynamo is running (default: localhost:8080) with LMCache enabled
+# 1. dynamo is running (default: localhost:8000) with LMCache enabled
# 2. the mmlu dataset is in a "data" directory
# 3. all invocations of this script should be run in the same directory
# (for later consolidation)