first commit

PeaBrane · PeaBrane · commit 30e26fc17f74 · 2025-09-04T11:55:31.000-07:00
Signed-off-by: PeaBrane &lt;yanrpei@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -120,7 +120,7 @@ Dynamo provides a simple way to spin up a local set of inference components incl
 ```
 # Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router.
 # Pass the TLS certificate and key paths to use HTTPS instead of HTTP.
-python -m dynamo.frontend --http-port 8080 [--tls-cert-path cert.pem] [--tls-key-path key.pem]
+python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key-path key.pem]
 
 # Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these,
 # both for the same model and for multiple models. The frontend node will discover them.
@@ -130,7 +130,7 @@ python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B
 #### Send a Request
 
 ```bash
-curl localhost:8080/v1/chat/completions   -H "Content-Type: application/json"   -d '{
+curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
     "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
     "messages": [
     {
diff --git a/components/backends/mocker/README.md b/components/backends/mocker/README.md
@@ -37,7 +37,7 @@ python -m dynamo.mocker \
   --enable-prefix-caching
 
 # Start frontend server
-python -m dynamo.frontend --http-port 8080
+python -m dynamo.frontend --http-port 8000
 ```
 
 ### Legacy JSON file support:
diff --git a/components/backends/vllm/deepseek-r1.md b/components/backends/vllm/deepseek-r1.md
@@ -26,7 +26,7 @@ node 1
 On node 0 (where the frontend was started) send a test request to verify your deployment:
 
 ```bash
-curl localhost:8080/v1/chat/completions \
+curl localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "deepseek-ai/DeepSeek-R1",
diff --git a/components/backends/vllm/deploy/README.md b/components/backends/vllm/deploy/README.md
@@ -197,7 +197,7 @@ See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/se
 Send a test request to verify your deployment:
 
 ```bash
-curl localhost:8080/v1/chat/completions \
+curl localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "Qwen/Qwen3-0.6B",
diff --git a/components/frontend/README.md b/components/frontend/README.md
@@ -1,6 +1,6 @@
 # Dynamo frontend node.
 
-Usage: `python -m dynamo.frontend [--http-port 8080]`.
+Usage: `python -m dynamo.frontend [--http-port 8000]`.
 
 This runs an OpenAI compliant HTTP server, a pre-processor, and a router in a single process. Engines / workers are auto-discovered when they call `register_llm`.
 
diff --git a/deploy/metrics/README.md b/deploy/metrics/README.md
@@ -23,7 +23,7 @@ graph TD
         PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
         PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
         PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
-        PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080]
+        PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
         PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
         DYNAMOFE --> DYNAMOBACKEND
         GRAFANA -->|:9090/query API| PROMETHEUS
diff --git a/docs/architecture/dynamo_flow.md b/docs/architecture/dynamo_flow.md
@@ -23,7 +23,7 @@ This diagram shows the NVIDIA Dynamo disaggregated inference system as implement
 The primary user journey through the system:
 
 1. **Discovery (S1)**: Client discovers the service endpoint
-2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8080)
+2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
 3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing
 4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker
 
@@ -84,7 +84,7 @@ graph TD
     %% Top Layer - Client & Frontend
     Client["<b>HTTP Client</b>"]
     S1[["<b>1 DISCOVERY</b>"]]
-    Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8080</i>"]
+    Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
     S2[["<b>2 REQUEST</b>"]]
 
     %% Processing Layer
diff --git a/docs/components/router/README.md b/docs/components/router/README.md
@@ -14,12 +14,12 @@ The Dynamo KV Router intelligently routes requests by evaluating their computati
 To launch the Dynamo frontend with the KV Router:
 
 ```bash
-python -m dynamo.frontend --router-mode kv --http-port 8080
+python -m dynamo.frontend --router-mode kv --http-port 8000
 ```
 
 This command:
 - Launches the Dynamo frontend service with KV routing enabled
-- Exposes the service on port 8080 (configurable)
+- Exposes the service on port 8000 (configurable)
 - Automatically handles all backend workers registered to the Dynamo endpoint
 
 Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
diff --git a/docs/guides/dynamo_deploy/create_deployment.md b/docs/guides/dynamo_deploy/create_deployment.md
@@ -88,7 +88,7 @@ Here's a template structure based on the examples:
 Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
 `extraPodSpec: -> mainContainer: -> args:`
 
-The front end is launched with "python3 -m dynamo.frontend [--http-port 8080] [--router-mode kv]"
+The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
 Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command.
 If you are a Dynamo contributor the [dynamo run guide](../dynamo_run.md) for details on how to run this command.
 
diff --git a/docs/guides/dynamo_run.md b/docs/guides/dynamo_run.md
@@ -72,12 +72,12 @@ You can also list models or send a request:
 
 *List the models*
 ```
-curl localhost:8080/v1/models
+curl localhost:8000/v1/models
 ```
 
 *Send a request*
 ```
-curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8080/v1/chat/completions
+curl -d '{"model": "Llama-3.2-3B-Instruct-Q4_K_M", "max_completion_tokens": 2049, "messages":[{"role":"user", "content": "What is the capital of South Africa?" }]}' -H 'Content-Type: application/json' http://localhost:8000/v1/chat/completions
 ```
 
 ## Distributed System
diff --git a/docs/guides/metrics.md b/docs/guides/metrics.md
@@ -79,7 +79,7 @@ graph TD
         PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
         PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
         PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
-        PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080]
+        PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
         PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081]
         DYNAMOFE --> DYNAMOBACKEND
         GRAFANA -->|:9090/query API| PROMETHEUS
diff --git a/docs/guides/planner_benchmark/README.md b/docs/guides/planner_benchmark/README.md
@@ -46,7 +46,7 @@ genai-perf profile \
     --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
     -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
     --endpoint-type chat \
-    --url http://localhost:8080 \
+    --url http://localhost:8000 \
     --streaming \
     --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
 ```
@@ -76,7 +76,7 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n
 # TODO
 
 # in terminal 2
-genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8080 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
+genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
 ```
 
 ## Results
diff --git a/docs/support_matrix.md b/docs/support_matrix.md
@@ -85,7 +85,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
 
 
 > [!Caution]
-> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8080 for frontend).
+> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
 
 
 ## Build Support
diff --git a/examples/multimodal/README.md b/examples/multimodal/README.md
@@ -73,7 +73,7 @@ bash launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct
 
 In another terminal:
 ```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "llava-hf/llava-1.5-7b-hf",
@@ -146,7 +146,7 @@ bash launch/disagg.sh --model llava-hf/llava-1.5-7b-hf
 
 In another terminal:
 ```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "llava-hf/llava-1.5-7b-hf",
@@ -223,7 +223,7 @@ bash launch/agg_llama.sh
 
 In another terminal:
 ```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
@@ -295,7 +295,7 @@ bash launch/disagg_llama.sh
 
 In another terminal:
 ```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
@@ -366,7 +366,7 @@ bash launch/video_agg.sh
 
 In another terminal:
 ```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
@@ -455,7 +455,7 @@ bash launch/video_disagg.sh
 
 In another terminal:
 ```bash
-curl http://localhost:8080/v1/chat/completions \
+curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "llava-hf/LLaVA-NeXT-Video-7B-hf",
diff --git a/lib/runtime/examples/system_metrics/README.md b/lib/runtime/examples/system_metrics/README.md
@@ -185,7 +185,7 @@ DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 cargo run --bin system_server
 The server will start an system status server on the specified port (8081 in this example) that exposes the Prometheus metrics endpoint at `/metrics`.
 
 
-To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8080.
+To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8000.
 ```
 python -m dynamo.frontend &
 
@@ -202,5 +202,5 @@ Once running, you can query the metrics:
 curl http://localhost:8081/metrics | grep -E "dynamo_component"
 
 # Get all frontend metrics
-curl http://localhost:8080/metrics | grep -E "dynamo_frontend"
+curl http://localhost:8000/metrics | grep -E "dynamo_frontend"
 ```
diff --git a/tests/lmcache/README.md b/tests/lmcache/README.md
@@ -62,13 +62,13 @@ python3 summarize_scores_dynamo.py
 
 ### Baseline Architecture (deploy-baseline-dynamo.sh)
 ```
-HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → Direct Inference
+HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → Direct Inference
 Environment: ENABLE_LMCACHE=0
 ```
 
 ### LMCache Architecture (deploy-lmcache_enabled-dynamo.sh)
 ```
-HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → LMCache-enabled Inference
+HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → LMCache-enabled Inference
 Environment: ENABLE_LMCACHE=1
             LMCACHE_CHUNK_SIZE=256
             LMCACHE_LOCAL_CPU=True
@@ -80,7 +80,7 @@ Environment: ENABLE_LMCACHE=1
 Test scripts use Dynamo's Chat Completions API:
 
 ```bash
-curl -X POST http://localhost:8080/v1/chat/completions \
+curl -X POST http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": Qwen/Qwen3-0.6B,