diff --git a/README.md b/README.md index 48561cbe08..5515f0c17d 100644 --- a/README.md +++ b/README.md @@ -120,7 +120,7 @@ Dynamo provides a simple way to spin up a local set of inference components incl ``` # Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router. # Pass the TLS certificate and key paths to use HTTPS instead of HTTP. -python -m dynamo.frontend --http-port 8080 [--tls-cert-path cert.pem] [--tls-key-path key.pem] +python -m dynamo.frontend --http-port 8000 [--tls-cert-path cert.pem] [--tls-key-path key.pem] # Start the SGLang engine, connecting to NATS and etcd to receive requests. You can run several of these, # both for the same model and for multiple models. The frontend node will discover them. @@ -130,7 +130,7 @@ python -m dynamo.sglang.worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B #### Send a Request ```bash -curl localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ +curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", "messages": [ { diff --git a/components/backends/mocker/README.md b/components/backends/mocker/README.md index 8b4ad23d6b..32594b26ab 100644 --- a/components/backends/mocker/README.md +++ b/components/backends/mocker/README.md @@ -37,7 +37,7 @@ python -m dynamo.mocker \ --enable-prefix-caching # Start frontend server -python -m dynamo.frontend --http-port 8080 +python -m dynamo.frontend --http-port 8000 ``` ### Legacy JSON file support: diff --git a/components/backends/vllm/deepseek-r1.md b/components/backends/vllm/deepseek-r1.md index dc5b0596a0..9170c4159c 100644 --- a/components/backends/vllm/deepseek-r1.md +++ b/components/backends/vllm/deepseek-r1.md @@ -26,7 +26,7 @@ node 1 On node 0 (where the frontend was started) send a test request to verify your deployment: ```bash -curl localhost:8080/v1/chat/completions \ +curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1", diff --git a/components/backends/vllm/deploy/README.md b/components/backends/vllm/deploy/README.md index dec2dbaf69..52f3d9d0b3 100644 --- a/components/backends/vllm/deploy/README.md +++ b/components/backends/vllm/deploy/README.md @@ -197,7 +197,7 @@ See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/se Send a test request to verify your deployment: ```bash -curl localhost:8080/v1/chat/completions \ +curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", diff --git a/components/frontend/README.md b/components/frontend/README.md index 193191e4a5..27d6a01d0c 100644 --- a/components/frontend/README.md +++ b/components/frontend/README.md @@ -1,6 +1,6 @@ # Dynamo frontend node. -Usage: `python -m dynamo.frontend [--http-port 8080]`. +Usage: `python -m dynamo.frontend [--http-port 8000]`. This runs an OpenAI compliant HTTP server, a pre-processor, and a router in a single process. Engines / workers are auto-discovered when they call `register_llm`. diff --git a/container/launch_message.txt b/container/launch_message.txt index 73b4ac56db..87f62440c4 100644 --- a/container/launch_message.txt +++ b/container/launch_message.txt @@ -48,7 +48,7 @@ tools. Try the following to begin interacting with a model: > dynamo --help -> python -m dynamo.frontend [--http-port 8080] +> python -m dynamo.frontend [--http-port 8000] > python -m dynamo.vllm Qwen/Qwen2.5-3B-Instruct To run more complete deployment examples, instances of etcd and nats need to be diff --git a/deploy/metrics/README.md b/deploy/metrics/README.md index 054981aedd..16a6d502ca 100644 --- a/deploy/metrics/README.md +++ b/deploy/metrics/README.md @@ -23,7 +23,7 @@ graph TD PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380] PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401] PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP - PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080] + PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000] PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081] DYNAMOFE --> DYNAMOBACKEND GRAFANA -->|:9090/query API| PROMETHEUS diff --git a/docs/_includes/quick_start_local.rst b/docs/_includes/quick_start_local.rst index 8d74d3d2ba..2f3ebcb010 100644 --- a/docs/_includes/quick_start_local.rst +++ b/docs/_includes/quick_start_local.rst @@ -24,7 +24,7 @@ Get started with Dynamo locally in just a few commands: .. code-block:: bash - # Start the OpenAI compatible frontend (default port is 8080) + # Start the OpenAI compatible frontend (default port is 8000) python -m dynamo.frontend # In another terminal, start an SGLang worker @@ -34,7 +34,7 @@ Get started with Dynamo locally in just a few commands: .. code-block:: bash - curl localhost:8080/v1/chat/completions \ + curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello!"}], diff --git a/docs/architecture/dynamo_flow.md b/docs/architecture/dynamo_flow.md index 23240fab5b..865c98ab5c 100644 --- a/docs/architecture/dynamo_flow.md +++ b/docs/architecture/dynamo_flow.md @@ -23,7 +23,7 @@ This diagram shows the NVIDIA Dynamo disaggregated inference system as implement The primary user journey through the system: 1. **Discovery (S1)**: Client discovers the service endpoint -2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8080) +2. **Request (S2)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000) 3. **Validate (S3)**: Frontend forwards request to Processor for validation and routing 4. **Route (S3)**: Processor routes the validated request to appropriate Decode Worker @@ -84,7 +84,7 @@ graph TD %% Top Layer - Client & Frontend Client["HTTP Client"] S1[["1 DISCOVERY"]] - Frontend["Frontend
OpenAI Compatible Server
Port 8080
"] + Frontend["Frontend
OpenAI Compatible Server
Port 8000
"] S2[["2 REQUEST"]] %% Processing Layer diff --git a/docs/components/router/README.md b/docs/components/router/README.md index ef6656fda1..d309f5faa5 100644 --- a/docs/components/router/README.md +++ b/docs/components/router/README.md @@ -14,12 +14,12 @@ The Dynamo KV Router intelligently routes requests by evaluating their computati To launch the Dynamo frontend with the KV Router: ```bash -python -m dynamo.frontend --router-mode kv --http-port 8080 +python -m dynamo.frontend --router-mode kv --http-port 8000 ``` This command: - Launches the Dynamo frontend service with KV routing enabled -- Exposes the service on port 8080 (configurable) +- Exposes the service on port 8000 (configurable) - Automatically handles all backend workers registered to the Dynamo endpoint Backend workers register themselves using the `register_llm` API, after which the KV Router automatically: diff --git a/docs/guides/dynamo_deploy/create_deployment.md b/docs/guides/dynamo_deploy/create_deployment.md index 50007a096a..a34865314c 100644 --- a/docs/guides/dynamo_deploy/create_deployment.md +++ b/docs/guides/dynamo_deploy/create_deployment.md @@ -88,7 +88,7 @@ Here's a template structure based on the examples: Consult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the `extraPodSpec: -> mainContainer: -> args:` -The front end is launched with "python3 -m dynamo.frontend [--http-port 8080] [--router-mode kv]" +The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]" Each worker will launch `python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags `command. If you are a Dynamo contributor the [dynamo run guide](../dynamo_run.md) for details on how to run this command. diff --git a/docs/guides/metrics.md b/docs/guides/metrics.md index 73699777d3..c0499b0bf6 100644 --- a/docs/guides/metrics.md +++ b/docs/guides/metrics.md @@ -79,7 +79,7 @@ graph TD PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380] PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401] PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP - PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080] + PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000] PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081] DYNAMOFE --> DYNAMOBACKEND GRAFANA -->|:9090/query API| PROMETHEUS diff --git a/docs/guides/planner_benchmark/README.md b/docs/guides/planner_benchmark/README.md index 4332c3cdb5..9e74117f43 100644 --- a/docs/guides/planner_benchmark/README.md +++ b/docs/guides/planner_benchmark/README.md @@ -46,7 +46,7 @@ genai-perf profile \ --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --endpoint-type chat \ - --url http://localhost:8080 \ + --url http://localhost:8000 \ --streaming \ --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl ``` @@ -76,7 +76,7 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n # TODO # in terminal 2 -genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8080 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl +genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl ``` ## Results diff --git a/docs/support_matrix.md b/docs/support_matrix.md index 340dd7a6eb..f6019c003a 100644 --- a/docs/support_matrix.md +++ b/docs/support_matrix.md @@ -85,7 +85,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo > [!Caution] -> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8080 for frontend). +> ¹ There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend). ## Build Support diff --git a/examples/multimodal/README.md b/examples/multimodal/README.md index f2c0f96d2c..16a3c1cc54 100644 --- a/examples/multimodal/README.md +++ b/examples/multimodal/README.md @@ -73,7 +73,7 @@ bash launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct In another terminal: ```bash -curl http://localhost:8080/v1/chat/completions \ +curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llava-hf/llava-1.5-7b-hf", @@ -146,7 +146,7 @@ bash launch/disagg.sh --model llava-hf/llava-1.5-7b-hf In another terminal: ```bash -curl http://localhost:8080/v1/chat/completions \ +curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llava-hf/llava-1.5-7b-hf", @@ -223,7 +223,7 @@ bash launch/agg_llama.sh In another terminal: ```bash -curl http://localhost:8080/v1/chat/completions \ +curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", @@ -295,7 +295,7 @@ bash launch/disagg_llama.sh In another terminal: ```bash -curl http://localhost:8080/v1/chat/completions \ +curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", @@ -366,7 +366,7 @@ bash launch/video_agg.sh In another terminal: ```bash -curl http://localhost:8080/v1/chat/completions \ +curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llava-hf/LLaVA-NeXT-Video-7B-hf", @@ -455,7 +455,7 @@ bash launch/video_disagg.sh In another terminal: ```bash -curl http://localhost:8080/v1/chat/completions \ +curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llava-hf/LLaVA-NeXT-Video-7B-hf", diff --git a/lib/runtime/examples/system_metrics/README.md b/lib/runtime/examples/system_metrics/README.md index 6ab654da41..dfbd4291d0 100644 --- a/lib/runtime/examples/system_metrics/README.md +++ b/lib/runtime/examples/system_metrics/README.md @@ -185,7 +185,7 @@ DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 cargo run --bin system_server The server will start an system status server on the specified port (8081 in this example) that exposes the Prometheus metrics endpoint at `/metrics`. -To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8080. +To Run an actual LLM frontend + server (aggregated example), launch both of them. By default, the frontend listens to port 8000. ``` python -m dynamo.frontend & @@ -202,5 +202,5 @@ Once running, you can query the metrics: curl http://localhost:8081/metrics | grep -E "dynamo_component" # Get all frontend metrics -curl http://localhost:8080/metrics | grep -E "dynamo_frontend" +curl http://localhost:8000/metrics | grep -E "dynamo_frontend" ``` diff --git a/tests/lmcache/README.md b/tests/lmcache/README.md index afb8f4545c..37ee3389cb 100644 --- a/tests/lmcache/README.md +++ b/tests/lmcache/README.md @@ -62,13 +62,13 @@ python3 summarize_scores_dynamo.py ### Baseline Architecture (deploy-baseline-dynamo.sh) ``` -HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → Direct Inference +HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → Direct Inference Environment: ENABLE_LMCACHE=0 ``` ### LMCache Architecture (deploy-lmcache_enabled-dynamo.sh) ``` -HTTP Request → Dynamo Ingress(8080) → Dynamo Worker → LMCache-enabled Inference +HTTP Request → Dynamo Ingress(8000) → Dynamo Worker → LMCache-enabled Inference Environment: ENABLE_LMCACHE=1 LMCACHE_CHUNK_SIZE=256 LMCACHE_LOCAL_CPU=True @@ -80,7 +80,7 @@ Environment: ENABLE_LMCACHE=1 Test scripts use Dynamo's Chat Completions API: ```bash -curl -X POST http://localhost:8080/v1/chat/completions \ +curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": Qwen/Qwen3-0.6B, diff --git a/tests/lmcache/mmlu-baseline-dynamo.py b/tests/lmcache/mmlu-baseline-dynamo.py index 943d411206..d8cfc22016 100644 --- a/tests/lmcache/mmlu-baseline-dynamo.py +++ b/tests/lmcache/mmlu-baseline-dynamo.py @@ -18,7 +18,7 @@ # Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/1-mmlu.py # ASSUMPTIONS: -# 1. dynamo is running (default: localhost:8080) without LMCache +# 1. dynamo is running (default: localhost:8000) without LMCache # 2. the mmlu dataset is in a "data" directory # 3. all invocations of this script should be run in the same directory # (for later consolidation) diff --git a/tests/lmcache/mmlu-lmcache_enabled-dynamo.py b/tests/lmcache/mmlu-lmcache_enabled-dynamo.py index 405ff6d5db..a07ef27750 100644 --- a/tests/lmcache/mmlu-lmcache_enabled-dynamo.py +++ b/tests/lmcache/mmlu-lmcache_enabled-dynamo.py @@ -17,7 +17,7 @@ # Reference: https://github.com/LMCache/LMCache/blob/dev/.buildkite/correctness/2-mmlu.py # ASSUMPTIONS: -# 1. dynamo is running (default: localhost:8080) with LMCache enabled +# 1. dynamo is running (default: localhost:8000) with LMCache enabled # 2. the mmlu dataset is in a "data" directory # 3. all invocations of this script should be run in the same directory # (for later consolidation)