Skip to content

Commit a8a9bd4

Browse files
authored
Merge branch 'main' into jacky-ft-complete-final
2 parents 540342b + dda59e3 commit a8a9bd4

File tree

2 files changed

+102
-0
lines changed

2 files changed

+102
-0
lines changed

docs/dynamo_glossary.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# NVIDIA Dynamo Glossary
2+
3+
## B
4+
**Block** - A fixed-size chunk of tokens (typically 16 or 64 tokens) used for efficient KV cache management and memory allocation, serving as the fundamental unit for techniques like PagedAttention.
5+
6+
## C
7+
**Component** - The fundamental deployable unit in Dynamo. A discoverable service entity that can host multiple endpoints and typically maps to a Docker container (such as VllmWorker, Router, Processor).
8+
9+
**Conditional Disaggregation** - Dynamo's intelligent decision-making process within disaggregated serving that determines whether a request is processed locally or sent to a remote prefill engine based on prefill length and queue status.
10+
11+
## D
12+
**Decode Phase** - The second phase of LLM inference that generates output tokens one at a time.
13+
14+
**depends()** - A Dynamo function that creates dependencies between services, enabling automatic client generation and service discovery.
15+
16+
**Disaggregated Serving** - Dynamo's core architecture that separates prefill and decode phases into specialized engines to maximize GPU throughput and improve performance.
17+
18+
**Distributed Runtime** - Dynamo's Rust-based core system that manages service discovery, communication, and component lifecycle across distributed clusters.
19+
20+
**Dynamo** - NVIDIA's high-performance distributed inference framework for Large Language Models (LLMs) and generative AI models, designed for multinode environments with disaggregated serving and cache-aware routing.
21+
22+
**Dynamo Artifact** - A packaged archive containing an inference graph and its dependencies, created using `dynamo build`. It's the containerized, deployable version of a Graph.
23+
24+
**Dynamo Cloud** - A Kubernetes platform providing managed deployment experience for Dynamo inference graphs.
25+
26+
**dynamo build** - The CLI command to containerize inference graphs or parts of graphs into Docker containers.
27+
28+
**dynamo deploy** - The CLI command to deploy inference graphs to Kubernetes with Helm charts or custom operators.
29+
30+
**dynamo run** - The CLI command to quickly experiment and test models with various LLM engines.
31+
32+
**dynamo serve** - The CLI command to compose and serve inference graphs locally.
33+
34+
## E
35+
**@endpoint** - A Python decorator used to define service endpoints within a Dynamo component.
36+
37+
**Endpoint** - A specific network-accessible API within a Dynamo component, such as `generate` or `load_metrics`.
38+
39+
## F
40+
**Frontend** - Dynamo's API server component that receives user requests and provides OpenAI-compatible HTTP endpoints.
41+
42+
## G
43+
**Graph** - A collection of interconnected Dynamo components that form a complete inference pipeline with request paths (single-in) and response paths (many-out for streaming). A graph can be packaged into a Dynamo Artifact for deployment.
44+
45+
## I
46+
**Instance** - A running process with a unique `instance_id`. Multiple instances can serve the same namespace, component, and endpoint for load balancing
47+
48+
## K
49+
**KV Block Manager (KVBM)** - Dynamo's scalable runtime component that handles memory allocation, management, and remote sharing of Key-Value blocks across heterogeneous and distributed environments.
50+
51+
**KV Cache** - Key-Value cache that stores computed attention states from previous tokens to avoid recomputation during inference.
52+
53+
**KV Router** - Dynamo's intelligent routing system that directs requests to workers with the highest cache overlap to maximize KV cache reuse. Determines routing based on KV cache hit rates and worker metrics.
54+
55+
**KVIndexer** - Dynamo component that maintains a global view of cached blocks across all workers using a prefix tree structure to calculate cache hit rates.
56+
57+
**KVPublisher** - Dynamo component that emits KV cache events (stored/removed) from individual workers to the global KVIndexer.
58+
59+
## N
60+
**Namespace** - Dynamo's logical grouping mechanism for related components. Similar to directories in a file system, they prevent collisions between different deployments.
61+
62+
**NIXL (NVIDIA Inference tranXfer Library)** - High-performance data transfer library optimized for inference workloads, supporting direct GPU-to-GPU transfers and multiple memory hierarchies.
63+
64+
## P
65+
**PagedAttention** - Memory management technique from vLLM that efficiently manages KV cache by chunking requests into blocks.
66+
67+
**Planner** - Dynamo component that performs dynamic resource scaling based on real-time demand signals and system metrics.
68+
69+
**Prefill Phase** - The first phase of LLM inference that processes the input prompt and generates KV cache.
70+
71+
**Prefix Caching** - Optimization technique that reuses previously computed KV cache for common prompt prefixes.
72+
73+
**Processor** - Dynamo component that handles request preprocessing, tokenization, and routing decisions.
74+
75+
## R
76+
**RadixAttention** - Technique from SGLang that uses a prefix tree structure for efficient KV cache matching, insertion, and eviction.
77+
78+
**RDMA (Remote Direct Memory Access)** - Technology that allows direct memory access between distributed systems, used for efficient KV cache transfers.
79+
80+
## S
81+
**@service** - Python decorator used to define a Dynamo service class.
82+
83+
**SGLang** - Fast LLM inference framework with native embedding support and RadixAttention.
84+
85+
## T
86+
**Tensor Parallelism (TP)** - Model parallelism technique where model weights are distributed across multiple GPUs.
87+
88+
**TensorRT-LLM** - NVIDIA's optimized LLM inference engine with multinode MPI distributed support.
89+
90+
**Time-To-First-Token (TTFT)** - The latency from receiving a request to generating the first output token.
91+
92+
## V
93+
**vLLM** - High-throughput LLM serving engine with Ray distributed support and PagedAttention.
94+
95+
## X
96+
**xPyD (x Prefill y Decode)** - Dynamo notation describing disaggregated serving configurations where x prefill workers serve y decode workers. Dynamo supports runtime-reconfigurable xPyD.

docs/index.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,5 +135,11 @@ Dive in: Examples
135135
Multinode Examples <examples/multinode.md>
136136
LLM Deployment Examples using TensorRT-LLM <examples/trtllm.md>
137137

138+
.. toctree::
139+
:hidden:
140+
:caption: Reference
141+
142+
Glossary <dynamo_glossary.md>
143+
KVBM Reading <architecture/kvbm_reading.md>
138144

139145

0 commit comments

Comments
 (0)