|
| 1 | +# NVIDIA Dynamo Glossary |
| 2 | + |
| 3 | +## B |
| 4 | +**Block** - A fixed-size chunk of tokens (typically 16 or 64 tokens) used for efficient KV cache management and memory allocation, serving as the fundamental unit for techniques like PagedAttention. |
| 5 | + |
| 6 | +## C |
| 7 | +**Component** - The fundamental deployable unit in Dynamo. A discoverable service entity that can host multiple endpoints and typically maps to a Docker container (such as VllmWorker, Router, Processor). |
| 8 | + |
| 9 | +**Conditional Disaggregation** - Dynamo's intelligent decision-making process within disaggregated serving that determines whether a request is processed locally or sent to a remote prefill engine based on prefill length and queue status. |
| 10 | + |
| 11 | +## D |
| 12 | +**Decode Phase** - The second phase of LLM inference that generates output tokens one at a time. |
| 13 | + |
| 14 | +**depends()** - A Dynamo function that creates dependencies between services, enabling automatic client generation and service discovery. |
| 15 | + |
| 16 | +**Disaggregated Serving** - Dynamo's core architecture that separates prefill and decode phases into specialized engines to maximize GPU throughput and improve performance. |
| 17 | + |
| 18 | +**Distributed Runtime** - Dynamo's Rust-based core system that manages service discovery, communication, and component lifecycle across distributed clusters. |
| 19 | + |
| 20 | +**Dynamo** - NVIDIA's high-performance distributed inference framework for Large Language Models (LLMs) and generative AI models, designed for multinode environments with disaggregated serving and cache-aware routing. |
| 21 | + |
| 22 | +**Dynamo Artifact** - A packaged archive containing an inference graph and its dependencies, created using `dynamo build`. It's the containerized, deployable version of a Graph. |
| 23 | + |
| 24 | +**Dynamo Cloud** - A Kubernetes platform providing managed deployment experience for Dynamo inference graphs. |
| 25 | + |
| 26 | +**dynamo build** - The CLI command to containerize inference graphs or parts of graphs into Docker containers. |
| 27 | + |
| 28 | +**dynamo deploy** - The CLI command to deploy inference graphs to Kubernetes with Helm charts or custom operators. |
| 29 | + |
| 30 | +**dynamo run** - The CLI command to quickly experiment and test models with various LLM engines. |
| 31 | + |
| 32 | +**dynamo serve** - The CLI command to compose and serve inference graphs locally. |
| 33 | + |
| 34 | +## E |
| 35 | +**@endpoint** - A Python decorator used to define service endpoints within a Dynamo component. |
| 36 | + |
| 37 | +**Endpoint** - A specific network-accessible API within a Dynamo component, such as `generate` or `load_metrics`. |
| 38 | + |
| 39 | +## F |
| 40 | +**Frontend** - Dynamo's API server component that receives user requests and provides OpenAI-compatible HTTP endpoints. |
| 41 | + |
| 42 | +## G |
| 43 | +**Graph** - A collection of interconnected Dynamo components that form a complete inference pipeline with request paths (single-in) and response paths (many-out for streaming). A graph can be packaged into a Dynamo Artifact for deployment. |
| 44 | + |
| 45 | +## I |
| 46 | +**Instance** - A running process with a unique `instance_id`. Multiple instances can serve the same namespace, component, and endpoint for load balancing |
| 47 | + |
| 48 | +## K |
| 49 | +**KV Block Manager (KVBM)** - Dynamo's scalable runtime component that handles memory allocation, management, and remote sharing of Key-Value blocks across heterogeneous and distributed environments. |
| 50 | + |
| 51 | +**KV Cache** - Key-Value cache that stores computed attention states from previous tokens to avoid recomputation during inference. |
| 52 | + |
| 53 | +**KV Router** - Dynamo's intelligent routing system that directs requests to workers with the highest cache overlap to maximize KV cache reuse. Determines routing based on KV cache hit rates and worker metrics. |
| 54 | + |
| 55 | +**KVIndexer** - Dynamo component that maintains a global view of cached blocks across all workers using a prefix tree structure to calculate cache hit rates. |
| 56 | + |
| 57 | +**KVPublisher** - Dynamo component that emits KV cache events (stored/removed) from individual workers to the global KVIndexer. |
| 58 | + |
| 59 | +## N |
| 60 | +**Namespace** - Dynamo's logical grouping mechanism for related components. Similar to directories in a file system, they prevent collisions between different deployments. |
| 61 | + |
| 62 | +**NIXL (NVIDIA Inference tranXfer Library)** - High-performance data transfer library optimized for inference workloads, supporting direct GPU-to-GPU transfers and multiple memory hierarchies. |
| 63 | + |
| 64 | +## P |
| 65 | +**PagedAttention** - Memory management technique from vLLM that efficiently manages KV cache by chunking requests into blocks. |
| 66 | + |
| 67 | +**Planner** - Dynamo component that performs dynamic resource scaling based on real-time demand signals and system metrics. |
| 68 | + |
| 69 | +**Prefill Phase** - The first phase of LLM inference that processes the input prompt and generates KV cache. |
| 70 | + |
| 71 | +**Prefix Caching** - Optimization technique that reuses previously computed KV cache for common prompt prefixes. |
| 72 | + |
| 73 | +**Processor** - Dynamo component that handles request preprocessing, tokenization, and routing decisions. |
| 74 | + |
| 75 | +## R |
| 76 | +**RadixAttention** - Technique from SGLang that uses a prefix tree structure for efficient KV cache matching, insertion, and eviction. |
| 77 | + |
| 78 | +**RDMA (Remote Direct Memory Access)** - Technology that allows direct memory access between distributed systems, used for efficient KV cache transfers. |
| 79 | + |
| 80 | +## S |
| 81 | +**@service** - Python decorator used to define a Dynamo service class. |
| 82 | + |
| 83 | +**SGLang** - Fast LLM inference framework with native embedding support and RadixAttention. |
| 84 | + |
| 85 | +## T |
| 86 | +**Tensor Parallelism (TP)** - Model parallelism technique where model weights are distributed across multiple GPUs. |
| 87 | + |
| 88 | +**TensorRT-LLM** - NVIDIA's optimized LLM inference engine with multinode MPI distributed support. |
| 89 | + |
| 90 | +**Time-To-First-Token (TTFT)** - The latency from receiving a request to generating the first output token. |
| 91 | + |
| 92 | +## V |
| 93 | +**vLLM** - High-throughput LLM serving engine with Ray distributed support and PagedAttention. |
| 94 | + |
| 95 | +## X |
| 96 | +**xPyD (x Prefill y Decode)** - Dynamo notation describing disaggregated serving configurations where x prefill workers serve y decode workers. Dynamo supports runtime-reconfigurable xPyD. |
0 commit comments