Updates to architecture pages

aws-neuron · Dec 7, 2024 · 619d496 · 619d496
1 parent dc98b44
commit 619d496
Show file tree

Hide file tree

Showing 19 changed files with 110 additions and 110 deletions.
diff --git a/compiler/neuronx-cc/api-reference-guide/neuron-compiler-cli-reference-guide.rst b/compiler/neuronx-cc/api-reference-guide/neuron-compiler-cli-reference-guide.rst
@@ -128,7 +128,7 @@ Available Commands:
 
     - ``llm-training``: Enable the compiler to perform optimizations applicable to large language model (LLMS) training runs that  shard parameters, gradients, and optimizer states across data-parallel workers. This is equivalent to the previously documented option argument value of ``NEMO``, which will be deprecated in a future release.
 
-  - :option:`--logical-nc-config <shard_degree>`: Instructs the compiler to shard the input graph across physical NeuronCore devices. Possible numeric values are {1, 2}. (only available on trn2; Default: ``2``)
+  - :option:`--logical-nc-config <shard_degree>`: Instructs the compiler to shard the input graph across physical NeuronCore accelerators. Possible numeric values are {1, 2}. (only available on trn2; Default: ``2``)
 
     Valid values:
 

diff --git a/general/arch/neuron-hardware/inf1-arch.rst b/general/arch/neuron-hardware/inf1-arch.rst
@@ -4,8 +4,8 @@ Amazon EC2 Inf1 Architecture
 ==============================
 
 On this page, we provide an architectural overview of the Amazon EC2 Inf1
-instance and the corresponding :ref:`Inferentia <inferentia-arch>` NeuronDevices that power
-them (:ref:`Inferentia <inferentia-arch>` devices from here on).
+instance and the corresponding :ref:`Inferentia <inferentia-arch>` NeuronChips that power
+them (:ref:`Inferentia <inferentia-arch>` chips from here on).
 
 .. contents:: Table of Contents
    :local:
@@ -16,7 +16,7 @@ them (:ref:`Inferentia <inferentia-arch>` devices from here on).
 Inf1 Architecture
 -----------------
 
-The EC2 Inf1 instance is powered by 16 :ref:`Inferentia <inferentia-arch>` devices, allowing
+The EC2 Inf1 instance is powered by 16 :ref:`Inferentia <inferentia-arch>` chips, allowing
 customers to choose between four instance sizes:
 
 .. list-table::
@@ -27,14 +27,14 @@ customers to choose between four instance sizes:
 
 
     *   - Instance size
-        - # of Inferentia devices
+        - # of Inferentia chips
         - vCPUs
         - Host Memory (GiB)
         - FP16/BF16 TFLOPS
         - INT8 TOPS
-        - Device Memory (GiB)
-        - Device Memory bandwidth (GiB/sec)
-        - NeuronLink-v1 device-to-device bandwidth (GiB/sec/device)
+        - Chip Memory (GiB)
+        - Chip Memory bandwidth (GiB/sec)
+        - NeuronLink-v1 chip-to-chip bandwidth (GiB/sec/chip)
         - EFA bandwidth (Gbps)
 
     *   - Inf1.xlarge
@@ -84,7 +84,7 @@ customers to choose between four instance sizes:
 
 
 
-Inf1 offers a direct device-to-device interconnect called NeuronLink-v1,
+Inf1 offers a direct chip-to-chip interconnect called NeuronLink-v1,
 which enables co-optimizing latency and throughput via the :ref:`Neuron Core Pipeline <neuroncore-pipeline>` technology. 
 
 .. image:: /images/inf1-server-arch.png

diff --git a/general/arch/neuron-hardware/inf2-arch.rst b/general/arch/neuron-hardware/inf2-arch.rst
@@ -4,13 +4,13 @@ Amazon EC2 Inf2 Architecture
 =============================
 
 On this page we provide an architectural overview of the Amazon EC2 Inf2
-instances and the corresponding Inferentia2 NeuronDevices that power
-them (Inferentia2 devices from here on).
+instances and the corresponding Inferentia2 NeuronChips that power
+them (Inferentia2 chips from here on).
 
 Inf2 Architecture
 -----------------
 
-The EC2 Inf2 instance is powered by up to 12 :ref:`Inferentia2 devices <inferentia2-arch>`, and allows
+The EC2 Inf2 instance is powered by up to 12 :ref:`Inferentia2 chips <inferentia2-arch>`, and allows
 customers to choose between four instance sizes:
 
 .. list-table::
@@ -20,14 +20,14 @@ customers to choose between four instance sizes:
     :align: left
 
     *   - Instance size
-        - # of Inferentia2 devices
+        - # of Inferentia2 chips
         - vCPUs
         - Host Memory (GiB)
         - FP8/FP16/BF16/TF32 TFLOPS
         - FP32 TFLOPS
-        - Device Memory (GiB)
+        - Chip Memory (GiB)
         - Instance Memory Bandwidth (GiB/sec)
-        - NeuronLink-v2 device-to-device (GiB/sec/device)
+        - NeuronLink-v2 chip-to-chip (GiB/sec/chip)
 
     *   - Inf2.xlarge
         - 1
@@ -73,7 +73,7 @@ customers to choose between four instance sizes:
 Inf2 offers a low-latency, high-bandwidth chip-to-chip interconnect
 called NeuronLink-v2, which enables high-performance collective communication operations (e.g., AllReduce and AllGather).
 
-This allows sharding large models across Inferentia2 devices (e.g., via
+This allows sharding large models across Inferentia2 chips (e.g., via
 Tensor Parallelism), thus optimizing latency and throughput. This
 capability is especially useful when deploying Large Generative Models.
 

diff --git a/general/arch/neuron-hardware/inferentia.rst b/general/arch/neuron-hardware/inferentia.rst
@@ -4,22 +4,22 @@
 Inferentia Architecture
 -----------------------
 
-At the heart of each Inf1 instance are sixteen Inferentia devices, each with four :ref:`NeuronCore-v1 <neuroncores-v1-arch>`, as depicted
+At the heart of each Inf1 instance are sixteen Inferentia chips, each with four :ref:`NeuronCore-v1 <neuroncores-v1-arch>`, as depicted
 below:
 
 .. image:: /images/inferentia-neurondevice.png
 
 
 
-Each Inferentia device consists of:
+Each Inferentia chip consists of:
 
 +---------------+-------------------------------------------+
 | Compute       | Four                                      |  
 |               | :ref:`NeuronCore-v1 <neuroncores-v1-arch>`|   
 |               | cores, delivering 128 INT8 TOPS and 64    |   
 |               | FP16/BF16 TFLOPS                          |  
 +---------------+-------------------------------------------+
-| Device Memory | 8GiB of device DRAM memory (for storing   |  
+| Chip Memory   | 8GiB of chip DRAM memory (for storing   |  
 |               | parameters and intermediate state), with  | 
 |               | 50 GiB/sec of bandwidth                   | 
 +---------------+-------------------------------------------+

diff --git a/general/arch/neuron-hardware/inferentia2.rst b/general/arch/neuron-hardware/inferentia2.rst
@@ -3,13 +3,13 @@
 Inferentia2 Architecture
 ------------------------
 
-At the heart of each Inf2 instance are up to twelve Inferentia2 devices (each with two :ref:`NeuronCore-v2 <neuroncores-v2-arch>` cores). Inferentia2 is the second
-generation AWS purpose-built Machine Learning inference accelerator. The Inferentia2 device architecture is depicted below: 
+At the heart of each Inf2 instance are up to twelve Inferentia2 chips (each with two :ref:`NeuronCore-v2 <neuroncores-v2-arch>` cores). Inferentia2 is the second
+generation AWS purpose-built Machine Learning inference accelerator. The Inferentia2 chip architecture is depicted below: 
 
 .. image:: /images/inferentia2.png
 
 
-Each Inferentia2 device consists of:
+Each Inferentia2 chip consists of:
 
 +----------------------------------+----------------------------------+
 | Compute                          | Two :ref:`NeuronCore-v2          |
@@ -18,7 +18,7 @@ Each Inferentia2 device consists of:
 |                                  | 190 FP16/BF16/cFP8/TF32 TFLOPS,  |
 |                                  | and 47.5 FP32 TFLOPS.            |
 +----------------------------------+----------------------------------+
-| Device Memory                    | 32GiB of high-bandwidth device   |                                  
+| Chip Memory                      | 32GiB of high-bandwidth chip     |                                  
 |                                  | memor (HBM) (for storing model   |                                  
 |                                  | state), with 820 GiB/sec of      |                                  
 |                                  | bandwidth.                       |
@@ -28,7 +28,7 @@ Each Inferentia2 device consists of:
 |                                  | compression/decompression.       |
 +----------------------------------+----------------------------------+
 | NeuronLink                       | NeuronLink-v2 for                |                                  
-|                                  | device-to-device interconnect    |                                  
+|                                  | chip-to-chip interconnect        |                                  
 |                                  | enables high-performance         |                                  
 |                                  | collective compute for           |                                  
 |                                  | co-optimization of latency and   |                                  

diff --git a/general/arch/neuron-hardware/neuron-core-v1.rst b/general/arch/neuron-hardware/neuron-core-v1.rst
@@ -5,7 +5,7 @@ NeuronCore-v1 Architecture
 --------------------------
 
 NeuronCore-v1 is the first generation NeuronCore engine, powering
-the Inferentia NeuronDevices. Each NeuronCore-v1 is a fully-independent
+the Inferentia chips. Each NeuronCore-v1 is a fully-independent
 heterogenous compute-unit, with three main engines (Tensor/Vector/Scalar
 Engines), and on-chip software-managed SRAM memory, for
 maximizing data locality (compiler managed, for maximum data locality

diff --git a/general/arch/neuron-hardware/neuron-core-v2.rst b/general/arch/neuron-hardware/neuron-core-v2.rst
@@ -4,7 +4,7 @@ NeuronCore-v2 Architecture
 --------------------------
 
 NeuronCore-v2 is the second generation of the NeuronCore engine,
-powering the Trainium NeuronDevices. Each NeuronCore-v2 is a
+powering the Trainium chips. Each NeuronCore-v2 is a
 fully-independent heterogenous compute-unit, with 4 main engines
 (Tensor/Vector/Scalar/GPSIMD Engines), and on-chip
 software-managed SRAM memory, for maximizing data locality (compiler

diff --git a/general/arch/neuron-hardware/neuron-core-v3.rst b/general/arch/neuron-hardware/neuron-core-v3.rst
@@ -3,7 +3,7 @@
 NeuronCore-v3 Architecture
 --------------------------
 
-NeuronCore-v3 is the third-generation NeuronCore that powers Trainium2 devices. It is a fully-independent heterogenous compute 
+NeuronCore-v3 is the third-generation NeuronCore that powers Trainium2 chips. It is a fully-independent heterogenous compute 
 unit consisting of 4 main engines: Tensor, Vector, Scalar, and GPSIMD, with on-chip software-managed SRAM memory to maximize data 
 locality and optimize data prefetch. The following diagram shows a high-level overview of the NeuronCore-V3 architecture.
 

diff --git a/general/arch/neuron-hardware/neuron-devices.rst b/general/arch/neuron-hardware/neuron-devices.rst
@@ -5,7 +5,7 @@ Amazon EC2 AI Chips Architecture
 
 Amazon EC2 AI Chips (Neuron Devices) are the accelerated machine learning chips (e.g. Inferentia or Trainium) that enable Trn and Inf instance.
 
-For a detailed description of current Neuron Devices:
+For a detailed description of current Neuron chips:
 
 * :ref:`trainium2-arch`
 * :ref:`trainium-arch`

diff --git a/general/arch/neuron-hardware/neuron-instances.rst b/general/arch/neuron-hardware/neuron-instances.rst
@@ -1,7 +1,7 @@
 .. _neuroninstances-arch:
 
-Trn and Inf Instances Architecture
-==================================
+Instance and UltraServer Architecture
+=====================================
 
 For a detailed description of Trn Instances:
 

diff --git a/general/arch/neuron-hardware/neuroncores-arch.rst b/general/arch/neuron-hardware/neuroncores-arch.rst
@@ -3,7 +3,7 @@
 AWS NeuronCore Architecture
 ===========================
 
-NeuronCores are fully-independent heterogenous compute-units that power Tranium, Tranium2, Inferentia, and Inferentia2 NeuronDevices. 
+NeuronCores are fully-independent heterogenous compute-units that power Tranium, Tranium2, Inferentia, and Inferentia2 chips. 
 For a detailed description of current generation NeuronCore (NeuronCore-v3) hardware engines, see:
 
 * :ref:`neuroncores-v3-arch`

diff --git a/general/arch/neuron-hardware/trainium.rst b/general/arch/neuron-hardware/trainium.rst
@@ -4,14 +4,14 @@
 Trainium Architecture
 ----------------------
 
-At the heart of the Trn1 instance are 16 x Trainium devices (each Trainium include 2 x :ref:`NeuronCore-v2 <neuroncores-v2-arch>`). Trainium is the second
+At the heart of the Trn1 instance are 16 x Trainium chips (each Trainium include 2 x :ref:`NeuronCore-v2 <neuroncores-v2-arch>`). Trainium is the second
 generation purpose-built Machine Learning accelerator from AWS. The
-Trainium device architecture is depicted below:
+Trainium chip architecture is depicted below:
 
 .. image:: /images/trainium-neurondevice.png
 
 
-Each Trainium device consists of:
+Each Trainium chip consists of:
 
 +----------------------------------+----------------------------------+
 | Compute                          | Two :ref:`NeuronCore-v2          |
@@ -20,7 +20,7 @@ Each Trainium device consists of:
 |                                  | 190 FP16/BF16/cFP8/TF32 TFLOPS,  |
 |                                  | and 47.5 FP32 TFLOP.             |
 +----------------------------------+----------------------------------+
-| Device Memory                    | 32 GiB of device memory (for     |                                  
+| Chip Memory                      | 32 GiB of chip memory (for       |                                  
 |                                  | storing model state), with 820   |                                  
 |                                  | GiB/sec of bandwidth.            |             
 +----------------------------------+----------------------------------+
@@ -29,11 +29,11 @@ Each Trainium device consists of:
 |                                  | compression/decompression.       |
 +----------------------------------+----------------------------------+
 | NeuronLink                       | NeuronLink-v2 for                |
-|                                  | device-to-device interconnect    |
+|                                  | chip-to-chip interconnect        |
 |                                  | enables efficient scale-out      |
 |                                  | training, as well as memory      |
 |                                  | pooling between the different    |
-|                                  | Trainium devices.                |
+|                                  | Trainium chips.                  |
 +----------------------------------+----------------------------------+
 | Programmability                  | Trainium supports dynamic shapes |
 |                                  | and control flow, via ISA        |

diff --git a/general/arch/neuron-hardware/trainium2.rst b/general/arch/neuron-hardware/trainium2.rst
@@ -4,39 +4,39 @@
 Trainium2 Architecture
 ######################
 
-Trainium2 is the third generation, purpose-built Machine Learning chip from AWS. It powers Amazon EC2 trn2-16.48xlarge instances and 
-the u-trn2x64 UltraServer. Every Trainium2 device contains eight NeuronCore-V3. Beginning with Trainium2, AWS Neuron adds support for Logical 
+Trainium2 is the third generation, purpose-built Machine Learning chip from AWS. Every Trainium2 chip contains eight NeuronCore-V3. Beginning with Trainium2, AWS Neuron adds support for Logical 
 NeuronCore Configuration (LNC), which lets you combine the compute and memory resources of multiple physical NeuronCores into a 
-single logical NeuronCore. The following diagram shows the architecture overview of a Trainium2 device.
+single logical NeuronCore. The following diagram shows the architecture overview of a Trainium2 chip.
 
 .. image:: /images/architecture/Trainium2/trainium2.png
     :align: center
     :width: 400
+
 ===========================
-Trainium2 device components
+Trainium2 chip components
 ===========================
 
-Each Trainium2 device consists of the following components:
+Each Trainium2 chip consists of the following components:
 
 +----------------------------------+-----------------------------------------------------+
 | Compute                          | Eight NeuronCore-v3 that collectively deliver:      |
 |                                  |                                                     |
-|                                  | * 1,287 FP8 TFLOPS                                  | 
-|                                  | * 655 BF16/FP16/TF32 TFLOPS                         |
-|                                  | * 2,551 FP8/FP16/BF16/TF32 sparse TFLOPS            |
+|                                  | * 1,299 FP8 TFLOPS                                  | 
+|                                  | * 667 BF16/FP16/TF32 TFLOPS                         |
+|                                  | * 2,563 FP8/FP16/BF16/TF32 sparse TFLOPS            |
 |                                  | * 181 FP32 TFLOPS                                   |
 |                                  |                                                     |
 +----------------------------------+-----------------------------------------------------+
-| Device Memory                    | 96 GiB of device memory with 2.9 TB/sec of          |
+| Chip Memory                      | 96 GiB of chip memory with 2.9 TB/sec of            |
 |                                  | bandwidth.                                          |             
 +----------------------------------+-----------------------------------------------------+
 | Data Movement                    | 3.5 TB/sec of DMA bandwidth, with inline            |
 |                                  | memory compression and decompression.               |
 +----------------------------------+-----------------------------------------------------+
-| NeuronLink                       | NeuronLink-v3 for device-to-device interconnect     |
-|                                  | provides 1.28 TB/sec bandwidth per device. It allows|
+| NeuronLink                       | NeuronLink-v3 for chip-to-chip interconnect         |
+|                                  | provides 1.28 TB/sec bandwidth per chip. It allows  |
 |                                  | for efficient scale-out training and inference, as  |
-|                                  | well as memory pooling between Trainium2 devices.   |
+|                                  | well as memory pooling between Trainium2 chips.     |
 +----------------------------------+-----------------------------------------------------+
 | Programmability                  | Trainium2 supports dynamic shapes and control flow  |
 |                                  | via NeuronCore-v3 ISA extensions. Trainium2 also    |
@@ -46,14 +46,14 @@ Each Trainium2 device consists of the following components:
 |                                  | custom operators via deeply embedded GPSIMD engines.|
 +----------------------------------+-----------------------------------------------------+
 | Collective communication         | 20 CC-Cores orchestrate collective communication    |
-|                                  | among Trainium2 devices within and across instances.|
+|                                  | among Trainium2 chips within and across instances.  |
 +----------------------------------+-----------------------------------------------------+     
 
 ==================================
 Trainium2 performance improvements
 ==================================
 
-The following set of tables offer a comparison between Trainium and Trainium2 devices. 
+The following set of tables offer a comparison between Trainium and Trainium2 chips. 
 
 Compute
 """""""
@@ -71,19 +71,19 @@ Compute
 
     *   - FP8 (TFLOPS)
         - 191
-        - 1287
+        - 1299
         - 6.7x
     *   - BF16/FP16/TF32 (TFLOPS)
         - 191
-        - 655
+        - 667
         - 3.4x
     *   - FP32 (TFLOPS)
         - 48
         - 181
         - 3.7x
     *   - FP8/FP16/BF16/TF32 Sparse (TFLOPS)
         - Not applicable
-        - 2551 
+        - 2563 
         - Not applicable
 
 Memory
@@ -113,8 +113,8 @@ Memory
         - 224
         - 4.7x
     *   - Memory Pool Size
-        - Up to 16 devices
-        - Up to 64 devices
+        - Up to 16 chips
+        - Up to 64 chips
         - 4x
 
 Interconnect
@@ -131,7 +131,7 @@ Interconnect
         - Trainium2
         - Improvement factor
 
-    *   - Inter-chip Interconnect (GB/sec/device)
+    *   - Inter-chip Interconnect (GB/sec/chip)
         - 384
         - 1280
         - 3.3x