Skip to content

Commit

Permalink
Updates to architecture pages
Browse files Browse the repository at this point in the history
  • Loading branch information
aws-maens authored and ivashkst committed Dec 7, 2024
1 parent dc98b44 commit 619d496
Show file tree
Hide file tree
Showing 19 changed files with 110 additions and 110 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ Available Commands:

- ``llm-training``: Enable the compiler to perform optimizations applicable to large language model (LLMS) training runs that shard parameters, gradients, and optimizer states across data-parallel workers. This is equivalent to the previously documented option argument value of ``NEMO``, which will be deprecated in a future release.

- :option:`--logical-nc-config <shard_degree>`: Instructs the compiler to shard the input graph across physical NeuronCore devices. Possible numeric values are {1, 2}. (only available on trn2; Default: ``2``)
- :option:`--logical-nc-config <shard_degree>`: Instructs the compiler to shard the input graph across physical NeuronCore accelerators. Possible numeric values are {1, 2}. (only available on trn2; Default: ``2``)

Valid values:

Expand Down
16 changes: 8 additions & 8 deletions general/arch/neuron-hardware/inf1-arch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ Amazon EC2 Inf1 Architecture
==============================

On this page, we provide an architectural overview of the Amazon EC2 Inf1
instance and the corresponding :ref:`Inferentia <inferentia-arch>` NeuronDevices that power
them (:ref:`Inferentia <inferentia-arch>` devices from here on).
instance and the corresponding :ref:`Inferentia <inferentia-arch>` NeuronChips that power
them (:ref:`Inferentia <inferentia-arch>` chips from here on).

.. contents:: Table of Contents
:local:
Expand All @@ -16,7 +16,7 @@ them (:ref:`Inferentia <inferentia-arch>` devices from here on).
Inf1 Architecture
-----------------

The EC2 Inf1 instance is powered by 16 :ref:`Inferentia <inferentia-arch>` devices, allowing
The EC2 Inf1 instance is powered by 16 :ref:`Inferentia <inferentia-arch>` chips, allowing
customers to choose between four instance sizes:

.. list-table::
Expand All @@ -27,14 +27,14 @@ customers to choose between four instance sizes:


* - Instance size
- # of Inferentia devices
- # of Inferentia chips
- vCPUs
- Host Memory (GiB)
- FP16/BF16 TFLOPS
- INT8 TOPS
- Device Memory (GiB)
- Device Memory bandwidth (GiB/sec)
- NeuronLink-v1 device-to-device bandwidth (GiB/sec/device)
- Chip Memory (GiB)
- Chip Memory bandwidth (GiB/sec)
- NeuronLink-v1 chip-to-chip bandwidth (GiB/sec/chip)
- EFA bandwidth (Gbps)

* - Inf1.xlarge
Expand Down Expand Up @@ -84,7 +84,7 @@ customers to choose between four instance sizes:



Inf1 offers a direct device-to-device interconnect called NeuronLink-v1,
Inf1 offers a direct chip-to-chip interconnect called NeuronLink-v1,
which enables co-optimizing latency and throughput via the :ref:`Neuron Core Pipeline <neuroncore-pipeline>` technology.

.. image:: /images/inf1-server-arch.png
Expand Down
14 changes: 7 additions & 7 deletions general/arch/neuron-hardware/inf2-arch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@ Amazon EC2 Inf2 Architecture
=============================

On this page we provide an architectural overview of the Amazon EC2 Inf2
instances and the corresponding Inferentia2 NeuronDevices that power
them (Inferentia2 devices from here on).
instances and the corresponding Inferentia2 NeuronChips that power
them (Inferentia2 chips from here on).

Inf2 Architecture
-----------------

The EC2 Inf2 instance is powered by up to 12 :ref:`Inferentia2 devices <inferentia2-arch>`, and allows
The EC2 Inf2 instance is powered by up to 12 :ref:`Inferentia2 chips <inferentia2-arch>`, and allows
customers to choose between four instance sizes:

.. list-table::
Expand All @@ -20,14 +20,14 @@ customers to choose between four instance sizes:
:align: left

* - Instance size
- # of Inferentia2 devices
- # of Inferentia2 chips
- vCPUs
- Host Memory (GiB)
- FP8/FP16/BF16/TF32 TFLOPS
- FP32 TFLOPS
- Device Memory (GiB)
- Chip Memory (GiB)
- Instance Memory Bandwidth (GiB/sec)
- NeuronLink-v2 device-to-device (GiB/sec/device)
- NeuronLink-v2 chip-to-chip (GiB/sec/chip)

* - Inf2.xlarge
- 1
Expand Down Expand Up @@ -73,7 +73,7 @@ customers to choose between four instance sizes:
Inf2 offers a low-latency, high-bandwidth chip-to-chip interconnect
called NeuronLink-v2, which enables high-performance collective communication operations (e.g., AllReduce and AllGather).

This allows sharding large models across Inferentia2 devices (e.g., via
This allows sharding large models across Inferentia2 chips (e.g., via
Tensor Parallelism), thus optimizing latency and throughput. This
capability is especially useful when deploying Large Generative Models.

Expand Down
6 changes: 3 additions & 3 deletions general/arch/neuron-hardware/inferentia.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,22 @@
Inferentia Architecture
-----------------------

At the heart of each Inf1 instance are sixteen Inferentia devices, each with four :ref:`NeuronCore-v1 <neuroncores-v1-arch>`, as depicted
At the heart of each Inf1 instance are sixteen Inferentia chips, each with four :ref:`NeuronCore-v1 <neuroncores-v1-arch>`, as depicted
below:

.. image:: /images/inferentia-neurondevice.png



Each Inferentia device consists of:
Each Inferentia chip consists of:

+---------------+-------------------------------------------+
| Compute | Four |
| | :ref:`NeuronCore-v1 <neuroncores-v1-arch>`|
| | cores, delivering 128 INT8 TOPS and 64 |
| | FP16/BF16 TFLOPS |
+---------------+-------------------------------------------+
| Device Memory | 8GiB of device DRAM memory (for storing |
| Chip Memory | 8GiB of chip DRAM memory (for storing |
| | parameters and intermediate state), with |
| | 50 GiB/sec of bandwidth |
+---------------+-------------------------------------------+
Expand Down
10 changes: 5 additions & 5 deletions general/arch/neuron-hardware/inferentia2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@
Inferentia2 Architecture
------------------------

At the heart of each Inf2 instance are up to twelve Inferentia2 devices (each with two :ref:`NeuronCore-v2 <neuroncores-v2-arch>` cores). Inferentia2 is the second
generation AWS purpose-built Machine Learning inference accelerator. The Inferentia2 device architecture is depicted below:
At the heart of each Inf2 instance are up to twelve Inferentia2 chips (each with two :ref:`NeuronCore-v2 <neuroncores-v2-arch>` cores). Inferentia2 is the second
generation AWS purpose-built Machine Learning inference accelerator. The Inferentia2 chip architecture is depicted below:

.. image:: /images/inferentia2.png


Each Inferentia2 device consists of:
Each Inferentia2 chip consists of:

+----------------------------------+----------------------------------+
| Compute | Two :ref:`NeuronCore-v2 |
Expand All @@ -18,7 +18,7 @@ Each Inferentia2 device consists of:
| | 190 FP16/BF16/cFP8/TF32 TFLOPS, |
| | and 47.5 FP32 TFLOPS. |
+----------------------------------+----------------------------------+
| Device Memory | 32GiB of high-bandwidth device |
| Chip Memory | 32GiB of high-bandwidth chip |
| | memor (HBM) (for storing model |
| | state), with 820 GiB/sec of |
| | bandwidth. |
Expand All @@ -28,7 +28,7 @@ Each Inferentia2 device consists of:
| | compression/decompression. |
+----------------------------------+----------------------------------+
| NeuronLink | NeuronLink-v2 for |
| | device-to-device interconnect |
| | chip-to-chip interconnect |
| | enables high-performance |
| | collective compute for |
| | co-optimization of latency and |
Expand Down
2 changes: 1 addition & 1 deletion general/arch/neuron-hardware/neuron-core-v1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ NeuronCore-v1 Architecture
--------------------------

NeuronCore-v1 is the first generation NeuronCore engine, powering
the Inferentia NeuronDevices. Each NeuronCore-v1 is a fully-independent
the Inferentia chips. Each NeuronCore-v1 is a fully-independent
heterogenous compute-unit, with three main engines (Tensor/Vector/Scalar
Engines), and on-chip software-managed SRAM memory, for
maximizing data locality (compiler managed, for maximum data locality
Expand Down
2 changes: 1 addition & 1 deletion general/arch/neuron-hardware/neuron-core-v2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ NeuronCore-v2 Architecture
--------------------------

NeuronCore-v2 is the second generation of the NeuronCore engine,
powering the Trainium NeuronDevices. Each NeuronCore-v2 is a
powering the Trainium chips. Each NeuronCore-v2 is a
fully-independent heterogenous compute-unit, with 4 main engines
(Tensor/Vector/Scalar/GPSIMD Engines), and on-chip
software-managed SRAM memory, for maximizing data locality (compiler
Expand Down
2 changes: 1 addition & 1 deletion general/arch/neuron-hardware/neuron-core-v3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
NeuronCore-v3 Architecture
--------------------------

NeuronCore-v3 is the third-generation NeuronCore that powers Trainium2 devices. It is a fully-independent heterogenous compute
NeuronCore-v3 is the third-generation NeuronCore that powers Trainium2 chips. It is a fully-independent heterogenous compute
unit consisting of 4 main engines: Tensor, Vector, Scalar, and GPSIMD, with on-chip software-managed SRAM memory to maximize data
locality and optimize data prefetch. The following diagram shows a high-level overview of the NeuronCore-V3 architecture.

Expand Down
2 changes: 1 addition & 1 deletion general/arch/neuron-hardware/neuron-devices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Amazon EC2 AI Chips Architecture

Amazon EC2 AI Chips (Neuron Devices) are the accelerated machine learning chips (e.g. Inferentia or Trainium) that enable Trn and Inf instance.

For a detailed description of current Neuron Devices:
For a detailed description of current Neuron chips:

* :ref:`trainium2-arch`
* :ref:`trainium-arch`
Expand Down
4 changes: 2 additions & 2 deletions general/arch/neuron-hardware/neuron-instances.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _neuroninstances-arch:

Trn and Inf Instances Architecture
==================================
Instance and UltraServer Architecture
=====================================

For a detailed description of Trn Instances:

Expand Down
2 changes: 1 addition & 1 deletion general/arch/neuron-hardware/neuroncores-arch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
AWS NeuronCore Architecture
===========================

NeuronCores are fully-independent heterogenous compute-units that power Tranium, Tranium2, Inferentia, and Inferentia2 NeuronDevices.
NeuronCores are fully-independent heterogenous compute-units that power Tranium, Tranium2, Inferentia, and Inferentia2 chips.
For a detailed description of current generation NeuronCore (NeuronCore-v3) hardware engines, see:

* :ref:`neuroncores-v3-arch`
Expand Down
12 changes: 6 additions & 6 deletions general/arch/neuron-hardware/trainium.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@
Trainium Architecture
----------------------

At the heart of the Trn1 instance are 16 x Trainium devices (each Trainium include 2 x :ref:`NeuronCore-v2 <neuroncores-v2-arch>`). Trainium is the second
At the heart of the Trn1 instance are 16 x Trainium chips (each Trainium include 2 x :ref:`NeuronCore-v2 <neuroncores-v2-arch>`). Trainium is the second
generation purpose-built Machine Learning accelerator from AWS. The
Trainium device architecture is depicted below:
Trainium chip architecture is depicted below:

.. image:: /images/trainium-neurondevice.png


Each Trainium device consists of:
Each Trainium chip consists of:

+----------------------------------+----------------------------------+
| Compute | Two :ref:`NeuronCore-v2 |
Expand All @@ -20,7 +20,7 @@ Each Trainium device consists of:
| | 190 FP16/BF16/cFP8/TF32 TFLOPS, |
| | and 47.5 FP32 TFLOP. |
+----------------------------------+----------------------------------+
| Device Memory | 32 GiB of device memory (for |
| Chip Memory | 32 GiB of chip memory (for |
| | storing model state), with 820 |
| | GiB/sec of bandwidth. |
+----------------------------------+----------------------------------+
Expand All @@ -29,11 +29,11 @@ Each Trainium device consists of:
| | compression/decompression. |
+----------------------------------+----------------------------------+
| NeuronLink | NeuronLink-v2 for |
| | device-to-device interconnect |
| | chip-to-chip interconnect |
| | enables efficient scale-out |
| | training, as well as memory |
| | pooling between the different |
| | Trainium devices. |
| | Trainium chips. |
+----------------------------------+----------------------------------+
| Programmability | Trainium supports dynamic shapes |
| | and control flow, via ISA |
Expand Down
40 changes: 20 additions & 20 deletions general/arch/neuron-hardware/trainium2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,39 +4,39 @@
Trainium2 Architecture
######################

Trainium2 is the third generation, purpose-built Machine Learning chip from AWS. It powers Amazon EC2 trn2-16.48xlarge instances and
the u-trn2x64 UltraServer. Every Trainium2 device contains eight NeuronCore-V3. Beginning with Trainium2, AWS Neuron adds support for Logical
Trainium2 is the third generation, purpose-built Machine Learning chip from AWS. Every Trainium2 chip contains eight NeuronCore-V3. Beginning with Trainium2, AWS Neuron adds support for Logical
NeuronCore Configuration (LNC), which lets you combine the compute and memory resources of multiple physical NeuronCores into a
single logical NeuronCore. The following diagram shows the architecture overview of a Trainium2 device.
single logical NeuronCore. The following diagram shows the architecture overview of a Trainium2 chip.

.. image:: /images/architecture/Trainium2/trainium2.png
:align: center
:width: 400

===========================
Trainium2 device components
Trainium2 chip components
===========================

Each Trainium2 device consists of the following components:
Each Trainium2 chip consists of the following components:

+----------------------------------+-----------------------------------------------------+
| Compute | Eight NeuronCore-v3 that collectively deliver: |
| | |
| | * 1,287 FP8 TFLOPS |
| | * 655 BF16/FP16/TF32 TFLOPS |
| | * 2,551 FP8/FP16/BF16/TF32 sparse TFLOPS |
| | * 1,299 FP8 TFLOPS |
| | * 667 BF16/FP16/TF32 TFLOPS |
| | * 2,563 FP8/FP16/BF16/TF32 sparse TFLOPS |
| | * 181 FP32 TFLOPS |
| | |
+----------------------------------+-----------------------------------------------------+
| Device Memory | 96 GiB of device memory with 2.9 TB/sec of |
| Chip Memory | 96 GiB of chip memory with 2.9 TB/sec of |
| | bandwidth. |
+----------------------------------+-----------------------------------------------------+
| Data Movement | 3.5 TB/sec of DMA bandwidth, with inline |
| | memory compression and decompression. |
+----------------------------------+-----------------------------------------------------+
| NeuronLink | NeuronLink-v3 for device-to-device interconnect |
| | provides 1.28 TB/sec bandwidth per device. It allows|
| NeuronLink | NeuronLink-v3 for chip-to-chip interconnect |
| | provides 1.28 TB/sec bandwidth per chip. It allows |
| | for efficient scale-out training and inference, as |
| | well as memory pooling between Trainium2 devices. |
| | well as memory pooling between Trainium2 chips. |
+----------------------------------+-----------------------------------------------------+
| Programmability | Trainium2 supports dynamic shapes and control flow |
| | via NeuronCore-v3 ISA extensions. Trainium2 also |
Expand All @@ -46,14 +46,14 @@ Each Trainium2 device consists of the following components:
| | custom operators via deeply embedded GPSIMD engines.|
+----------------------------------+-----------------------------------------------------+
| Collective communication | 20 CC-Cores orchestrate collective communication |
| | among Trainium2 devices within and across instances.|
| | among Trainium2 chips within and across instances. |
+----------------------------------+-----------------------------------------------------+

==================================
Trainium2 performance improvements
==================================

The following set of tables offer a comparison between Trainium and Trainium2 devices.
The following set of tables offer a comparison between Trainium and Trainium2 chips.

Compute
"""""""
Expand All @@ -71,19 +71,19 @@ Compute

* - FP8 (TFLOPS)
- 191
- 1287
- 1299
- 6.7x
* - BF16/FP16/TF32 (TFLOPS)
- 191
- 655
- 667
- 3.4x
* - FP32 (TFLOPS)
- 48
- 181
- 3.7x
* - FP8/FP16/BF16/TF32 Sparse (TFLOPS)
- Not applicable
- 2551
- 2563
- Not applicable

Memory
Expand Down Expand Up @@ -113,8 +113,8 @@ Memory
- 224
- 4.7x
* - Memory Pool Size
- Up to 16 devices
- Up to 64 devices
- Up to 16 chips
- Up to 64 chips
- 4x

Interconnect
Expand All @@ -131,7 +131,7 @@ Interconnect
- Trainium2
- Improvement factor

* - Inter-chip Interconnect (GB/sec/device)
* - Inter-chip Interconnect (GB/sec/chip)
- 384
- 1280
- 3.3x
Expand Down
Loading

0 comments on commit 619d496

Please sign in to comment.