Modern GPU Architectures

GPU hardware vendors continually refine their microarchitectures and release improved product lineups. Oftentimes the changes in new architectures are incremental or scoped to a particular domain (e.g. ray tracing); however, having a general understanding of how different vendors implement the same concepts is definitely useful. These differences becomes important when you start to optimize compute shaders.

There is little point in dissecting every architecture in detail here since most vendor's publish comprehensive architecture whitepapers and optimization guides. This page will simply summarize the architectures that support DirectX 12, which limits the scope to hardware released around 2012 or later. I'll include an example GPU built on each architecture, its number of shader units, theoretical single-precision throughput (this is what most GPUs specialize in), and memory bandwidth numbers as a frame of reference; however, keep in mind that two GPUs with the same numbers may perform very differently because of architectural differences.

AMD
Intel
- Execution Units (EU)
- Resources
NVIDIA
Resources

AMD

AMD is one of the largest vendors of discrete graphics cards. See this page for a list of AMD GPUs.

Year	Architecture	Example	Example: Shader Units	Example: FP32 Throughput	Example: Bandwidth	Notable features for compute
2012	GCN 1	HD 7990	32 CU (2048 ALUs)	3.89 TFLOPS	288 GB/s
2013	GCN 2	R9 390X	44 CU (2816 ALUs)	5.91 TFLOPS	384 GB/s
2015	GCN 3	R9 Fury	56 CU (3584 ALUs)	7.17 TFLOPS	512 GB/s	FP16 support, GPU preemption
2016	GCN 4	RX 590	36 CU (2304 ALUs)	6.77 TFLOPS	256 GB/s
2017	GCN 5 (Vega)	Radeon VII	60 CU (3840 ALUs)	11.14 TFLOPS	1024 GB/s	2x FP16 per SP (double throughput)
2019	RDNA 1	RX 5700 XT	20 WGP (2560 ALUs)	8.22 TFLOPS	448 GB/s	WGPs double resources per thread group; Wave32/Wave64
2020	RDNA 2	RX 6900 XT	40 WGP (5120 ALUs)	18.69 TFLOPS	512 GB/s	Infinity cache

There are currently two main architectures that are relevant to DirectX 12: Graphics Core Next (GCN) and Radeon DNA (RDNA).

Compute Units (CUs)

In AMD's GCN architecture, a shader unit maps to a compute unit (CU). Each CU comprises:

4x 16-wide SIMD units. Instructions can be issued once per 4 cycles on these SIMDs.
A 64 KiB of Local Data Share (LDS) (thread group shared memory).
A scalar ALU (SALU) separate from the SIMD vector processors

Workgroup Processors (WGP)

In AMD's RDNA architecture, a shader unit maps to a work group processor (WGP). Each WGP comprises:

2x CUs, which are now 2x 32-wide SIMD units. Instructions can be issued every cycle on these SIMDs.
A 128 KiB of Local Data Share (LDS) (thread group shared memory).
4x scalar ALUs (SALU); one per CU.

Resources

RDNA Whitepaper
RDNA Architecture
Optimizing for the RDNA Architecture
GCN: Optimizing GPU occupancy and resource usage with large thread groups
An Architectural Deep-Device into AMD's GCN & RDNA
Vega Instruction Set Architecture

Intel

Intel is well-known for CPUs, and most modern consumer CPUs incorporate integrated graphics processors. They are also starting to produce discrete graphics cards. See this page for a list of Intel GPUs.

Year	Architecture	Example	Example: Shader Units	Example: FP32 Throughput	Example: Bandwidth	Notable features for compute
2013	Gen7	HD Graphics 4600 (Haswell GT2)	20 EU (160 ALUs)	0.35 TFLOPS	25.6 GB/s
2015	Gen8	HD Graphics 5600 (Broadwell GT2)	24 EU (192 ALUs)	0.40 TFLOPS	25.6 GB/s
2018	Gen9	UHD Graphics 630 (Coffee Lake GT2)	24 EU (192 ALUs)	0.40 TFLOPS	42.7 GB/s
2019	Gen11	Iris Plus Graphics (Ice Lake GT2)	64 EU (512 ALUs)	1.08 TFLOPS	59.7 GB/s
2020	Gen12	Intel Xe MAX (DG1)	96 EU (768 ALUs)	2.53 TFLOPS	68 GB/s	Shared memory on subslice (no longer L3 cache)

Execution Units (EU)

With Intel graphics architectures, a shader unit maps an Execution Unit (EU).

Below is a diagram of Intel's Gen11 graphics architecture, which illustrates collections of EUs into groups known as subslices. Refer to the specific generation's architecture doc for specifics on the number of EUs and slices in a given graphics processor.

Resources

Architecture Overview for Intel Processor Graphics Gen9
Architecture Overview for Intel Processor Graphics Gen11
Developer and Optimization Guide for Intel Processor Graphics Gen11

NVIDIA

NVIDIA is one of the largest vendors of discrete graphics cards. See this page for a list of NVIDIA GPUs.

Year	Architecture	Example	Example: Shader Units	Example: FP32 Throughput	Example: Bandwidth	Notable features for compute
2010	Fermi	GTX 580	16 SM (512 ALUs)	1.58 TFLOPS	192 GB/s
2012	Kepler	GTX 780	12 SM (2304 ALUs)	3.98 TFLOPS	288 GB/s
2014	Maxwell	GTX 980	16 SM (2048 ALUs)	4.62 TFLOPS	224 GB/s
2016	Pascal	GTX 1080	20 SM (2560 ALUs)	8.23 TFLOPS	320 GB/s	FP16 support (1:64 throughput of FP32 for consumer cards)
2018	Turing	RTX 2080S	48 SM (3072 ALUs)	10.14 TFLOPS	496 GB/s	Tensor Cores, 2:1 throughput of FP32 for consumer cards, concurrent INT32/FP32 math
2020	Ampere	RTX 3080	68 SM (8704 ALUs)	25.07 TFLOPS	760 GB/s	bfloat16, TesorFloat-32 support

Streaming Multiprocessors (SM)

In all of NVIDIA's recent architectures, a shader unit maps to a streaming multiprocessor (SM). Unlike AMD compute units, however, the exact configuration of an SM changes every generation. You should refer to each architecture's design document for details. Later generations also incorporate new types of cores (tensor cores) that are complicated to summarize neatly in a table.

One thing that can be cleanly summarized is the number of dedicated single-precision FPUs in each SM by generation:

Fermi	Kepler	Maxwell	Pascal	Turing	Ampere
32	192	128	128	64	64

Below is a diagram of an SM from the Ampere architecture (specifically GA10x GPUs, which are found in the consumer-grade graphics cards).

Tensor Cores

Tensor cores are a unique unit in NVIDIA's latest architectures that accelerate specific types of matrix multiplications. These units offer exceptional throughput for certain AI applications, but unfortunately they're not accessible through HLSL at the moment; the only way we can leverage this hardware with DirectX is through metacommands.

Resources

Ampere Architecture Whitepaper
Turing Architecture Whitepaper
Programming Tensor Cores

Resources

The GPU Database. Amazing site for browsing AMD and NVIDIA graphics cards by architecture. Includes per-architecture diagrams of shader units (compute units), performance characteristics, references to ISA documentation, and more.
GPU Specs Database. Lists basic stats on GPUs from all vendors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

architectures.md

architectures.md

Modern GPU Architectures

AMD

Compute Units (CUs)

Workgroup Processors (WGP)

Resources

Intel

Execution Units (EU)

Resources

NVIDIA

Streaming Multiprocessors (SM)

Tensor Cores

Resources

Resources

Files

architectures.md

Latest commit

History

architectures.md

File metadata and controls

Modern GPU Architectures

AMD

Compute Units (CUs)

Workgroup Processors (WGP)

Resources

Intel

Execution Units (EU)

Resources

NVIDIA

Streaming Multiprocessors (SM)

Tensor Cores

Resources

Resources