Valhall is a 4th generation of Mali GPU architecture.
Content:
- Mali-G57
- Mali-G77
1.1. Arm's New Mali-G77 & Valhall GPU Architecture
1.2. Arm Mali-G77 Performance Counters Reference Guide, [backup]
1.3. Vulkan features for Mali-G57
1.4. Mali-G57 Benchmarks
-
2 work queues: non-fragment, fragment. [1.2]
-
fragment density map
-
One subgroup can fill multiple triangles, but only with the same instanceIndex. [1.4]
-
AFBC new formats: [1.4]
- B10G11R11_UFLOAT_PACK32
- RG16F
- RG16_UNorm
-
core config [4]:
- 2 ALU
- 128 fp16/cy (64 per ALU)
- 64 fp32/cy (32 per ALU)
- 2 frag/cy
- 2 pix/cy
- 4 tex/cy
- Mali-G68
- Mali-G78
- Google Tensor (Mali-G78 MP20)
2.1. Arm Announces The Mali-G78 GPU
2.2. Mali-G78 Performance Counters Reference Guide, [backup]
2.3. Vulkan features for Mali-G78
2.4. Reverse-engineering the Mali G78
-
core config [4]:
- 2 ALU
- 128 fp16/cy (64 per ALU)
- 64 fp32/cy (32 per ALU)
- 2 frag/cy
- 2 pix/cy
- 4 tex/cy
-
Mali-G78 MP20 VK_ARM_shader_core_builtins, VK_ARM_shader_core_properties [2.3]:
- shaderCoreCount: 20
- shaderWarpsPerCore: 32 -- maximum number of simultaneously executing warps on a shader core
- fmaRate: 32 -- maximum number of single-precision fused multiply-add operations per clock per shader core.
- pixelRate: 2 -- maximum number of pixels output per clock per shader core.
- texelRate: 4 -- maximum number of texels per clock per shader core.
- Mali-G310
- Mali-G510
- Mali-G610
- Mali-G710
- Rockchip RK 3588 (Mali-G610 MC4)
- Google Tensor G2 (Mali-G710 MP7)
3.1. Arm Mali-G610 Performance Counters Reference Guide, [backup]
3.2. Arm Announces New Mali-G710, G610, G510 & G310 Mobile GPU Families
3.3. Mali-G510
3.4. Vulkan features for Mali-G710, G710 MC10, G610, G610 MC6
-
G610, G710 L2 cache: Configurable 512KB – 2MB, 2 or 4 slices of 256K or 512K
-
Scalability: 7 to 16 cores
-
Added Command Stream Front-end (CSF) instead of Job Manager.
-
3 hardware work queues: compute, vertex, fragment.
-
Arm Fixed Rate Compression (AFRC), 4x4 block lossy compression for textures and framebuffer. [3.3]
-
All RGBA16 formats are compatible with AFBC. [3]
-
128 GFlops per core at 1000MHz
-
G710 core config [4]:
- 4 ALU
- 256 fp16/cy (64 per ALU)
- 128 fp32/cy (32 per ALU)
- 4 frag/cy
- 4 pix/cy
- 8 tex/cy
-
Mali-G710 MP7 VK_ARM_shader_core_builtins, VK_ARM_shader_core_properties [3.4]:
- shaderCoreCount: 7
- shaderWarpsPerCore: 64 -- maximum number of simultaneously executing warps on a shader core
- fmaRate: 64 -- maximum number of single-precision fused multiply-add operations per clock per shader core.
- pixelRate: 4 -- maximum number of pixels output per clock per shader core.
- texelRate: 8 -- maximum number of texels per clock per shader core.
- Mali-G615
- Mali-G715
- Immortalis-G715
- Google Tensor G3, G4 (Mali-G715 MP7)
4.1. Arm Mali-G615 Performance Counters Reference Guide, [backup]
4.2. The Valhall Shader Core, [backup]
4.3. Vulkan features for Mali-G715, Mali-615 MC6
-
The FMA and SVT pipelines are 16-wide, the SFU pipeline is 4-wide and runs at one quarter of the throughput of the other two. [4.2]
-
Valhall maintains native support for int8, int16, and fp16 data types. These data types can be packed using SIMD instructions to fill each 32-bit data processing lane. This arrangement maintains the power efficiency and performance that is provided by the types that are narrower than 32-bits. [4.2]
-
A single 16-wide warp maths unit can therefore perform 32x fp16/int16 operations per clock cycle, or 64x int8 operations per clock cycle. [4.2]
-
Fragment shading rate (VRS)
-
LSU: [4.2]
- 64-byte cache line.
- 16KB L1 data cache per core
- Warp unit accesses are optimized to reduce unique cache access requests. Data can be returned in a single cycle if all threads access data inside the same cache line.
-
varying unit can interpolate 32 bits for every thread in a warp. [4.2]
-
Variable rate shading (VRS).
-
G715 core config [4]:
- 4 ALU
- 512 fp16/cy (128 per ALU)
- 256 fp32/cy (64 per ALU)
- 4 frag/cy
- 4 pix/cy
- 8 tex/cy
-
Mali-G715 MP7 VK_ARM_shader_core_builtins, VK_ARM_shader_core_properties [4.3]:
- shaderCoreCount: 7
- shaderWarpsPerCore: 64 -- maximum number of simultaneously executing warps on a shader core
- fmaRate: 128 -- maximum number of single-precision fused multiply-add operations per clock per shader core.
- pixelRate: 4 -- maximum number of pixels output per clock per shader core.
- texelRate: 8 -- maximum number of texels per clock per shader core.
-
Mali-G615 MC6 VK_ARM_shader_core_builtins, VK_ARM_shader_core_properties [4.3]:
- shaderCoreCount: 6
- shaderWarpsPerCore: 64 -- maximum number of simultaneously executing warps on a shader core
- fmaRate: 128 -- maximum number of single-precision fused multiply-add operations per clock per shader core.
- pixelRate: 4 -- maximum number of pixels output per clock per shader core.
- texelRate: 8 -- maximum number of texels per clock per shader core.
- Instruction Set Architecture, [backup]
- Mesa driver details
- Arm GPU Best Practices Developer Guide, [backup]
- Arm GPU Datasheet, [backup]
- PanCSF: A new DRM driver for Mali CSF-based GPUs
- Writing an open source GPU driver - without the hardware
- reverse-engineered Mali Valhall ISA
-
scalar
-
16 threads per warp. [1.2, 4.2]
-
Fragment Task with 32x32 pixels region. [1.2]
-
MSAA: 4x, 8x, 16x
-
AFBC (v1.3) with 4x4 block.
-
Transaction Elimination with 16x16 pixel block size.
-
All Valhall GPU cores implement a 4 texel-per-clock and 2 pixel-per-clock shader core.
-
Mali Valhall GPU shader cores allow variable numbers of threads to be created, depending on the number of work registers that are used by the in-flight shader programs.
- 0-32 registers - Maximum thread capacity
- 33-64 registers - Half thread capacity
-
A Valhall core can perform 32 FP32 FMAs, read 4 bilinear filtered texture samples, blend 2 fragments, and write 2 pixels per clock. [4.2]
-
Each Processing Engine (PE) executes the programmable shader instructions. [4.2]
-
Each PE includes 3 arithmetic processing pipelines: [4.2]
- FMA pipeline with is used for complex maths operations
- CVT pipeline which is used for simple maths operations
- SFU pipeline which is used for special functions
-
Has accelerated hardware blending for FP16 and R11G11B10 formats. Simple blends of those formats are accelerated, but advanced blends (logic/min/max) are not. [3]
-
AFBC compatible with any 32 bit or smaller formats. [3]
-
Branching:
- Divergence of threads within a warp carries a performance penalty. Divergence is handled in hardware, but the compiler must insert some hints to ensure divergence is handled correctly. [7]
- Indirect access to attributes and texture handles must not be divergent. If divergent access is required, the compiler must lower to an if-chain predicated on lane ID. [7]
-
Texturing:
- Usually, helper threads do not need to execute texture instructions once the level-of-detail has been selected. Skipping texturing on helper threads can save memory bandwidth. [7]
- Texture projection is not supported. [7]
-
Uniform/constant restrictions: [7]
- An instruction may access no more than a single 64-bit uniform slot.
- An instruction may access no more than 64-bits of combined uniforms and constants.
- An instruction may only access uniforms in the default immediate mode.
- An instruction may access no more than a single special immediate (e.g. lane_id).