Skip to content

Latest commit

 

History

History
234 lines (171 loc) · 10.2 KB

ARM-Mali-Valhall.md

File metadata and controls

234 lines (171 loc) · 10.2 KB

Valhall is a 4th generation of Mali GPU architecture.

Content:

Valhall Gen1

Examples

  • Mali-G57
  • Mali-G77

References

1.1. Arm's New Mali-G77 & Valhall GPU Architecture
1.2. Arm Mali-G77 Performance Counters Reference Guide, [backup]
1.3. Vulkan features for Mali-G57
1.4. Mali-G57 Benchmarks

Notes

  • 2 work queues: non-fragment, fragment. [1.2]

  • fragment density map

  • One subgroup can fill multiple triangles, but only with the same instanceIndex. [1.4]

  • AFBC new formats: [1.4]

    • B10G11R11_UFLOAT_PACK32
    • RG16F
    • RG16_UNorm
  • core config [4]:

    • 2 ALU
    • 128 fp16/cy (64 per ALU)
    • 64 fp32/cy (32 per ALU)
    • 2 frag/cy
    • 2 pix/cy
    • 4 tex/cy

Valhall Gen2

Examples

  • Mali-G68
  • Mali-G78

SoC

  • Google Tensor (Mali-G78 MP20)

References

2.1. Arm Announces The Mali-G78 GPU
2.2. Mali-G78 Performance Counters Reference Guide, [backup]
2.3. Vulkan features for Mali-G78
2.4. Reverse-engineering the Mali G78

Notes

  • core config [4]:

    • 2 ALU
    • 128 fp16/cy (64 per ALU)
    • 64 fp32/cy (32 per ALU)
    • 2 frag/cy
    • 2 pix/cy
    • 4 tex/cy
  • Mali-G78 MP20 VK_ARM_shader_core_builtins, VK_ARM_shader_core_properties [2.3]:

    • shaderCoreCount: 20
    • shaderWarpsPerCore: 32 -- maximum number of simultaneously executing warps on a shader core
    • fmaRate: 32 -- maximum number of single-precision fused multiply-add operations per clock per shader core.
    • pixelRate: 2 -- maximum number of pixels output per clock per shader core.
    • texelRate: 4 -- maximum number of texels per clock per shader core.

Valhall Gen3

Examples

  • Mali-G310
  • Mali-G510
  • Mali-G610
  • Mali-G710

SoC

  • Rockchip RK 3588 (Mali-G610 MC4)
  • Google Tensor G2 (Mali-G710 MP7)

References

3.1. Arm Mali-G610 Performance Counters Reference Guide, [backup]
3.2. Arm Announces New Mali-G710, G610, G510 & G310 Mobile GPU Families
3.3. Mali-G510
3.4. Vulkan features for Mali-G710, G710 MC10, G610, G610 MC6

Notes

  • G610, G710 L2 cache: Configurable 512KB – 2MB, 2 or 4 slices of 256K or 512K

  • Scalability: 7 to 16 cores

  • Added Command Stream Front-end (CSF) instead of Job Manager.

  • 3 hardware work queues: compute, vertex, fragment.

  • Arm Fixed Rate Compression (AFRC), 4x4 block lossy compression for textures and framebuffer. [3.3]

  • All RGBA16 formats are compatible with AFBC. [3]

  • 128 GFlops per core at 1000MHz

  • G710 core config [4]:

    • 4 ALU
    • 256 fp16/cy (64 per ALU)
    • 128 fp32/cy (32 per ALU)
    • 4 frag/cy
    • 4 pix/cy
    • 8 tex/cy
  • Mali-G710 MP7 VK_ARM_shader_core_builtins, VK_ARM_shader_core_properties [3.4]:

    • shaderCoreCount: 7
    • shaderWarpsPerCore: 64 -- maximum number of simultaneously executing warps on a shader core
    • fmaRate: 64 -- maximum number of single-precision fused multiply-add operations per clock per shader core.
    • pixelRate: 4 -- maximum number of pixels output per clock per shader core.
    • texelRate: 8 -- maximum number of texels per clock per shader core.

Valhall Gen4

Examples

  • Mali-G615
  • Mali-G715
  • Immortalis-G715

SoC

  • Google Tensor G3, G4 (Mali-G715 MP7)

References

4.1. Arm Mali-G615 Performance Counters Reference Guide, [backup]
4.2. The Valhall Shader Core, [backup]
4.3. Vulkan features for Mali-G715, Mali-615 MC6

Notes

  • The FMA and SVT pipelines are 16-wide, the SFU pipeline is 4-wide and runs at one quarter of the throughput of the other two. [4.2]

  • Valhall maintains native support for int8, int16, and fp16 data types. These data types can be packed using SIMD instructions to fill each 32-bit data processing lane. This arrangement maintains the power efficiency and performance that is provided by the types that are narrower than 32-bits. [4.2]

  • A single 16-wide warp maths unit can therefore perform 32x fp16/int16 operations per clock cycle, or 64x int8 operations per clock cycle. [4.2]

  • Fragment shading rate (VRS)

  • LSU: [4.2]

    • 64-byte cache line.
    • 16KB L1 data cache per core
    • Warp unit accesses are optimized to reduce unique cache access requests. Data can be returned in a single cycle if all threads access data inside the same cache line.
  • varying unit can interpolate 32 bits for every thread in a warp. [4.2]

  • Variable rate shading (VRS).

  • G715 core config [4]:

    • 4 ALU
    • 512 fp16/cy (128 per ALU)
    • 256 fp32/cy (64 per ALU)
    • 4 frag/cy
    • 4 pix/cy
    • 8 tex/cy
  • Mali-G715 MP7 VK_ARM_shader_core_builtins, VK_ARM_shader_core_properties [4.3]:

    • shaderCoreCount: 7
    • shaderWarpsPerCore: 64 -- maximum number of simultaneously executing warps on a shader core
    • fmaRate: 128 -- maximum number of single-precision fused multiply-add operations per clock per shader core.
    • pixelRate: 4 -- maximum number of pixels output per clock per shader core.
    • texelRate: 8 -- maximum number of texels per clock per shader core.
  • Mali-G615 MC6 VK_ARM_shader_core_builtins, VK_ARM_shader_core_properties [4.3]:

    • shaderCoreCount: 6
    • shaderWarpsPerCore: 64 -- maximum number of simultaneously executing warps on a shader core
    • fmaRate: 128 -- maximum number of single-precision fused multiply-add operations per clock per shader core.
    • pixelRate: 4 -- maximum number of pixels output per clock per shader core.
    • texelRate: 8 -- maximum number of texels per clock per shader core.

Valhall (all gens)

References

  1. Instruction Set Architecture, [backup]
  2. Mesa driver details
  3. Arm GPU Best Practices Developer Guide, [backup]
  4. Arm GPU Datasheet, [backup]
  5. PanCSF: A new DRM driver for Mali CSF-based GPUs
  6. Writing an open source GPU driver - without the hardware
  7. reverse-engineered Mali Valhall ISA

Notes

  • scalar

  • 16 threads per warp. [1.2, 4.2]

  • Fragment Task with 32x32 pixels region. [1.2]

  • MSAA: 4x, 8x, 16x

  • AFBC (v1.3) with 4x4 block.

  • Transaction Elimination with 16x16 pixel block size.

  • All Valhall GPU cores implement a 4 texel-per-clock and 2 pixel-per-clock shader core.

  • Mali Valhall GPU shader cores allow variable numbers of threads to be created, depending on the number of work registers that are used by the in-flight shader programs.

    • 0-32 registers - Maximum thread capacity
    • 33-64 registers - Half thread capacity
  • A Valhall core can perform 32 FP32 FMAs, read 4 bilinear filtered texture samples, blend 2 fragments, and write 2 pixels per clock. [4.2]

  • Each Processing Engine (PE) executes the programmable shader instructions. [4.2]

  • Each PE includes 3 arithmetic processing pipelines: [4.2]

    • FMA pipeline with is used for complex maths operations
    • CVT pipeline which is used for simple maths operations
    • SFU pipeline which is used for special functions
  • Has accelerated hardware blending for FP16 and R11G11B10 formats. Simple blends of those formats are accelerated, but advanced blends (logic/min/max) are not. [3]

  • AFBC compatible with any 32 bit or smaller formats. [3]

  • Branching:

    • Divergence of threads within a warp carries a performance penalty. Divergence is handled in hardware, but the compiler must insert some hints to ensure divergence is handled correctly. [7]
    • Indirect access to attributes and texture handles must not be divergent. If divergent access is required, the compiler must lower to an if-chain predicated on lane ID. [7]
  • Texturing:

    • Usually, helper threads do not need to execute texture instructions once the level-of-detail has been selected. Skipping texturing on helper threads can save memory bandwidth. [7]
    • Texture projection is not supported. [7]
  • Uniform/constant restrictions: [7]

    • An instruction may access no more than a single 64-bit uniform slot.
    • An instruction may access no more than 64-bits of combined uniforms and constants.
    • An instruction may only access uniforms in the default immediate mode.
    • An instruction may access no more than a single special immediate (e.g. lane_id).