Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] compute shader support #1200

Draft
wants to merge 26 commits into
base: master
Choose a base branch
from
Draft

[WIP] compute shader support #1200

wants to merge 26 commits into from

Conversation

floooh
Copy link
Owner

@floooh floooh commented Jan 25, 2025

  • general stuff:
    • GL backend: ifdef-compute-not-available
    • GL backend: use glMemoryBarrier (see: https://arm-software.github.io/opengl-es-sdk-for-android/compute_intro.html)
    • change the internal names of the various *_info structs to not be abbreviated so much
    • rename sg_features.storage_buffer to sg_features.compute
    • write a compute demo which samples from regular textures (just to check if texture sampling works in compute passes)
    • port wgpu-compute-boids
    • drive-by: remove return value from _sg_*_apply_bindings(), all implementations always return true anyway
    • investigate D3D11 Refcounting Leaks on shutdown (doh, missing d3d11_shutdown()
  • update buffer init behaviour:
    • no longer initialize desc.data size from desc.size (and stricter validation for data.size) - code cleanup, not directly related to compute shader
    • GL specific hazard tracking:
      • track a per-buffer flag when a storage buffer was bound for read/write (cleared
        cleared when bound for read-only
      • if a storage buffer is bound in any pass-type and no matter if read/only or read/write and the read/write flag in the buffer was set, issue a glMemoryBarrier()) - don't need other glMemoryBarrier types since storage buffers cannot currently be re-used as vertex- or index-buffer.
      • remove the other unneeded current hazard-tracking code
    • allow creating immutable buffers without content (will be initialized to zero)
      • metal
      • d3d11
      • gl
      • wgpu
  • compute shader objects
    • metal
    • d3d11
      • on sg_shader_storage_buffer rename hlsl_register_t_n to hlsl_register_t_or_u_n
    • gl
    • wgpu
  • compute pipelines
    • metal
      • investigate: setThreadgroupMemoryLength
      • semi-related: set bound buffer immutability:
        • in render pipelines: all bound vertex- and storage-buffers are MTLMutabilityImmutable
        • in compute pipelines: read-only storage buffers are MTLMutabilityImmutable
    • d3d11
    • gl
    • wgpu
  • compute passes
    • NOTE: concurrent compute pass encoders require bumping the minimal supported macOS base version to 10.14
    • track gpu-written storage-buffers in compute passes
    • metal
      • synchronize written storage buffers (managed buffers only - this blits the gpu-written data back into the cpu-side managed buffer, do this at end of a compute pass)
    • d3d11
    • gl
    • wgpu
  • apply_bindings
    • metal
      • figure out if samplers are supposed to work in compute passes, e.g. MTLComputeCommandEncoder allows to bind samplers but how does that work without the partial-derivatives?
      • apply images and samplers in compute passes
    • d3d11
      • ??? - works without d3d11 validation layer warnings, maybe doesn't apply to "raw buffers"? hmm.. "Buffer resources used for output from the compute shader must be created with the D3D11_BIND_RENDER_TARGET flag. Such resources may be read from, however."
      • ??? works without d3d11 validation layer warnings, maybe doesn't apply to "raw buffers"? "Buffer resources created with the D3D11_BIND_SHADER_RESOURCE flag may only be used as inputs to the compute shader."
    • gl
    • wgpu
  • dispatch
    • metal
      • assume iPhone8 (e.g. non-uniform thread size feature) (probably don't need this with the below changes)
      • rewrite: take workgroup size from shader reflection
      • rewrite: dispatch takes num_groups, not num_threads
    • d3d11
    • gl
    • wgpu
  • validation layer / assert:
    • (NOTE: we probably don't need that) WriteOnce ^ ReadMultiple rule for storage resources within a compute pass, this allows to run dispatches within a compute pass in parallel and only sync in sg_end_pass():
      • a storage resource can either be bound once as read/write binding or multiple times as readonly binding, but not both
      • a storage resource with read/write binding can only be used once
      • a storage resource with readonly binding can be used multiple times
      • how to:
        • during a compute pass, track all read/write and readonly bindings separately
        • in the storage resource, set used_as_readwrite and used_as_readonly flags
        • in sg_end_pass() for each tracked resource, do any necessary synchronization work and clear the used_as_* flags
    • render/compute shader vs SG_SHADERSTAGE_COMPUTE
    • validate sg_shader_desc.compute_workgroup_size (must be > 0, question is just: what's the upper bound?)
    • don't call sg_dispatch() in render passes
    • check that the num_groups_* in the dispatch call doesn't exceed 0xFFFF (see D3D11 functional spec)
    • don't call sg_draw(), sg_apply_viewport, sg_apply_scissor_rect in compute passes
    • storage buffer bindings on vertex- or fragment-stage must be readonly
    • in sg_begin_pass() don't set attachments or swapchain in compute passes
    • in sg_apply_bindings() don't bind vertex/index buffers, textures and samplers (textures and samplers should be fine)
    • in sg_make_shader(): don't mix texture, sampler and writable storage buffer bindings
    • don't use read/write storage resources in render pipeline objects
    • ...?
  • trace hooks and stats tracking
    • stats: num_dispatch
    • update sokol_gfx_imgui.h
    • ...?

Questions and open problems:

  • [YES] should we just allow multiple writes to the same resource and insert the necessary barriers in the APIs which require it?

  • [DONE] ||in HLSL, GLSL and WGSL the shader defines the workgroup size, but in Metal that's defined on the CPU side in the dispatch call.

    • GLSL: layout(local_size_x = X​, local_size_y = Y​, local_size_z = Z​) in;
    • HLSL: numthreads[x, y, z];
    • WGSL: @workgroup_size(x,y,z)
    • then the dispatch call differs by either providing the number of threads, or the number of workgroup invocation in each dimension (the latter implies that number of threads is an integer multiple of number of workgroups)
    • Solution:
      • follow WebGPU convention
      • need to provide workgroup size as shader reflection information in sg_shader_desc, and this will be passed into the Metal dispatch call... (but needs to be evaluated against the max allowed sizes)
      • need to figure out the max-safe-workgroup-size values (again, what does WebGPU do?)
      • ok, WebGPU does the dispatchWorkgroups thing, e.g. if workgroup_size is 64, then: passEncoder.dispatchWorkgroups(Math.ceil(numParticles / 64));
      • ...still need to figure the max safe values for workgroup size though.
  • [LATER] storage textures now or later?

  • [LATER] split SG_BUFFERTYPE into a bool-flags struct? (to allow storage+vertex buffers), alternatively: just do vertex pulling from storage buffers...?

@floooh floooh self-assigned this Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant