Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute example #22

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

@Shfty
Copy link
Collaborator

Shfty commented Jun 2, 2023

Thanks for putting in the effort on this - I've not had chance to test it yet, but if it runs and produces the expected output then I'd say it's a strong start!

@samoylovfp
Copy link
Author

Not yet. I need to read more about how bevy handles shaders and didn't have the energy to do so yet. Planning on taking another stab this or the next weekend

@samoylovfp
Copy link
Author

I hope that I am missing something obvious, but the ways of achieving the result, that I see, feel a bit backwards.

The current problem is that the bevy compute pipeline requires the bevy shader handle,
which in the bevy compute example is obtained by loading the shader from a path via an AssetServer.
For rust-gpu shader we first need to convert it into a bevy shader to build the compute pipeline.

I initially thought it will be easiest to just do this processing in the render graph node "update" method,
but seems like the render-sub-app's World does not provide the Assets<Shader> or Assets<RustGpuBuilderOutput>,
so I cannot just take the shader builder output, convert it and put into the Assets<Shader>. Maybe it is possible by converting them in the bevy app and then "extracting" them into the render world, but it seems more backwards than other solutions.

I think maybe at this point the most straightforward way is to provide a custom AssetLoader, that builds a bevy Shader out of the RustGpuBuilderOutput,
then the bevy compute example requires little change,
only the addition of the assetloader and the name of the shader,
maybe using the "hashpound fragment" syntax to indicate the entrypoint; I'll try that.

@samoylovfp
Copy link
Author

I ended up in a situation where I think it should work, but it doesn't and it doesn't even complain that something is wrong.
I suspect I might have messed up the signature of the shader entrypoint, will try to debug it somehow

@samoylovfp
Copy link
Author

I'll take another stab in two weeks unless someone figures it out sooner

@tombh
Copy link

tombh commented Jun 14, 2023

Sounds like great progress, I'm excited to try it out!

@samoylovfp
Copy link
Author

Wasn't able to debug why nothing is showing, lost the remainder of my motivation reading through the SPIR-V specification. Might take another stab in a few months

@tombh
Copy link

tombh commented Jul 6, 2023

Ha yeah SPIR-V is pretty esoteric. Awesome work @samoylovfp 🙇

@johnny-smitherson
Copy link

johnny-smitherson commented Nov 6, 2023

hey i got this to work with some extensive hacking and vendoring, also updated to bevy 0.11

the problem with this PR is that the shader accepts storage

    #[spirv(storage, descriptor_set = 0, binding = 0)] texture:  &[Vec4],

but the bevy app binds storage texture - solution was to change the shader to accept Image!(2D, format=...) instead.

    #[spirv(descriptor_set = 0, binding = 0)] texture:   &Image!(2D, format=rgba8_snorm, sampled=false),

Here is a branch where it renders simplex noise at 11FPS for 1280x720 (about 10x slower than single-threaded CPU)

noiseCapture

#23

@tombh
Copy link

tombh commented Nov 6, 2023

That's great! Did you mean to post a link to the branch? Or were just mentioning it? I see the link now.

Do you think the notably slower frame rate is because of Bevy's rust-gpu integration? Or just because your implementation is prioritising proof of concept for now?

@johnny-smitherson
Copy link

That old noise crate was the only no-std crate that worked - check rust-gpu/shader/lib/noise.

To see if it's really slow or not, we'd have to translate it and compare with WGSL on same GPU - otherwise the comparison with CPU is meaningless (who knows what intrinsics magically show up on the CPU side?)

Anyway, it's a starting point for doing your own computation and benchmarks.

@johnny-smitherson
Copy link

If the maintainer is still interested i can make separate PRs for each of the vendored codebases - otherwise, it's really a bother to work with many little crates spread around, when I need to upgrade dependencies in each and every one

@johnny-smitherson
Copy link

johnny-smitherson commented Nov 6, 2023

The timings of GPU vs. CPU for the noise crate:

  • CPU cargo run Elapsed: 270ms
  • CPU cargo run --release 22ms
  • GPU nvidia - 90ms (both with and without --release when building shader crate)

I think the noise crate is a little bit too much compute for each thread. I'm sure other tasks are better suited for this - through I'd first translate some WGSL/GLSL compute benchmarks into Rust first and see if there are major losses in SPIR-V/SPIR-T/whatever

EDIT Actually only 4x slower than CPU - think it's working correctly

@johnny-smitherson
Copy link

johnny-smitherson commented Nov 6, 2023

here is game of life in 80 lines, running at 4k in 60fps.

ported this: https://github.com/bevyengine/bevy/blob/v0.12.0/assets/shaders/game_of_life.wgsl

setting "NO VSYNC" in bevy doesn't actually let the game go over 60fps - there's probably some way we can trace into the compute shader runtime and get the compute shader runtime from there?

Ca111pture
#![no_std]
#![feature(asm_experimental_arch)]

use spirv_std::{
    spirv,
    glam::{UVec3, IVec2, Vec4}, Image,
};

fn hash(value: u32) -> u32 {
    let mut state = value;
    state = state ^ 2747636419;
    state = state * 2654435769;
    state = state ^ state >> 16;
    state = state * 2654435769;
    state = state ^ state >> 16;
    state = state * 2654435769;
    return state;
}

fn randomFloat(value: u32) -> f32 {
    return (hash(value) as f32) / 4294967295.0;
}

pub type Image_2D_SNORM =  Image!(2D, format=rgba8_snorm, sampled=false);

fn is_alive(location: IVec2, offset_x: i32, offset_y: i32, image: &Image_2D_SNORM) -> i32 {
    let value= image.read(location + IVec2::new(offset_x, offset_y));
    return value.x as i32;
}

fn count_alive(location: IVec2, image: &Image_2D_SNORM) -> i32 {
    return is_alive(location, -1, -1, image) +
           is_alive(location, -1,  0, image) +
           is_alive(location, -1,  1, image) +
           is_alive(location,  0, -1, image) +
           is_alive(location,  0,  1, image) +
           is_alive(location,  1, -1, image) +
           is_alive(location,  1,  0, image) +
           is_alive(location,  1,  1, image);
}



#[spirv(compute(threads(8,8)))]
pub fn init(
    #[spirv(global_invocation_id)] id: UVec3,
    #[spirv(num_workgroups)] num: UVec3,
    #[spirv(descriptor_set = 0, binding = 0)] texture: &Image_2D_SNORM,
) {

    let coord = IVec2::new(id.x as i32, id.y as i32);
    let randomNumber = randomFloat(id.y * num.x + id.x);
    let alive = randomNumber > 0.9;
    let alive_f = alive as i32 as f32;
    let pixel = Vec4::new(alive_f, alive_f, alive_f, 1.0);
    unsafe {
        texture.write(coord, pixel);
    }
}

#[spirv(compute(threads(8,8)))]
pub fn update(
    #[spirv(global_invocation_id)] id: UVec3,
    #[spirv(num_workgroups)] num: UVec3,
    #[spirv(descriptor_set = 0, binding = 0)] texture:   &Image!(2D, format=rgba8_snorm, sampled=false),
){

    let coord = IVec2::new(id.x as i32, id.y as i32);
    let n_alive = count_alive(coord, texture);
    let alive = n_alive == 3 || n_alive == 2 && is_alive(coord, 0, 0, texture) == 1;
    let alive_f = alive as i32 as f32;
    let pixel = Vec4::new(alive_f, alive_f, alive_f, 1.0);

    unsafe { spirv_std::arch::workgroup_memory_barrier_with_group_sync() };

    unsafe {
        texture.write(coord, pixel);
    }

}

@johnny-smitherson
Copy link

johnny-smitherson commented Nov 6, 2023

I've also thrown in some of my game logic (ballistic solution for 1st order viscosity) and it works exactly as expected, with speeds comparable to CPU rust (5x slower) and no value error

I can't wait not to learn WGSL

thanks for the code!

@johnny-smitherson
Copy link

Also I've set up a docker build process on the fork, so you don't have to install the nightly rust from 6 months ago on the host.

Another note is that for compute shaders, the entry_points.json stuff in the bevy app / rust gpu builder doesn't need to exist, so the rust-gpu-builder probably doesn't have to pull bevy and bevy-gpu-builder-shared that pulls in bevy reflection, since we don't want any interop with internal types.

So it might be easier to use rust-gpu direcly, without any bevy-specific code for generating compute shaders - provided we still have to set up all the bind groups by hand.

Maybe after the entry_points.json stuff is updated to work with bevy 0.11/0.12 we can look at automatically generating the layouts and bindings code for the bevy side (similar to what's being done for materials)

@tombh
Copy link

tombh commented Nov 11, 2023

That's a lot of great info and insight. Even if its slower, it's just great to know that it all works. I'm still a newbie to all this, so it'll take me a while to pore over everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compute example
4 participants