Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute renderer displays garbage at high IRs #2047

Open
FireNX70 opened this issue May 16, 2024 · 29 comments
Open

Compute renderer displays garbage at high IRs #2047

FireNX70 opened this issue May 16, 2024 · 29 comments

Comments

@FireNX70
Copy link

Resolutions higher than 6x are very prone to displaying garbage. 6x will display garbage in certain spots, although Zelda Spirit Tracks triggers this pretty easily. 5x seems fine. Tested on a 1080Ti + Win10. Tested with a72b79a, so 5df83c9 doesn't fix it.

Some examples
pkmn_bw_6x Pokemon BW Driftveil Bridge 6x
pkmn_pt_spot_5x Pokemon Platinum 5x
pkmn_pt_spot_6x Pokemon Platinum 6x
pkmn_pt_next_to_spot_6x Pokemon Platinum 6x
pkmn_pt_next_to_spot_7x Pokemon Platinum 7x
tloz_ph_6x Zelda Phantom Hourglass 6x
tloz_ph_7x Zelda Phantom Hourglass 7x
tloz_st_5x Zelda Spirit Tracks 5x
tloz_st_6x Zelda Spirit Tracks 6x

@RSDuck
Copy link
Member

RSDuck commented May 16, 2024

Sorry I have Intel iGPU, I can't debug these things.

They probably just need even more work tiles or some other index starts overflowing. I need to come up with a smarter way of allocating the work tile amount or atleast give an option to override it.

@FireNX70
Copy link
Author

I tried doubling MaxWorkTiles and it didn't fix it. Increasing the size of FinalTileMemory didn't do anything either.

@takeshineale128
Copy link

no issues on my gtx 1660 gpu at 16x ir on win 11 with driver version v552.44

@FireNX70
Copy link
Author

I've tested both updating the driver and using a T1000. Neither fixed this. So it's not the driver version or the architecture. I really don't think the OS would make a difference, especially if it's Win10 vs Win11. Were you really testing with the compute renderer?

@takeshineale128
Copy link

opengl

@FireNX70
Copy link
Author

There's two OpenGL renderers. Are you testing with a dev version or are you using 0.9.5?

@takeshineale128
Copy link

0.9.5

@RSDuck
Copy link
Member

RSDuck commented May 17, 2024

Increasing the size of FinalTileMemory didn't do anything either.

FinalTileMemory isn't dynamically allocated, it cannot overflow.

Not really sure what could be the issue here. It seems to be definitely tile related and the index should not overflow (yet).

@nadiaholmquist
Copy link
Collaborator

Can't seem to reproduce this on an AMD GPU (Steam Deck). Will try on other stuff later.

@user18081972
Copy link

user18081972 commented May 17, 2024

opengl

@takeshineale128
This is the OpenGL renderer we are talking about, so the compute shader one, not the classic one
image

@user18081972
Copy link

AvalonCode.mp4

heres a video example. hopefully it helps

@FireNX70
Copy link
Author

FireNX70 commented May 17, 2024

@nadiaholmquist I've tested with a Vega 8 (3500U) on Win10 with an old driver from 2023 and it works fine even at 16x, although attempting to switch resolutions while the game was running completely froze the renderer. I've also tested with LLVMpipe in a VM and it also works fine even at 16x. It looks like it's a bug specific to Nvidia. I'll test Nvidia on Linux with the proprietary driver, I expect that it will behave just like it does in Win10.

Edit: Nvidia on Linux with the proprietary driver gives the exact same result as on Win10.

@RSDuck
Copy link
Member

RSDuck commented May 17, 2024

oh no, there still might be undefined behaviour somewhere in the shaders. I'm going to run the shaders with mesa and a debug context eventually to see whether it warns about something.

@takeshineale128
Copy link

takeshineale128 commented May 18, 2024

not useing dev builds, my 0.9.5 build has no opengl option for classic or compute shader. its prob classic
Screenshot 2024-05-18 051021

@KostaSaizo7
Copy link

Blue Dragon - Awakened Shadow
Windows 10
Nvidia GPU
Compute shader x8 (x7 and below is fine)
https://www.youtube.com/shorts/TX5Nq3RdX5Q

@user18081972
Copy link

I decided to downgrade my GPU driver (Nvidia RTX 3080) to the oldest supported one (version 456.71) to see if it might be a 'regression'.
The same issue occurs on this driver version from October 7, 2020.

@FireNX70
Copy link
Author

FireNX70 commented Jun 2, 2024

glDispatchComputeIndirect will not log OpenGL errors even if the compute work group counts are too high, and checking these would be annoying since I'd have to read the buffer back from VRAM. However, RenderDoc did capture the values used for these. At 6x none of the calls ever go above the minimum of 65535. At 7x there is one call that does exceed that minimum (glDispatchComputeIndirect(<1, 1, 79414>)). This exceeds the maximum supported by my 1080Ti with the 552 driver (it sticks to the minimum of 65535 in the Z axis). I tried replacing the glDispatchComputeIndirect call in the loop with glDispatchCompute(1, 1, 65535). This DID have an effect on the garbage, but it did not fix it. It's likely this is the problem on Nvidia. The local sizes all seem fine.

Edit: Here's a gist on all the testing I've been doing https://gist.github.com/FireNX70/10c72169ea0105f998bab6f51443d42e

@RSDuck
Copy link
Member

RSDuck commented Jun 2, 2024

thank you for debugging that.

Regarding the barriers that is obviously a mistake. Though GL_SHADER_STORAGE_BUFFER is defined as 0x90D2 (which should result in a GL error), though if the driver accepts it and only uses the valid bits, most GPUs don't have that fine grained barriers anyway. So there's probably atleast one value set which will insert a correct barrier.

Huh, I always assumed you could do work groups as long as it fits inside a 32-bit int (as it is stored in one). The reason it didn't fix it could be multiple things. If you don't change any of the bindings you will end up just doing the same work multiple times (possibly at the same time).

I need to look into what parallel rdp again, maybe it has a solution to this. I simplified some things, e.g. we always process all polygons at once, because it is limited to a fixed amount on DS. One possible easy out would probably just be to increase the tile size to 16x16 at some point. That would also solve the possible issue of the tile buffer index overrunning.

@FireNX70
Copy link
Author

FireNX70 commented Jun 3, 2024

thank you for debugging that.

No problem, once this is fixed I can fully switch to MelonDS.

Huh, I always assumed you could do work groups as long as it fits inside a 32-bit int (as it is stored in one). The reason it didn't fix it could be multiple things. If you don't change any of the bindings you will end up just doing the same work multiple times (possibly at the same time).

There's a minimum required limit of 65535. Implementations are free to establish higher limits. I haven't checked what limits Mesa and Radeon report, Nvidia has a higher limit on the X dimension but sticks to 65535 for the Y and Z dims.

I need to look into what parallel rdp again, maybe it has a solution to this. I simplified some things, e.g. we always process all polygons at once, because it is limited to a fixed amount on DS. One possible easy out would probably just be to increase the tile size to 16x16 at some point. That would also solve the possible issue of the tile buffer index overrunning.

I'm not well versed in graphics nor OpenGL, certainly not enough to understand what the compute renderer's doing. Having said that, it looks to me as though the right thing to do would be to cap VariantWorkCount's Z component to the respective GL_MAX_COMPUTE_WORK_GROUP_COUNT and do multiple passes per variant if needed. Easy to say, probably a pain in the ass to implement. I don't think larger tiles would guarantee the work group count wouldn't go above the limit at higher internal resolutions.

@RSDuck
Copy link
Member

RSDuck commented Jun 3, 2024

The problem is that the amount of work done per variant is not known to the cpu at the time it issues the indirect compute calls (that's the reason they are indirect in the first place). And letting the GPU do the binning, then waiting on that and then dispatching as much compute jobs as necessary would work, but would just waste CPU time on waiting.

After sleeping over this, there is technically another solution. The other two axis as well and then put all the axis together into one index in the shader.

I quite like the idea of increasing the tile size at very high resolutions though. The increased tile size results in a lower amount of tiles still filling the same area. As long as the total amount of tiles is less than 2^16 no single variant can result in a larger compute job than that.

What of course remains a potential issue is that the tile buffer can overflow, as it has a fixed size. It would be possible to conservely estimate the size necessary each frame by using the bounding boxes of each polygon, but with some polygon shapes that might result in excessive allocations.

Both solutions don't exclude each other, I guess I can also just implement them both.

@FireNX70
Copy link
Author

FireNX70 commented Jun 3, 2024

Scaling the tile size with the internal resolution sounds good. Do keep in mind there's a limit on local size too, although the minimum of 1024 should be fine (1024x1024 for a single tile would be huge). As for the tile buffer; IMO it's better safe than sorry, especially if it doesn't overestimate by a lot.

Edit: 11x (almost 4K on a single screen) uses about 2650 MB of VRAM right now. It's not nothing, but there's some margin for the tile buffer to grow.

@FireNX70
Copy link
Author

I've been messing around with scaled tiles. 16x16 tiles work fine, 32x32 tiles (the max size, since the tile size is used for the X and Y local sizes in certain shaders and I'm sticking to the default GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS of 1024) presents a couple of problems.

glDispatchCompute(TilesPerLine*TileLines/32, 1, 1), used for ShaderClearCoarseBinMask, is a problem because the total tile count (TilesPerLine*TileLines) with size 32 tiles will not divide by 32 nicely. I think we divide TilesPerLine * TileLines by 32 because of the shader's local size. That division works fine with both 16 and 48. It won't be optimal for Nvidia (warp size of 32) nor Radeon (RDNA wave size is 32, GCN wave size is 64). I'm not sure if the divider for CoarseBinStride has to match the local size too. If so, this only works if it's 16. It's as far from optimal as 48 for Nvidia and RDNA but probably worse for GCN.

glDispatchCompute(((gpu.GPU3D.RenderNumPolygons + 31) / 32), ScreenWidth/CoarseTileW, ScreenHeight/CoarseTileH), used for ShaderBinCombined, is also a problem. With 32x32 tiles, CoarseTileH would end up being 128. At lower tile sizes it's 32 or 64, which 192 is a multiple of, so it was guaranteed the vertical resolution would be a multiple of CoarseTileH. 192 is not a multiple of 128, so a CoarseTileH of 128 becomes a problem at certain resolutions (specifically, odd multiples of the base resolution). This (I think) is what would cause a blank strip at the bottom of the screen with odd scales. Making CoarseTileCountY variable should fix it, but it introduced more bands (the number of bands matches the scale). With 6-tile tall coarse tiles (when using 32x32 tiles) the height of a coarse tile is 192 (that way we make sure internal resolution is a multiple of CoarseTileH), so the number of coarse tile lines matches the scale. What I think is really happening here is there's actually one two-tile-thick band per coarse tile because there's a hardcoded 4 somewhere so there's 2 rows of tiles that are ignored per coarse tile.

Just in case here's what I've done: https://github.com/FireNX70/melonDS/tree/try-fixing-the-compute-renderer

@user18081972
Copy link

user18081972 commented Jun 10, 2024

I build your branch, and it fixes it until 10x IR, then static blue lines appear in ChibiRobo
image

And in Avalon Code static stripes of missing pixels
image

@FireNX70
Copy link
Author

Yeah, those are the stripes I was talking about. Starting at 10x it uses 32x32 tiles.

@FireNX70
Copy link
Author

I think what's happening with the stripes is BinCombined's local_size_x must match CoarseTileCountX * CoarseTileCountY. A quick hack to do that (and changing the associated glDispatchCompute call to add 47 and divide by 48 instead 31 and 32) gets rid of the stripes but causes a bunch of geometry to fail to render.

I've also found (through testing with 16x16 tiles) that the scale at which garbage may start being displayed would be 3.5x (with 8x8 tiles, 7x with 16x16 tiles in practice). The worst case scenario seems to be MKDS' desert track. If the fire balls (which are only there on the last 2 laps) get extremely close to the camera, the scale is high enough and the tile size is small enough; you'll get garbage. Driving through them seems to be the best way to trigger this. The next worst scenario, Pokemon BW Driftveil Bridge, only starts breaking at 5x (8x8) and is fine at 4x (8x8), 7x (16x16) and 8x (16x16).

@FireNX70
Copy link
Author

FireNX70 commented Jun 12, 2024

I think I got it working. Work count X just wasn't related to the local size for BinCombined. The only remaining problem is ClearCoarseBinMask, but I think changing the local size to 48 and dividing TilesPerLine*TileLines by 48 should fix that.

@user18081972
Copy link

On the Pokemon BW2 Driftveil Bridge i indeed do get corruption, at 4x 8x and 16x, any other resolution works perfectly.
(tested on: FireNX70@6f8ce9f)
image

@user18081972
Copy link

user18081972 commented Jul 16, 2024

I tested it again, with FireNX70@6f8ce9f)
With Pokemon BW2 on the drawbridge in the same spot, but this time on a Intel Arc A380 GPU 6GB.
It works without any issues, on all resolutions, with the compute shader renderer. So Arc is also unaffected.

@patataofcourse
Copy link
Contributor

happens with the melon 1.0 RC on Built to Scale DS (first game in Rhythm Heaven), on 8x upwards

image

this is on Windows 10 with an Nvidia RTX 4050

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants