Compute renderer displays garbage at high IRs #2047

FireNX70 · 2024-05-16T06:42:26Z

Resolutions higher than 6x are very prone to displaying garbage. 6x will display garbage in certain spots, although Zelda Spirit Tracks triggers this pretty easily. 5x seems fine. Tested on a 1080Ti + Win10. Tested with a72b79a, so 5df83c9 doesn't fix it.

Some examples
Pokemon BW Driftveil Bridge 6x
Pokemon Platinum 5x
Pokemon Platinum 6x
Pokemon Platinum 6x
Pokemon Platinum 7x
Zelda Phantom Hourglass 6x
Zelda Phantom Hourglass 7x
Zelda Spirit Tracks 5x
Zelda Spirit Tracks 6x

The text was updated successfully, but these errors were encountered:

RSDuck · 2024-05-16T07:46:01Z

Sorry I have Intel iGPU, I can't debug these things.

They probably just need even more work tiles or some other index starts overflowing. I need to come up with a smarter way of allocating the work tile amount or atleast give an option to override it.

FireNX70 · 2024-05-16T17:36:00Z

I tried doubling MaxWorkTiles and it didn't fix it. Increasing the size of FinalTileMemory didn't do anything either.

takeshineale128 · 2024-05-16T22:24:59Z

no issues on my gtx 1660 gpu at 16x ir on win 11 with driver version v552.44

FireNX70 · 2024-05-17T07:16:12Z

I've tested both updating the driver and using a T1000. Neither fixed this. So it's not the driver version or the architecture. I really don't think the OS would make a difference, especially if it's Win10 vs Win11. Were you really testing with the compute renderer?

takeshineale128 · 2024-05-17T07:28:52Z

opengl

FireNX70 · 2024-05-17T07:30:03Z

There's two OpenGL renderers. Are you testing with a dev version or are you using 0.9.5?

takeshineale128 · 2024-05-17T07:30:59Z

0.9.5

RSDuck · 2024-05-17T09:42:28Z

Increasing the size of FinalTileMemory didn't do anything either.

FinalTileMemory isn't dynamically allocated, it cannot overflow.

Not really sure what could be the issue here. It seems to be definitely tile related and the index should not overflow (yet).

nadiaholmquist · 2024-05-17T15:40:33Z

Can't seem to reproduce this on an AMD GPU (Steam Deck). Will try on other stuff later.

user18081972 · 2024-05-17T16:40:13Z

opengl

@takeshineale128
This is the OpenGL renderer we are talking about, so the compute shader one, not the classic one

user18081972 · 2024-05-17T17:00:14Z

AvalonCode.mp4

heres a video example. hopefully it helps

FireNX70 · 2024-05-17T22:34:14Z

@nadiaholmquist I've tested with a Vega 8 (3500U) on Win10 with an old driver from 2023 and it works fine even at 16x, although attempting to switch resolutions while the game was running completely froze the renderer. I've also tested with LLVMpipe in a VM and it also works fine even at 16x. It looks like it's a bug specific to Nvidia. I'll test Nvidia on Linux with the proprietary driver, I expect that it will behave just like it does in Win10.

Edit: Nvidia on Linux with the proprietary driver gives the exact same result as on Win10.

RSDuck · 2024-05-17T23:13:47Z

oh no, there still might be undefined behaviour somewhere in the shaders. I'm going to run the shaders with mesa and a debug context eventually to see whether it warns about something.

takeshineale128 · 2024-05-18T04:13:03Z

not useing dev builds, my 0.9.5 build has no opengl option for classic or compute shader. its prob classic

KostaSaizo7 · 2024-05-18T16:46:57Z

Blue Dragon - Awakened Shadow
Windows 10
Nvidia GPU
Compute shader x8 (x7 and below is fine)
https://www.youtube.com/shorts/TX5Nq3RdX5Q

user18081972 · 2024-05-26T22:53:39Z

I decided to downgrade my GPU driver (Nvidia RTX 3080) to the oldest supported one (version 456.71) to see if it might be a 'regression'.
The same issue occurs on this driver version from October 7, 2020.

FireNX70 · 2024-06-02T11:33:57Z

glDispatchComputeIndirect will not log OpenGL errors even if the compute work group counts are too high, and checking these would be annoying since I'd have to read the buffer back from VRAM. However, RenderDoc did capture the values used for these. At 6x none of the calls ever go above the minimum of 65535. At 7x there is one call that does exceed that minimum (glDispatchComputeIndirect(<1, 1, 79414>)). This exceeds the maximum supported by my 1080Ti with the 552 driver (it sticks to the minimum of 65535 in the Z axis). I tried replacing the glDispatchComputeIndirect call in the loop with glDispatchCompute(1, 1, 65535). This DID have an effect on the garbage, but it did not fix it. It's likely this is the problem on Nvidia. The local sizes all seem fine.

Edit: Here's a gist on all the testing I've been doing https://gist.github.com/FireNX70/10c72169ea0105f998bab6f51443d42e

RSDuck · 2024-06-02T21:34:46Z

thank you for debugging that.

Regarding the barriers that is obviously a mistake. Though GL_SHADER_STORAGE_BUFFER is defined as 0x90D2 (which should result in a GL error), though if the driver accepts it and only uses the valid bits, most GPUs don't have that fine grained barriers anyway. So there's probably atleast one value set which will insert a correct barrier.

Huh, I always assumed you could do work groups as long as it fits inside a 32-bit int (as it is stored in one). The reason it didn't fix it could be multiple things. If you don't change any of the bindings you will end up just doing the same work multiple times (possibly at the same time).

I need to look into what parallel rdp again, maybe it has a solution to this. I simplified some things, e.g. we always process all polygons at once, because it is limited to a fixed amount on DS. One possible easy out would probably just be to increase the tile size to 16x16 at some point. That would also solve the possible issue of the tile buffer index overrunning.

FireNX70 · 2024-06-03T16:33:30Z

thank you for debugging that.

No problem, once this is fixed I can fully switch to MelonDS.

Huh, I always assumed you could do work groups as long as it fits inside a 32-bit int (as it is stored in one). The reason it didn't fix it could be multiple things. If you don't change any of the bindings you will end up just doing the same work multiple times (possibly at the same time).

There's a minimum required limit of 65535. Implementations are free to establish higher limits. I haven't checked what limits Mesa and Radeon report, Nvidia has a higher limit on the X dimension but sticks to 65535 for the Y and Z dims.

I need to look into what parallel rdp again, maybe it has a solution to this. I simplified some things, e.g. we always process all polygons at once, because it is limited to a fixed amount on DS. One possible easy out would probably just be to increase the tile size to 16x16 at some point. That would also solve the possible issue of the tile buffer index overrunning.

I'm not well versed in graphics nor OpenGL, certainly not enough to understand what the compute renderer's doing. Having said that, it looks to me as though the right thing to do would be to cap VariantWorkCount's Z component to the respective GL_MAX_COMPUTE_WORK_GROUP_COUNT and do multiple passes per variant if needed. Easy to say, probably a pain in the ass to implement. I don't think larger tiles would guarantee the work group count wouldn't go above the limit at higher internal resolutions.

RSDuck · 2024-06-03T17:09:31Z

The problem is that the amount of work done per variant is not known to the cpu at the time it issues the indirect compute calls (that's the reason they are indirect in the first place). And letting the GPU do the binning, then waiting on that and then dispatching as much compute jobs as necessary would work, but would just waste CPU time on waiting.

After sleeping over this, there is technically another solution. The other two axis as well and then put all the axis together into one index in the shader.

I quite like the idea of increasing the tile size at very high resolutions though. The increased tile size results in a lower amount of tiles still filling the same area. As long as the total amount of tiles is less than 2^16 no single variant can result in a larger compute job than that.

What of course remains a potential issue is that the tile buffer can overflow, as it has a fixed size. It would be possible to conservely estimate the size necessary each frame by using the bounding boxes of each polygon, but with some polygon shapes that might result in excessive allocations.

Both solutions don't exclude each other, I guess I can also just implement them both.

FireNX70 · 2024-06-03T18:13:07Z

Scaling the tile size with the internal resolution sounds good. Do keep in mind there's a limit on local size too, although the minimum of 1024 should be fine (1024x1024 for a single tile would be huge). As for the tile buffer; IMO it's better safe than sorry, especially if it doesn't overestimate by a lot.

Edit: 11x (almost 4K on a single screen) uses about 2650 MB of VRAM right now. It's not nothing, but there's some margin for the tile buffer to grow.

FireNX70 · 2024-06-10T20:41:19Z

I've been messing around with scaled tiles. 16x16 tiles work fine, 32x32 tiles (the max size, since the tile size is used for the X and Y local sizes in certain shaders and I'm sticking to the default GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS of 1024) presents a couple of problems.

glDispatchCompute(TilesPerLine*TileLines/32, 1, 1), used for ShaderClearCoarseBinMask, is a problem because the total tile count (TilesPerLine*TileLines) with size 32 tiles will not divide by 32 nicely. I think we divide TilesPerLine * TileLines by 32 because of the shader's local size. That division works fine with both 16 and 48. It won't be optimal for Nvidia (warp size of 32) nor Radeon (RDNA wave size is 32, GCN wave size is 64). I'm not sure if the divider for CoarseBinStride has to match the local size too. If so, this only works if it's 16. It's as far from optimal as 48 for Nvidia and RDNA but probably worse for GCN.

glDispatchCompute(((gpu.GPU3D.RenderNumPolygons + 31) / 32), ScreenWidth/CoarseTileW, ScreenHeight/CoarseTileH), used for ShaderBinCombined, is also a problem. With 32x32 tiles, CoarseTileH would end up being 128. At lower tile sizes it's 32 or 64, which 192 is a multiple of, so it was guaranteed the vertical resolution would be a multiple of CoarseTileH. 192 is not a multiple of 128, so a CoarseTileH of 128 becomes a problem at certain resolutions (specifically, odd multiples of the base resolution). This (I think) is what would cause a blank strip at the bottom of the screen with odd scales. Making CoarseTileCountY variable should fix it, but it introduced more bands (the number of bands matches the scale). With 6-tile tall coarse tiles (when using 32x32 tiles) the height of a coarse tile is 192 (that way we make sure internal resolution is a multiple of CoarseTileH), so the number of coarse tile lines matches the scale. What I think is really happening here is there's actually one two-tile-thick band per coarse tile because there's a hardcoded 4 somewhere so there's 2 rows of tiles that are ignored per coarse tile.

Just in case here's what I've done: https://github.com/FireNX70/melonDS/tree/try-fixing-the-compute-renderer

user18081972 · 2024-06-10T21:52:10Z

I build your branch, and it fixes it until 10x IR, then static blue lines appear in ChibiRobo

And in Avalon Code static stripes of missing pixels

FireNX70 · 2024-06-10T21:55:28Z

Yeah, those are the stripes I was talking about. Starting at 10x it uses 32x32 tiles.

FireNX70 · 2024-06-12T21:33:16Z

I think what's happening with the stripes is BinCombined's local_size_x must match CoarseTileCountX * CoarseTileCountY. A quick hack to do that (and changing the associated glDispatchCompute call to add 47 and divide by 48 instead 31 and 32) gets rid of the stripes but causes a bunch of geometry to fail to render.

I've also found (through testing with 16x16 tiles) that the scale at which garbage may start being displayed would be 3.5x (with 8x8 tiles, 7x with 16x16 tiles in practice). The worst case scenario seems to be MKDS' desert track. If the fire balls (which are only there on the last 2 laps) get extremely close to the camera, the scale is high enough and the tile size is small enough; you'll get garbage. Driving through them seems to be the best way to trigger this. The next worst scenario, Pokemon BW Driftveil Bridge, only starts breaking at 5x (8x8) and is fine at 4x (8x8), 7x (16x16) and 8x (16x16).

FireNX70 · 2024-06-12T22:52:26Z

I think I got it working. Work count X just wasn't related to the local size for BinCombined. The only remaining problem is ClearCoarseBinMask, but I think changing the local size to 48 and dividing TilesPerLine*TileLines by 48 should fix that.

user18081972 · 2024-06-13T08:32:18Z

On the Pokemon BW2 Driftveil Bridge i indeed do get corruption, at 4x 8x and 16x, any other resolution works perfectly.
(tested on: FireNX70@6f8ce9f)

user18081972 · 2024-07-16T08:47:38Z

I tested it again, with FireNX70@6f8ce9f)
With Pokemon BW2 on the drawbridge in the same spot, but this time on a Intel Arc A380 GPU 6GB.
It works without any issues, on all resolutions, with the compute shader renderer. So Arc is also unaffected.

patataofcourse · 2024-11-23T23:16:07Z

happens with the melon 1.0 RC on Built to Scale DS (first game in Rhythm Heaven), on 8x upwards

this is on Windows 10 with an Nvidia RTX 4050

vitor251093 mentioned this issue May 29, 2024

Render Glitch on resolutions above x5 (compute-shader-renderer-2) vitor251093/KHMelonMix#105

Open

FireNX70 mentioned this issue Jun 13, 2024

Implement tile size scaling for the compute renderer #2065

Open

user18081972 mentioned this issue Aug 12, 2024

(Compute Shader) Garbled mess of pixels #1825

Closed

user18081972 mentioned this issue Nov 23, 2024

Graphics problem with OpenGL (compute shader) #2206

Open

user18081972 mentioned this issue Nov 24, 2024

OpenGL Compute Shader - Broken Graphics #2215

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute renderer displays garbage at high IRs #2047

Compute renderer displays garbage at high IRs #2047

FireNX70 commented May 16, 2024

RSDuck commented May 16, 2024

FireNX70 commented May 16, 2024

takeshineale128 commented May 16, 2024

FireNX70 commented May 17, 2024

takeshineale128 commented May 17, 2024

FireNX70 commented May 17, 2024

takeshineale128 commented May 17, 2024

RSDuck commented May 17, 2024

nadiaholmquist commented May 17, 2024

user18081972 commented May 17, 2024 •

edited

Loading

user18081972 commented May 17, 2024

FireNX70 commented May 17, 2024 •

edited

Loading

RSDuck commented May 17, 2024

takeshineale128 commented May 18, 2024 •

edited

Loading

KostaSaizo7 commented May 18, 2024

user18081972 commented May 26, 2024

FireNX70 commented Jun 2, 2024 •

edited

Loading

RSDuck commented Jun 2, 2024

FireNX70 commented Jun 3, 2024

RSDuck commented Jun 3, 2024

FireNX70 commented Jun 3, 2024 •

edited

Loading

FireNX70 commented Jun 10, 2024

user18081972 commented Jun 10, 2024 •

edited

Loading

FireNX70 commented Jun 10, 2024

FireNX70 commented Jun 12, 2024

FireNX70 commented Jun 12, 2024 •

edited

Loading

user18081972 commented Jun 13, 2024

user18081972 commented Jul 16, 2024 •

edited

Loading

patataofcourse commented Nov 23, 2024

Compute renderer displays garbage at high IRs #2047

Compute renderer displays garbage at high IRs #2047

Comments

FireNX70 commented May 16, 2024

RSDuck commented May 16, 2024

FireNX70 commented May 16, 2024

takeshineale128 commented May 16, 2024

FireNX70 commented May 17, 2024

takeshineale128 commented May 17, 2024

FireNX70 commented May 17, 2024

takeshineale128 commented May 17, 2024

RSDuck commented May 17, 2024

nadiaholmquist commented May 17, 2024

user18081972 commented May 17, 2024 • edited Loading

user18081972 commented May 17, 2024

FireNX70 commented May 17, 2024 • edited Loading

RSDuck commented May 17, 2024

takeshineale128 commented May 18, 2024 • edited Loading

KostaSaizo7 commented May 18, 2024

user18081972 commented May 26, 2024

FireNX70 commented Jun 2, 2024 • edited Loading

RSDuck commented Jun 2, 2024

FireNX70 commented Jun 3, 2024

RSDuck commented Jun 3, 2024

FireNX70 commented Jun 3, 2024 • edited Loading

FireNX70 commented Jun 10, 2024

user18081972 commented Jun 10, 2024 • edited Loading

FireNX70 commented Jun 10, 2024

FireNX70 commented Jun 12, 2024

FireNX70 commented Jun 12, 2024 • edited Loading

user18081972 commented Jun 13, 2024

user18081972 commented Jul 16, 2024 • edited Loading

patataofcourse commented Nov 23, 2024

user18081972 commented May 17, 2024 •

edited

Loading

FireNX70 commented May 17, 2024 •

edited

Loading

takeshineale128 commented May 18, 2024 •

edited

Loading

FireNX70 commented Jun 2, 2024 •

edited

Loading

FireNX70 commented Jun 3, 2024 •

edited

Loading

user18081972 commented Jun 10, 2024 •

edited

Loading

FireNX70 commented Jun 12, 2024 •

edited

Loading

user18081972 commented Jul 16, 2024 •

edited

Loading