-
-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute renderer displays garbage at high IRs #2047
Comments
Sorry I have Intel iGPU, I can't debug these things. They probably just need even more work tiles or some other index starts overflowing. I need to come up with a smarter way of allocating the work tile amount or atleast give an option to override it. |
I tried doubling MaxWorkTiles and it didn't fix it. Increasing the size of FinalTileMemory didn't do anything either. |
no issues on my gtx 1660 gpu at 16x ir on win 11 with driver version v552.44 |
I've tested both updating the driver and using a T1000. Neither fixed this. So it's not the driver version or the architecture. I really don't think the OS would make a difference, especially if it's Win10 vs Win11. Were you really testing with the compute renderer? |
opengl |
There's two OpenGL renderers. Are you testing with a dev version or are you using 0.9.5? |
0.9.5 |
FinalTileMemory isn't dynamically allocated, it cannot overflow. Not really sure what could be the issue here. It seems to be definitely tile related and the index should not overflow (yet). |
Can't seem to reproduce this on an AMD GPU (Steam Deck). Will try on other stuff later. |
@takeshineale128 |
AvalonCode.mp4heres a video example. hopefully it helps |
@nadiaholmquist I've tested with a Vega 8 (3500U) on Win10 with an old driver from 2023 and it works fine even at 16x, although attempting to switch resolutions while the game was running completely froze the renderer. I've also tested with LLVMpipe in a VM and it also works fine even at 16x. It looks like it's a bug specific to Nvidia. I'll test Nvidia on Linux with the proprietary driver, I expect that it will behave just like it does in Win10. Edit: Nvidia on Linux with the proprietary driver gives the exact same result as on Win10. |
oh no, there still might be undefined behaviour somewhere in the shaders. I'm going to run the shaders with mesa and a debug context eventually to see whether it warns about something. |
Blue Dragon - Awakened Shadow |
I decided to downgrade my GPU driver (Nvidia RTX 3080) to the oldest supported one (version 456.71) to see if it might be a 'regression'. |
Edit: Here's a gist on all the testing I've been doing https://gist.github.com/FireNX70/10c72169ea0105f998bab6f51443d42e |
thank you for debugging that. Regarding the barriers that is obviously a mistake. Though GL_SHADER_STORAGE_BUFFER is defined as 0x90D2 (which should result in a GL error), though if the driver accepts it and only uses the valid bits, most GPUs don't have that fine grained barriers anyway. So there's probably atleast one value set which will insert a correct barrier. Huh, I always assumed you could do work groups as long as it fits inside a 32-bit int (as it is stored in one). The reason it didn't fix it could be multiple things. If you don't change any of the bindings you will end up just doing the same work multiple times (possibly at the same time). I need to look into what parallel rdp again, maybe it has a solution to this. I simplified some things, e.g. we always process all polygons at once, because it is limited to a fixed amount on DS. One possible easy out would probably just be to increase the tile size to 16x16 at some point. That would also solve the possible issue of the tile buffer index overrunning. |
No problem, once this is fixed I can fully switch to MelonDS.
There's a minimum required limit of 65535. Implementations are free to establish higher limits. I haven't checked what limits Mesa and Radeon report, Nvidia has a higher limit on the X dimension but sticks to 65535 for the Y and Z dims.
I'm not well versed in graphics nor OpenGL, certainly not enough to understand what the compute renderer's doing. Having said that, it looks to me as though the right thing to do would be to cap VariantWorkCount's Z component to the respective GL_MAX_COMPUTE_WORK_GROUP_COUNT and do multiple passes per variant if needed. Easy to say, probably a pain in the ass to implement. I don't think larger tiles would guarantee the work group count wouldn't go above the limit at higher internal resolutions. |
The problem is that the amount of work done per variant is not known to the cpu at the time it issues the indirect compute calls (that's the reason they are indirect in the first place). And letting the GPU do the binning, then waiting on that and then dispatching as much compute jobs as necessary would work, but would just waste CPU time on waiting. After sleeping over this, there is technically another solution. The other two axis as well and then put all the axis together into one index in the shader. I quite like the idea of increasing the tile size at very high resolutions though. The increased tile size results in a lower amount of tiles still filling the same area. As long as the total amount of tiles is less than 2^16 no single variant can result in a larger compute job than that. What of course remains a potential issue is that the tile buffer can overflow, as it has a fixed size. It would be possible to conservely estimate the size necessary each frame by using the bounding boxes of each polygon, but with some polygon shapes that might result in excessive allocations. Both solutions don't exclude each other, I guess I can also just implement them both. |
Scaling the tile size with the internal resolution sounds good. Do keep in mind there's a limit on local size too, although the minimum of 1024 should be fine (1024x1024 for a single tile would be huge). As for the tile buffer; IMO it's better safe than sorry, especially if it doesn't overestimate by a lot. Edit: 11x (almost 4K on a single screen) uses about 2650 MB of VRAM right now. It's not nothing, but there's some margin for the tile buffer to grow. |
I've been messing around with scaled tiles. 16x16 tiles work fine, 32x32 tiles (the max size, since the tile size is used for the X and Y local sizes in certain shaders and I'm sticking to the default GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS of 1024) presents a couple of problems.
Just in case here's what I've done: https://github.com/FireNX70/melonDS/tree/try-fixing-the-compute-renderer |
Yeah, those are the stripes I was talking about. Starting at 10x it uses 32x32 tiles. |
I think what's happening with the stripes is BinCombined's local_size_x must match CoarseTileCountX * CoarseTileCountY. A quick hack to do that (and changing the associated glDispatchCompute call to add 47 and divide by 48 instead 31 and 32) gets rid of the stripes but causes a bunch of geometry to fail to render. I've also found (through testing with 16x16 tiles) that the scale at which garbage may start being displayed would be 3.5x (with 8x8 tiles, 7x with 16x16 tiles in practice). The worst case scenario seems to be MKDS' desert track. If the fire balls (which are only there on the last 2 laps) get extremely close to the camera, the scale is high enough and the tile size is small enough; you'll get garbage. Driving through them seems to be the best way to trigger this. The next worst scenario, Pokemon BW Driftveil Bridge, only starts breaking at 5x (8x8) and is fine at 4x (8x8), 7x (16x16) and 8x (16x16). |
I think I got it working. Work count X just wasn't related to the local size for BinCombined. The only remaining problem is ClearCoarseBinMask, but I think changing the local size to 48 and dividing TilesPerLine*TileLines by 48 should fix that. |
On the Pokemon BW2 Driftveil Bridge i indeed do get corruption, at 4x 8x and 16x, any other resolution works perfectly. |
I tested it again, with FireNX70@6f8ce9f) |
Resolutions higher than 6x are very prone to displaying garbage. 6x will display garbage in certain spots, although Zelda Spirit Tracks triggers this pretty easily. 5x seems fine. Tested on a 1080Ti + Win10. Tested with a72b79a, so 5df83c9 doesn't fix it.
Some examples
Pokemon BW Driftveil Bridge 6x
Pokemon Platinum 5x
Pokemon Platinum 6x
Pokemon Platinum 6x
Pokemon Platinum 7x
Zelda Phantom Hourglass 6x
Zelda Phantom Hourglass 7x
Zelda Spirit Tracks 5x
Zelda Spirit Tracks 6x
The text was updated successfully, but these errors were encountered: