Skip to content
This repository has been archived by the owner on Jul 10, 2023. It is now read-only.

Multipass render to texture with options - implemented #1460

Merged
merged 26 commits into from
Feb 19, 2019

Conversation

mar753
Copy link
Contributor

@mar753 mar753 commented Nov 11, 2018

I was a huge fan of THPS2, so many memories ;), but when I saw those artifacts on my Android phone:
screenshot_2018-11-11-21-41-36-881_com reicast emulator
I decided that it must be fixed!
Current version of reicast supports GLES 2.0. Plain glReadPixels() operation is time consuming, sometimes preventing smooth 60fps gameplay. I have even upgraded GLES to 3.0 in reicast application and tried using single PBO (on GLES 2.0 we can use eglCreateImageKHR() as well, but unfortunately GraphicBuffer is not available in NDK, without compiling with Android native code, fortunately solution is to dynamically load libui.so library and get GraphicBuffer from there, check this out: https://github.com/fuyufjh/GraphicBuffer), but performance increase was not noticeable, at least on my device with Qualcomm Snapdragon 625.

I know that 'render to texture' support was disabled because of performance issues, so I have prepared several options, what I hope will be a good solution. There is a new menu select (Experimental section) called 'Render to texture', where available options are:

  • 'Disabled - skip frames' - this option skipes rtt frames, exactly as it is working now (image above)
  • 'Zeros' - this option writes zeros to the texture (for all RGBA channels) that will be used as shadow/anything else
    screenshot_2018-11-11-21-39-49-182_com reicast emulator
  • 'Ones' - this option writes ones to the texture (for all RGBA channels) that will be used as shadow/anything else
    screenshot_2018-11-11-21-43-12-981_com reicast emulator
  • 'Shadow circle' - (default) this option creates a circle (disk in fact) in a 128x128 buffer using sin/cos functions and writes this data to the texture, which is then used as a nice pseudo-shadow
    screenshot_2018-11-11-21-44-55-578_com reicast emulator
  • 'Full' - this option implements full 'render to texture' support with two pass rendering (first: rtt, second: scene)
    screenshot_2018-11-11-21-46-26-977_com reicast emulator

In terms of performance:
when 'render to texture' is 'Disabled - skip frames' performance is exactly the same as it is now in the current version (it is using exactly the same code as currently), there is also no noticeable performance differencies in every other modes/options, except 'Full' (Drawing is disabled in every mode except 'Full'), which is sometimes much slower. Maybe, limiting the draw distance may help here.
When using 'Full', only two pass rendering is available at the moment. It will be nice to implement multipass RTT as well.

Feel free to fetch this branch and test!

Cheers,
Marcel

@CLAassistant
Copy link

CLAassistant commented Nov 11, 2018

CLA assistant check
All committers have signed the CLA.

@mar753 mar753 changed the title Mar753/render to texture with options Render to texture with options - implemented Nov 11, 2018
@mar753 mar753 force-pushed the mar753/render-to-texture-with-options branch from 0f6f94c to 4d39c10 Compare November 11, 2018 22:22
@skmp
Copy link
Owner

skmp commented Nov 11, 2018

This pull request introduces 2 alerts when merging 4d39c10 into 19ca0ed - view on LGTM.com

new alerts:

  • 1 for Multiplication result converted to larger type
  • 1 for Short global name

Comment posted by LGTM.com

@baka0815 baka0815 requested a review from flyinghead November 12, 2018 06:45
Copy link
Contributor

@flyinghead flyinghead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mar753 Thank you for this proposal and for your work.

I'm afraid this PR cannot be merged as is. Let me explain why.

Render-to-texture is currently disabled in the master branch, presumably because of performance and/or incomplete implementation.
If enabled, render-to-texture will render the current scene to an Open GL texture, only stored on the GPU, while keeping track of the destination address of said texture in VRAM (ft_rtt.TexAddr). When the next scene references a texture at this address, the renderer will simply return the previously generated RTT texture.

This fixes 99% of the games using RTT and is efficient since the texture data buffer doesn't have to be transferred back to VRAM from the GPU memory. So my first complaint about this PR is that it doesn't seem to handle this very general case. I'd rather have RTT working for most games first, and only then work on the exceptions.

Now, this system works for most games, but it doesn't work with THPS and THPS2 (and may be a few others). Why? Because THPS and THPS2 use render-to-texture is an "special" way: they first render the skater's shadow into the red channel of an RGB 565 texture. Then, when using this texture in a subsequent scene, they load it as an ARGB 4444 texture, thereby using the red channel of the rendered texture as the alpha channel of the new texture (and loosing one bit of precision along the way).

There's no easy way to do this in Open GL and detecting this behaviour is quite difficult. The solution is, in this particular case, to copy back the rendered texture buffer into the VRAM. And let the normal texture cache load it when needed. There is no need to manually hack the texture data as long as it's properly copied into the dest VRAM location with the requested bitpack format. This is how the actual dreamcast hardware is working so doing it this way should take care of any "exotic" use of textures such as the THPS[2] case.

If I'm not mistaken, this is not what this PR is doing. Instead the channel switching between red and alpha is hardcoded and thus can only work for this particular game.

You might want to have a look at my branch (fh/mymaster) where proper render-to-texture is implemented including reading back the texture buffer data into VRAM when a particular option is enabled (RenderToTextureBuffer, which is enabled by default for THPS and THPS2).

@mar753
Copy link
Contributor Author

mar753 commented Nov 15, 2018

Thank you for your feedback @flyinghead!

Regarding my implementation, I have done some debugging before and yes, I know that byte swap from red channel to alpha without LSB (RGB565 -> 'ARGB4444') needed by THPS2 can be done automatically when we read pixels directly to vram[fb_rtt.TexAddr<<3] location without any swapping, but it must be done within the same RTT frame rendering cycle (like in your code). I have even determined that fb_rtt.TexAddr<<3 for the shadow in THPS2 equals 0x4D0180, at least on my Android device, this can be used for debugging purposes.

In my implementation I have moved glReadPixels() to the next frame rendering cycle on purpose (after RTT frame rendering cycle and next, scene frame, preparation), doing like this I was able to significantly increase the performance, thus framerate (tested on Qualcomm Snapdragon 625) with fully enabled single pass RTT (first: RTT frame, next: scene frame, then: RTT frame etc.). Performance is better because during making some stuff on CPU to prepare the next frame, RTT rendering/drawing from the previous frame is executing in parallel. In contrast, calling glReadPixels() within an RTT frame rendering cycle, causes a stall on glReadPixels() until rendering/drawing is finished.

Unfortunately this approach forces me to make this color channel swap manually (red->alpha), because when a next frame is being processed (scene drawing), vram[fb_rtt.TexAddr<<3] location is not valid anymore - now it is valid for a current frame thus I cannot write to it. But... I am doing:

for (u32 i = 0; i < w * h; i++)
{
	buf[i] = ((*dataPointer & 0xF000) >> 12) | ((*dataPointer & 0x0FFF) << 4);
	*dataPointer++;
}
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, w, h, 0, GL_RGBA, GL_UNSIGNED_SHORT_4_4_4_4, buf);

it is exactly what will be done in TextureCacheData::Update() from gltex.cpp:

if (texID) {
	//upload to OpenGL !
	glBindTexture(GL_TEXTURE_2D, texID);
	GLuint comps=textype==GL_UNSIGNED_SHORT_5_6_5?GL_RGB:GL_RGBA;
	glTexImage2D(GL_TEXTURE_2D, 0,comps , w, h, 0, comps, textype, temp_tex_buffer);
	if (tcw.MipMapped && settings.rend.UseMipmaps)
		glGenerateMipmap(GL_TEXTURE_2D);
}

little bit higher in this function there is a code responsible for red->alpha channel swap:

if(texconv!=0)
{
	texconv(&pbt,(u8*)&vram[sa],stride,h);
}

What is more gl_GetTexture() at the beginning has:

if (tcw.TexAddr==fb_rtt.TexAddr && fb_rtt.tex)
{
	return fb_rtt.tex;
}

and will always return here in my solution (when using RTT), because fb_rtt.tex will always be defined, so TextureCacheData::Update() will not be executed each time, what will be unnecessary time waste.

What is more I have limited glDetele* commands in BindRTT(), performance increase is rather not noticeable but maybe will be in some specific conditions.

But you are right in terms of hardcoding red->alpha swap/move in my code, because when a texture to which we are rendering is RGB565 I am always doing the swap and converting the texture to RGBA4444, what can cause artifacts. I have tested several games and the problem was not visible, probably because they were using some other formats for RTT rendering like RGBA5551 or RGBA4444 from the beginning. Anyway, the artifacts can be visible only when RTT is in use, thus changing 'Render to texture' select to 'Zeros' in my implementation will prevent any artifacts, because in case of RGB565, rendered texture will be transparent, so it will not be visible anyways, if I am thinking correctly. I can make this option a default one.

Do you know maybe games that uses RGB565 for RTT rendering in the 'normal' way that I can test (maybe some games with mirrors)?

@flyinghead
Copy link
Contributor

Crazy Taxi uses RTT for the pause menu in game.
Many games use RTT for crossfade transitions. See the opening demos of Skies of Arcadia and Virtua Tennis for example.
Rez makes heavy use of RTT for graphic effects and level transitions.

@skmp
Copy link
Owner

skmp commented Dec 30, 2018

This pull request introduces 3 alerts when merging 302bfcd into bbc54e4 - view on LGTM.com

new alerts:

  • 2 for Local variable hides global variable
  • 1 for Multiplication result converted to larger type

Comment posted by LGTM.com

@mar753
Copy link
Contributor Author

mar753 commented Jan 1, 2019

Render to texture (RTT) with multipass rendering support has been implemented.
During processing RTT frame, drawing is initiated, but glReadPixels() is invoked in the next nearest non-RTT frame - where RTT texture is needed for screen rendering, in this way GPU and CPU can work simultaneously and stalls are minimized. This improves the performance significantly. It is recommended to enable 'Synchronous rendering' as well - I saw some issues with level transitions in Rez without this option enabled.

OpenGL ES 2.0 fully compatible solution was implemented, when reading pixels from a framebuffer object (FBO).
For example, when a texture (or color renderbuffer) bound to the FBO is RGBA5551 format/type, GL_IMPLEMENTATION_COLOR_READ_TYPE query is executed just before glReadPixels() to explore GPU's supported image type, when GL_UNSIGNED_SHORT_5_5_5_1 will be returned, we proceed the standard way, but if something other is returned GL_UNSIGNED_BYTE type and GL_RGBA format are used to read pixels - OpenGL ES 2.0 compatible hardware must always support this format/type. We need to properly convert it back to 16 bit color value then. This fixes an issue with ARM Mali-450 MP GPU, where (as I saw) GL_RGB/GL_UNSIGNED_SHORT_5_6_5 is the only 16 bit format/type supported to read back from a framebuffer object.

There is an issue on Adreno 506 with RGBA5551 format/type, to be specific I was able to observe a huge performance drop when using this image format/type, even though this GPU declares that it is supported (GL_IMPLEMENTATION_COLOR_READ_TYPE query). This format/type is supported, but probably there is some time expensive internal format conversion, which makes it unusable, at least here. We can see this issue very clearly in case of Rez game. Using framebuffer object with GL_UNSIGNED_BYTE texture, and then reading it using glReadPixels() with the same image format/type combination seems to be a workaround for this issue. It is strange that 32 bit color rendering works faster than RGBA5551 16 bit rendering, but it is true in this case. I have not altered TextureCacheData struct (gltex.cpp) and TexCache.h/cpp files to adapt them to support 32 bit colors, because I decided it will be a lot of changes just for this fix and because we do not need 'texconv' in this case, 32 bit image is directly written to the texture after read (bypassing TextureCacheData's 'Update()' method).

To achieve this blur effect in Rez:

screenshot_2018-12-28-14-28-11-662_com reicast emulator

blending feature is used in gldraw.cpp:
glBlendFunc(SrcBlendGL[gp->tsp.SrcInstr],DstBlendGL[gp->tsp.DstInstr]);
with both parameters set to GL_ONE (additive blending).
For this to work properly with different formats/types of images I needed to modify BindRTT() function code to create texture/renderbuffer/framebuffer and initialize it only once at the initialization state of a given texture address.

I have added RTT swizzle texture support too. It is used in Virtua Tennis during logo and demo video transitions to achieve 640 pixels in width of a texture.

As I saw in the source code, there is no standard code convention and a lot of the C++ code is mixed with the C code, but I have tried to keep the similar convention and limit the refactoring.

Default option for 'Render to Texture' is set to 'Zeros'. 'Full' option must be selected to enable multipass RTT.

Solution was tested with several games that use RTT (including: Rez, Virtua Tennis, THPS2 and Crazy Taxi) and is working fine.

Testing platform:

  • Huawei P8 lite, Android 6 (ARM Mali-450 MP)
  • Xiaomi Redmi Note 4 Snapdragon, Android 7 (Adreno 506)
  • Xiaomi Pocophone F1, Android 9 (Adreno 630)

Feel free to test.

@mar753 mar753 changed the title Render to texture with options - implemented Multipass render to texture with options - implemented Jan 7, 2019
@skmp
Copy link
Owner

skmp commented Jan 7, 2019

This pull request introduces 3 alerts when merging 2aadb3c into 6936bc2 - view on LGTM.com

new alerts:

  • 2 for Local variable hides global variable
  • 1 for Multiplication result converted to larger type

Comment posted by LGTM.com

@mar753
Copy link
Contributor Author

mar753 commented Jan 7, 2019

I have added the stencil support as well (if "OES_packed_depth_stencil" extension is available). This fixes the dimmed screen issue in Crazy Taxi.

If you would like to test this solution please fetch my branch: https://github.com/mar753/reicast-emulator/tree/multisample_rtt_implementation
because current master doesn't work on my Xiaomi devices (works only on P8 lite) - when I try to load a game I am entering the bios settings.

@flyinghead Could you take a look at this PR

@baka0815
Copy link
Contributor

@flyinghead were all your points worked on or is there still something left from your pov?

@skmp
Copy link
Owner

skmp commented Jan 11, 2019

This pull request introduces 3 alerts when merging 10d6f25 into 3c57177 - view on LGTM.com

new alerts:

  • 2 for Local variable hides global variable
  • 1 for Multiplication result converted to larger type

Comment posted by LGTM.com

@mar753
Copy link
Contributor Author

mar753 commented Jan 14, 2019

@flyinghead
Ok, requested feature added. Render to texture for RGB565 (except for cases like THPS2) and ARGB4444 will reuse rendered texture (on GPU) without calling glReadPixels(), this should improve the performance.

@mar753
Copy link
Contributor Author

mar753 commented Jan 19, 2019

As I have checked FBOs are not supported in OpenGL 2.x without the "GL_EXT_framebuffer_object" extension, so this last commit is not needed because OpenGL 3.0 at least is required anyway.

@mar753
Copy link
Contributor Author

mar753 commented Jan 21, 2019

Just to clarify: I meant desktop OpenGL 2.x (NOT OpenGL ES 2.x).

@baka0815
Copy link
Contributor

@flyinghead are all your remarks addressed?

Since the default for RTT is 0 the behaviour is the same as before I have no problem merging this one.
I tested THPS and the results are as described in the PR.
I also tested Crazy Taxi and Shenmue and didn't notice any regressions (however I don't know if they use RTT).

@flyinghead
Copy link
Contributor

Sorry I haven't had time to dedicate to reviewing this. I assume the changes answered my concerns.

Interestingly you mention that fb_alpha_threshold and kval_bit could change between the moment the texture is rendered and when it is used. (I must say I overlooked this case) Have you noticed this happening in some games?
I don't have the answer but if it never happens, this would eliminate a round-trip to the GPU in all common cases (except for TPHS2 obviously).

Also I don't think Circle/All Ones and And Zeros options are needed. We need three options:

  • Disabled
  • Fast (cache rendered textures on the GPU)
  • Full (read back pixels from GPU to VRAM then back. Needed for TPHS2 and such)
    The Fast option assumes that fb_alpha_threshold and kval_bit usually don't change and that all texture types can be cached on the GPU.

@mar753
Copy link
Contributor Author

mar753 commented Jan 25, 2019

Actually I would prefer to stick with my current solution, I will explain why.
First of all option 'Disabled' is just to have a solution that is currently in the master branch (just in case of some rare situations). 'Zeros' is a must for mobile platforms to avoid artifacts like the first screen in this PR (probably not an issue in desktop version) - this option does not have an impact on performance, just like 'Ones'.
Both 'Zeros' and 'Ones' were useful for debugging in case of games which use RTT. It is possible that in some situations 'Ones' can be useful to avoid artifacts or just to let get through some points in a game.
Dreamcast/PowerVR specification includes that fb_alpha_threshold and kval_bit must be taken into account when working with respectively ARGB1555 and KRGB1555 texture formats. I am not sure if some games actually make use of it (maybe Rez is not using it but I cannot be sure, because I am not able to test every scenario), but I am not in favor of making solution that in not entirely correct. For me the correctness is the most importang thing, only then we can work on performance improvements. Anyway, rather few games use A(K)RGB1555 format, and performance will not be spectacularly better if we keep the texture on GPU. Another counter-argument of 'Fast' option is that it will not fix the THPS2 case. For this case we have 'Shadow circle' option which was quite handy during debugging as well and most importantly it does not have (or have minimal) an impact on performance. So we can play THPS2 or similar games on older/slower mobile devices without any (or with minimal) performance drop and for sure this is a better option than no shadow at all ('Zeros').
Those options can be available, but noone enforces us to use them.

@mar753 mar753 force-pushed the mar753/render-to-texture-with-options branch from cb3b0e7 to 88b9deb Compare January 31, 2019 20:09
@mar753
Copy link
Contributor Author

mar753 commented Feb 8, 2019

Reicast does not handle odd resolutions, we can see that effect in case of Xiaomi POCOPHONE F1 phone where higher dimension resolution is 2159 pixels. There is a narrow line visible (Rayman 2):

line

I have fixed that in "Handle odd screen resolution (POCOPHONE fix)" commit.

I have added dynamic resolution recalculation when screen resolution changes during emulation. This fix was needed when lowering rendering resolution is active.

It's ready to be reviewed

@mar753
Copy link
Contributor Author

mar753 commented Feb 11, 2019

Added some minor fixes (e.g. to odd screen resolution handling).

Vertical scaling is now handled (SCALER_CTL), this fixes the issue in Crazy Taxi pause screen, where background is not properly aligned (see the red arrow):

20190210105426
20190210011430

@mar753
Copy link
Contributor Author

mar753 commented Feb 11, 2019

Moreover as I have compared on Android, native reicast (this repository) with my RTT solution from this PR (even that resolution is higher) is faster than RetroArch reicast core emulation (tested with 'Rez' - violet level).

@baka0815 @skmp @dmiller423 This pull request is complete and ready to be merged.

Copy link
Owner

@skmp skmp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mar753 did a really quick scroll though the diff and added some comments.

In future PRs, would be great to

  • avoid nonfunctional whitespace changes to make reviewing easier. Trying to understand what was actually changed in gl_tex.cpp was a bit of a challenge.
  • Split changes that are not directly related to the main PR purpose into separate PRs (eg odd-sized-rendering workaround)

Here's a couple more questions

  • Are textures always read back to the host ram and then re-uploaded?
  • How does this compare with @flyinghead RTT changes / fh/mymaster? There's quite some divergence from master.

I think it's up to @flyinghead to have final say on this. If it doesn't upset the code too much otherwise, I'd merge it.

core/rend/gles/gles.cpp Outdated Show resolved Hide resolved
core/rend/gles/gltex.cpp Show resolved Hide resolved
core/rend/gles/gltex.cpp Show resolved Hide resolved
core/rend/gles/gltex.cpp Show resolved Hide resolved
@mar753
Copy link
Contributor Author

mar753 commented Feb 11, 2019

@skmp Find my answers below

@mar753 did a really quick scroll though the diff and added some comments.

In future PRs, would be great to

  • avoid nonfunctional whitespace changes to make reviewing easier. Trying to understand what was actually changed in gl_tex.cpp was a bit of a challenge.

You are right, I prefer spaces instead of tabs (which looks strange on GitHub page) in source files, but just to be consistent with the rest of files I have used tabs to indent source files.

  • Split changes that are not directly related to the main PR purpose into separate PRs (eg odd-sized-rendering workaround)

I agree to this in future PRs, just though that changes like odd screen resolution are so small that I do not need to create separate PRs.

Here's a couple more questions

  • Are textures always read back to the host ram and then re-uploaded?
  • How does this compare with @flyinghead RTT changes / fh/mymaster? There's quite some divergence from master.

I will briefly write how it works.
So in 'fh/mymaster' solution when RTT frame is processed, rendering and glReadPixels() are invoked within the same frame processing cycle (one by one). As we know glReadPixels() is a blocking operation, so we will freeze CPU on this call until rendering will be completed and GPU will return pixel data. I have tested it and this solution can cause framerate drops on slower/mid range portable devices.
In my implementation, when RTT frame is processed I start rendering to a texture, but I do not invoke glReadPixels() right after that, it will be invoked in the the next frame processing cycle. This way, when rendering is executed on GPU we can execute code on CPU not wasting precious clock cycles. That's why there is some additional code and variables needed. What is more those stored variables can be reused when gl_GetTexture() is called, until they will be invalidated by the next RTT rendering cycle (with the same address). Moreover, in the recent version of my RTT implementaion, ARGB4444 and RGB565 are stored in the OpenGL's video memory, and when they are needed we will just pass a texture ID and glReadPixels() is not needed at all (other formats like KRGB1555 need it because OpenGL hardware cannot substitute 'K' with a given 'fb_kval[7]' value, this must be done on CPU and for that we need glReadPixels() call).

I think it's up to @flyinghead to have final say on this. If it doesn't upset the code too much otherwise, I'd merge it.

@mar753
Copy link
Contributor Author

mar753 commented Feb 11, 2019

@skmp
Requested changes added, next steps are yours.

@mar753
Copy link
Contributor Author

mar753 commented Feb 13, 2019

@flyinghead Can you approve this PR?

@mar753
Copy link
Contributor Author

mar753 commented Feb 16, 2019

@baka0815 @dmiller423 @skmp
PR is accepted, thus we have a green light to merge this solution.

@baka0815
Copy link
Contributor

Since @skmp is ok with this and @flyinghead approved it, I'm merging this.

Thanks @mar753 for your work!

@baka0815 baka0815 merged commit 41907bc into skmp:master Feb 19, 2019
@skmp skmp mentioned this pull request Mar 28, 2019
12 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants