Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set GL_MaxFramesAllowed=1 instead of GL_YIELD=usleep #641

Merged
merged 2 commits into from
Jul 11, 2021

Conversation

kwand
Copy link

@kwand kwand commented Jun 18, 2021

This essentially concludes my issue #592, implementing the original change mentioned in my first post (instead of the other, later implementations I tried, which in the time since, I've decided they were only marginal improvements, with testing ultimately being inconclusive - at a cost of (personally) unacceptable amount of changes to the code/code separation.*)

*I'll mention more about the "inconclusive testing" later, for anyone reading this and thinking the ideas mentioned in #592 have some merit. Which I do still think they have, but the cost of implementation is high. High enough that I no longer think it is practical. A hardware upgrade, honestly, might be more practical (compared to more consistent 59-60Hz if fixed - as the smoothness issue with picom really only affects 60Hz and lower displays) given how much more afforable high refresh rate displays have become recently.

I have conclusive data this time from testing with NVIDIA's Nsight Systems. Nothing too major in terms of improvements - other than much-reduced max frame times (~63-82ms -> ~19-25ms, on a 160Hz display) and a bit less CPU time - but I think it is enough to warrant the changes.

I will preface this by saying that I still don't quite understand why we're (specifically) using USLEEP. I am aware of why it was needed - since glFinish busy-waits and causes high CPU utilization - but I'm unsure why disabling triple buffering (with this MaxFramesAllowed flag) was not considered as an alternative, as this essentially renders glFinish a dummy call that returns immediately and forces the swap buffers command to always block/wait until vertical refresh (which is the same result we appear to be trying to achieve with glFinish).

Anyways, the data:

Testing notes:

  • I managed to do the testing on a Ryzen 9 5950x, RTX 3090 system running a 3440x1440 160Hz ultrawide display (with G-Sync enabled), so the differences in testing below I'm quite sure are not related to the hardware (other than perhaps picom struggling a bit more with the ultrawide resolution. I have tested (seperately) that picom is capable of running 160Hz at native resolution at idle/light tasks though)
  • The test (ran twice): Quickly dragging a floating instance of kitty terminal (in a random/erratic fashion) over a playing 60Hz YouTube video in Chrome, on i3wm. This seems to be the best way of forcing as many redraws as possible, with real-world noticeable differences (since I + some other issues have mentioned noticing "laggy" floating windows while a video is playing in the background. Though, on second thought, I could also try doing something similar from Blurbusters + while dragging a floating window later)
  • Launch command: picom --experimental-backends --benchmark 2400 (Note: 2400 is benchmarking 2400 frames or roughly 15s assuming we can maintain 160Hz)

picom next branch (GL_YIELD=usleep)
image
image

Note: Yellow colored frames seem to indicate that they took slightly longer than ~16.6 ms i.e. 1 frame under 60Hz. Not sure why this is the default color coding from Nvidia - I don't see any options to change this in Nsight either, unfortunately. Red colored frames are ones which took 2-3 frames under 60Hz to render (i.e. ~40-50ms).

picom next branch with GL_MaxFramesAllowed=1
image
image

Overall frame-rate is higher here, with lower average frame times and much lower max-frame times. Frame stability is also much improved (no red-coloured frames at all, with only a few yellow-coloured ones). While CPU utlization seems to be higher*, we seem to be spending that time using actual useful work (instead of calling usleep every few ms). Randomly observing a selection of frames from both reports seem to confirm this:

*Thought this might have been caused by the changes, but on second thought - after examining a few more selections of frames, they both seem to be about the same, minus time spent calling usleep.

picom next branch (selection of 8 frames)
image

picom next branch with MaxFramesAllowed (8 frames)
image

EDIT: Hmm, just noticed this, but it appears Nsight is saying I selected 10 CPU frames... except I'm not seeing where it counts 10. (Looks like 8 to me).

picom 8.2
Also, for fun (and also b/c I forgot that the current picom release version is still 8.2 and installed the wrong version), picom 8.2. This has, by far, the worst results - but I think it's mostly due to the new changes to floating window movement that I believe were introduced after version 8.2 (which of course benefit the results of the post-8.2 builds)

image
image


As a final note, I do have the report files created by Nsight and originally planned to upload them, but unfortunately they only seem to be able to opened by Nsight, with no viable export options that provide the same readability. @yshui, please let me know if you or others want a copy of the files (if you have Nsight installed), as well as the best place to upload them to (for best ease of access).

EDIT: Also, forgot to mention. This is just one testing result, from one system. More testing is of course needed! (to see if there are really benefits from making this change). If anyone else can testing using Nsight (requires registering an NVIDIA Developer account, but the process was relatively simple IIRC), those results would be quite valuable as well.

(And, of course, this fix only affects NVIDIA GPUs since the original GL_YIELD=usleep flag setting is also a fix specific to when NVIDIA drivers are detected)

@codecov
Copy link

codecov bot commented Jun 18, 2021

Codecov Report

Merging #641 (fe7876c) into next (e92403b) will increase coverage by 0.00%.
The diff coverage is 0.00%.

❗ Current head fe7876c differs from pull request most recent head 976d074. Consider uploading reports for the commit 976d074 to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##             next     #641   +/-   ##
=======================================
  Coverage   39.54%   39.55%           
=======================================
  Files          46       46           
  Lines        9588     9529   -59     
=======================================
- Hits         3792     3769   -23     
+ Misses       5796     5760   -36     
Impacted Files Coverage Δ
src/backend/driver.c 33.33% <0.00%> (ø)
src/backend/gl/glx.c 37.04% <0.00%> (-0.13%) ⬇️
src/win.c 68.01% <0.00%> (-0.85%) ⬇️
src/picom.c 67.78% <0.00%> (-0.61%) ⬇️
src/backend/dummy/dummy.c 76.56% <0.00%> (-0.37%) ⬇️
src/backend/gl/gl_common.h 22.72% <0.00%> (-0.35%) ⬇️
src/backend/xrender/xrender.c 0.00% <0.00%> (ø)
src/backend/gl/gl_common.c 15.34% <0.00%> (+0.49%) ⬆️

@yshui
Copy link
Owner

yshui commented Jun 20, 2021

Thanks for your work and the detailed analysis!

but I'm unsure why disabling triple buffering (with this MaxFramesAllowed flag) was not considered as an alternative

At first I assumed MaxFramesAllowed is just going to be the NVIDIA driver doing some sort of glFinish equivalent internally.

My rational is this: MaxFramesAllowed has to block waiting for the current frame to finish swapping, just like glFinish. If the driver has a more efficient blocking mechanism, why would they also use it for glFinish?

Is my assumption wrong? Does MaxFramesAllowed=1 not use usleep()/busy-loop internally?

@kwand
Copy link
Author

kwand commented Jun 25, 2021

Yikes, I meant to reply the next day after doing testing to confirm, but unfortunately didn't have much time and it seems like I won't have much time in the future either, so it's probably best I reply to you now based on what I remember from previous testing (though, this is 3-4 months ago. I don't believe there have been any changes that affect this behaviour though, but there have certainly been changes in how well I remember what I tested back then :P)

At first I assumed MaxFramesAllowed is just going to be the NVIDIA driver doing some sort of glFinish equivalent internally.

My rational is this: MaxFramesAllowed has to block waiting for the current frame to finish swapping, just like glFinish. If the driver has a more efficient blocking mechanism, why would they also use it for glFinish?

First, I assume you meant to say "why wouldn't they also use it for glFinish"? (Otherwise, I'm not quite sure what you mean).

Is my assumption wrong? Does MaxFramesAllowed=1 not use usleep()/busy-loop internally?

I think we can rule out it using busy-looping internally, since 'disabling triple buffering'* does not have anywhere near the CPU utilization of busy-waiting glFinish (without usleep). I tested this months ago on the i7-8700K/RTX 2080 system:

  • glFinish gave a 10% CPU utilization (and possibly more, I don't quite remember - my memory seems to say up to 30%, but this also seems a bit too absurd for me now. I also don't quite remember what numbers similar issues reported as well, back in the day where picom took over laptop /weaker CPUs because of glFinish) while viewed through gotop
  • glFinish with usleep or MaxFramesAllowed=1 gave similar numbers of 3% CPU utilization, with maybe 0.5-1% difference between the two at times (with MaxFramesAllowed=1 having the lower number - but the average difference was negligible IIRC, at least on a CPU like the 8700K)
  • Also, if I recall correctly, this is CPU utilization at IDLE (with the exception of the gotop window updating).

*I am assuming NVIDIA's documentation about this flag is accurate - I have no idea what it does behind the scenes, but I think working with this assumption is safe. I was going to link to the documentation page, but for some reason, I can't seem to find it anymore. I guess we'll have to go with assuming this information from the KWin developers is accurate, which was the only other source I could find about __GL_MaxFramesAllowed)

As for whether or not it is usleeping internally - that I can't say for sure, nor do I think it's possible to detect it via Nsight - but I think it is likely not. First, I don't think NVIDIA expected glFinish to be used as frequently as it is (in picom) or to enforce v-sync - since the glX protocol documentation only requires glFinish to block until all previously issued operations are complete, not that it needs to be resource-friendly. Of course, busy-waiting is still a terrible implementation, but I believe this hints towards NVIDIA expecting it to only be used sparingly (where it is then less of a problem CPU utilization wise).

Setting _GL_MaxFramesAllowed=1 on the other hand is completely different from glFinish. Your wording here is a bit confusing:

My rational is this: MaxFramesAllowed has to block waiting for the current frame to finish swapping, just like glFinish. If the driver has a more efficient blocking mechanism, why would they also use it for glFinish?

Your wording seems to imply MaxFramesAllowed is another function like glFinish? Because it is not a function at all. Setting the flag only causes glXSwapBuffers to block until v-sync, which as far as I understand was the default behavior prior to graphics drivers adding in triple-buffering functionality (for Linux). So, it would seem to me that glXSwapBuffers and glFinish have completely different behaviours. glXSwapBuffers (with triple buffering disabled) only has to block until v-sync - and this is also previously existing functionality before triple buffering. glFinish on the other hand seems like it needs to query whether or not all of the previously issued commands have been completed or not. So, it doesn't seem obvious to me that NVIDIA would use similar implementations for both behaviours, even if they both block (though there are possibly some optimizations they could have made for glFinish).

Of course, this is all a bit theoretical and based on my memory. I'll edit this once I get the chance to do some testing again on this 5950x system. (I can tell you now that, at idle, Glances reports 0.3-1% CPU utilization with MaxFramesAllowed=1)

EDIT: I found some time to do the testing - immediately after I finished writing this. I think giving you some numbers from Glances should be enough since we're just noting picom's CPU utilization. Ran a few quick compiles and monitored the numbers in Glances:

  • As mentioned before, MaxFramesAllowed=1 has 0.3-1% CPU utlization at idle.
  • glFinish (without usleep) gives CPU utilization of 3-10%, at idle. Moving the mouse around causes 10% utilization. And simply switching workspaces in i3wm causes spikes up to 25%. (Perhaps I was wrong about 30% utilization on a 8700K being just an absurd detect in my memory...)
  • glFinish (with usleep) hovers around 0.3-1% CPU utilization at idle. Doesn't seem to ever break 1% either, so on par with MaxFramesAllowed=1.

Looks like my memory is still fine, and everything behaves about as expected (except for the fact that I forgot about how bad glFinish was without usleep...)

EDIT 2: Might have spoken a bit too soon. MaxFramesAllowed=1 does seem to have slightly higher peaks than glFinish (with usleep) at around 2%, even though it still hovers mostly around 0.3-1%. glFinish with usleep also does break 1%, peaking at 1.3% (haven't observed anything higher yet).

This did get me thinking - since it appears that setting GL_YIELD=usleep affects more than just glFinish, I decided to try out a compile with both GL_YIELD=usleep and __GL_MaxFramesAllowed=1. Utilization-wise, the numbers are great. Doesn't seem to ever break 0.7% in CPu utilization and actually disappears most of the time into the list of many 0.0% utilization processes. Frame timings performance, of course, is another question. (I'm thinking it may perform slightly worse than without usleep, since we must be getting those utilization savings at the cost of something)

EDIT 3: Disregard that last idea I had about combining usleep and MaxFramesAllowed - performance is atrocious. Seems like usleep is the problem and something we should just get rid of (along with glFinish - since it's a useless dummy call now that returns immediately) if we want responsiveness:
image

Though, I think this testing confirms that they are not all doing the same thing internally, since all three (glFinish, glFinish with usleep, MaxFramesAllowed=1) have different CPU utilization patterns.

@kwand kwand force-pushed the __GL_MaxFramesAllowed branch 2 times, most recently from 0531d5a to 9b90353 Compare July 4, 2021 00:46
@kwand
Copy link
Author

kwand commented Jul 4, 2021

Removed glFinish altogether for NVIDIA (since it doesn't do anything other than wasting a few ms). Tests forthcoming, though I expect the results to just be that it no longer wastes time doing some redundant.

Ignore the mess that I created above; had some trouble rebasing and made an even bigger mess. All is well now. :)

EDIT: Done some preliminary testing - mostly running a quick 24 frame 'benchmark' in benchmark mode with Nsight and observing the calls, and observing CPU utilization through glances while just using picom.

Nothing much to note, other than glFinish is indeed no longer being used. We also seemed to have lowered CPU utilization by removing glFinish. picom stays at 0.3-0.7% CPU utilization while typing this (compared to 0.3-1% previously). Mostly, it usually does not even appear in the >0.0% usage in glances, which is great and comparable to (the terrible performance-wise, but great CPU-utilization-wise) combination of usleep + MaxFramesAllowed.

@yshui
Copy link
Owner

yshui commented Jul 6, 2021

Again thanks for your analyses.

Your wording seems to imply MaxFramesAllowed is another function like glFinish? Because it is not a function at all.

Sorry I wasn't very clear. What I meant was, if you set MaxFramesAllowed=1, then glXSwapBuffers would have to first block wait for rendering to finish, then block wait for v-sync. Basically it needs to do the same thing we currently use glFinish to do. To be fair, given the quality of NVIDIA driver, I wouldn't be surprised if it does indeed have different implementations for the same functionality.

I wonder if waiting for the vsync (by using, e.g. OML_sync_control) would have a better result than glFinish? I think I have tested that but I am not sure.

I will merge this, but I do wish we could find a more general solution.

@kwand
Copy link
Author

kwand commented Jul 10, 2021

Sorry I wasn't very clear. What I meant was, if you set MaxFramesAllowed=1, then glXSwapBuffers would have to first block wait for rendering to finish, then block wait for v-sync. Basically it needs to do the same thing we currently use glFinish to do. To be fair, given the quality of NVIDIA driver, I wouldn't be surprised if it does indeed have different implementations for the same functionality.

Ah, yes. I see your point now. It does seem to have the same functionality, but yeah, as to why they behave differently, it probably has something to do with NVIDIA and their (unfortunate) Linux driver quality.

I wonder if waiting for the vsync (by using, e.g. OML_sync_control) would have a better result than glFinish? I think I have tested that but I am not sure.

I can try looking into that - not quite familiar with how it works yet though. (Also, I may be misremembering, but I think I might have seen you mentioning it before somewhere - along the lines of it having same behaviour as glFinish: busy-waiting. Or that it didn't work with NVIDIA. It's a been a while - I don't quite remember much at this point :P)

A general solution indeed would be nice, but it looks like that probably won't happen until NVIDIA decides to release open-source drivers or at least follows more standard behaviour in line with other Linux graphics drivers.

@kwand kwand marked this pull request as ready for review July 10, 2021 00:54
@yshui
Copy link
Owner

yshui commented Jul 10, 2021

@kwand Are there related issues to this PR? If there are, can you close them in commit messages? And can you rebase your branch on top of next (instead of merging next into your branch), thanks!

@kwand kwand force-pushed the __GL_MaxFramesAllowed branch from fe7876c to 9b90353 Compare July 10, 2021 22:13
@kwand kwand force-pushed the __GL_MaxFramesAllowed branch from 9b90353 to 976d074 Compare July 10, 2021 22:15
@kwand
Copy link
Author

kwand commented Jul 10, 2021

@kwand Are there related issues to this PR? If there are, can you close them in commit messages? And can you rebase your branch on top of next (instead of merging next into your branch), thanks!

Forgot about that! It's fixed now.

I don't think there are any issues that this PR closes, but you did remind me that I had an issue opened about the same thing (but using a much different approach: #592) that I forgot to close. I'll close it now, since I don't have much desire to revisit that messy approach again, even if it can potentially shave off another 10ms.*

*In the case of 60Hz only. It is practically useless with higher refresh rates since I had to put in a base amount of ~2-3ms delay for that approach to avoid missing vblank and having greater than 1 frame delays, which is worse than what we have now (<1 frame delays, but blocking times vary between 1ms to 16ms. Ideally, we would block for only ~1-2ms and very close to vblank times. Perhaps we can still try this approach again in the future, and take actual hints from KWin's implementation. I just haven't had time to fully understand how they did it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants