-
-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set GL_MaxFramesAllowed=1 instead of GL_YIELD=usleep #641
Conversation
Codecov Report
@@ Coverage Diff @@
## next #641 +/- ##
=======================================
Coverage 39.54% 39.55%
=======================================
Files 46 46
Lines 9588 9529 -59
=======================================
- Hits 3792 3769 -23
+ Misses 5796 5760 -36
|
Thanks for your work and the detailed analysis!
At first I assumed MaxFramesAllowed is just going to be the NVIDIA driver doing some sort of glFinish equivalent internally. My rational is this: MaxFramesAllowed has to block waiting for the current frame to finish swapping, just like glFinish. If the driver has a more efficient blocking mechanism, why would they also use it for glFinish? Is my assumption wrong? Does MaxFramesAllowed=1 not use usleep()/busy-loop internally? |
Yikes, I meant to reply the next day after doing testing to confirm, but unfortunately didn't have much time and it seems like I won't have much time in the future either, so it's probably best I reply to you now based on what I remember from previous testing (though, this is 3-4 months ago. I don't believe there have been any changes that affect this behaviour though, but there have certainly been changes in how well I remember what I tested back then :P)
First, I assume you meant to say "why wouldn't they also use it for glFinish"? (Otherwise, I'm not quite sure what you mean).
I think we can rule out it using busy-looping internally, since 'disabling triple buffering'* does not have anywhere near the CPU utilization of busy-waiting glFinish (without usleep). I tested this months ago on the i7-8700K/RTX 2080 system:
*I am assuming NVIDIA's documentation about this flag is accurate - I have no idea what it does behind the scenes, but I think working with this assumption is safe. I was going to link to the documentation page, but for some reason, I can't seem to find it anymore. I guess we'll have to go with assuming this information from the KWin developers is accurate, which was the only other source I could find about __GL_MaxFramesAllowed) As for whether or not it is usleeping internally - that I can't say for sure, nor do I think it's possible to detect it via Nsight - but I think it is likely not. First, I don't think NVIDIA expected glFinish to be used as frequently as it is (in picom) or to enforce v-sync - since the glX protocol documentation only requires glFinish to block until all previously issued operations are complete, not that it needs to be resource-friendly. Of course, busy-waiting is still a terrible implementation, but I believe this hints towards NVIDIA expecting it to only be used sparingly (where it is then less of a problem CPU utilization wise). Setting _GL_MaxFramesAllowed=1 on the other hand is completely different from glFinish. Your wording here is a bit confusing:
Your wording seems to imply MaxFramesAllowed is another function like glFinish? Because it is not a function at all. Setting the flag only causes glXSwapBuffers to block until v-sync, which as far as I understand was the default behavior prior to graphics drivers adding in triple-buffering functionality (for Linux). So, it would seem to me that glXSwapBuffers and glFinish have completely different behaviours. glXSwapBuffers (with triple buffering disabled) only has to block until v-sync - and this is also previously existing functionality before triple buffering. glFinish on the other hand seems like it needs to query whether or not all of the previously issued commands have been completed or not. So, it doesn't seem obvious to me that NVIDIA would use similar implementations for both behaviours, even if they both block (though there are possibly some optimizations they could have made for glFinish). Of course, this is all a bit theoretical and based on my memory. I'll edit this once I get the chance to do some testing again on this 5950x system. (I can tell you now that, at idle, Glances reports 0.3-1% CPU utilization with MaxFramesAllowed=1) EDIT: I found some time to do the testing - immediately after I finished writing this. I think giving you some numbers from Glances should be enough since we're just noting picom's CPU utilization. Ran a few quick compiles and monitored the numbers in Glances:
Looks like my memory is still fine, and everything behaves about as expected (except for the fact that I forgot about how bad glFinish was without usleep...) EDIT 2: Might have spoken a bit too soon. MaxFramesAllowed=1 does seem to have slightly higher peaks than glFinish (with usleep) at around 2%, even though it still hovers mostly around 0.3-1%. glFinish with usleep also does break 1%, peaking at 1.3% (haven't observed anything higher yet). This did get me thinking - since it appears that setting GL_YIELD=usleep affects more than just glFinish, I decided to try out a compile with both GL_YIELD=usleep and __GL_MaxFramesAllowed=1. Utilization-wise, the numbers are great. Doesn't seem to ever break 0.7% in CPu utilization and actually disappears most of the time into the list of many 0.0% utilization processes. Frame timings performance, of course, is another question. (I'm thinking it may perform slightly worse than without usleep, since we must be getting those utilization savings at the cost of something) EDIT 3: Disregard that last idea I had about combining usleep and MaxFramesAllowed - performance is atrocious. Seems like usleep is the problem and something we should just get rid of (along with glFinish - since it's a useless dummy call now that returns immediately) if we want responsiveness: Though, I think this testing confirms that they are not all doing the same thing internally, since all three (glFinish, glFinish with usleep, MaxFramesAllowed=1) have different CPU utilization patterns. |
0531d5a
to
9b90353
Compare
Removed glFinish altogether for NVIDIA (since it doesn't do anything other than wasting a few ms). Tests forthcoming, though I expect the results to just be that it no longer wastes time doing some redundant. Ignore the mess that I created above; had some trouble rebasing and made an even bigger mess. All is well now. :) EDIT: Done some preliminary testing - mostly running a quick 24 frame 'benchmark' in benchmark mode with Nsight and observing the calls, and observing CPU utilization through glances while just using picom. Nothing much to note, other than glFinish is indeed no longer being used. We also seemed to have lowered CPU utilization by removing glFinish. picom stays at 0.3-0.7% CPU utilization while typing this (compared to 0.3-1% previously). Mostly, it usually does not even appear in the >0.0% usage in glances, which is great and comparable to (the terrible performance-wise, but great CPU-utilization-wise) combination of usleep + MaxFramesAllowed. |
Again thanks for your analyses.
Sorry I wasn't very clear. What I meant was, if you set I wonder if waiting for the vsync (by using, e.g. OML_sync_control) would have a better result than I will merge this, but I do wish we could find a more general solution. |
Ah, yes. I see your point now. It does seem to have the same functionality, but yeah, as to why they behave differently, it probably has something to do with NVIDIA and their (unfortunate) Linux driver quality.
I can try looking into that - not quite familiar with how it works yet though. (Also, I may be misremembering, but I think I might have seen you mentioning it before somewhere - along the lines of it having same behaviour as glFinish: busy-waiting. Or that it didn't work with NVIDIA. It's a been a while - I don't quite remember much at this point :P) A general solution indeed would be nice, but it looks like that probably won't happen until NVIDIA decides to release open-source drivers or at least follows more standard behaviour in line with other Linux graphics drivers. |
@kwand Are there related issues to this PR? If there are, can you close them in commit messages? And can you rebase your branch on top of |
fe7876c
to
9b90353
Compare
9b90353
to
976d074
Compare
Forgot about that! It's fixed now. I don't think there are any issues that this PR closes, but you did remind me that I had an issue opened about the same thing (but using a much different approach: #592) that I forgot to close. I'll close it now, since I don't have much desire to revisit that messy approach again, even if it can potentially shave off another 10ms.* *In the case of 60Hz only. It is practically useless with higher refresh rates since I had to put in a base amount of ~2-3ms delay for that approach to avoid missing vblank and having greater than 1 frame delays, which is worse than what we have now (<1 frame delays, but blocking times vary between 1ms to 16ms. Ideally, we would block for only ~1-2ms and very close to vblank times. Perhaps we can still try this approach again in the future, and take actual hints from KWin's implementation. I just haven't had time to fully understand how they did it) |
This essentially concludes my issue #592, implementing the original change mentioned in my first post (instead of the other, later implementations I tried, which in the time since, I've decided they were only marginal improvements, with testing ultimately being inconclusive - at a cost of (personally) unacceptable amount of changes to the code/code separation.*)
*I'll mention more about the "inconclusive testing" later, for anyone reading this and thinking the ideas mentioned in #592 have some merit. Which I do still think they have, but the cost of implementation is high. High enough that I no longer think it is practical. A hardware upgrade, honestly, might be more practical (compared to more consistent 59-60Hz if fixed - as the smoothness issue with picom really only affects 60Hz and lower displays) given how much more afforable high refresh rate displays have become recently.
I have conclusive data this time from testing with NVIDIA's Nsight Systems. Nothing too major in terms of improvements - other than much-reduced max frame times (~63-82ms -> ~19-25ms, on a 160Hz display) and a bit less CPU time - but I think it is enough to warrant the changes.
I will preface this by saying that I still don't quite understand why we're (specifically) using USLEEP. I am aware of why it was needed - since glFinish busy-waits and causes high CPU utilization - but I'm unsure why disabling triple buffering (with this MaxFramesAllowed flag) was not considered as an alternative, as this essentially renders glFinish a dummy call that returns immediately and forces the swap buffers command to always block/wait until vertical refresh (which is the same result we appear to be trying to achieve with glFinish).
Anyways, the data:
Testing notes:
picom next branch (GL_YIELD=usleep)
Note: Yellow colored frames seem to indicate that they took slightly longer than ~16.6 ms i.e. 1 frame under 60Hz. Not sure why this is the default color coding from Nvidia - I don't see any options to change this in Nsight either, unfortunately. Red colored frames are ones which took 2-3 frames under 60Hz to render (i.e. ~40-50ms).
picom next branch with GL_MaxFramesAllowed=1
Overall frame-rate is higher here, with lower average frame times and much lower max-frame times. Frame stability is also much improved (no red-coloured frames at all, with only a few yellow-coloured ones).
While CPU utlization seems to be higher*, we seem to be spending that time using actual useful work (instead of calling usleep every few ms). Randomly observing a selection of frames from both reports seem to confirm this:*Thought this might have been caused by the changes, but on second thought - after examining a few more selections of frames, they both seem to be about the same, minus time spent calling usleep.
picom next branch (selection of 8 frames)
picom next branch with MaxFramesAllowed (8 frames)
EDIT: Hmm, just noticed this, but it appears Nsight is saying I selected 10 CPU frames... except I'm not seeing where it counts 10. (Looks like 8 to me).
picom 8.2
Also, for fun (and also b/c I forgot that the current picom release version is still 8.2 and installed the wrong version), picom 8.2. This has, by far, the worst results - but I think it's mostly due to the new changes to floating window movement that I believe were introduced after version 8.2 (which of course benefit the results of the post-8.2 builds)
As a final note, I do have the report files created by Nsight and originally planned to upload them, but unfortunately they only seem to be able to opened by Nsight, with no viable export options that provide the same readability. @yshui, please let me know if you or others want a copy of the files (if you have Nsight installed), as well as the best place to upload them to (for best ease of access).
EDIT: Also, forgot to mention. This is just one testing result, from one system. More testing is of course needed! (to see if there are really benefits from making this change). If anyone else can testing using Nsight (requires registering an NVIDIA Developer account, but the process was relatively simple IIRC), those results would be quite valuable as well.
(And, of course, this fix only affects NVIDIA GPUs since the original GL_YIELD=usleep flag setting is also a fix specific to when NVIDIA drivers are detected)