-
-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using "__GL_MaxFramesAllowed=1" for NVIDIA (and other possible improvements to 'fix' frame timings and input lag) #592
Comments
Im unfamiliar with the compositing portion of the code, but a quick look at render.c seems to indicate that the default is glFlush(), and glFinish() is only activated if picom is started with --vsync-use-glfinish flag as seen in the man page and looking at the bool for config.h. Im mentioning this because trying to use __GL_MaxFramesAllowed=1 as a bypass to glFinish() might not actually be achieving anything because it isnt being used in the first place. Ive compiled using the __GL_MaxFramesAllowed=1 flag, and dont see much stutter with floating windows in i3 at least. |
@mitchell-gil96 Are you using the experimental backends Apologies if I haven't mentioned this somewhere before, but I'm mainly talking about the future improvement of There's a glFinish() call in the new backends files and is mentioned in numerous commits made by @yshui. (See wiki as well for a description of the current approach, which also mentions glFinish) (Also, did you set the flag in the appropriate function as mentioned? I'm just confirming, as the way you've described it sounds like you added it as a compile flag, which I hope you didn't do as that would have no effect) |
Always use experimental backends to get blur on glx. I wasn't aware render.c was legacy code. I ran a basic grep search on my cloned directory and found most instances of glFinish within that file, completely missed the call in /backend/gl/glx.c. I edited line 18 of src/backend/driver.c before compiling. For good measure, I again recompiled while removing the glFinish() call in line 469 of glx.c. Will report back after a full day of use, but so far it doesnt seem to present any issues. Unless I misunderstood your original issue, setting the max frames 1 flag essentially makes glFinish obsolete, passing block to glSwapBuffers? |
Looking forward to hearing back! I'm not sure how much of a difference you'll notice - it should honestly be small though hopefully still noticeable. On a related note, I have done some more testing recently on the older versions (latest release from Oct) that yielded some surprising results. It appears that the jitter from using the USLEEP flag is dependent on the CPU clock speed. I thought that running a CPU stress test at 100% might expose some of the flaws with the USLEEP flag (such as late returning and missing vblanks), but surprisingly it performs better than normal. Though, using the MaxFramesAllowed flag seems to still offer the best experience (almost no jitter + smooth), and even under the stress test. I haven't tested whether running a stress test on an unmodified version of the latest branch gives similar performance to this MaxFrames patch yet, since as mentioned previously I noted there seems to be less jitter due to changes since the last release. EDIT: Just returned from testing (with the same setup i.e. stress test) on the latest branch, unmodified with USLEEP enabled. Under the stress test, it seems to offer only similar performance to the last release (with USLEEP enabled) - which was odd since I somewhat expected it to perform better and perhaps even beat out this MaxFramesAllowed solution. (Also re-tested the MaxFramesAllowed version and the experience was again similar to what I previously reported, and still better than using USLEEP on the latest branch)
For NVIDIA, yes. For all other drivers, however, the glFinish call has to be kept as glxSwapBuffers might not block* even with this flag set. (Actually, I'm unsure if this flag even has an effect on the other drivers, since I've only read that this is a workaround for NVIDIA to get glxSwapBuffers to block) *Though, history-wise - according to what I've read in the linked KDE development issues - glxSwapBuffers() used to be a blocking call on all drivers until 5-10 years ago. The original specification also does not require drivers to implement it as a blocking call. As to why this used to be blocking previously, I'm not sure. Probably b/c triple buffering was not supported then? I'm not sure about this, but this is what we're doing by setting the MaxFrames flag: disabling triple buffering - for picom at least - to force the call to block. If this works and performs better, I like to view it as a more stable way of blocking than glFinish - which for some reason NVIDIA implemented using busy-waiting - and without the need for the USLEEP flag (which could be affecting more calls than just glFinish) that might potentially be causing picom to return late and miss vblanks. I say "potentially" as I don't have firm evidence that this happening yet; I'm still trying to collect some concrete measurements like KDE has done here, but I'm having a bit of trouble tracing the events. |
On an unrelated note, I just discovered the Though, that's not to say I don't think a rework that implements KDE's design wouldn't be worth it - I think it would, and it seems that many recognize their current design as released in January (Plasma 5.21) as superior to GNOME's Mutter and all other existing compositors. With Using KDE's design might cut this down by a bit more + offer better performance on older machines that take longer to render things. |
it is known that it has some issues with flickering |
On the new backends? (i.e. I don't see an existing issue that reports flickering (on the new backends). Can you link it or is this a new issue? (Also, so far have not experienced flickering in the 6 hours that I've turned it. Will report back later if I do experience it during the week) |
I'm not sure what you mean by "affection", but thanks for the second link - which mentions some residue issues* even with the experimental backends turned on. *though noticeably less (which seems to be my experience as well, except I haven't had flickering issues just yet. I suspect that I probably might though given their reports) I do wonder if switching to a KDE design would help fix the residual problems of Ideally, as I mentioned before, I'd like to use |
ah, i'm sorry, english is not my primary language. i meant how |
Ah, I see what you mean now. I understand your point, but I believe the fixes that |
After two whole days of running with this for hours, I can report zero issues so far. I actually noticed in i3 my windows now close a fraction of a second faster after turning on use-damage, as well as disabling glx-no-stencil (not sure which option did it.). This is with running some shadows, blur, glx no rebind pixmap true, glx copy from front false, glx experimental, and the Gl Max frames allowed 1 flag, as well as completely removing glFinish() during compile. Floating windows have no stutter when dragging though I do see some ghosting. Im on a 2013 IPS monitor which heavily ghosts anyways (even on windows) so I cant tell if picom is causing jitter or just my monitor being bad. |
I was originally thinking of doing some more testing before reporting back here, but it seems that I won't have much time to work on this in the coming weeks - so, it might be best for me to report now in hope that someone can possibly work on this in the meantime. I've written a first and second draft on this (wrote a second as the first was lost during a power outage), which may or may not be clearer than what follows. But to include everything that I've discovered so far (+ some external links on similar implementations of the "KDE" algorithm which are apparently being worked on right now for GNOME and some Wayland WM compositors), here it is in point form (as I'm now short on time):
Finally...
(Well, this took longer to write than I expected. And certainly, this is not brief. Hopefully, it is still readable. EDIT: Forgot to note that I've been running my own fork as a daily driver for a few days now. I seem to notice an input latency difference, and I do have some results from Typometer to back this up, but the difference is not that great since I'm completely missing the GPU time + I'm not quite sure if my crude algorithm is missing frames and how often. I'll try doing some more profiling when I have time again.) |
OK, this bugged me enough that I poured a bit more of the little spare time I had into this. kwand@0de29ef I had wanted to see earlier if I could re-purpose the delayed drawing timers and callbacks that were originally used for According to my initial tests + starting to use it as a daily driver, it appears that this is indeed possible - and I can get rid of that ugly I still don't have GPU timings though, and I highly suspect my CPU render timings are not correct (occasional reported render times of 14ms seems a bit high... Especially since it usually reports 1-3ms, and I can confirm this in Nsight) EDIT: Did a bit more testing (this time with --benchmark mode), and it appears not having the GPU timings is fatal - as in the whole purpose of this design is defeated, since we take into account the max. time taken to render across the last few frames before deciding how much time to delay. It appears that CPU timings are quite stable, and it is the GPU timings that vary (which can go as high as 7ms - but this seems a bit odd given I'm running an RTX 2080. Is there some inefficiency in the GLX code? For context, I was trying to debug stuttering with smooth scrolling in Chrome - and it was a good few frames of stutter (delayed too close to the vblank for several frames, as GPU time increased). Though, it seems odd that it can take up to 7ms to render a new frame from me just scrolling around, unless this is in fact normal. I'm not sure.) |
So, I tested your improved-frame-timing branch, and I want to thank you for your work! It really feels better than what I have in the master branch on the my NVIDIA GeForce GTX 1050 Ti (NVIDIA 460.67). Please, keep working on this. |
Would this site perhaps provide a means of extracting GPU timings? https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference |
Wow, I'm very glad to hear that! To be honest, I wasn't expecting this at all, as I still wasn't sure if it was just me noticing a difference :P (Though, credit for most of this work really belongs to the ones who came up with and implemented similar algorithms. I wouldn't have any idea of what I was doing if it wasn't for the clear explanations written by people like Vlad Zahoridii)
That's not quite what we need; we need OpenGL timer queries since currently (if I understand correctly) only the GLX backend utilizes the GPU. (Xrender seems to be CPU-based?) It's fine though because I do have good news! I've managed to finally get OpenGL timer queries working, and the GPU timings are now accurate to three decimal places of a millisecond as the results in Nsight's profiler (that's to say, we have the same numbers - with possibly greater accuracy. Nsight only gives results as x.xxx ms, so I can't really check if the number I get back in nanoseconds matches completely. But the point is that we have more than enough than we need for draw delaying :) ) Oddly, it appears that it's the CPU timings that aren't quite correct. Somehow, they can be off by 0.3 to 1.5 ms. This isn't that big of a problem, as I account for a buffer of 2ms on top of the maximum drawing time of the last N frames, where N is some hardcoded constant. Though, I am considering increasing this buffer (even though I don't quite want to - might look for some alternatives first) as using the libev callback timer with a certain delay doesn't exactly ensure that picom resumes exactly after the specified delay time as passed - it can be off by 0.1-0.7ms, which exceeds that 2ms buffer and is enough for us to miss frames in the middle of an increasing GPU workload. The frame missing is not bad though - it's much better than it was before, as we're now actually taking into account GPU times. Using as a daily-driver isn't bad at all, although there does seem to be some micro-stuttering (if that's even the correct term to use here - but it is much less annoying than the stuttering caused by several continuously missed frames that occurred previously). I'm going to have to look a bit more into what's causing inaccurate CPU timings, but otherwise, performance is quite good. We're truly getting (in the case of 60Hz), on average, around ~3-16ms of latency (with And once I clean up the code + add some safety checks, ensure that this delaying algorithm is used only for (I'll maybe add some initial commits if someone is interested in testing this early. Be warned that the issues as mentioned above exist, and something bad may happen (but hopefully not - I haven't tested this myself. But please don't try this...) if you attempt to run this on anything other than *NOTE: I'm still not completely sure about these latency numbers as they stand for vgit-dac85 (perhaps I should get to compiling a unmodified |
I've added that initial commit to my fork, with GPU timing support and some changed constants:
Same warnings as said previously apply - though I've added a check to ensure early on that the delaying code won't be visited at all for the legacy backends. Still no support (yet) for the Xrender backend or refresh rates other than 60Hz. Code cleanup to follow, including trying to implement this for the Xrender backend (if possible). I'm not entirely happy with the algorithm that figures out how much we should delay by - as there is some room for improvement - though I'm starting to doubt if 1-2ms gains that would result are worth it. It might be a more productive use of time to look into whether the backends can be improved any further and simply reduce the render times than improving the prediction algorithm (Perhaps by experimenting with other possible backends, like Vulkan? I'm not sure if this helps for compositing) |
I'm starting to use your improved-frame-timing fork because it seems to minimize my problem of resizing opengl terminal (alacritty) with mouse high poll rate (see: #623). Note: I've try with gsync enabled too and it improve things too but resizing is slow too but not as much as with original picom. |
Partially addressed (at least the title) in PR #641. Currently, I have no desire to revisit the old approach used here, using previous render times to guess future render times and delay accordingly, as it is too messy for my liking. |
Wow ! This PR have solved definitivelly all my issue with high poll rate mouse ! Using mpv, I can even have a real smoothness moving transparent window on top of two 60fps movie ! So good job @kwand ! |
Have we tried using the "__GL_MaxFramesAllowed" flag instead of the current USLEEP workaround for NVIDIA drivers? Apparently, setting this environment variable to 1 will cause glxSwapBuffers() to block*, achieving the same functionality as glFinish() currently does but without the busy waiting problem for NVIDIA or need for the USLEEP workaround.
As mentioned in #588, I'm currently running a patched version of
picom
that sets this flag to 1, instead of using USLEEP. And unlike the (embarrassing) issue of #588, I can very noticeably see the difference between the patched and unpatched versions.Moving windows around in floating mode on i3 are a noticeable, jittery mess on the latest release, though the jitter is (somehow) somewhat better on the latest branch. To my eyes, the jitter is mostly eliminated by using "__GL_MaxFramesAllowed=1" - and this seems to be the same path that KDE took with their compositor after previously using USLEEP (though they seem to have now switched to a completely new compositing algorithm - more on this later).
I simply commented out setting __GL_YIELD to usleep and added this line to the driver workarounds code instead:
I will note that the jitter noticed on the latest release (on Manjaro repos at least; vgit-dac85) and on the latest branch doesn't seem to be constant. What I mean is, though there is a very noticeable jitter upon moving the floating window for the first few seconds (~3-5 secs), the jitter also mostly disappears if you persist in moving around the window for longer (~10+ seconds).
I assume this is simply related to how the USLEEP flag works? (as far as I understand it, it seems that with USLEEP, the NVIDIA driver will stop its busy-waiting on glFinish if it doesn't return within a certain threshold, goes to sleep, and hands back control to the CPU. I guess that persistently moving windows around raises this threshold - at the cost of CPU performance due to the busy-waiting?)
It seems to me that if we set this flag (i.e. disabling triple buffering), we will get the intended blocking until the buffer swap occurs at vblank - instead, it will be glxSwapBuffers() itself being the blocking call for NVIDIA, not glFinish() and without the busy-waiting()
I'm not sure if my experience is universal though, so some additional points for investigation are:
As I may have alluded to earlier, I don't think this is much of a permanent solution to input lag and jitter. Along with KDE, it seems that GNOME's Mutter compositor has also tried using this flag recently (8 months ago), though both have since abandoned using this flag, as this also seems to have non-ideal problems of its own reported by NVIDIA users (what these problems as reported on KDE/GNOME translate into for
picom
, however, I don't know).I will note that none of them have gone back to using USLEEP or using glFinish.
picom
(they don't seem to use blocking) so it might not make sense to compare.I've read a lot about KDE's current approach - which seems the most promising approach to improving
picom
- as well as many positive comments from KDE users (that everything is much smoother and with less latency) which is definitely surprising given how problematic the KDE/NVIDIA combination has been for years.Here are some of the most helpful links to understanding how their compositing now works:
picom
(and previously KDE) and his 'fix' to the problem of frame delays inherent with the blocking method.picom
is currently using.Unfortunately, I'm not too familiar with the
picom
codebase and how everything is handled yet, so my question is this: @yshui: Is there anythingpicom
can learn and use from KDE's new approach? (without having to rewrite the entire compositor)Also, apologies if I've totally misrepresented how
picom
currently does compositing. I'm mainly relying on the wiki and a cursory look at the code (and acknowledging that glFinish has now been temporarily been enabled for NVIDIA since April, contra the old info in the wiki)The text was updated successfully, but these errors were encountered: