Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using "__GL_MaxFramesAllowed=1" for NVIDIA (and other possible improvements to 'fix' frame timings and input lag) #592

Closed
kwand opened this issue Feb 9, 2021 · 21 comments

Comments

@kwand
Copy link

kwand commented Feb 9, 2021

Have we tried using the "__GL_MaxFramesAllowed" flag instead of the current USLEEP workaround for NVIDIA drivers? Apparently, setting this environment variable to 1 will cause glxSwapBuffers() to block*, achieving the same functionality as glFinish() currently does but without the busy waiting problem for NVIDIA or need for the USLEEP workaround.

As mentioned in #588, I'm currently running a patched version of picom that sets this flag to 1, instead of using USLEEP. And unlike the (embarrassing) issue of #588, I can very noticeably see the difference between the patched and unpatched versions.

Moving windows around in floating mode on i3 are a noticeable, jittery mess on the latest release, though the jitter is (somehow) somewhat better on the latest branch. To my eyes, the jitter is mostly eliminated by using "__GL_MaxFramesAllowed=1" - and this seems to be the same path that KDE took with their compositor after previously using USLEEP (though they seem to have now switched to a completely new compositing algorithm - more on this later).

I simply commented out setting __GL_YIELD to usleep and added this line to the driver workarounds code instead:

setenv("__GL_MaxFramesAllowed", "1", true);

I will note that the jitter noticed on the latest release (on Manjaro repos at least; vgit-dac85) and on the latest branch doesn't seem to be constant. What I mean is, though there is a very noticeable jitter upon moving the floating window for the first few seconds (~3-5 secs), the jitter also mostly disappears if you persist in moving around the window for longer (~10+ seconds).

I assume this is simply related to how the USLEEP flag works? (as far as I understand it, it seems that with USLEEP, the NVIDIA driver will stop its busy-waiting on glFinish if it doesn't return within a certain threshold, goes to sleep, and hands back control to the CPU. I guess that persistently moving windows around raises this threshold - at the cost of CPU performance due to the busy-waiting?)

It seems to me that if we set this flag (i.e. disabling triple buffering), we will get the intended blocking until the buffer swap occurs at vblank - instead, it will be glxSwapBuffers() itself being the blocking call for NVIDIA, not glFinish() and without the busy-waiting()

I'm not sure if my experience is universal though, so some additional points for investigation are:

  • Is this simply my own experience, or is there a noticeable impact for others using this patch?
  • What do we lose by disabling triple buffering for NVIDIA? I've heard that triple buffering shouldn't be used for a compositor in the first place (by default, __GL_MaxFramesAllowed seems to be set to 2 for NVIDIA drivers), though whether it is true or not, I don't think I'm qualified to say so.
  • I didn't remove the glFinish() call in my patch, and I'm not quite sure what the effects are for leaving it there or removing it (for the NVIDIA use-case only. Obviously, removing it entirely would break the current implementation for other drivers). Theoretically, if my understanding is correct, it shouldn't be doing anything (for NVIDIA) as it is only called after the buffer swap is complete, so it should simply be returning immediately.
  • *Allegedly, this feature was 'added' since version 435, but I couldn't find a definite source on this from NVIDIA aside from second-hand forum discussions. This may be a problem for older NVIDIA drivers if true - but whether or not it is true should be investigated first.
  • (Also, what changes have been made since vgit-dac85 that would result in less jitter compared to the previous release?)

As I may have alluded to earlier, I don't think this is much of a permanent solution to input lag and jitter. Along with KDE, it seems that GNOME's Mutter compositor has also tried using this flag recently (8 months ago), though both have since abandoned using this flag, as this also seems to have non-ideal problems of its own reported by NVIDIA users (what these problems as reported on KDE/GNOME translate into for picom, however, I don't know).

I will note that none of them have gone back to using USLEEP or using glFinish.

I've read a lot about KDE's current approach - which seems the most promising approach to improving picom - as well as many positive comments from KDE users (that everything is much smoother and with less latency) which is definitely surprising given how problematic the KDE/NVIDIA combination has been for years.

Here are some of the most helpful links to understanding how their compositing now works:

Unfortunately, I'm not too familiar with the picom codebase and how everything is handled yet, so my question is this: @yshui: Is there anything picom can learn and use from KDE's new approach? (without having to rewrite the entire compositor)

Also, apologies if I've totally misrepresented how picom currently does compositing. I'm mainly relying on the wiki and a cursory look at the code (and acknowledging that glFinish has now been temporarily been enabled for NVIDIA since April, contra the old info in the wiki)

@kwand kwand changed the title Using "_GL_MaxFramesAllowed=1" for NVIDIA (and other possible improvements to 'fix' input lag) Using "_GL_MaxFramesAllowed=1" for NVIDIA (and other possible improvements to 'fix' frame timings and input lag) Feb 9, 2021
@kwand kwand changed the title Using "_GL_MaxFramesAllowed=1" for NVIDIA (and other possible improvements to 'fix' frame timings and input lag) Using "__GL_MaxFramesAllowed=1" for NVIDIA (and other possible improvements to 'fix' frame timings and input lag) Feb 10, 2021
@MitchMG2
Copy link

Im unfamiliar with the compositing portion of the code, but a quick look at render.c seems to indicate that the default is glFlush(), and glFinish() is only activated if picom is started with --vsync-use-glfinish flag as seen in the man page and looking at the bool for config.h. Im mentioning this because trying to use __GL_MaxFramesAllowed=1 as a bypass to glFinish() might not actually be achieving anything because it isnt being used in the first place.

Ive compiled using the __GL_MaxFramesAllowed=1 flag, and dont see much stutter with floating windows in i3 at least.

@kwand
Copy link
Author

kwand commented Feb 13, 2021

@mitchell-gil96 Are you using the experimental backends --experimental-backends?

Apologies if I haven't mentioned this somewhere before, but I'm mainly talking about the future improvement of picom. Almost all new features are already being exclusively added to the new backend. render.c is on track for deprecation and it shouldn't be used if the experimental/new backends are enabled.

There's a glFinish() call in the new backends files and is mentioned in numerous commits made by @yshui. (See wiki as well for a description of the current approach, which also mentions glFinish)

(Also, did you set the flag in the appropriate function as mentioned? I'm just confirming, as the way you've described it sounds like you added it as a compile flag, which I hope you didn't do as that would have no effect)

@MitchMG2
Copy link

MitchMG2 commented Feb 13, 2021

Always use experimental backends to get blur on glx. I wasn't aware render.c was legacy code. I ran a basic grep search on my cloned directory and found most instances of glFinish within that file, completely missed the call in /backend/gl/glx.c.

I edited line 18 of src/backend/driver.c before compiling. For good measure, I again recompiled while removing the glFinish() call in line 469 of glx.c. Will report back after a full day of use, but so far it doesnt seem to present any issues.

Unless I misunderstood your original issue, setting the max frames 1 flag essentially makes glFinish obsolete, passing block to glSwapBuffers?

@kwand
Copy link
Author

kwand commented Feb 14, 2021

Looking forward to hearing back! I'm not sure how much of a difference you'll notice - it should honestly be small though hopefully still noticeable.

On a related note, I have done some more testing recently on the older versions (latest release from Oct) that yielded some surprising results. It appears that the jitter from using the USLEEP flag is dependent on the CPU clock speed. I thought that running a CPU stress test at 100% might expose some of the flaws with the USLEEP flag (such as late returning and missing vblanks), but surprisingly it performs better than normal. Though, using the MaxFramesAllowed flag seems to still offer the best experience (almost no jitter + smooth), and even under the stress test. I haven't tested whether running a stress test on an unmodified version of the latest branch gives similar performance to this MaxFrames patch yet, since as mentioned previously I noted there seems to be less jitter due to changes since the last release.

EDIT: Just returned from testing (with the same setup i.e. stress test) on the latest branch, unmodified with USLEEP enabled. Under the stress test, it seems to offer only similar performance to the last release (with USLEEP enabled) - which was odd since I somewhat expected it to perform better and perhaps even beat out this MaxFramesAllowed solution. (Also re-tested the MaxFramesAllowed version and the experience was again similar to what I previously reported, and still better than using USLEEP on the latest branch)

setting the max frames 1 flag essentially makes glFinish obsolete, passing block to glSwapBuffers?

For NVIDIA, yes. For all other drivers, however, the glFinish call has to be kept as glxSwapBuffers might not block* even with this flag set. (Actually, I'm unsure if this flag even has an effect on the other drivers, since I've only read that this is a workaround for NVIDIA to get glxSwapBuffers to block)

*Though, history-wise - according to what I've read in the linked KDE development issues - glxSwapBuffers() used to be a blocking call on all drivers until 5-10 years ago. The original specification also does not require drivers to implement it as a blocking call. As to why this used to be blocking previously, I'm not sure. Probably b/c triple buffering was not supported then? I'm not sure about this, but this is what we're doing by setting the MaxFrames flag: disabling triple buffering - for picom at least - to force the call to block.

If this works and performs better, I like to view it as a more stable way of blocking than glFinish - which for some reason NVIDIA implemented using busy-waiting - and without the need for the USLEEP flag (which could be affecting more calls than just glFinish) that might potentially be causing picom to return late and miss vblanks.

I say "potentially" as I don't have firm evidence that this happening yet; I'm still trying to collect some concrete measurements like KDE has done here, but I'm having a bit of trouble tracing the events.

@kwand
Copy link
Author

kwand commented Feb 15, 2021

On an unrelated note, I just discovered the use-damage = true has much better performance without any (so far observed) downsides on the experimental backends (switched to experimental backends only a few weeks ago, but pretty much kept the same config; had problems with using damage on the old backends), so in some respects, I'm no longer personally counting on a major rework to get better frame timings as I'm quite satisfied with the performance now with use-damage = true.

Though, that's not to say I don't think a rework that implements KDE's design wouldn't be worth it - I think it would, and it seems that many recognize their current design as released in January (Plasma 5.21) as superior to GNOME's Mutter and all other existing compositors.

With use-damage = true, frame delays seem to be down from 5-6+ frames to 1-3+ frames. Though, admittedly, this is eyeballed from moving a window around in i3 floating mode and seeing what the delay is from the cursor moving to the window - so I wouldn't take my numbers too seriously.

Using KDE's design might cut this down by a bit more + offer better performance on older machines that take longer to render things.

@absolutelynothelix
Copy link
Collaborator

On an unrelated note, I just discovered the use-damage = true has much better performance without any (so far observed) downsides on the experimental backends (switched to experimental backends only a few weeks ago, but pretty much kept the same config; had problems with using damage on the old backends),

it is known that it has some issues with flickering

@kwand
Copy link
Author

kwand commented Feb 15, 2021

it is known that it has some issues with flickering

On the new backends? (i.e. --experimental-backends). I'm aware there are flickering issues on the old backends, which is why I turned them off previously. But they seem to have been fixed on the new backends.

I don't see an existing issue that reports flickering (on the new backends). Can you link it or is this a new issue?

(Also, so far have not experienced flickering in the 6 hours that I've turned it. Will report back later if I do experience it during the week)

@absolutelynothelix
Copy link
Collaborator

absolutelynothelix commented Feb 15, 2021

not sure, i'm not really following the flickering topic, and as i can see, people wasn't discussing --experimental-backends affection on this

see at least #401, #375, #242, #237

@kwand
Copy link
Author

kwand commented Feb 15, 2021

I'm not sure what you mean by "affection", but thanks for the second link - which mentions some residue issues* even with the experimental backends turned on.

*though noticeably less (which seems to be my experience as well, except I haven't had flickering issues just yet. I suspect that I probably might though given their reports)

I do wonder if switching to a KDE design would help fix the residual problems of use-damage, as it does seem to me that there is still might be too much frame/input delay going on, that may or may not be averaging beyond 1 frame. Unfortunately, I still can't collect measurements yet.

Ideally, as I mentioned before, I'd like to use gpuvis and check how the rendering is going on in between vblanks like how KDE's development team has. But unfortunately, it seems like there are some problems doing so while using an NVIDIA GPU. It does seem to be working for AMD/Intel GPUs and iGPUs though - in case anyone is willing to grab measures similar to Vlad Zahorodnii's work on KDE. See the screenshots in the follow-up discussion there. gpuvis seems to be quite a powerful tool.

@absolutelynothelix
Copy link
Collaborator

I'm not sure what you mean by "affection"

ah, i'm sorry, english is not my primary language. i meant how --experimental-backends affects on these issues, people were not discussing this for some reason, though experimental backends should already exist back then.

@kwand
Copy link
Author

kwand commented Feb 15, 2021

I'm not sure what you mean by "affection"

ah, i'm sorry, english is not my primary language. i meant how --experimental-backends affects on these issues, people were not discussing this for some reason, though experimental backends should already exist back then.

Ah, I see what you mean now. I understand your point, but I believe the fixes that --experimental-backends now provides were only added a few months ago (which is why you see reports only in December/January claiming it has (mostly) fixed the problem, while there were no mentions previously. A few of the issues you linked were quite old, going back to last June and earlier, when I don't think these fixes were added to the new backends yet).

@MitchMG2
Copy link

After two whole days of running with this for hours, I can report zero issues so far. I actually noticed in i3 my windows now close a fraction of a second faster after turning on use-damage, as well as disabling glx-no-stencil (not sure which option did it.).

This is with running some shadows, blur, glx no rebind pixmap true, glx copy from front false, glx experimental, and the Gl Max frames allowed 1 flag, as well as completely removing glFinish() during compile.

Floating windows have no stutter when dragging though I do see some ghosting. Im on a 2013 IPS monitor which heavily ghosts anyways (even on windows) so I cant tell if picom is causing jitter or just my monitor being bad.

sandsmark added a commit to sandsmark/picom that referenced this issue Feb 25, 2021
sandsmark added a commit to sandsmark/picom that referenced this issue Mar 12, 2021
@kwand
Copy link
Author

kwand commented Mar 20, 2021

I was originally thinking of doing some more testing before reporting back here, but it seems that I won't have much time to work on this in the coming weeks - so, it might be best for me to report now in hope that someone can possibly work on this in the meantime.

I've written a first and second draft on this (wrote a second as the first was lost during a power outage), which may or may not be clearer than what follows. But to include everything that I've discovered so far (+ some external links on similar implementations of the "KDE" algorithm which are apparently being worked on right now for GNOME and some Wayland WM compositors), here it is in point form (as I'm now short on time):

  • I recently managed to get my hands on NVIDIA's Nsight Systems and used their profiler on picom. Thankfully, this time I've been able to get some useable data on frame timings.

    • Decided to profile picom on something that would update the screen at around 60fps (as my monitor refresh is 60Hz); went with the first 60fps video I could find off YT and added in some random mouse movement during the duration of the profiling just to get some non-constant updates.
    • Here are the results of that profile on vgit-dac85, or rather a zoomed-in snippet as the rest of the results look almost identical. (I can provide the full file if anyone is interested, and has the software necessary i.e. Nsight Systems to open it)
  • After profiling, I'm no longer sure of what __GL_MaxFramesAllowed does, if anything at all. Hopefully, as you can see, glXSwapBuffers seems to block already even though I'm quite sure vgit-dac85 does not do anything to the __GL_MaxFramesAllowed flag.

    • I also reran the same profiling on the latest branch (next) modified with __GL_MaxFramesAllowed flag set. The results were identical, except I now saw glFinish calls - which was surprising since I thought calling glFinish() had been enabled for NVIDIA since April 2020 (based on commit history). No glFinish calls showed up for vgit-dac85, even though I believe it was released in October?
    • Note that glFinish is of course a useless call here - for NVIDIA at least. SwapBuffers already seems to block up to the vblank, and glFinish is only used to ensure picom blocks until vblank. It just seems to add an unecessary ~1ms delay. I removed glFinish completely afterwards and reprofiled it - results were now completely identical to vgit-dac85.
  • Lots of questions came to mind after profiling; questions I'm not sure if I'm qualified to answer, as I'm almost completely unfamiliar with how frame rendering precisely works, from the X server to the GPU. I'll list a few here, in hope that someone finds at least some of them useful and not stupid questions (resulting from my own ignorance):

    • It seems that the GPU does not start processing a frame until after a delay of one vblank? (Notice that under "Frame duration", frame Experimental GL backend: Do not use larger-than-screen textures for blur buffers #674, for example, does not seem to be processed until what appears to be the last frame Unexpected inner shadow #673 is, if I understand correctly, 'swapped to the display'? Apologies if this does not make sense, but hopefully you see the disconnect between the frame numbers for the CPU and GPU; the GPU is one frame behind).
    • While this does not seem as bad as I originally expected, it appears that picom in fact has a constant 2-frame input delay? I say this because it seems that picom updates itself with X server right after a vblank, then after some processing, hands the frame off to the GPU for additional processing and presenting to the display, and then blocks b/c of glXSwapBuffers - but the GPU does not even start this processing until after the next vblank. That is, more concretely, it seems that the GPU only begins processing frame Unexpected inner shadow #673 while the CPU/picom is beginning to prepare frame Experimental GL backend: Do not use larger-than-screen textures for blur buffers #674 - and frame Unexpected inner shadow #673 is not actually presented to the display until the end of frame Experimental GL backend: Do not use larger-than-screen textures for blur buffers #674. If I understand this correctly, it takes about two frames for input processed by picom to show up on screen.
    • Also, what is the effect on glXSwapBuffers (seemingly) blocking most of the time? Does this mean that any input/other update events that occur when glxSwapBuffers is blocking will not be processed until it returns i.e. after the next vblank?
      • If this is true, it seems like our input latency is even worse than 2 frames, or ~32ms for a 60Hz display. Since glxSwapBuffers seems to block consistently for around ~13-15ms, the true delay of any events that occur when glxSwapBuffers is blocking would be around (0 to ~15ms) + 32ms = ~32-47ms of latency i.e. close to three frames.
    • What's stranger is that this isn't the only behaviour for frame rendering. Every time I profile, I do get a few, though rare, one to four continuous frames whose 'rendering pattern' looks like this. Here, it seems like GPU processing of the same frame completes just before vblank occurs, which seems to mean, in these cases, there is no 1-frame delay between CPU and GPU processing. We also get the minimal amount of time being spent blocking on glXSwapBuffers. (apologies that I couldn't zoom in further without something else being cut off, but the 'blue call' under "OpenGL API" is the glXSwapBuffers call).

Finally...

  • I tried my hand at implementing (a very crude version of) KDE's current frame rendering algorithm. I'm again linking the blog post by the main dev behind the KDE changes as a source to understand their algorithm, though I recently discovered that many other compositors have and are implementing similar things:

  • Here's my fork where I implemented some delaying. And the results of profiling this fork. If I'm correct with my reasoning on frame timings (which, honestly, I'm not at all sure if I am, given this is not what I usually code), it seems that with delaying, we can get much better latency. And that is delaying using previous render times as a guide and 'delaying' as in delaying picom's updating with the X server (I'm assuming this is what handle_pending_updates does in draw_callback_impl) for as long as possible - but within reason such that we can still process the frame on both the CPU and GPU and send it off by the next vblank (That is, we want to keep input delay within ~16ms i.e. less than one frame for events that occurred since the last vblank and before we updated with the X server. Of course, we still get more than one frame of delay for anything that occurs once we start draw_callback_impl, as they won't be processed again until the next call of draw_callback_impl - which is after the next vblank. Apologies if I'm not doing a good job explaining this - please do read some of the links above as they do a much better job than me).

  • Now, the problems my crude implementation has i.e. why this is far from usable code:

    • It uses nanosleep to do delaying. Ideally, I guess this should be reimplemented using callbacks with libev, since the rest of the code does this. (I'm just not familiar with libev or the picom.c file outside of draw_callback_impl. I only have a rough idea of what the rest of the code right now)
    • It doesn't seem to be calculating rendering times correctly i.e. the time from the start of draw_callback_impl to when processing finishes on the GPU. The numbers that my code is calcalating are either consistently lower (2-3 ms lower) or absurdly higher (24ms?!?) than what I see in the profiling results with Nsight Systems. (That is, well within 16ms, and usually around 5-8ms rendering time, on the CPU and GPU combined)
      • Based on my limited knowledge, and after some debugging, I think this is due to the fact that I'm only measuring CPU time. From what I can see in Nsight Systems, the GPU usually does not start processing until after the CPU calls glXSwapBuffers. So, I'm completely missing GPU processing time, which seems to vary between 1-4ms, with my basic configuration of some light shadows + fading (no blurring or transparency).
      • I've currently added 5ms to the render times, to temporarily account for GPU render time (even though this on average makes glXSwapBuffers block for much longer than it has to. I've tried setting buffers smaller than 5ms, but this usually results in causing delays that result in us missing the next vblank)
      • As for the absurdly high reported render times of 24ms and more, I'm quite frankly at a loss of what is causing this. Perhaps I'm just not doing the timing at the right places? (To prevent these "render times" - which I've checked do not actually take this long in the profiling results - from contaminating the render log, I have temporarily added a hack that just ignores any render time that is greater than 16ms, which is my monitor's refresh interval)
    • I've hardcoded the refresh interval (NS_PER_SEC / 60). I haven't looked into whether update_refresh_rate() works (since it seems to be used only for sw-opti, a now-deprecated option) or if the refresh rate and refresh interval vars in the session struct work. (Yet.)
  • NOTE: If anyone is interested in running my fork, please make sure you add back glFinish IF you are on a non-NVIDIA system (warning: I have not tested this on non-NVIDIA system yet - though, theoretically it should work if either glFinish or glXSwapBuffers blocks) AND set your refresh rate properly in the renlog functions (you can just replace 60 with your refresh rate)

(Well, this took longer to write than I expected. And certainly, this is not brief. Hopefully, it is still readable. EDIT: Forgot to note that I've been running my own fork as a daily driver for a few days now. I seem to notice an input latency difference, and I do have some results from Typometer to back this up, but the difference is not that great since I'm completely missing the GPU time + I'm not quite sure if my crude algorithm is missing frames and how often. I'll try doing some more profiling when I have time again.)

@kwand
Copy link
Author

kwand commented Mar 22, 2021

OK, this bugged me enough that I poured a bit more of the little spare time I had into this. kwand@0de29ef

I had wanted to see earlier if I could re-purpose the delayed drawing timers and callbacks that were originally used for --sw-opti, which IIRC is slated for deprecation (and has been warned to users for a while now).

According to my initial tests + starting to use it as a daily driver, it appears that this is indeed possible - and I can get rid of that ugly nanosleep call (which seemed to interfere with things already previously).

I still don't have GPU timings though, and I highly suspect my CPU render timings are not correct (occasional reported render times of 14ms seems a bit high... Especially since it usually reports 1-3ms, and I can confirm this in Nsight)

EDIT: Did a bit more testing (this time with --benchmark mode), and it appears not having the GPU timings is fatal - as in the whole purpose of this design is defeated, since we take into account the max. time taken to render across the last few frames before deciding how much time to delay. It appears that CPU timings are quite stable, and it is the GPU timings that vary (which can go as high as 7ms - but this seems a bit odd given I'm running an RTX 2080. Is there some inefficiency in the GLX code?

For context, I was trying to debug stuttering with smooth scrolling in Chrome - and it was a good few frames of stutter (delayed too close to the vblank for several frames, as GPU time increased). Though, it seems odd that it can take up to 7ms to render a new frame from me just scrolling around, unless this is in fact normal. I'm not sure.)

@Cabopust
Copy link

OK, this bugged me enough that I poured a bit more of the little spare time I had into this. kwand@0de29ef

I had wanted to see earlier if I could re-purpose the delayed drawing timers and callbacks that were originally used for --sw-opti, which IIRC is slated for deprecation (and has been warned to users for a while now).

According to my initial tests + starting to use it as a daily driver, it appears that this is indeed possible - and I can get rid of that ugly nanosleep call (which seemed to interfere with things already previously).

I still don't have GPU timings though, and I highly suspect my CPU render timings are not correct (occasional reported render times of 14ms seems a bit high... Especially since it usually reports 1-3ms, and I can confirm this in Nsight)

EDIT: Did a bit more testing (this time with --benchmark mode), and it appears not having the GPU timings is fatal - as in the whole purpose of this design is defeated, since we take into account the max. time taken to render across the last few frames before deciding how much time to delay. It appears that CPU timings are quite stable, and it is the GPU timings that vary (which can go as high as 7ms - but this seems a bit odd given I'm running an RTX 2080. Is there some inefficiency in the GLX code?

For context, I was trying to debug stuttering with smooth scrolling in Chrome - and it was a good few frames of stutter (delayed too close to the vblank for several frames, as GPU time increased). Though, it seems odd that it can take up to 7ms to render a new frame from me just scrolling around, unless this is in fact normal. I'm not sure.)

So, I tested your improved-frame-timing branch, and I want to thank you for your work! It really feels better than what I have in the master branch on the my NVIDIA GeForce GTX 1050 Ti (NVIDIA 460.67). Please, keep working on this.

@MitchMG2
Copy link

Would this site perhaps provide a means of extracting GPU timings?
Currently looking at it now, but have absolutely no idea what im looking for. If you can point out what specific gpu timings you need, I can try to find it. Its alot of documentation to go through.

https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference

@kwand
Copy link
Author

kwand commented Mar 23, 2021

So, I tested your improved-frame-timing branch, and I want to thank you for your work! It really feels better than what I have in the master branch on the my NVIDIA GeForce GTX 1050 Ti (NVIDIA 460.67). Please, keep working on this.

Wow, I'm very glad to hear that! To be honest, I wasn't expecting this at all, as I still wasn't sure if it was just me noticing a difference :P (Though, credit for most of this work really belongs to the ones who came up with and implemented similar algorithms. I wouldn't have any idea of what I was doing if it wasn't for the clear explanations written by people like Vlad Zahoridii)

Would this site perhaps provide a means of extracting GPU timings?
Currently looking at it now, but have absolutely no idea what im looking for. If you can point out what specific gpu timings you need, I can try to find it. Its alot of documentation to go through.

https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference

That's not quite what we need; we need OpenGL timer queries since currently (if I understand correctly) only the GLX backend utilizes the GPU. (Xrender seems to be CPU-based?)

It's fine though because I do have good news! I've managed to finally get OpenGL timer queries working, and the GPU timings are now accurate to three decimal places of a millisecond as the results in Nsight's profiler (that's to say, we have the same numbers - with possibly greater accuracy. Nsight only gives results as x.xxx ms, so I can't really check if the number I get back in nanoseconds matches completely. But the point is that we have more than enough than we need for draw delaying :) )

Oddly, it appears that it's the CPU timings that aren't quite correct. Somehow, they can be off by 0.3 to 1.5 ms. This isn't that big of a problem, as I account for a buffer of 2ms on top of the maximum drawing time of the last N frames, where N is some hardcoded constant. Though, I am considering increasing this buffer (even though I don't quite want to - might look for some alternatives first) as using the libev callback timer with a certain delay doesn't exactly ensure that picom resumes exactly after the specified delay time as passed - it can be off by 0.1-0.7ms, which exceeds that 2ms buffer and is enough for us to miss frames in the middle of an increasing GPU workload.

The frame missing is not bad though - it's much better than it was before, as we're now actually taking into account GPU times. Using as a daily-driver isn't bad at all, although there does seem to be some micro-stuttering (if that's even the correct term to use here - but it is much less annoying than the stuttering caused by several continuously missed frames that occurred previously).

I'm going to have to look a bit more into what's causing inaccurate CPU timings, but otherwise, performance is quite good. We're truly getting (in the case of 60Hz), on average, around ~3-16ms of latency (with time spent blocking with glXSwapBuffers ms + 16ms latency if the X server changes happen during those 1-16ms where it is blocking - this should usually be around ~3-5ms, unless we miss a frame which means ~17-19ms of latency. And at most 32ms if we miss a frame, which should be isolated incidents, with the next frame very likely not being a miss given the algorithm. Anyhow, this seems much better than ~32-47ms (or sometimes ~16-32ms) 'average' latency of before.* The frame drops aren't that noticeable though, at least to my eyes.

And once I clean up the code + add some safety checks, ensure that this delaying algorithm is used only for --experimental-backends (I personally have no plans to port it to the legacy backends unless this is both possible and simple), and add the same functionality to the (experimental) Xrender backend, I think we can proceed to a PR.

(I'll maybe add some initial commits if someone is interested in testing this early. Be warned that the issues as mentioned above exist, and something bad may happen (but hopefully not - I haven't tested this myself. But please don't try this...) if you attempt to run this on anything other than --experimental-backends and the GLX backends and an NVIDIA GPU.)


*NOTE: I'm still not completely sure about these latency numbers as they stand for vgit-dac85 (perhaps I should get to compiling a unmodified next branch binary to do some testing against). But I'm quite sure the minimum latency there is 16ms, since the GPU only starts rendering a frame after a vblank and when the CPU has already moved on to the next frame.

@kwand
Copy link
Author

kwand commented Mar 23, 2021

I've added that initial commit to my fork, with GPU timing support and some changed constants:

  • increased buffer to 4ms - in reality, this is more like a 2ms buffer, as the CPU timings are still off by around 0.3-1ms (no idea why yet) and returning from delay callback could take up to ~1ms.
  • I've also reduced the number of last presented frames, whose render times we take account before figuring out how much to delay, from 30 to 15. This should mean somewhat better latency, with a risk of slightly more frame drops than before but not by much, as there's quite a bit of noise in the render times. We could have a 7ms render time followed by several 1-3ms render times; since we previously took the maximum of the render times of the last 30 frames, this meant we would essentially be predicting that the 1-3ms renders will also take 7ms and thus only delaying by 16.67ms - (7ms + ~2ms buffer) = 7ms when we could have easily done 11ms delay instead. (I'm more concerned about latency incurred by the time we spend blocking on glXSwapBuffers. The more we can delay, the less time we spend blocking)

Same warnings as said previously apply - though I've added a check to ensure early on that the delaying code won't be visited at all for the legacy backends. Still no support (yet) for the Xrender backend or refresh rates other than 60Hz.

Code cleanup to follow, including trying to implement this for the Xrender backend (if possible). I'm not entirely happy with the algorithm that figures out how much we should delay by - as there is some room for improvement - though I'm starting to doubt if 1-2ms gains that would result are worth it. It might be a more productive use of time to look into whether the backends can be improved any further and simply reduce the render times than improving the prediction algorithm (Perhaps by experimenting with other possible backends, like Vulkan? I'm not sure if this helps for compositing)

@zen2
Copy link

zen2 commented May 2, 2021

I'm starting to use your improved-frame-timing fork because it seems to minimize my problem of resizing opengl terminal (alacritty) with mouse high poll rate (see: #623).

Note: I've try with gsync enabled too and it improve things too but resizing is slow too but not as much as with original picom.

@kwand
Copy link
Author

kwand commented Jul 10, 2021

Partially addressed (at least the title) in PR #641. Currently, I have no desire to revisit the old approach used here, using previous render times to guess future render times and delay accordingly, as it is too messy for my liking.

@kwand kwand closed this as completed Jul 10, 2021
@zen2
Copy link

zen2 commented Jul 14, 2021

Partially addressed (at least the title) in PR #641. Currently, I have no desire to revisit the old approach used here, using previous render times to guess future render times and delay accordingly, as it is too messy for my liking.

Wow ! This PR have solved definitivelly all my issue with high poll rate mouse !
Everything works like a charm at full 144 Hz. No glitches, it's so smooth. You make my day ^^
So kwand GL_MaxFramesAllowed branch was better that original picom 8.2 but this present picom with this PR is really the perfection here ! Never get a so smooth experience about a compositer in 20 years of linux !

Using mpv, I can even have a real smoothness moving transparent window on top of two 60fps movie !
Note: using 2 vaapi mpv vo driver, one gpu(shader)+one vaapi is OK. With one vdpau or two gpu vo driver, it's sluggish but I suppose it's an extreme test that deal more with nvidia drivers that picom.

So good job @kwand !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants