Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accurate event for when a swapchain image is visible on screen #370

Closed
haasn opened this issue Sep 18, 2016 · 24 comments
Closed

Accurate event for when a swapchain image is visible on screen #370

haasn opened this issue Sep 18, 2016 · 24 comments

Comments

@haasn
Copy link

haasn commented Sep 18, 2016

I see no way currently to figure out when a swapchain image is actually visible on the screen.

Imagine an application which needs 4ms to execute a draw call and is running on a 16ms vsync display. Here's what a timeline could look like: (correct me if I'm wrong), supposing that we start the application immediately after a vsync already happened.

  1. t=0ms: Swapchain is created and all command buffers drawing to its images are recorded. Images are guarded by a semaphore (respectively).
  2. t=0ms: The application acquires the next image for use, which will signal the semaphore and (optionally) a fence
  3. t=0ms: The application submits a draw command which will wait for the signal, remove it, and resignal it once done
  4. t=0ms: The application queues up the image for presentation, which will wait for signal and remove it again

After this batch of setup, the following things happen:
3. t=0ms: the semaphore is signaled right away, and (optionally) the fence is triggered indicating that the image acquired in step 2 is available for use. The semaphore being signalled allows the draw command to start
4. t=4ms: The draw command finishes, and signals the semaphore again. This allows the GPU to start using the image for presentation (removing the signal). But it is not visible yet, because the next page flip has not yet occurred
5. t=16ms: The GPU flips pages and actually starts displaying the screen.
... at this point it is assume that the application also does whatever is necessary for drawing the next frame
6. t=32ms: The GPU flips pages again and stops using the surface (signalling the semaphore). Assume it takes 1ms for the image to get freed up and be reusable again
7. t=33ms: The application would be able to acquire the image again (i.e. triggering the fence)

To summarize it, on the CPU side of things I can get accurate information about the following points in time:

  1. The image is ready for use (t=0ms, t=33ms)
  2. The draw command completes, but the image is not yet visible (e.g. by triggering an event from the end of the command buffer)

But I can't seem to get any reliable information about t=16ms, i.e. when the frame I just submitted is actually visible. This is important to me because I need to measure display latency and effective refresh rate accurately.

The problem gets worse if I use a large swapchain. For example, suppose my swapchain is size 4.

In the first world, i.e. where I wait on the fence indicating that the image is ready for use again, I would measure differences in frame times something like this:

  1. t=0ms -> delta = 0ms
  2. t=0ms -> delta = 0ms
  3. t=0ms -> delta = 0ms
  4. t=0ms -> delta = 0ms
  5. t=17ms -> delta = 17ms
  6. t=33ms -> delta = 16ms
  7. t=49ms -> delta = 16ms
    ...

In the second world, i.e. where I trigger an event once I've finished rendering and wait on that to complete, I would measure frame times like this:

  1. t=4ms -> delta = 4ms
  2. t=8ms -> delta = 4ms
  3. t=12ms -> delta = 4ms
  4. t=16ms -> delta = 4ms
  5. t=20ms -> delta = 4ms
  6. t=36ms -> delta = 16ms
  7. t=52ms -> delta = 16ms
    ...

Basically, they all converge to the true vsync timing (16ms) in the limit, but the measurements at the start will always be off since the GPU can already acquire the next image and/or render to it well in advance of when it will actually be used.

How do you advise accomplishing what I want? (Measuring the real delay between submitting a frame and it being visible on screen)

@ianelliottus
Copy link

Hi @haasn,

What you are asking for is very reasonable. Unfortunately, we don't have a solution for you at this time. Khronos is working on it, but I'm sorry to say that there's no estimated time for when we'll be shipping a solution.

I'd like to understand what you want, to compare it with other requests we've received. Is your goal to call vkQueuePresentKHR() and then be able to find out when the image(s) are actually presented? Something different? Something additional?

To clarify, your description makes it sound like you are using the same semaphore for multiple purposes, which is not correct. Was that just to make it easier for you to describe?

Thanks for your input/feedback!
Ian Elliott

@haasn
Copy link
Author

haasn commented Sep 21, 2016

Is your goal to call vkQueuePresentKHR() and then be able to find out when the image(s) are actually presented? Something different? Something additional?

My ultimate goal is to keep audio and video playback synchronized while minimizing glitches due to repeated or droppde frames, which requires measuring 1. display refresh rate, and 2. frame skips.

In the case of 1., I do not want to rely on the EDID information or “reported” display refresh rate alone, but I want to measure it in realtime, since these can be both inconsistent and subtly different.

In the case of 2., I need to know when I've dropped a vsync due to rendering too slowly. For example, imagine a program which uses a swapchain of size 4, acquires 4 imags, submits 4 draw calls, and 4 vkQueuePhresentKHR calls at t=0ms. Depending on how long the draw calls actually took to execute, the semaphore guarding each rendered image might not be triggered in time for the corresponding page flip, so it might be the case that frame 2 was displayed for 32ms instead of 16ms. In this case, I need to know it, so I can resynchronize the video and audio.

It's worth noting that the approaches I have already outlined (waiting on an event which the end of my rendering command emits, and waiting on a fence signalling the next image was acquired) both cover my needs already, in the limit. The only complaint I have about them is that the timing gets thrown off near the beginning of playback, which I'm trying to minimize since it can throw off stuff like averaging filters for the duration of the averaging window.

If I had to design an API for this myself, I would loosely suggest the following:

  1. Add a ‘fence’ parameter to vkQueuePresentKHR, which gets signalled once the image queued is made visible. This works fine for FIFO-like presentation modes, but it won't necessarily work reliably for mailbox swapchains because an image queued might never be visible (e.g. triple buffering). Personally, I have no use for mailbox mode, but it might still be worth thinking about.
  2. Add an API call to block until the next page flip, and perhaps also report the index of the swapchain image that got made visible in this page flip, or -1 if none.

Of these two, 2. might be the more powerful of the two approaches, since it solves a number of problems:

  1. Lets me explicitly detect missed vsyncs without having to rely on comparing the actual and expected timing myself
  2. Lets me know in retrospect how long it took for given frame to be visible (by recording the timestamps and indices returned)
  3. Lets me easily wait however long it takes until a given image is visible (by looping)
  4. Lets me synchronize the start of playback to a vsync boundary (which can solve a few edge cases)
  5. Works well with fifo-mailbox swapchains

It also requires no changes to existing API calls. So all things considered, that's the approach I'd be happiest with, I think.

To clarify, your description makes it sound like you are using the same semaphore for multiple purposes, which is not correct. Was that just to make it easier for you to describe?

I wrote this post before having a solid understanding of the rules for semaphore use and ordering. You're right in that you'd usually use a pair of semaphores for each image in a swapchain. That said, I think you could re-use the same semaphore for both directions as long as it's done on the same VkQueue. Either way, I don't think the distinction is meaningful for this problem.

@ianelliottus
Copy link

@haasn, thanks for your input! It makes sense and will help us design a good solution.

To clarify, your description makes it sound like you are using the same semaphore for multiple purposes, which is not correct. Was that just to make it easier for you to describe?
I wrote this post before having a solid understanding of the rules for semaphore use and ordering. ... Either way, I don't think the distinction is meaningful for this problem.

Yes, it was orthogonal to the main topic (just an FYI).

@ghost
Copy link

ghost commented May 18, 2017

Without knowing much about Vulkan, I think such an API should provide the following mechanisms:

  • retrieve the current refresh rate
  • retrieve the time of the most recent swap event
    • a vsync counter, which is incremented on each "physical" screen refresh
    • or by real time (would be useful for gsync/freesync)
    • possibly with a flag that determines whether an image, whose targeted display time is in the past, should be displayed anyway for at least 1 refresh, or should be dropped
  • possibly retrieve or set the current display latency (if that even makes sense on Vulkan's level, I don't know) (the MS DXGI API has this)
  • possibly allow very quick refresh rate changing (useful for displays and connects which support it, and could be emulated by gsync/freesync) (the MS DXGI API has this)
  • set explicitly when a queued image should be displayed
    • by a vsync number (for example for intentionally skipping a number of refresh cycles)
    • or by real time time (would be useful for gsync/freesync)
  • feedback when a previously queued image was displayed
    • reporting the time (again by vsync time or real time (or both))
    • there should be some sort of event mechanism, that avoids the need to block in the caller
    • of course it must be possible to associate this feedback with input images (some APIs report feedback for a single past frame, and make it surprisingly tricky to associate it with a user's swap call)

Here are some links to other display APIs, which try to deal with this, for better or worse, in no specific order:
https://cgit.freedesktop.org/wayland/wayland-protocols/tree/stable/presentation-time/presentation-time.xml
https://www.khronos.org/registry/OpenGL/extensions/OML/GLX_OML_sync_control.txt
https://msdn.microsoft.com/en-us/library/windows/desktop/bb173060.aspx (and others)
http://http.download.nvidia.com/XFree86/vdpau/doxygen/html/group___vdp_presentation_queue.html

@ratchetfreak
Copy link

VK_GOOGLE_display_timing nearly all of that already...

@haasn
Copy link
Author

haasn commented May 20, 2017

It's worth pointing out that VK_KHX_display_control also covers some of this, notably vkRegisterDisplayEventEXT (allows signalling a VkFence when a frame becomes visible) and vkGetSwapchainCounterEXT (allows counting the number of vlanks on a display).

More interestingly, nvidia has added support for VK_KHX_display_control in their new 381.22 drivers; while there's no support for VK_GOOGLE_display_timing (yet).

@haasn
Copy link
Author

haasn commented Aug 29, 2017

Upon re-approaching this problem, I noticed that this is not just a requirement for “accurate” vsync timing the way mpv does it - this is in fact a basic requirement for simply metering rendering to the display rate at all. (i.e. implementing vsync).

The vulkan samples I'm looking at (e.g. cube.c from LunarG/VulkanSamples) seem to essentially do this:

  1. while (outstanding_fences == max_frame_latency) { wait(fences); }
  2. vkAcquireNextImageKHR(signal=acquired, fence=NULL);
  3. vkQueueSubmit(wait=acquired, signal=done, fence=fences[i]);
  4. vkQueuePresentKHR(wait=done, index=i);

But this appears to have a rather serious bug: It only waits on the vkQueueSubmit to complete, not on the actual vkQueuePresentKHR. So if you imagine a GPU that renders a cube at 1000 fps, the fences would all fire 1ms after the corresponding vkQueueSubmit, and thus the only factor metering rendering speed here is the implicit assumption that vkAcquireNextImageKHR will block if the available swapchain images are all stuck in the presentation queue (due to the vkQueuePresentKHR calls). But the spec explicitly states that you cannot rely on vkAcquireNextImageKHR blocking to meter rendering speed, because implementations could have an arbitrary upper bound (or even no upper bound whatsoever) on the size of the swapchain. After all, the only thing the application can do is set the minimum swapchain size, not the maximum.

If even demo applications like cube.c seem to get this wrong, then I'm at a complete loss for what khronos expects the correct behavior to look like. It seems like changing this at the source requires either vkQueuePresentKHR to signal a fence once the frame either leaves the frame queue (and becomes the active front-buffer), or once it's done being (and the contents have effectively been fully sent to the GPU). Alternatively, vkQueuePresentKHR could be redesigned to be part of a command buffer - so you could vkCmdQueuePresentKHR(cmdbuf, image, swapchain);, and this command would be marked as “pending” until the image is no longer in use.

@haasn
Copy link
Author

haasn commented Aug 29, 2017

It also seems like VK_EXT_display_control may not be as good a solution for this problem as I had originally anticipated: It requires a VkDisplayKHR, which I can't necessarily easily figure out. (Shouldn't the VkSurface have this information?) It also has a very, very awkward design. (For some reason it seems to violate vulkan API conventions by requiring that the pAllocator be non-NULL. I don't have a custom allocator though, can't it just use malloc like everything else? Or is that because it expects me to allocate a new fence for every vsync? Why can't it just re-use the same fence like literally everything other command?)

@cubanismo
Copy link

@haasn, I agree the first pixel event in VK_EXT_display_control is not a good solution for this problem. It simply generates a signal when the next vblank occurs. That doesn't necessarily correspond to when any prior-submitted presentation command completes. I agree the Google display timing spec is a closer match to your needs, but it is unlikely we will ever implement it outside of Android. Its semantics don't align with the capabilities available to us across other operating systems. We'll continue to work on a general solution for this problem within the Khronos working groups. As @ianelliottus mentioned, we're aware it's a sorely needed bit of functionality missing from the current specs.

There's nothing special about the allocator requirements of the functions in VK_EXT_display_control. They will fall back to the system allocator if pAllocator is NULL. If you're seeing issues with that, let me know, and ideally provide some code snippets illustrating the problem. This would be a bug.

Yes, the notifications in VK_EXT_display_control require using VK_KHR_display. Note that doesn't mean you need to be using a swapchain that presents to a VK_KHR_display. VK_KHR_display just allows enumerating displays, and VK_EXT_display_control let's you wait for events on those displays. You'd need some way to figure out what display your window system is using for swapchains presenting to it though to correlate the events back to your presentation commands. You could do this on X11 with the RANDR correlation function provided in VK_EXT_acquire_xlib_display. I'm not aware of definitive solutions available for other platforms at the moment, but you could compare display names with some native API to make an educated guess.

Yes a new fence does need to be allocated for each vblank. This design choice of creating a fence when requesting the events was made because these fences were different enough from regular fences that we would essentially have to do the equivalent of re-creating the fence anyway within the driver to convert an existing fence into a vblank event, and I needed to ensure they were not shareable using the new fence export extensions. The need to create a new fence every time was a side effect of that. In retrospect, I wish I'd created a new object type entirely to handle these notifications, and allowed them to be reusable. If there's ever a KHR version of this functionality, that's likely the direction I'll recommend.

@haasn
Copy link
Author

haasn commented Aug 29, 2017

There's nothing special about the allocator requirements of the functions in VK_EXT_display_control. They will fall back to the system allocator if pAllocator is NULL. If you're seeing issues with that, let me know, and ideally provide some code snippets illustrating the problem. This would be a bug.

From the spec:

pAllocator must be a pointer to a valid VkAllocationCallbacks structure

This goes against the convention of most other pAllocator functions, which all state:

If pAllocator is not NULL, pAllocator must be a pointer to a valid VkAllocationCallbacks structure

So it's actually a spec-documented deviation, not an implementation bug. The validation layers also confirm this:

vk [ParameterValidation] 4: vkRegisterDisplayEventEXT: required parameter pAllocator specified as NULL (obj 0x0 (unknown object), loc 0xb3)

(But perhaps it's a bug in the specification)

@cubanismo
Copy link

That is indeed a bug in the spec. Thanks for pointing it out. I'll get it fixed.

haasn added a commit to haasn/mp that referenced this issue Aug 30, 2017
This time based on RA. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The entire thing depends on VK_NV_glsl_shader, which is a god-awful
   nvidia-exclusive hack that barely works and is held together with
   duct tape and prayers. Long-term, we really, REALLY need to figure
   out a way to use a GLSL->SPIR-V middleware like glslang. The problem
   with glslang in particular is that it's a gigantic pile of awful, but
   maybe time will help here..

2. We don't use async transfer at all. This is very difficult, but
   doable in theory with the newer design. Would require refactoring
   vk_cmdpool slightly, and also expanding ra_vk.active_cmd to include
   commands on the async queue as well. Also, async compute is pretty
   much impossible to benefit from because we need to pingpong with
   serial dependencies anyway. (Sorry AMD users, you fell for the async
   compute meme)

3. Lots of resource deallocation callbacks are thread-safe (because
   the vulkan device itself is, and once we've added a free callback
   we're pretty much guaranteed to never use that resource again from
   within mpv). As such, we could call those cleanup callbacks from a
   different thread. This would make stuff slightly more responsive when
   deallocating lots of resources at once. (e.g. resizing swapchain)

4. The custom memory allocator is pretty naive. It's prone to
   under-allocating memory, allocation thrashing, freeing slabs too
   aggressively, and general slowness due to allocating from the same
   thread. In addition to making it smarter, we should also make it
   multi-threaded: ideally it would free slabs from a different thread,
   and also pre-allocate slabs from a different thread if it reaches
   some critical "low" threshold on the amount of available bytes.
   (Perhaps relative to the current heap size). These limitations
   manifest themselves as occasional choppy performance when changing
   the window size.

5. The swapchain code and ANGLE's swapchain code could share common
   options somehow. Left away for now because I don't want to deal with
   that headache for the time being.

6. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)
haasn added a commit to haasn/mp that referenced this issue Sep 3, 2017
This time based on RA. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The entire thing depends on VK_NV_glsl_shader, which is a god-awful
   nvidia-exclusive hack that barely works and is held together with
   duct tape and prayers. Long-term, we really, REALLY need to figure
   out a way to use a GLSL->SPIR-V middleware like glslang. The problem
   with glslang in particular is that it's a gigantic pile of awful, but
   maybe time will help here..

2. We don't use async transfer at all. This is very difficult, but
   doable in theory with the newer design. Would require refactoring
   vk_cmdpool slightly, and also expanding ra_vk.active_cmd to include
   commands on the async queue as well. Also, async compute is pretty
   much impossible to benefit from because we need to pingpong with
   serial dependencies anyway. (Sorry AMD users, you fell for the async
   compute meme)

3. The custom memory allocator is pretty naive. It's prone to
   under-allocating memory, allocation thrashing, freeing slabs too
   aggressively, and general slowness due to allocating from the same
   thread. In addition to making it smarter, we should also make it
   multi-threaded: ideally it would free slabs from a different thread,
   and also pre-allocate slabs from a different thread if it reaches
   some critical "low" threshold on the amount of available bytes.
   (Perhaps relative to the current heap size). These limitations
   manifest themselves as occasional choppy performance when changing
   the window size.

4. The swapchain code and ANGLE's swapchain code could share common
   options somehow. Left away for now because I don't want to deal with
   that headache for the time being.

5. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)
haasn added a commit to haasn/mp that referenced this issue Sep 3, 2017
This time based on RA. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The entire thing depends on VK_NV_glsl_shader, which is a god-awful
   nvidia-exclusive hack that barely works and is held together with
   duct tape and prayers. Long-term, we really, REALLY need to figure
   out a way to use a GLSL->SPIR-V middleware like glslang. The problem
   with glslang in particular is that it's a gigantic pile of awful, but
   maybe time will help here..

2. We don't use async transfer at all. This is very difficult, but
   doable in theory with the newer design. Would require refactoring
   vk_cmdpool slightly, and also expanding ra_vk.active_cmd to include
   commands on the async queue as well. Also, async compute is pretty
   much impossible to benefit from because we need to pingpong with
   serial dependencies anyway. (Sorry AMD users, you fell for the async
   compute meme)

3. The custom memory allocator is pretty naive. It's prone to
   under-allocating memory, allocation thrashing, freeing slabs too
   aggressively, and general slowness due to allocating from the same
   thread. In addition to making it smarter, we should also make it
   multi-threaded: ideally it would free slabs from a different thread,
   and also pre-allocate slabs from a different thread if it reaches
   some critical "low" threshold on the amount of available bytes.
   (Perhaps relative to the current heap size). These limitations
   manifest themselves as occasional choppy performance when changing
   the window size.

4. The swapchain code and ANGLE's swapchain code could share common
   options somehow. Left away for now because I don't want to deal with
   that headache for the time being.

5. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)
haasn added a commit to haasn/mp that referenced this issue Sep 4, 2017
This time based on RA. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The entire thing depends on VK_NV_glsl_shader, which is a god-awful
   nvidia-exclusive hack that barely works and is held together with
   duct tape and prayers. Long-term, we really, REALLY need to figure
   out a way to use a GLSL->SPIR-V middleware like glslang. The problem
   with glslang in particular is that it's a gigantic pile of awful, but
   maybe time will help here..

2. We don't use async transfer at all. This is very difficult, but
   doable in theory with the newer design. Would require refactoring
   vk_cmdpool slightly, and also expanding ra_vk.active_cmd to include
   commands on the async queue as well. Also, async compute is pretty
   much impossible to benefit from because we need to pingpong with
   serial dependencies anyway. (Sorry AMD users, you fell for the async
   compute meme)

3. The custom memory allocator is pretty naive. It's prone to
   under-allocating memory, allocation thrashing, freeing slabs too
   aggressively, and general slowness due to allocating from the same
   thread. In addition to making it smarter, we should also make it
   multi-threaded: ideally it would free slabs from a different thread,
   and also pre-allocate slabs from a different thread if it reaches
   some critical "low" threshold on the amount of available bytes.
   (Perhaps relative to the current heap size). These limitations
   manifest themselves as occasional choppy performance when changing
   the window size.

4. The swapchain code and ANGLE's swapchain code could share common
   options somehow. Left away for now because I don't want to deal with
   that headache for the time being.

5. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)
haasn added a commit to haasn/mp that referenced this issue Sep 13, 2017
This time based on RA. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The entire thing depends on VK_NV_glsl_shader, which is a god-awful
   nvidia-exclusive hack that barely works and is held together with
   duct tape and prayers. Long-term, we really, REALLY need to figure
   out a way to use a GLSL->SPIR-V middleware like glslang. The problem
   with glslang in particular is that it's a gigantic pile of awful, but
   maybe time will help here..

2. We don't use async transfer at all. This is very difficult, but
   doable in theory with the newer design. Would require refactoring
   vk_cmdpool slightly, and also expanding ra_vk.active_cmd to include
   commands on the async queue as well. Also, async compute is pretty
   much impossible to benefit from because we need to pingpong with
   serial dependencies anyway. (Sorry AMD users, you fell for the async
   compute meme)

3. The custom memory allocator is pretty naive. It's prone to
   under-allocating memory, allocation thrashing, freeing slabs too
   aggressively, and general slowness due to allocating from the same
   thread. In addition to making it smarter, we should also make it
   multi-threaded: ideally it would free slabs from a different thread,
   and also pre-allocate slabs from a different thread if it reaches
   some critical "low" threshold on the amount of available bytes.
   (Perhaps relative to the current heap size). These limitations
   manifest themselves as occasional choppy performance when changing
   the window size.

4. The swapchain code and ANGLE's swapchain code could share common
   options somehow. Left away for now because I don't want to deal with
   that headache for the time being.

5. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)
haasn added a commit to haasn/mp that referenced this issue Sep 14, 2017
This time based on RA. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The entire thing depends on VK_NV_glsl_shader, which is a god-awful
   nvidia-exclusive hack that barely works and is held together with
   duct tape and prayers. Long-term, we really, REALLY need to figure
   out a way to use a GLSL->SPIR-V middleware like glslang. The problem
   with glslang in particular is that it's a gigantic pile of awful, but
   maybe time will help here..

2. We don't use async transfer at all. This is very difficult, but
   doable in theory with the newer design. Would require refactoring
   vk_cmdpool slightly, and also expanding ra_vk.active_cmd to include
   commands on the async queue as well. Also, async compute is pretty
   much impossible to benefit from because we need to pingpong with
   serial dependencies anyway. (Sorry AMD users, you fell for the async
   compute meme)

3. The custom memory allocator is pretty naive. It's prone to
   under-allocating memory, allocation thrashing, freeing slabs too
   aggressively, and general slowness due to allocating from the same
   thread. In addition to making it smarter, we should also make it
   multi-threaded: ideally it would free slabs from a different thread,
   and also pre-allocate slabs from a different thread if it reaches
   some critical "low" threshold on the amount of available bytes.
   (Perhaps relative to the current heap size). These limitations
   manifest themselves as occasional choppy performance when changing
   the window size.

4. The swapchain code and ANGLE's swapchain code could share common
   options somehow. Left away for now because I don't want to deal with
   that headache for the time being.

5. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)
haasn added a commit to haasn/mp that referenced this issue Sep 15, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Use async compute on supported devices.

4. Could/should use sub-command buffers instead of semaphores/switching
   for stuff involving multiple queue families.

5. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.
haasn added a commit to haasn/mp that referenced this issue Sep 15, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Use async compute on supported devices.

4. Could/should use sub-command buffers instead of semaphores/switching
   for stuff involving multiple queue families.

5. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.
haasn added a commit to haasn/mp that referenced this issue Sep 16, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Use async compute on supported devices.

4. Could/should use sub-command buffers instead of semaphores/switching
   for stuff involving multiple queue families.

5. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.
haasn added a commit to haasn/mp that referenced this issue Sep 16, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Use async compute on supported devices.

4. Could/should use sub-command buffers instead of semaphores/switching
   for stuff involving multiple queue families.

5. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.
haasn added a commit to haasn/mp that referenced this issue Sep 17, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.
haasn added a commit to haasn/mp that referenced this issue Sep 18, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.
haasn added a commit to haasn/mp that referenced this issue Sep 20, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.

vo_gpu: vulkan: implement ra_vk_ctx.depth

Also moved the depth querying for vo_gpu from preinit to resize, since
it was a tiny bit more convenient. (And in theory, it could change during
runtime anyway)

This only affects a calculation in the dither code path anyway.
haasn added a commit to haasn/mp that referenced this issue Sep 20, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.

vo_gpu: vulkan: implement ra_vk_ctx.depth

Also moved the depth querying for vo_gpu from preinit to resize, since
it was a tiny bit more convenient. (And in theory, it could change during
runtime anyway)

This only affects a calculation in the dither code path anyway.
haasn added a commit to haasn/mp that referenced this issue Sep 20, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.

vo_gpu: vulkan: implement ra_vk_ctx.depth

Also moved the depth querying for vo_gpu from preinit to resize, since
it was a tiny bit more convenient. (And in theory, it could change during
runtime anyway)

This only affects a calculation in the dither code path anyway.
haasn added a commit to haasn/mp that referenced this issue Sep 22, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.

4. Parallelism across frames / async transfer is not well-defined, we
   either need to use a better semaphore / command buffer strategy or a
   resource pooling layer to safely handle cross-frame parallelism.
   (That said, I gave resource pooling a try and was not happy with the
   result at all - so I'm still exploring the semaphore strategy)

5. We aggressively use pipeline barriers where events would offer a much
   more fine-grained synchronization mechanism. As a result of this, we
   might be suffering from GPU bubbles due to too-short dependencies on
   objects. (That said, I'm also exploring the use of semaphores as a an
   ordering tactic which would allow cross-frame time slicing in theory)

Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
haasn added a commit to haasn/mp that referenced this issue Sep 23, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.

4. Parallelism across frames / async transfer is not well-defined, we
   either need to use a better semaphore / command buffer strategy or a
   resource pooling layer to safely handle cross-frame parallelism.
   (That said, I gave resource pooling a try and was not happy with the
   result at all - so I'm still exploring the semaphore strategy)

5. We aggressively use pipeline barriers where events would offer a much
   more fine-grained synchronization mechanism. As a result of this, we
   might be suffering from GPU bubbles due to too-short dependencies on
   objects. (That said, I'm also exploring the use of semaphores as a an
   ordering tactic which would allow cross-frame time slicing in theory)

Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
haasn added a commit to haasn/mp that referenced this issue Sep 23, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.

4. Parallelism across frames / async transfer is not well-defined, we
   either need to use a better semaphore / command buffer strategy or a
   resource pooling layer to safely handle cross-frame parallelism.
   (That said, I gave resource pooling a try and was not happy with the
   result at all - so I'm still exploring the semaphore strategy)

5. We aggressively use pipeline barriers where events would offer a much
   more fine-grained synchronization mechanism. As a result of this, we
   might be suffering from GPU bubbles due to too-short dependencies on
   objects. (That said, I'm also exploring the use of semaphores as a an
   ordering tactic which would allow cross-frame time slicing in theory)

Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
haasn added a commit to haasn/mp that referenced this issue Sep 25, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.

4. Parallelism across frames / async transfer is not well-defined, we
   either need to use a better semaphore / command buffer strategy or a
   resource pooling layer to safely handle cross-frame parallelism.
   (That said, I gave resource pooling a try and was not happy with the
   result at all - so I'm still exploring the semaphore strategy)

5. We aggressively use pipeline barriers where events would offer a much
   more fine-grained synchronization mechanism. As a result of this, we
   might be suffering from GPU bubbles due to too-short dependencies on
   objects. (That said, I'm also exploring the use of semaphores as a an
   ordering tactic which would allow cross-frame time slicing in theory)

Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.

NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).

The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
haasn added a commit to mpv-player/mpv that referenced this issue Sep 26, 2017
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!

Current problems / limitations / improvement opportunities:

1. The swapchain/flipping code violates the vulkan spec, by assuming
   that the presentation queue will be bounded (in cases where rendering
   is significantly faster than vsync). But apparently, there's simply
   no better way to do this right now, to the point where even the
   stupid cube.c examples from LunarG etc. do it wrong.
   (cf. KhronosGroup/Vulkan-Docs#370)

2. The memory allocator could be improved. (This is a universal
   constant)

3. Could explore using push descriptors instead of descriptor sets,
   especially since we expect to switch descriptors semi-often for some
   passes (like interpolation). Probably won't make a difference, but
   the synchronization overhead might be a factor. Who knows.

4. Parallelism across frames / async transfer is not well-defined, we
   either need to use a better semaphore / command buffer strategy or a
   resource pooling layer to safely handle cross-frame parallelism.
   (That said, I gave resource pooling a try and was not happy with the
   result at all - so I'm still exploring the semaphore strategy)

5. We aggressively use pipeline barriers where events would offer a much
   more fine-grained synchronization mechanism. As a result of this, we
   might be suffering from GPU bubbles due to too-short dependencies on
   objects. (That said, I'm also exploring the use of semaphores as a an
   ordering tactic which would allow cross-frame time slicing in theory)

Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.

NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).

The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
@haasn
Copy link
Author

haasn commented Nov 23, 2018

It's been over a year, Has there been any progress on this issue? (Also, sorry for the commit spam)

One thing I tried doing to solve this bug in practice (if not in theory) is to use the time that vkAcquireNextImageKHR takes to block as a sort of “wait for vblank” substitute, but that has the side effect of actually acquiring the next image, which is not always what I want to do when waiting for the swap_buffers to complete. (In particular, it breaks horribly if I have a window size operation in between this swapchain acquisition and the rendering of the next frame). Without a way to "un-present" acquired images, it therefore does not solve my problem even in practice.

I had a closer look at the (now renamed) VK_EXT_display_control, but it still doesn't map very well to my use case:

  1. If there's a delay between submitting a frame and swap_buffers, the swap_buffers call should immediately return. If I implement the swap_buffers call as "wait for next vblank", it will always block, even if it shouldn't have to.

  2. There's still no clear way to associate a VkSwapchainKHR with a VkDisplayKHR.

That said, in theory it might be possible to combine the vblank event with the swapchain counters, in the following manner:

  • when the queue submission succeeds, record the current vblank counter
  • when calling swap_buffers(), first do a check to see if the vblank counter has increased by a sufficient amount relative to the recorded vblank counter that we can assume the image must have already been made visible, and if so, return
  • otherwise, wait for the next vblank

But this does not seem like a clean solution, nor do I know how well it extends to e.g. mailbox-style swapchains. Things could could help me include:

  1. A way to map a VkSurfaceKHR or VkSwapchainKHR to a VkDisplayKHR, alternatively something equivalent to VkDisplayKHR's vblank event that works for a VkSwapchainKHR. For example, something like a fence that get signalled when the internal state of a VkSwapchainKHR advances as a result of a vsync. (i.e. when an image gets dequeued from the swapchain)

  2. Alternatively, a way to query the internal status of a swapchain: how many images are queued? how many images are available? That way I could make a more informed decision about when to resort to waiting until the vblank counter advances or not.

  3. A fence that gets signalled when the status of a surface counter changes (e.g. the vblank counter). Right now the only way to wait until that changes is a busy wait loop.

@ianelliottus
Copy link

We are working on this in Khronos, and hope to have a new extension that solves this, in the early part of next year.

@singron
Copy link

singron commented Mar 17, 2019

Any progress to report on this?

@singron
Copy link

singron commented Mar 18, 2019

It looks like the newly announced 0.9 provisional OpenXR spec has the required features for proper frame timing. It seems a little odd that vulkan applications without AR/VR have to integrate with OpenXR, but whatever.

  1. xrWaitFrame blocks to synchronize render thread to swapchain and returns the predicted display time of the next frame.
  2. xrBeginFrame (Not sure why this exists)
  3. xrEndFrame submits the frame to be rendered for a given display time (I'm guessing you mostly use the predicated display time you got from xrWaitFrame).

Since OpenXR seems to have endorsements from AMD, NVIDIA, Intel, and Microsoft, it seems to be the most likely way forward that will actually be implemented unless Khronos is planning to announce a Vulkan specific extension to duplicate this functionality.

@ghost
Copy link

ghost commented Jan 23, 2021

3-4 years later, Vulkan present types are still a joke, VSync timing isn't a thing, and the only way to mitigate this is by abusing mailbox plus overpowered hardware resources to bruteforce the timing. What the hell is going on?

@osor-io
Copy link

osor-io commented May 24, 2021

Has there been any more progress on this @ianelliottus @cubanismo? Or is there a different recommended way to get the functionality of VK_GOOGLE_display_timing?

I've seen the need for this pop up and be mentioned for quite a while now but doesn't seem like the needle has moved sadly 😢

@krOoze
Copy link
Contributor

krOoze commented May 24, 2021

@osor-io WIP: #1364

@stonesthrow
Copy link
Contributor

Yes, please refer to #1364 for solution, this thread is dead and can be closed.

@stonesthrow
Copy link
Contributor

please refer to #1364 for solution

@Triang3l
Copy link

Triang3l commented Jul 13, 2021

please refer to #1364 for solution

Would it help with safely destroying the semaphore awaited by vkQueuePresentKHR (in a situation when a full vkQueueWaitIdle or vkDeviceWaitIdle is overkill), or is it limited to just querying time intervals?

@krOoze
Copy link
Contributor

krOoze commented Jul 13, 2021

@Triang3l It is broke. As per #152 not even vk*WaitIdle might be enough.

The extension does add another way to infer the semaphore state. Nevertheless it is something that should be fixed in core 1.0 and not by usage of extensions. Besides, busywaiting on vkGetPastPresentationTimingEXT might be no better than vk*WaitIdle.

It would also be getting on the thin ice a bit. vkDestroySemaphore says the whole batch must be finished before destroying the semaphore, whatever that means in the case of present op.

@stonesthrow
Copy link
Contributor

There is an update, long in coming, to address these scenarios. Specifically semaphore states for one. Its priority

@Triang3l
Copy link

There is an update, long in coming, to address these scenarios. Specifically semaphore states for one. Its priority

Oh, nice, thank you for helping in resolving this confusing part! What is the current "industry standard" solution to this issue, by the way? Acquiring all images (not sure if that's possible for the mailbox mode) and awaiting all fences? Full WaitIdle? Or would just destroying the swapchain before the semaphores be enough (or is vkDestroySwapchainKHR also affected by this lack of a fence, and doesn't have implicit lifetime tracking)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests