Syncval: using vertex/index buffers a frame later than they were written to causes bogus error. #8655

nicebyte · 2024-10-05T22:40:54Z

Environment:

OS: Windows 10 22H2
GPU and driver version: NVidia GeForce RTX 2080 Super driver version 31.0.15.5212
SDK or header version if building from repo: 1.3.290.0
Options enabled (synchronization, best practices, etc.): Synchronization, Submit time validation, Fine Grained Locking, Core, Handle Wrapping, Object Lifetime, Stateless Parameter, Thread Safety

Describe the Issue

The application records a command buffer that writes data into two GPU buffers using vkCmdCopyBuffer. This command buffer is submitted on frame 0.
The application does not render anything on frame 0 (just an empty renderpass).
On frame 1, the application records and submits a command buffer that issues pipeline barriers for the buffers written during frame 0, and uses those buffers as vertex and index buffers in an indexed draw call.
The synchronization validation reports a read-after-write hazard for one of the buffers on frame 1.

To recap, the timeline looks like this:

[0: vkCmdCopy; empty renderpass] ---> [1: vkCmdPipelineBarrier2; vkCmdDrawIndexed ]

I think that the read-after-write hazard report is erroneous.

What makes me think this is a bug in syncval

I have double and triple checked that the application emits the required vkCommandPipelineBarrier2 command where appropriate (breakpoints in the debugger, looked at renderdoc captures, etc.). My stage and access masks look correct as well see screenshot:
When there is no wait - i.e., the app uses the buffers within the same frame that the buffers are written to, there is no syncval error.
Finally: if, instead of only skipping drawing on frame 0, I skip drawing on frames 0, 1 and 2, the syncval error goes away! Skipping just 0 and 1 still has the syncval error.

The last two points together convince me this is a bug. If there genuinely was synchronization missing, I believe syncval should report it no matter how long has passed since the buffers were written.

I am attaching two gfxreconstruct captures:

Here, app skips drawing only on frame 0: syncval-error.zip -- with syncval enabled, replaying this capture gives an error.
And here, the app skips drawing on frames 0, 1 and 2: syncval-noerror.zip -- replay does not trigger an error.

Expected behavior

Syncval should not have flagged this workload.

The text was updated successfully, but these errors were encountered:

artem-lunarg · 2024-10-07T14:51:04Z

@nicebyte I reproduced the behavior using the first capture with SDK 1.3.290. Thanks for providing them!

Then I tested with the latest validation code and it behaves differently. You can check it or wait for the soon to be released new SDK.

Good news: in the latest code syncval does not report the errors.

Potentially bad news: The 1.3.290 SDK does not report errors with Standard preset (core validation), but the latest code does. The errors are related to command buffer is reset/begin when it is still in use. It is not a surprise because we recently reworked implementation of in-use resource tracking and now it covers some use cases not detected previously. Of course there is a room for bugs too. If you have opportunity to check new validation please let us know if you think new validation errors are false-positives.

nicebyte · 2024-10-07T15:24:47Z

@artem-lunarg can I download the preview sdk from somewhere or do I need to build from source?

artem-lunarg · 2024-10-07T15:40:32Z

@nicebyte You can build the latest code (https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/main/BUILD.md). Also the new SDK should be publicly available in the first half of this week if you prefer this option.

nicebyte · 2024-10-08T00:20:46Z

So I've had time to try out the latest VVL built from HEAD today, and found some interesting things:

my reported false positive is indeed gone.
the command buffer related errors that you mentioned are not reported when i run my applications. they are validation-clean.
however, errors ARE reported for the gfxreconstruct replay!

so i think the original bug that i opened this for is already fixed, but we've established that something is up with gfxrecon (or VVL's interpretation of gfxrecon API usage) :)

nicebyte · 2024-10-08T03:04:16Z

to be more precise, what i'm seeing is: my application has NO sync errors, but if I capture it with gfxrecon and replay - the replay DOES have errors.

artem-lunarg · 2024-10-08T10:56:15Z

I'm investigating API dump from the capture, the first impression is that commands in the capture can cause this issue. I'll spend a bit more time to be sure and then will post here a structure of the commands/synchronization to confirm it's not something that the app does and is potentially caused by the capture tool.

artem-lunarg · 2024-10-08T14:24:10Z

full-apidump.txt
mini-apidump.txt

Attached full API dump and also mini version with commands related to synchronization and command buffers.

artem-lunarg · 2024-10-08T14:49:16Z

Here is a problem description based on mini API dump. This describes all usages of the command buffer 00000199A6269130 and how it leads to the validation error in frame 3 (zero-based) when vkResetCommandBuffer is called (the last command in the file).

Frame 0:
  vkAllocateCommandBuffers -> 00000199A6269130
  vkResetCommandBuffer (00000199A6269130)
  vkBeginCommandBuffer(00000199A6269130)
  vkEndCommandBuffer(00000199A6269130)
  vkQueueSubmit(cmd=00000199A6269130)
  vkQueueWaitIdle()

  vkQueueSubmit(cmd=some_other_command_buffers, fence=00000199A62619A0)

  vkResetCommandBuffer(00000199A6269130)
  vkBeginCommandBuffer(00000199A6269130)
  vkEndCommandBuffer(00000199A6269130)
  vkQueueSubmit(cmd=00000199A6269130, signal_semaphore=00000199A61C5FD0, fence=null)
  vkQueuePresentKHR(wait_semaphore=00000199A61C5FD0)

Frame 1:

  // 00000199A6269130 is not used this frame

Frame 2:
  // 00000199A6269130 is not used this frame

Frame 3:
  vkWaitForFences(00000199A62619A0)
  vkResetFences(00000199A62619A0)
  vkQueueSubmit(cmd=some_other_command_buffers, fence=00000199A62619A0)

  vkResetCommandBuffer(00000199A6269130)

These are the commands related to 00000199A6269130 command buffer as they were recorded by the capture tool. vkResetCommandBuffer in frame 3 does not have a guarantee that command buffer is not in use - fence was not used in the third submit in frame 0 (so that GPU job is not synchronized with CPU and can still be in flight).

The second submit in frame 0 uses fence 00000199A62619A0 and then we wait on it in frame 3, but this ensures only that the second submit in frame 0 finished, but not the 3rd one.

vkAcquireNextImageKHR does not use fences so it does not participate in the synchronization with the GPU.

Based on this the error reported by the capture looks like a valid error.

@nicebyte If you can spot a difference here from what your program does please let me know.

Otherwise I think we can close this issue and I can notify gfx reconstruct team.

nicebyte · 2024-10-08T16:38:41Z

Yeah, that definitely seems suspicious. My application never calls vkResetCommandBuffer, it calls vkFreeCommandBuffers followed by vkResetCommandPool. Also, my application never calls vkQueueWaitIdle.

artem-lunarg · 2024-10-08T16:43:45Z

Okay, it should be side effect of the capture tool. I'll let the team know.

artem-lunarg · 2024-10-08T17:24:57Z

@nicebyte the team is looking into it, one workaround it to use --swapchain captured replay option. It does not report errors but reports something else related to replay (I'm not very familiar with that project). Another option is --swapchain offscreen it does not report errors but it does not show window during replay, still can be useful for validation. Just in case you encounter similar issues again!

bradgrantham-lunarg · 2024-10-08T17:50:36Z

@nicebyte the team is looking into it, one workaround it to use --swapchain captured replay option. It does not report errors but reports something else related to replay (I'm not very familiar with that project).

Thanks for the issue report! It looks like we do have invalid usage for the "virtual swapchain" mode in replay, and will work that out. In the meantime, as @artem-lunarg says, the workaround is --swapchain captured but that may cause incorrect rendering on replay, most likely if you use a present mode other than FIFO. It looks like you use FIFO in your capture, so you're probably safe (but even FIFO can acquire swapchain indices out-of-order which will cause replay to render incorrectly, FYI)

nicebyte changed the title ~~Syncval: using vertex/index buffers a frame later than they were written to causes bogus syncval error.~~ Syncval: using vertex/index buffers a frame later than they were written to causes bogus error. Oct 5, 2024

artem-lunarg added the Synchronization Synchronization Validation Object Issue label Oct 5, 2024

artem-lunarg self-assigned this Oct 7, 2024

artem-lunarg closed this as completed Oct 8, 2024

bradgrantham-lunarg mentioned this issue Oct 8, 2024

Fix invalid usage by waiting on CommandBuffer fence before resetting CommandBuffer LunarG/gfxreconstruct#1795

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syncval: using vertex/index buffers a frame later than they were written to causes bogus error. #8655

Syncval: using vertex/index buffers a frame later than they were written to causes bogus error. #8655

nicebyte commented Oct 5, 2024 •

edited

Loading

artem-lunarg commented Oct 7, 2024 •

edited

Loading

nicebyte commented Oct 7, 2024

artem-lunarg commented Oct 7, 2024

nicebyte commented Oct 8, 2024

nicebyte commented Oct 8, 2024

artem-lunarg commented Oct 8, 2024 •

edited

Loading

artem-lunarg commented Oct 8, 2024

artem-lunarg commented Oct 8, 2024 •

edited

Loading

nicebyte commented Oct 8, 2024

artem-lunarg commented Oct 8, 2024

artem-lunarg commented Oct 8, 2024

bradgrantham-lunarg commented Oct 8, 2024

Syncval: using vertex/index buffers a frame later than they were written to causes bogus error. #8655

Syncval: using vertex/index buffers a frame later than they were written to causes bogus error. #8655

Comments

nicebyte commented Oct 5, 2024 • edited Loading

artem-lunarg commented Oct 7, 2024 • edited Loading

nicebyte commented Oct 7, 2024

artem-lunarg commented Oct 7, 2024

nicebyte commented Oct 8, 2024

nicebyte commented Oct 8, 2024

artem-lunarg commented Oct 8, 2024 • edited Loading

artem-lunarg commented Oct 8, 2024

artem-lunarg commented Oct 8, 2024 • edited Loading

nicebyte commented Oct 8, 2024

artem-lunarg commented Oct 8, 2024

artem-lunarg commented Oct 8, 2024

bradgrantham-lunarg commented Oct 8, 2024

nicebyte commented Oct 5, 2024 •

edited

Loading

artem-lunarg commented Oct 7, 2024 •

edited

Loading

artem-lunarg commented Oct 8, 2024 •

edited

Loading

artem-lunarg commented Oct 8, 2024 •

edited

Loading