Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syncval: using vertex/index buffers a frame later than they were written to causes bogus error. #8655

Closed
nicebyte opened this issue Oct 5, 2024 · 12 comments
Assignees
Labels
Synchronization Synchronization Validation Object Issue

Comments

@nicebyte
Copy link

nicebyte commented Oct 5, 2024

Environment:

  • OS: Windows 10 22H2
  • GPU and driver version: NVidia GeForce RTX 2080 Super driver version 31.0.15.5212
  • SDK or header version if building from repo: 1.3.290.0
  • Options enabled (synchronization, best practices, etc.): Synchronization, Submit time validation, Fine Grained Locking, Core, Handle Wrapping, Object Lifetime, Stateless Parameter, Thread Safety

Describe the Issue

  • The application records a command buffer that writes data into two GPU buffers using vkCmdCopyBuffer. This command buffer is submitted on frame 0.
  • The application does not render anything on frame 0 (just an empty renderpass).
  • On frame 1, the application records and submits a command buffer that issues pipeline barriers for the buffers written during frame 0, and uses those buffers as vertex and index buffers in an indexed draw call.
  • The synchronization validation reports a read-after-write hazard for one of the buffers on frame 1.

To recap, the timeline looks like this:

[0: vkCmdCopy; empty renderpass] ---> [1: vkCmdPipelineBarrier2; vkCmdDrawIndexed ]

I think that the read-after-write hazard report is erroneous.

What makes me think this is a bug in syncval

  • I have double and triple checked that the application emits the required vkCommandPipelineBarrier2 command where appropriate (breakpoints in the debugger, looked at renderdoc captures, etc.). My stage and access masks look correct as well see screenshot:
    image

  • When there is no wait - i.e., the app uses the buffers within the same frame that the buffers are written to, there is no syncval error.

  • Finally: if, instead of only skipping drawing on frame 0, I skip drawing on frames 0, 1 and 2, the syncval error goes away! Skipping just 0 and 1 still has the syncval error.

The last two points together convince me this is a bug. If there genuinely was synchronization missing, I believe syncval should report it no matter how long has passed since the buffers were written.

I am attaching two gfxreconstruct captures:

  • Here, app skips drawing only on frame 0: syncval-error.zip -- with syncval enabled, replaying this capture gives an error.
  • And here, the app skips drawing on frames 0, 1 and 2: syncval-noerror.zip -- replay does not trigger an error.

Expected behavior

Syncval should not have flagged this workload.

@nicebyte nicebyte changed the title Syncval: using vertex/index buffers a frame later than they were written to causes bogus syncval error. Syncval: using vertex/index buffers a frame later than they were written to causes bogus error. Oct 5, 2024
@artem-lunarg artem-lunarg added the Synchronization Synchronization Validation Object Issue label Oct 5, 2024
@artem-lunarg
Copy link
Contributor

artem-lunarg commented Oct 7, 2024

@nicebyte I reproduced the behavior using the first capture with SDK 1.3.290. Thanks for providing them!

Then I tested with the latest validation code and it behaves differently. You can check it or wait for the soon to be released new SDK.

Good news: in the latest code syncval does not report the errors.

Potentially bad news: The 1.3.290 SDK does not report errors with Standard preset (core validation), but the latest code does. The errors are related to command buffer is reset/begin when it is still in use. It is not a surprise because we recently reworked implementation of in-use resource tracking and now it covers some use cases not detected previously. Of course there is a room for bugs too. If you have opportunity to check new validation please let us know if you think new validation errors are false-positives.

@artem-lunarg artem-lunarg self-assigned this Oct 7, 2024
@nicebyte
Copy link
Author

nicebyte commented Oct 7, 2024

@artem-lunarg can I download the preview sdk from somewhere or do I need to build from source?

@artem-lunarg
Copy link
Contributor

@nicebyte You can build the latest code (https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/main/BUILD.md). Also the new SDK should be publicly available in the first half of this week if you prefer this option.

@nicebyte
Copy link
Author

nicebyte commented Oct 8, 2024

So I've had time to try out the latest VVL built from HEAD today, and found some interesting things:

  1. my reported false positive is indeed gone.
  2. the command buffer related errors that you mentioned are not reported when i run my applications. they are validation-clean.
  3. however, errors ARE reported for the gfxreconstruct replay!

so i think the original bug that i opened this for is already fixed, but we've established that something is up with gfxrecon (or VVL's interpretation of gfxrecon API usage) :)

@nicebyte
Copy link
Author

nicebyte commented Oct 8, 2024

to be more precise, what i'm seeing is: my application has NO sync errors, but if I capture it with gfxrecon and replay - the replay DOES have errors.

@artem-lunarg
Copy link
Contributor

artem-lunarg commented Oct 8, 2024

I'm investigating API dump from the capture, the first impression is that commands in the capture can cause this issue. I'll spend a bit more time to be sure and then will post here a structure of the commands/synchronization to confirm it's not something that the app does and is potentially caused by the capture tool.

@artem-lunarg
Copy link
Contributor

full-apidump.txt
mini-apidump.txt

Attached full API dump and also mini version with commands related to synchronization and command buffers.

@artem-lunarg
Copy link
Contributor

artem-lunarg commented Oct 8, 2024

Here is a problem description based on mini API dump. This describes all usages of the command buffer 00000199A6269130 and how it leads to the validation error in frame 3 (zero-based) when vkResetCommandBuffer is called (the last command in the file).

Frame 0:
  vkAllocateCommandBuffers -> 00000199A6269130
  vkResetCommandBuffer (00000199A6269130)
  vkBeginCommandBuffer(00000199A6269130)
  vkEndCommandBuffer(00000199A6269130)
  vkQueueSubmit(cmd=00000199A6269130)
  vkQueueWaitIdle()

  vkQueueSubmit(cmd=some_other_command_buffers, fence=00000199A62619A0)

  vkResetCommandBuffer(00000199A6269130)
  vkBeginCommandBuffer(00000199A6269130)
  vkEndCommandBuffer(00000199A6269130)
  vkQueueSubmit(cmd=00000199A6269130, signal_semaphore=00000199A61C5FD0, fence=null)
  vkQueuePresentKHR(wait_semaphore=00000199A61C5FD0)

Frame 1:

  // 00000199A6269130 is not used this frame

Frame 2:
  // 00000199A6269130 is not used this frame

Frame 3:
  vkWaitForFences(00000199A62619A0)
  vkResetFences(00000199A62619A0)
  vkQueueSubmit(cmd=some_other_command_buffers, fence=00000199A62619A0)

  vkResetCommandBuffer(00000199A6269130)

These are the commands related to 00000199A6269130 command buffer as they were recorded by the capture tool. vkResetCommandBuffer in frame 3 does not have a guarantee that command buffer is not in use - fence was not used in the third submit in frame 0 (so that GPU job is not synchronized with CPU and can still be in flight).

The second submit in frame 0 uses fence 00000199A62619A0 and then we wait on it in frame 3, but this ensures only that the second submit in frame 0 finished, but not the 3rd one.

vkAcquireNextImageKHR does not use fences so it does not participate in the synchronization with the GPU.

Based on this the error reported by the capture looks like a valid error.

@nicebyte If you can spot a difference here from what your program does please let me know.

Otherwise I think we can close this issue and I can notify gfx reconstruct team.

@nicebyte
Copy link
Author

nicebyte commented Oct 8, 2024

Yeah, that definitely seems suspicious. My application never calls vkResetCommandBuffer, it calls vkFreeCommandBuffers followed by vkResetCommandPool. Also, my application never calls vkQueueWaitIdle.

@artem-lunarg
Copy link
Contributor

Okay, it should be side effect of the capture tool. I'll let the team know.

@artem-lunarg
Copy link
Contributor

@nicebyte the team is looking into it, one workaround it to use --swapchain captured replay option. It does not report errors but reports something else related to replay (I'm not very familiar with that project). Another option is --swapchain offscreen it does not report errors but it does not show window during replay, still can be useful for validation. Just in case you encounter similar issues again!

@bradgrantham-lunarg
Copy link

@nicebyte the team is looking into it, one workaround it to use --swapchain captured replay option. It does not report errors but reports something else related to replay (I'm not very familiar with that project).

Thanks for the issue report! It looks like we do have invalid usage for the "virtual swapchain" mode in replay, and will work that out. In the meantime, as @artem-lunarg says, the workaround is --swapchain captured but that may cause incorrect rendering on replay, most likely if you use a present mode other than FIFO. It looks like you use FIFO in your capture, so you're probably safe (but even FIFO can acquire swapchain indices out-of-order which will cause replay to render incorrectly, FYI)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Synchronization Synchronization Validation Object Issue
Projects
None yet
Development

No branches or pull requests

3 participants