Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with partially bound texture arrays with variable descriptor count #2206

Closed
mbechard opened this issue Apr 13, 2024 · 11 comments
Closed
Labels
Bug Completed Issue has been fixed, or enhancement implemented.

Comments

@mbechard
Copy link
Contributor

If I have a texture array that has been sized 10, but I'm only providing 8 elements in a call to vkUpdateDescriptorSets(), it doesn't seem like the other elements in the array in MVKDescriptorSet get cleared out. This can cause a crash with deleted resources, since bindings for elements 8 and 9 will still try to get bound in MVKDescriptorSetLayoutBinding::bind. That function loops over the entire length of the variableDescriptorCount, and doesn't seem to account for partially bound sizes.
Happy to do a PR, just looking for confirmation my assumptions are correct.
I would think the solution is to call MVKImageDescriptor::reset() on every element of the array that isn't set in the Update. Is that correct?
Thanks

@mbechard
Copy link
Contributor Author

mbechard commented Apr 15, 2024

Thinking about this more. I think when a MVK resource is destroyed it needs to know all of the partially bound descriptors that reference it, and clears out those references. Otherwise dangling references to resources can occur when converting the MVK state into Metal calls.
My test case only uses variable sized descriptors, but I'm thinking this crash would affect statically sized descriptors as well that are partially bound.

@billhollings
Copy link
Contributor

Perhaps I'm missing something here.

vkUpdateDescriptorSets() is specifically designed to update only the descriptors that you indicate, and leave the others unchanged.

What behaviour are you seeing on other Vulkan platforms? What kind of crash are you seeing here? Do you have a call trace, or a small app that can demonstrate the issue you're encountering?

I think when a MVK resource is destroyed it needs to know all of the partially bound descriptors that reference it

This is existing behaviour. The resources are retained until they are removed from the descriptors. However, Vulkan does require that an app not destroy resources while they are in-flight in an executing command buffer, and that a descriptor holding a destroyed resource must not be used further while it holds that resource.

@mbechard
Copy link
Contributor Author

mbechard commented Apr 16, 2024

It works fine on Windows for all GPUs there. There are two crashes so far I've seen that show up.
Here is the call stack from the first crash. The mtlTexture passed into setFragmentTexture is invalid.

#0	0x00000001a23e5834 in objc_msgSend ()
#1	0x00000001a2fcd6e8 in -[MTLDebugRenderCommandEncoder setFragmentTexture:atIndex:] ()
#2	0x00000001187fe6f8 in MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20::operator()(MVKCommandEncoder*, MVKMTLTextureBinding&) const at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/Commands/MVKCommandEncoderState.mm:1110
#3	0x00000001187fe6b0 in decltype(std::declval<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20&>()(std::declval<MVKCommandEncoder*>(), std::declval<MVKMTLTextureBinding&>())) std::__1::__invoke[abi:v15006]<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20&, MVKCommandEncoder*, MVKMTLTextureBinding&>(MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20&, MVKCommandEncoder*&&, MVKMTLTextureBinding&) at /Applications/Xcode14.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/c++/v1/__functional/invoke.h:394
#4	0x00000001187fe654 in void std::__1::__invoke_void_return_wrapper<void, true>::__call<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20&, MVKCommandEncoder*, MVKMTLTextureBinding&>(MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20&, MVKCommandEncoder*&&, MVKMTLTextureBinding&) at /Applications/Xcode14.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/c++/v1/__functional/invoke.h:479
#5	0x00000001187fe620 in std::__1::__function::__alloc_func<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20, std::__1::allocator<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20>, void (MVKCommandEncoder*, MVKMTLTextureBinding&)>::operator()[abi:v15006](MVKCommandEncoder*&&, MVKMTLTextureBinding&) at /Applications/Xcode14.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/c++/v1/__functional/function.h:185
#6	0x00000001187fd49c in std::__1::__function::__func<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20, std::__1::allocator<MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int)::$_20>, void (MVKCommandEncoder*, MVKMTLTextureBinding&)>::operator()(MVKCommandEncoder*&&, MVKMTLTextureBinding&) at /Applications/Xcode14.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/c++/v1/__functional/function.h:359
#7	0x00000001187e033c in std::__1::__function::__value_func<void (MVKCommandEncoder*, MVKMTLTextureBinding&)>::operator()[abi:v15006](MVKCommandEncoder*&&, MVKMTLTextureBinding&) const at /Applications/Xcode14.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/c++/v1/__functional/function.h:512
#8	0x00000001187e02dc in std::__1::function<void (MVKCommandEncoder*, MVKMTLTextureBinding&)>::operator()(MVKCommandEncoder*, MVKMTLTextureBinding&) const at /Applications/Xcode14.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/c++/v1/__functional/function.h:1197
#9	0x00000001187d8fec in void MVKResourcesCommandEncoderState::encodeBinding<MVKMTLTextureBinding, MVKSmallVectorImpl<MVKMTLTextureBinding, mvk_smallvector_allocator<MVKMTLTextureBinding, 8> > >(MVKSmallVectorImpl<MVKMTLTextureBinding, mvk_smallvector_allocator<MVKMTLTextureBinding, 8> >&, bool&, std::__1::function<void (MVKCommandEncoder*, MVKMTLTextureBinding&)>) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/Commands/MVKCommandEncoderState.h:455
#10	0x00000001187d891c in MVKGraphicsResourcesCommandEncoderState::encodeBindings(MVKShaderStage, char const*, bool, std::__1::function<void (MVKCommandEncoder*, MVKMTLBufferBinding&)>, std::__1::function<void (MVKCommandEncoder*, MVKMTLBufferBinding&, MVKArrayRef<unsigned int const>)>, std::__1::function<void (MVKCommandEncoder*, MVKMTLTextureBinding&)>, std::__1::function<void (MVKCommandEncoder*, MVKMTLSamplerStateBinding&)>) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/Commands/MVKCommandEncoderState.mm:901
#11	0x00000001187d9e14 in MVKGraphicsResourcesCommandEncoderState::encodeImpl(unsigned int) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/Commands/MVKCommandEncoderState.mm:1088
#12	0x00000001186cdef0 in MVKCommandEncoderState::encode(unsigned int) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/Commands/MVKCommandEncoderState.h:93
#13	0x00000001186cde14 in MVKCommandEncoder::finalizeDrawState(MVKGraphicsStage) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/Commands/MVKCommandBuffer.mm:714

Here is the call stack from the second. This one is much rarer. In this one the MVKImageViewPlane::_imageView is NULL.
What's interesting here though is that at this point the _planes member of MVKImageView seems to be empty. And since getMTLTexture can only be called if _planes.size() > 0, I think it's a race condition where the MVKImageView is getting destroy() called on it while this code is executing. Likely because I've called vkDestroyImageView in another thread.

#1	0x000000011868dcc4 in MVKImageViewPlane::getMTLTexture() at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKImage.mm:1566
#2	0x000000011871ec54 in MVKImageView::getMTLTexture(unsigned char) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKImage.h:574
#3	0x000000011873a0f4 in MVKImageDescriptor::bind(MVKCommandEncoder*, VkPipelineBindPoint, MVKDescriptorSetLayoutBinding*, unsigned int, bool*, MVKShaderResourceBinding&, MVKArrayRef<unsigned int>, unsigned int&) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptor.mm:918
#4	0x0000000118735f44 in MVKDescriptorSetLayoutBinding::bind(MVKCommandEncoder*, VkPipelineBindPoint, MVKDescriptorSet*, MVKShaderResourceBinding&, MVKArrayRef<unsigned int>, unsigned int&) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptor.mm:222
#5	0x000000011861a55c in MVKDescriptorSetLayout::bindDescriptorSet(MVKCommandEncoder*, VkPipelineBindPoint, unsigned int, MVKDescriptorSet*, MVKShaderResourceBinding&, MVKArrayRef<unsigned int>, unsigned int&) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKDescriptorSet.mm:46
#6	0x000000011873f4c8 in MVKPipelineLayout::bindDescriptorSets(MVKCommandEncoder*, VkPipelineBindPoint, MVKArrayRef<MVKDescriptorSet*>, unsigned int, MVKArrayRef<unsigned int>) at /Users/malcolm/dev/devel/MoltenVK/MoltenVK/MoltenVK/GPUObjects/MVKPipeline.mm:60

I also notice that if I use MVK_CONFIG_PREFILL_METAL_COMMAND_BUFFERS_STYLE_IMMEDIATE_ENCODING, or ensure all my resource destructions happen in my main thread (instead of a worker thread), I can't make the crash occur. So that's another hint it's a race condition.

What I think is happening is that the worker thread is clearing up resources at the same time the the descriptor binding operations are occurring inside of MoltenVK. For the first case the MVKGraphicsResourcesCommandEncoderState was given a MTLTexture that has since been deleted. For the second case the MVKImageView had it's _planes object cleaned up between the start of the getMTLTexture and the end of it (since it shouldn't be able to get into MVKImageViewPlane::getMTLTexture() when _planes.size() == 0), which it is when I inspect the stack at this point.

So the question is, is it legal to destroy resources pointed to by a descriptor, while that descriptor is part of a descriptor set that is being used, even though that particular descriptor isn't referenced by the shader?

@billhollings
Copy link
Contributor

So the question is, is it legal to destroy resources pointed to by a descriptor, while that descriptor is part of a descriptor set that is being used, even though that particular descriptor isn't referenced by the shader?

As I mention above, Vulkan does require that an app not destroy resources while they are in-flight in an executing command buffer, and that a descriptor holding a destroyed resource must not be used further while it holds that destroyed resource:

The application must not destroy any other type of Vulkan object until all uses of that object by the device (such as via command buffer execution) have completed.

The following Vulkan objects must not be destroyed while any command buffers using the object are in the pending state:

  • ...
  • VkBuffer
  • VkBufferView
  • VkImage
  • VkImageView
  • ...

It's possible that you may not be encountering this on other platforms, because the representation within the platform command encoder may not depend on the Vulkan resource object the way it does in MoltenVK and Metal. To improve performance, MoltenVK deliberately uses a MTLCommandBuffer that does not retain the references to the MTLTextures and MTLBuffers.

Having said that, MoltenVK does retain the resource objects within descriptors, but if you also destroy the descriptor set on your worker thread, then it might all disappear too early.

@mbechard
Copy link
Contributor Author

mbechard commented Apr 17, 2024

Thanks for your continued attention to this. The spec also states:

VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT indicates that descriptors in this binding that are not dynamically used need not contain valid descriptors at the time the descriptors are consumed. A descriptor is dynamically used if any shader invocation executes an instruction that performs any memory access using the descriptor. If a descriptor is not dynamically used, any resource referenced by the descriptor is not considered to be referenced during command execution.

This clarification was only added recently, in spec v1.3.210.

There is also this validation layer issue that mentions a similar workflow:
KhronosGroup/Vulkan-Docs#1794

I do understand why this would be problematic for MoltenVK though, as the translation layer between it and Metal requires accessing resources on the CPU to do the translation that would otherwise not be accessed by other implementations.

Is my understanding correct? If so, I can try to think of a solution.

@tranvanlam
Copy link

Thanks for your continued attention to this. The spec also states:

VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT indicates that descriptors in this binding that are not dynamically used need not contain valid descriptors at the time the descriptors are consumed. A descriptor is dynamically used if any shader invocation executes an instruction that performs any memory access using the descriptor. If a descriptor is not dynamically used, any resource referenced by the descriptor is not considered to be referenced during command execution.

This clarification was only added recently, in spec v1.3.210.

There is also this validation layer issue that mentions a similar workflow: KhronosGroup/Vulkan-Docs#1794

I do understand why this would be problematic for MoltenVK though, as the translation layer between it and Metal requires accessing resources on the CPU to do the translation that would otherwise not be accessed by other implementations.

Is my understanding correct? If so, I can try to think of a solution.

@mbechard Have you found a solution yet? Can you share that solution with me in here?

@mbechard
Copy link
Contributor Author

What I ended up doing is for each descriptor, I keep track of which elements have been bound. When I update the array for a new usage, I make sure I unbind elements (or bind them to known valid entries) for anything that has been previously bound. So even if your array allocation is large, you can still leave most of it partially-bound entries untouched, and only maintain the ones you are actually making use of.
It won't work for all workflows though, but it worked for me.

@billhollings
Copy link
Contributor

PR #2320 should fix this. Please retest with latest MoltenVK and close this issue if the problem is fixed.

@billhollings
Copy link
Contributor

It's been a month. Closing now.

@mbechard
Copy link
Contributor Author

mbechard commented Sep 30, 2024

Sorry, this fell through the cracks. Honestly I feel like the cure is worse than the symptom on this one. I think in both cases you are stuck doing non-Vulkan spec behavior. With this new change, GPU resources aren't freed until descriptor sets stop referencing them. So the choices are;
(1). Workaround for old behavior: Keep track of all entries in the array that are bound, and whenever you use that descriptor, ensure that those entries are valid (you can ignore entries that were never used).
(2). Required new management for new behavior: All resources now need to keep track of every descriptor that references them, and when they want to be freed, go through all of those descriptors and do an update to cause the descriptor to no longer reference them.

Dealing with (2) seems much more difficult than (1), and I worry that for people that aren't running into (1), they are going to be getting hit by (2) without realizing it.

(Sorry about the errant references to issues 1 and 2 from this repo in my original post in this message)

@billhollings
Copy link
Contributor

I've created the separate enhancement issue #2359 to deal with the unreleased memory consumed by the dead resources that are still retained by a descriptor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Completed Issue has been fixed, or enhancement implemented.
Projects
None yet
Development

No branches or pull requests

3 participants