Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan: Fix incorrect access to the buffers on Android #84852

Merged
merged 1 commit into from
Dec 4, 2023

Conversation

Alex2782
Copy link
Contributor

@Alex2782 Alex2782 commented Nov 13, 2023

Fixes #78715
Fixes #75599
Fixes #80371
Fixes #84355

Error description: #78715 (comment) / #78715 (comment)


old tests

Godot 4.2 beta5 78715 and 75599
"Mesh Array_Index" is not read correctly (it sometimes contains 'attrib_data' or 'vertex_data'). The app crashes when "soft_body_3d" is inserted.

crash.Design.ohne.Titel.mp4

this PR 78715 and 75599
"Mesh Array_Index" is always read correctly and there are no crashes when soft_body_3d is inserted.

Screen_recording_20231113_174153.mp4

80371 Godot 4.1.3

cubes.4.1.3-Screen_recording_20231113_16101.mp4

80371 Godot 4.2 (beta.custom, this PR)

cubes.custom.Screen_recording_20231113_161639.mp4

@Alex2782 Alex2782 requested a review from a team as a code owner November 13, 2023 17:00
@akien-mga akien-mga changed the title Fixes incorrect access to the buffers on Android Vulkan: Fixes incorrect access to the buffers on Android Nov 13, 2023
@akien-mga akien-mga changed the title Vulkan: Fixes incorrect access to the buffers on Android Vulkan: Fix incorrect access to the buffers on Android Nov 13, 2023
@Alex2782

This comment was marked as outdated.

@clayjohn
Copy link
Member

The functions vmaInvalidateAllocation / vmaFlushAllocation require too much performance. The FPS goes from 60 to 40 because 'issue 75599' queries the data via _process. With VK_MEMORY_PROPERTY_HOST_COHERENT_BIT this was no problem.

	void *buffer_mem;
	VkResult vkerr = vmaMapMemory(allocator, tmp_buffer.allocation, &buffer_mem);
	ERR_FAIL_COND_V_MSG(vkerr, Vector<uint8_t>(), "vmaMapMemory failed with error " + itos(vkerr) + ".");

#ifdef __ANDROID__
	vmaInvalidateAllocation(allocator, tmp_buffer.allocation, 0, VK_WHOLE_SIZE);
#endif

	Vector<uint8_t> buffer_data;
	{
		buffer_data.resize(p_size);
		uint8_t *w = buffer_data.ptrw();
		memcpy(w, buffer_mem, p_size);
	}

#ifdef __ANDROID__
	vmaFlushAllocation(allocator, tmp_buffer.allocation, 0, VK_WHOLE_SIZE);
#endif

	vmaUnmapMemory(allocator, tmp_buffer.allocation);

	_buffer_free(&tmp_buffer);

Screen_recording_20231113_231241.mp4

Then let's just request COHERENT memory. I would add an optional parameter to _buffer_allocate() called p_coherent (default to false). When true it should add the coherent flag to the preferred flags

@Alex2782
Copy link
Contributor Author

now: requiredFlags = VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
like: google-hellovk and VMA_MEMORY_USAGE_CPU_ONLY

Tests:

  • without FPS limit between 63 and 69
  • without set_text(debug) (for issue 75599, RichText, gdscript) 20 FPS more
  • without issue 75599 self.get_mesh() and mesh_data.surface_get_arrays(0) up to 120 fps (system limit)

Alex2782

This comment was marked as duplicate.

@akien-mga akien-mga added this to the 4.3 milestone Nov 14, 2023
@akien-mga akien-mga added the cherrypick:4.2 Considered for cherry-picking into a future 4.2.x release label Nov 14, 2023
@RandomShaper
Copy link
Member

RandomShaper commented Nov 14, 2023

I'd suggest simplifyng this PR by making _buffer_allocate() always add VK_MEMORY_PROPERTY_HOST_COHERENT_BIT in case it's given VMA_MEMORY_USAGE_AUTO_PREFER_HOST. Regarding VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, as far as I know, VMA will add it automatically.

And, as someone else has pointed out, _insert_staging_block() would also benefit from VK_MEMORY_PROPERTY_HOST_COHERENT_BIT for the CPU-to-GPU direction.

On desktop, you don't need to ask for VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, but you should unless you want to rely on driver behavior. Quoting VMA docs:

Also, Windows drivers from all 3 PC GPU vendors (AMD, Intel, NVIDIA) currently provide HOST_COHERENT flag on all memory types that are HOST_VISIBLE, so on PC you may not need to bother.

My guess is that there wouldn't be a performance loss on platforms that already add the proper bit unconditionally. And on platforms that don't, we have to add it explicitly anyway. Therefore, let's always ask for coherent memory when it comes to data transfers via mapped memory.

clayjohn
clayjohn previously approved these changes Nov 14, 2023
Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks good to me. Let's merge this early in the 4.3 release cycle and cherrypick to 4.2.1

@Alex2782
Copy link
Contributor Author

the new changes have an impact on more functions

texture_get_data -> _buffer_allocate(...., VMA_MEMORY_USAGE_AUTO_PREFER_HOST....);
_texture_update and _buffer_update -> _insert_staging_block

new Test project test.buffer.zip

MacOs and Android no problems.
Godot Android Editor (issue 80371) not tested.

Screen-2023-11-14-152844.mp4

@Alex2782
Copy link
Contributor Author

#include "thirdparty/vulkan/vk_mem_alloc.h"

@akien-mga Can this change be left in?

  • not necessary for scons and cmake
  • VsCode finds "vk_mem_alloc.h"
  • Android Studio: "vk_mem_alloc.h" (NOT), <vk_mem_alloc.h> (NOT)
  • "thirdparty/vulkan/vk_mem_alloc.h" (OK), <thirdparty/vulkan/vk_mem_alloc.h> (OK)

@sakrel
Copy link
Contributor

sakrel commented Nov 14, 2023

The functions vmaInvalidateAllocation / vmaFlushAllocation require too much performance. The FPS goes from 60 to 40 because 'issue 75599' queries the data via _process. With VK_MEMORY_PROPERTY_HOST_COHERENT_BIT this was no problem.

	void *buffer_mem;
	VkResult vkerr = vmaMapMemory(allocator, tmp_buffer.allocation, &buffer_mem);
	ERR_FAIL_COND_V_MSG(vkerr, Vector<uint8_t>(), "vmaMapMemory failed with error " + itos(vkerr) + ".");

#ifdef __ANDROID__
	vmaInvalidateAllocation(allocator, tmp_buffer.allocation, 0, VK_WHOLE_SIZE);
#endif

	Vector<uint8_t> buffer_data;
	{
		buffer_data.resize(p_size);
		uint8_t *w = buffer_data.ptrw();
		memcpy(w, buffer_mem, p_size);
	}

#ifdef __ANDROID__
	vmaFlushAllocation(allocator, tmp_buffer.allocation, 0, VK_WHOLE_SIZE);
#endif

	vmaUnmapMemory(allocator, tmp_buffer.allocation);

	_buffer_free(&tmp_buffer);

Screen_recording_20231113_231241.mp4

hmm, I don't think we should disregard the vmaInvalidateAllocation / HOST_VISIBLE | HOST_CACHED approach just yet.
The ARM documentation strongly suggests that HOST_VISIBLE | HOST_CACHED should be the preferred memory type for CPU readback.

Use HOST_VISIBLE | HOST_COHERENT | HOST_CACHED memory for resources which are read back on to the CPU. Use HOST_VISIBLE | HOST_CACHED if it is not available.

Do not read back data from uncached memory on the CPU.

Uncached readbacks can be much slower than cached reads due to increased CPU processing costs.

Check that all CPU-read buffers are using cached memory.

https://developer.arm.com/documentation/101897/0301/CPU-overheads/Vulkan-CPU-memory-mapping

However, the documentation also mentions that manually invaliding/flushing the cache can be expensive:

vkFlushMappedRanges() when the CPU has finished writing data for the GPU.
vkInvalidateMappedRanges() when reading back the data that the GPU has written.
However, both calls are expensive to use, so must be used sparingly.

I wonder if the vmaInvalidateAllocation / HOST_VISIBLE | HOST_CACHED approach just has a bigger base cost, but scales better with bigger buffer sizes?

Also, I think, the vmaFlushAllocation in your code is unnecessary because the CPU never writes to the mapped buffer.

@clayjohn
Copy link
Member

hmm, I don't think we should disregard the vmaInvalidateAllocation / HOST_VISIBLE | HOST_CACHED approach just yet.
The ARM documentation strongly suggests that HOST_VISIBLE | HOST_CACHED should be the preferred memory type for CPU readback.

I think you mixed that up. The document you linked recommends using HOST_VISIBLE | HOST_COHERENT | HOST_CACHED over HOST_VISIBLE | HOST_CACHED It also recommends avoiding vkInvalidateMappedRanges()

@sakrel
Copy link
Contributor

sakrel commented Nov 14, 2023

I think you mixed that up. The document you linked recommends using HOST_VISIBLE | HOST_COHERENT | HOST_CACHED over HOST_VISIBLE | HOST_CACHED

Yes, this memory type is recommended if supported by the hardware:

The HOST_VISIBLE | HOST_COHERENT | HOST_CACHED memory type:
Provides cached storage on the CPU, which is also coherent with the GPU view of the memory without needing manual synchronization.
Supported by Arm Bifrost and later GPUs if the chipset supports the hardware coherency protocol between the CPU and GPU.

This random android device for example doesn't support HOST_VISIBLE | HOST_COHERENT | HOST_CACHED: https://vulkan.gpuinfo.org/displayreport.php?id=25724#memory

@clayjohn
Copy link
Member

clayjohn commented Nov 14, 2023

I think you mixed that up. The document you linked recommends using HOST_VISIBLE | HOST_COHERENT | HOST_CACHED over HOST_VISIBLE | HOST_CACHED

Yes, this memory type is recommended if supported by the hardware:

The HOST_VISIBLE | HOST_COHERENT | HOST_CACHED memory type:
Provides cached storage on the CPU, which is also coherent with the GPU view of the memory without needing manual synchronization.
Supported by Arm Bifrost and later GPUs if the chipset supports the hardware coherency protocol between the CPU and GPU.

This random android device for example doesn't support HOST_VISIBLE | HOST_COHERENT | HOST_CACHED: https://vulkan.gpuinfo.org/displayreport.php?id=25724#memory
gpu

But it does support DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_COHERENT_BIT (Memory Type 0) which is the type of memory that we request with this PR.

The documentation only recommends HOST_VISIBLE | HOST_CACHED as a fallback in case HOST_VISIBLE | HOST_COHERENT | HOST_CACHED is not available. But we don't actually care about HOST_CACHED since we are doing a one time operation. We aren't leaving the memory mapped. So the caching is an unnecessary overhead.

In my mind HOST_VISIBLE | HOST_COHERENT is the optimal memory type here as it is always available (on Mali anyway)

@darksylinc
Copy link
Contributor

As sakrel said:

The functions vmaInvalidateAllocation / vmaFlushAllocation require too much performance. The FPS goes from 60 to 40 because 'issue 75599' queries the data via _process. With VK_MEMORY_PROPERTY_HOST_COHERENT_BIT this was no problem.

  • For reading (which is what issue Mobile Renderer makes Meshes unreadable/uneditable on Android #75599 is about) you always want HOST_CACHED.
    • However Godot does something ill-advised which is map, then memcpy, then unmap.
    • This mostly negates the benefits of HOST_CACHED, since the first thing you do is get the memory out of a non-cached region and copy it into a cached region.
  • For writing, you usually don't need HOST_CACHED nor HOST_COHERENT.
    • Whether you want them depends on how you do the writes.

I noticed the code flushes WHOLE_SIZE instead of flushing only the necessary ranges (Note to self: WAIT A MINUTE! VMA does not do buffer suballocation!?!?).
I don't know if VMA prevents this or not, but you must NOT call vkInvalidateMappedMemoryRanges if the memory is coherent. Validation layers will complain. If you do this on coherent memory, it is undefined behavior but it may explain performance difference.

Also make sure validation layers are disabled, because validation layers will make sure non-coherent memory returns garbage to more easily identify bugs.

@clayjohn
Copy link
Member

@darksylinc What is the benefit of using HOST_CACHED for a one-time only read operation? That's the context we are dealing with here. I agree it makes sense when we will be doing multiple memory accesses before unmapping, but in this specific use-case, is there any point?

@darksylinc
Copy link
Contributor

What is the benefit of using HOST_CACHED for a one-time only read operation?

It's more as to how the CPU operates and how we are doing the read.

When we do (in pseudo x86 asm but applies to arm as well):

loop:
   mov eax, [non_cached_address]
   mov [cached_address], eax
   add non_cached_address, 4
   add cached_address, 4
   goto loop

If "non_cached_address" is actually cached, the CPU reads 64 bytes (a cache line) into L1, and probably prefetches much more into L2. So the next 16 iterations (we're reading 4 bytes at a time) read by the mov will be super fast because it's already in L1. On the 17th iteration it will be slightly slower but it still be very fast because it fetches the next 64 bytes from L2. And so on.

If "non_cached_address" is uncached, every 4 bytes is a trip to RAM. I don't know if Out of Order Execution and speculative execution can apply here (if they do, then it can hide several of the next 4 bytes reads). The CPU will have to wait for every read operation to arrive.

Now memcpy is a tricky one, because the insides of it are intrinsic driven based on things like CPU being used on, and size of the transfer in bytes.

But usually memcpy in x86 ends up using either rep stos or a loop with a huge vmovaps with AVX2 that reads either 32 or 64 bytes at a time. On aarch64 I suspect there is something similar for bulk reading.

This heavily reduces the impact of a lack of cache because it's bulk-reading from RAM.

So to summarize: I suspect that the impact of HOST_CACHED (or lack of it) when we just use memcpy is very low.

If Godot had been designed differently, the user would be reading directly from the mapped memory. This would save a lot of bandwidth (because we save a memcpy), but since we don't control how the user going to read that memory, then using HOST_CACHED would be very important.

@Alex2782
Copy link
Contributor Author

+VK_MEMORY_PROPERTY_HOST_CACHED_BIT

During my tests, I did not notice any improvements or deterioration.

@RandomShaper
Copy link
Member

For writing, you usually don't need HOST_CACHED nor HOST_COHERENT.

  • Whether you want them depends on how you do the writes.

Don't you need HOST_COHERENT for CPU-to-GPU transfer as well (unless you want to explicitly call vkFlushMappedMemoryRanges()? Or am I misunderstanding the spec?

Taking the opportunity to ask, is VK_ACCESS_HOST_WRITE_BIT needed somewhere for the GPU to read stuff from the staging buffer?

@RandomShaper
Copy link
Member

+VK_MEMORY_PROPERTY_HOST_CACHED_BIT

During my tests, I did not notice any improvements or deterioration.

I'd suggest considering the cached flag out of the scope of this PR. This patch is about correctness. We can consider the cached thing a potential optimization belonging to a separate discussion, involving the broader topic of how Godot deals with mapped memory.

@clayjohn clayjohn dismissed their stale review November 15, 2023 07:55

Changes since approved

@Alex2782
Copy link
Contributor Author

without HOST_CACHED again

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me again.

HOST_CACHED might be useful in the future if we are holding the map to memory on the host side. But right now we just memcpy the memory and move on. HOST_VISIBLE | HOST_COHERENT | HOST_CACHED is not always available, so it definitely shouldn't be a required memory format. In the future we can add it is a preferred option, if we move away from doing a full memcpy

@akien-mga
Copy link
Member

akien-mga commented Dec 4, 2023

Thanks! Great work tracking down this critical issue!

@YuriSizov
Copy link
Contributor

Cherry-picked for 4.2.1.

@JekSun97

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants