Vulkan: Fix incorrect access to the buffers on Android #84852

Alex2782 · 2023-11-13T17:00:31Z

Fixes #78715
Fixes #75599
Fixes #80371
Fixes #84355

Error description: #78715 (comment) / #78715 (comment)

Test project test.buffer.zip for 78715, 75599 and 84355
41 MB download (drive.google.com) android_editor-debug.apk for 80371

old tests

Godot 4.2 beta5 78715 and 75599
"Mesh Array_Index" is not read correctly (it sometimes contains 'attrib_data' or 'vertex_data'). The app crashes when "soft_body_3d" is inserted.

crash.Design.ohne.Titel.mp4

this PR 78715 and 75599
"Mesh Array_Index" is always read correctly and there are no crashes when soft_body_3d is inserted.

Screen_recording_20231113_174153.mp4

80371 Godot 4.1.3

cubes.4.1.3-Screen_recording_20231113_16101.mp4

80371 Godot 4.2 (beta.custom, this PR)

cubes.custom.Screen_recording_20231113_161639.mp4

drivers/vulkan/rendering_device_vulkan.h

drivers/vulkan/rendering_device_vulkan.cpp

clayjohn · 2023-11-13T22:27:49Z

The functions vmaInvalidateAllocation / vmaFlushAllocation require too much performance. The FPS goes from 60 to 40 because 'issue 75599' queries the data via _process. With VK_MEMORY_PROPERTY_HOST_COHERENT_BIT this was no problem.

	void *buffer_mem;
	VkResult vkerr = vmaMapMemory(allocator, tmp_buffer.allocation, &buffer_mem);
	ERR_FAIL_COND_V_MSG(vkerr, Vector<uint8_t>(), "vmaMapMemory failed with error " + itos(vkerr) + ".");

#ifdef __ANDROID__
	vmaInvalidateAllocation(allocator, tmp_buffer.allocation, 0, VK_WHOLE_SIZE);
#endif

	Vector<uint8_t> buffer_data;
	{
		buffer_data.resize(p_size);
		uint8_t *w = buffer_data.ptrw();
		memcpy(w, buffer_mem, p_size);
	}

#ifdef __ANDROID__
	vmaFlushAllocation(allocator, tmp_buffer.allocation, 0, VK_WHOLE_SIZE);
#endif

	vmaUnmapMemory(allocator, tmp_buffer.allocation);

	_buffer_free(&tmp_buffer);

Screen_recording_20231113_231241.mp4

Then let's just request COHERENT memory. I would add an optional parameter to _buffer_allocate() called p_coherent (default to false). When true it should add the coherent flag to the preferred flags

Alex2782 · 2023-11-13T23:30:41Z

now: requiredFlags = VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
like: google-hellovk and VMA_MEMORY_USAGE_CPU_ONLY

Tests:

without FPS limit between 63 and 69
without set_text(debug) (for issue 75599, RichText, gdscript) 20 FPS more
without issue 75599 self.get_mesh() and mesh_data.surface_get_arrays(0) up to 120 fps (system limit)

drivers/vulkan/rendering_device_vulkan.cpp

RandomShaper · 2023-11-14T13:03:43Z

I'd suggest simplifyng this PR by making _buffer_allocate() always add VK_MEMORY_PROPERTY_HOST_COHERENT_BIT in case it's given VMA_MEMORY_USAGE_AUTO_PREFER_HOST. Regarding VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, as far as I know, VMA will add it automatically.

And, as someone else has pointed out, _insert_staging_block() would also benefit from VK_MEMORY_PROPERTY_HOST_COHERENT_BIT for the CPU-to-GPU direction.

On desktop, you don't need to ask for VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, but you should unless you want to rely on driver behavior. Quoting VMA docs:

Also, Windows drivers from all 3 PC GPU vendors (AMD, Intel, NVIDIA) currently provide HOST_COHERENT flag on all memory types that are HOST_VISIBLE, so on PC you may not need to bother.

My guess is that there wouldn't be a performance loss on platforms that already add the proper bit unconditionally. And on platforms that don't, we have to add it explicitly anyway. Therefore, let's always ask for coherent memory when it comes to data transfers via mapped memory.

clayjohn

This approach looks good to me. Let's merge this early in the 4.3 release cycle and cherrypick to 4.2.1

Alex2782 · 2023-11-14T14:48:27Z

the new changes have an impact on more functions

texture_get_data -> _buffer_allocate(...., VMA_MEMORY_USAGE_AUTO_PREFER_HOST....);
_texture_update and _buffer_update -> _insert_staging_block

new Test project test.buffer.zip

MacOs and Android no problems.
Godot Android Editor (issue 80371) not tested.

Screen-2023-11-14-152844.mp4

Alex2782 · 2023-11-14T15:07:37Z

#include "thirdparty/vulkan/vk_mem_alloc.h"

@akien-mga Can this change be left in?

not necessary for scons and cmake
VsCode finds "vk_mem_alloc.h"
Android Studio: "vk_mem_alloc.h" (NOT), <vk_mem_alloc.h> (NOT)
"thirdparty/vulkan/vk_mem_alloc.h" (OK), <thirdparty/vulkan/vk_mem_alloc.h> (OK)

sakrel · 2023-11-14T15:39:39Z

The functions vmaInvalidateAllocation / vmaFlushAllocation require too much performance. The FPS goes from 60 to 40 because 'issue 75599' queries the data via _process. With VK_MEMORY_PROPERTY_HOST_COHERENT_BIT this was no problem.

	void *buffer_mem;
	VkResult vkerr = vmaMapMemory(allocator, tmp_buffer.allocation, &buffer_mem);
	ERR_FAIL_COND_V_MSG(vkerr, Vector<uint8_t>(), "vmaMapMemory failed with error " + itos(vkerr) + ".");

#ifdef __ANDROID__
	vmaInvalidateAllocation(allocator, tmp_buffer.allocation, 0, VK_WHOLE_SIZE);
#endif

	Vector<uint8_t> buffer_data;
	{
		buffer_data.resize(p_size);
		uint8_t *w = buffer_data.ptrw();
		memcpy(w, buffer_mem, p_size);
	}

#ifdef __ANDROID__
	vmaFlushAllocation(allocator, tmp_buffer.allocation, 0, VK_WHOLE_SIZE);
#endif

	vmaUnmapMemory(allocator, tmp_buffer.allocation);

	_buffer_free(&tmp_buffer);

Screen_recording_20231113_231241.mp4

hmm, I don't think we should disregard the vmaInvalidateAllocation / HOST_VISIBLE | HOST_CACHED approach just yet.
The ARM documentation strongly suggests that HOST_VISIBLE | HOST_CACHED should be the preferred memory type for CPU readback.

Use HOST_VISIBLE | HOST_COHERENT | HOST_CACHED memory for resources which are read back on to the CPU. Use HOST_VISIBLE | HOST_CACHED if it is not available.

Do not read back data from uncached memory on the CPU.

Uncached readbacks can be much slower than cached reads due to increased CPU processing costs.

Check that all CPU-read buffers are using cached memory.

https://developer.arm.com/documentation/101897/0301/CPU-overheads/Vulkan-CPU-memory-mapping

However, the documentation also mentions that manually invaliding/flushing the cache can be expensive:

vkFlushMappedRanges() when the CPU has finished writing data for the GPU.
vkInvalidateMappedRanges() when reading back the data that the GPU has written.
However, both calls are expensive to use, so must be used sparingly.

I wonder if the vmaInvalidateAllocation / HOST_VISIBLE | HOST_CACHED approach just has a bigger base cost, but scales better with bigger buffer sizes?

Also, I think, the vmaFlushAllocation in your code is unnecessary because the CPU never writes to the mapped buffer.

clayjohn · 2023-11-14T15:49:08Z

hmm, I don't think we should disregard the vmaInvalidateAllocation / HOST_VISIBLE | HOST_CACHED approach just yet.
The ARM documentation strongly suggests that HOST_VISIBLE | HOST_CACHED should be the preferred memory type for CPU readback.

I think you mixed that up. The document you linked recommends using HOST_VISIBLE | HOST_COHERENT | HOST_CACHED over HOST_VISIBLE | HOST_CACHED It also recommends avoiding vkInvalidateMappedRanges()

sakrel · 2023-11-14T15:55:37Z

I think you mixed that up. The document you linked recommends using HOST_VISIBLE | HOST_COHERENT | HOST_CACHED over HOST_VISIBLE | HOST_CACHED

Yes, this memory type is recommended if supported by the hardware:

The HOST_VISIBLE | HOST_COHERENT | HOST_CACHED memory type:
Provides cached storage on the CPU, which is also coherent with the GPU view of the memory without needing manual synchronization.
Supported by Arm Bifrost and later GPUs if the chipset supports the hardware coherency protocol between the CPU and GPU.

This random android device for example doesn't support HOST_VISIBLE | HOST_COHERENT | HOST_CACHED: https://vulkan.gpuinfo.org/displayreport.php?id=25724#memory

clayjohn · 2023-11-14T16:26:15Z

I think you mixed that up. The document you linked recommends using HOST_VISIBLE | HOST_COHERENT | HOST_CACHED over HOST_VISIBLE | HOST_CACHED

Yes, this memory type is recommended if supported by the hardware:

The HOST_VISIBLE | HOST_COHERENT | HOST_CACHED memory type:
Provides cached storage on the CPU, which is also coherent with the GPU view of the memory without needing manual synchronization.
Supported by Arm Bifrost and later GPUs if the chipset supports the hardware coherency protocol between the CPU and GPU.

This random android device for example doesn't support HOST_VISIBLE | HOST_COHERENT | HOST_CACHED: https://vulkan.gpuinfo.org/displayreport.php?id=25724#memory
gpu

But it does support DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_COHERENT_BIT (Memory Type 0) which is the type of memory that we request with this PR.

The documentation only recommends HOST_VISIBLE | HOST_CACHED as a fallback in case HOST_VISIBLE | HOST_COHERENT | HOST_CACHED is not available. But we don't actually care about HOST_CACHED since we are doing a one time operation. We aren't leaving the memory mapped. So the caching is an unnecessary overhead.

In my mind HOST_VISIBLE | HOST_COHERENT is the optimal memory type here as it is always available (on Mali anyway)

darksylinc · 2023-11-14T16:36:54Z

As sakrel said:

The functions vmaInvalidateAllocation / vmaFlushAllocation require too much performance. The FPS goes from 60 to 40 because 'issue 75599' queries the data via _process. With VK_MEMORY_PROPERTY_HOST_COHERENT_BIT this was no problem.

For reading (which is what issue Mobile Renderer makes Meshes unreadable/uneditable on Android #75599 is about) you always want HOST_CACHED.
- However Godot does something ill-advised which is map, then memcpy, then unmap.
- This mostly negates the benefits of HOST_CACHED, since the first thing you do is get the memory out of a non-cached region and copy it into a cached region.
For writing, you usually don't need HOST_CACHED nor HOST_COHERENT.
- Whether you want them depends on how you do the writes.

I noticed the code flushes WHOLE_SIZE instead of flushing only the necessary ranges (Note to self: WAIT A MINUTE! VMA does not do buffer suballocation!?!?).
I don't know if VMA prevents this or not, but you must NOT call vkInvalidateMappedMemoryRanges if the memory is coherent. Validation layers will complain. If you do this on coherent memory, it is undefined behavior but it may explain performance difference.

Also make sure validation layers are disabled, because validation layers will make sure non-coherent memory returns garbage to more easily identify bugs.

clayjohn · 2023-11-14T16:53:03Z

@darksylinc What is the benefit of using HOST_CACHED for a one-time only read operation? That's the context we are dealing with here. I agree it makes sense when we will be doing multiple memory accesses before unmapping, but in this specific use-case, is there any point?

darksylinc · 2023-11-14T17:31:47Z

What is the benefit of using HOST_CACHED for a one-time only read operation?

It's more as to how the CPU operates and how we are doing the read.

When we do (in pseudo x86 asm but applies to arm as well):

loop:
   mov eax, [non_cached_address]
   mov [cached_address], eax
   add non_cached_address, 4
   add cached_address, 4
   goto loop

If "non_cached_address" is actually cached, the CPU reads 64 bytes (a cache line) into L1, and probably prefetches much more into L2. So the next 16 iterations (we're reading 4 bytes at a time) read by the mov will be super fast because it's already in L1. On the 17th iteration it will be slightly slower but it still be very fast because it fetches the next 64 bytes from L2. And so on.

If "non_cached_address" is uncached, every 4 bytes is a trip to RAM. I don't know if Out of Order Execution and speculative execution can apply here (if they do, then it can hide several of the next 4 bytes reads). The CPU will have to wait for every read operation to arrive.

Now memcpy is a tricky one, because the insides of it are intrinsic driven based on things like CPU being used on, and size of the transfer in bytes.

But usually memcpy in x86 ends up using either rep stos or a loop with a huge vmovaps with AVX2 that reads either 32 or 64 bytes at a time. On aarch64 I suspect there is something similar for bulk reading.

This heavily reduces the impact of a lack of cache because it's bulk-reading from RAM.

So to summarize: I suspect that the impact of HOST_CACHED (or lack of it) when we just use memcpy is very low.

If Godot had been designed differently, the user would be reading directly from the mapped memory. This would save a lot of bandwidth (because we save a memcpy), but since we don't control how the user going to read that memory, then using HOST_CACHED would be very important.

Alex2782 · 2023-11-14T18:20:13Z

+VK_MEMORY_PROPERTY_HOST_CACHED_BIT

During my tests, I did not notice any improvements or deterioration.

RandomShaper · 2023-11-15T04:54:22Z

For writing, you usually don't need HOST_CACHED nor HOST_COHERENT.

Whether you want them depends on how you do the writes.

Don't you need HOST_COHERENT for CPU-to-GPU transfer as well (unless you want to explicitly call vkFlushMappedMemoryRanges()? Or am I misunderstanding the spec?

Taking the opportunity to ask, is VK_ACCESS_HOST_WRITE_BIT needed somewhere for the GPU to read stuff from the staging buffer?

RandomShaper · 2023-11-15T05:02:39Z

+VK_MEMORY_PROPERTY_HOST_CACHED_BIT

During my tests, I did not notice any improvements or deterioration.

I'd suggest considering the cached flag out of the scope of this PR. This patch is about correctness. We can consider the cached thing a potential optimization belonging to a separate discussion, involving the broader topic of how Godot deals with mapped memory.

Changes since approved

Alex2782 · 2023-11-15T13:47:17Z

without HOST_CACHED again

clayjohn

Looks good to me again.

HOST_CACHED might be useful in the future if we are holding the map to memory on the host side. But right now we just memcpy the memory and move on. HOST_VISIBLE | HOST_COHERENT | HOST_CACHED is not always available, so it definitely shouldn't be a required memory format. In the future we can add it is a preferred option, if we move away from doing a full memcpy

akien-mga · 2023-12-04T22:10:04Z

Thanks! Great work tracking down this critical issue!

YuriSizov · 2023-12-05T13:47:39Z

Cherry-picked for 4.2.1.

Alex2782 requested a review from a team as a code owner November 13, 2023 17:00

akien-mga reviewed Nov 13, 2023

View reviewed changes

drivers/vulkan/rendering_device_vulkan.h Show resolved Hide resolved

akien-mga changed the title ~~Fixes incorrect access to the buffers on Android~~ Vulkan: Fixes incorrect access to the buffers on Android Nov 13, 2023

akien-mga added bug platform:android topic:rendering labels Nov 13, 2023

akien-mga changed the title ~~Vulkan: Fixes incorrect access to the buffers on Android~~ Vulkan: Fix incorrect access to the buffers on Android Nov 13, 2023

Alex2782 commented Nov 13, 2023

View reviewed changes

drivers/vulkan/rendering_device_vulkan.h Show resolved Hide resolved

drivers/vulkan/rendering_device_vulkan.h Show resolved Hide resolved

drivers/vulkan/rendering_device_vulkan.h Show resolved Hide resolved

clayjohn reviewed Nov 13, 2023

View reviewed changes

drivers/vulkan/rendering_device_vulkan.cpp Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

Alex2782 force-pushed the fix_vulkan_buffer_android branch from 86f530c to 761f48d Compare November 13, 2023 23:17

This comment was marked as duplicate.

Sign in to view

lmurray reviewed Nov 14, 2023

View reviewed changes

drivers/vulkan/rendering_device_vulkan.cpp Outdated Show resolved Hide resolved

akien-mga added this to the 4.3 milestone Nov 14, 2023

akien-mga added the cherrypick:4.2 Considered for cherry-picking into a future 4.2.x release label Nov 14, 2023

clayjohn mentioned this pull request Nov 14, 2023

Surface Tool doesn't generate mesh on mobile renderer with Android 10 #84355

Closed

Alex2782 force-pushed the fix_vulkan_buffer_android branch from 761f48d to 4ba3c49 Compare November 14, 2023 14:35

clayjohn previously approved these changes Nov 14, 2023

View reviewed changes

Alex2782 force-pushed the fix_vulkan_buffer_android branch from 4ba3c49 to 2667e2f Compare November 14, 2023 18:17

Vulkan: Fix incorrect access to the buffers on Android

04a142c

Alex2782 force-pushed the fix_vulkan_buffer_android branch from 2667e2f to 04a142c Compare November 15, 2023 13:46

clayjohn approved these changes Nov 15, 2023

View reviewed changes

RandomShaper approved these changes Nov 15, 2023

View reviewed changes

bitsawer mentioned this pull request Dec 1, 2023

MeshDataTool.create_from_surface() : Index r[i] is out of bounds #85622

Closed

Alex2782 mentioned this pull request Dec 2, 2023

Android editor: Somewhat random graphical corruptions when viewing meshes in inspector #83481

Closed

akien-mga merged commit a9ba869 into godotengine:master Dec 4, 2023

clayjohn mentioned this pull request Dec 4, 2023

Vulkan Mobile: Tilemap has rendering issues on MediaTek Dimensity SoC #85750

Closed

Alex2782 deleted the fix_vulkan_buffer_android branch December 5, 2023 02:04

YuriSizov removed the cherrypick:4.2 Considered for cherry-picking into a future 4.2.x release label Dec 5, 2023

This was referenced Dec 6, 2023

Vulkan: TileMaps sometimes render incorrectly on Android #61789

Open

TileMap rendering error after Android export. #71020

Open

This was referenced Dec 7, 2023

ArrayMesh.surface_get_arrays and PrimitiveMesh.get_mesh_arrays return incorrect data on Android #85889

Closed

create_from_surface is broken on Android #85913

Closed

This comment was marked as off-topic.

Sign in to view

leanmendoza mentioned this pull request Dec 21, 2023

feat: implement ethereum controller decentraland/godot-explorer#122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan: Fix incorrect access to the buffers on Android #84852

Vulkan: Fix incorrect access to the buffers on Android #84852

Alex2782 commented Nov 13, 2023 •

edited

Loading

This comment was marked as outdated.

clayjohn commented Nov 13, 2023

Alex2782 commented Nov 13, 2023

This comment was marked as duplicate.

RandomShaper commented Nov 14, 2023 •

edited

Loading

clayjohn left a comment

Alex2782 commented Nov 14, 2023

Alex2782 commented Nov 14, 2023

sakrel commented Nov 14, 2023

clayjohn commented Nov 14, 2023

sakrel commented Nov 14, 2023 •

edited

Loading

clayjohn commented Nov 14, 2023 •

edited

Loading

darksylinc commented Nov 14, 2023

clayjohn commented Nov 14, 2023

darksylinc commented Nov 14, 2023

Alex2782 commented Nov 14, 2023

RandomShaper commented Nov 15, 2023

RandomShaper commented Nov 15, 2023

Alex2782 commented Nov 15, 2023

clayjohn left a comment

akien-mga commented Dec 4, 2023 •

edited

Loading

YuriSizov commented Dec 5, 2023

This comment was marked as off-topic.

Vulkan: Fix incorrect access to the buffers on Android #84852

Vulkan: Fix incorrect access to the buffers on Android #84852

Conversation

Alex2782 commented Nov 13, 2023 • edited Loading

This comment was marked as outdated.

clayjohn commented Nov 13, 2023

Alex2782 commented Nov 13, 2023

This comment was marked as duplicate.

RandomShaper commented Nov 14, 2023 • edited Loading

clayjohn left a comment

Choose a reason for hiding this comment

Alex2782 commented Nov 14, 2023

Alex2782 commented Nov 14, 2023

sakrel commented Nov 14, 2023

clayjohn commented Nov 14, 2023

sakrel commented Nov 14, 2023 • edited Loading

clayjohn commented Nov 14, 2023 • edited Loading

darksylinc commented Nov 14, 2023

clayjohn commented Nov 14, 2023

darksylinc commented Nov 14, 2023

Alex2782 commented Nov 14, 2023

RandomShaper commented Nov 15, 2023

RandomShaper commented Nov 15, 2023

Alex2782 commented Nov 15, 2023

clayjohn left a comment

Choose a reason for hiding this comment

akien-mga commented Dec 4, 2023 • edited Loading

YuriSizov commented Dec 5, 2023

This comment was marked as off-topic.

Alex2782 commented Nov 13, 2023 •

edited

Loading

RandomShaper commented Nov 14, 2023 •

edited

Loading

sakrel commented Nov 14, 2023 •

edited

Loading

clayjohn commented Nov 14, 2023 •

edited

Loading

akien-mga commented Dec 4, 2023 •

edited

Loading