Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Vulkan PSO caching #80296

Merged
merged 1 commit into from
Sep 2, 2023

Conversation

RandomShaper
Copy link
Member

@RandomShaper RandomShaper commented Aug 5, 2023

My goal was just to address any potential threading issue, but I couldn't but overhaul the whole thing to address various other issues. Highlights:

  • Godot doesn't need to manage its own pipeline cache header. The Vulkan implementation is expected to do so, which includes considering a cache out of date after a driver update, for instance. UPDATE 2: This is not as true as I though. See the discussion below.
  • The editor and the running game were writing the PSO cache to the same file, within a given project. Reusing at runtime a PSO created at edit time is good, but it isn't being properly handled in the current implementation (some file modified timestamp and vkMergePipelineCaches would likely have to be involved). Therefore, I've taken the simple path of keeping both separate. UPDATE (2023-08-30): The file name now also includes the device name so each one has its own cache, preventing invalidation when switching GPUs.
  • Some additional synchronization is also avoided in this PR, via VK_PIPELINE_CACHE_CREATE_EXTERNALLY_SYNCHRONIZED_BIT and use of LocalVector instead of Vector.
  • The check about the cache having grown at least as much as needed for it to be resaved (see was doing integer math, so fractions of MiB in the project setting were being ignored https://github.com/godotengine/godot/compare/master...RandomShaper:overhaul_vk_pso_cache?expand=1#diff-0d547c4e6687e653e54df1623d33ab291396cca8c6265101c211a028e610b1e9L9135).
  • The logic to wait and/or store an udpated cache has been partially rewritten for better clarity.

NOTE: I'm marking this as Needs testing because at the moment of writing this I'm not sure yet this fixes the bug referenced below (I still don't know what part of the current implementation is causing it).

UPDATE: Maybe the cause was the load of the cache at https://github.com/godotengine/godot/pull/80296/files#diff-0d547c4e6687e653e54df1623d33ab291396cca8c6265101c211a028e610b1e9L9138 while the writing task may still be running.
UPDATE 2: It wasn't.

Fixes #78749.
UPDATE (2023-08-30): Fixes #81150.

@darksylinc
Copy link
Contributor

Hi Pedro!

Godot doesn't need to manage its own pipeline cache header. The Vulkan implementation is expected to do so, which includes considering a cache out of date after a driver update, for instance.

This goes against good Vulkan practices. See Arseny Kapoulkine's Creating a Robust Pipeline Cache with Vulkan and Robust pipeline cache serialization.

I'd recommend going against this direction (in fact Godot should do the opposite, perform a full checksum validation of the blob).

Some additional synchronization is also avoided in this PR, via VK_PIPELINE_CACHE_CREATE_EXTERNALLY_SYNCHRONIZED_BIT

You added this flag, but then call add_native_task((&_save_pipeline_cache, this, false, "PipelineCacheSave") where the pipeline object is accessed without a mutex to protect cache_object from being accessed at the same time as Vulkan.

While vkGetPipelineCacheData is being called in the worker task, if the main thread calls render_pipeline_create or compute_pipeline_create, it's going to corrupt it.

@RandomShaper
Copy link
Member Author

I wasn't aware of the drivers not being reliable in how they manage the cache. I'll restore the relevant parts of the code.

Regarding the external sync flag, the spec says:

If flags of pCreateInfo includes VK_PIPELINE_CACHE_CREATE_EXTERNALLY_SYNCHRONIZED_BIT, all commands that modify the returned pipeline cache object must be externally synchronized.

So, unless it's failing to mention that the function that gets the contents of the cache needs sync as well, it should be fine. It's as if the implementation was still doing some sort of sync even without that flag, but less exhaustive. However, I may be misunderstanding or it may be another real-world issue that diverges from the spec. What do you think?

@darksylinc
Copy link
Contributor

darksylinc commented Aug 5, 2023

Hi!

I wasn't aware of the drivers not being reliable in how they manage the cache. I'll restore the relevant parts of the code.

There are two difficult things in computer science: caches and naming things. 🤣

Vulkan Caches are one of the ugliest part of the API, like everything that has caches.

So, unless it's failing to mention that the function that gets the contents of the cache needs sync as well, it should be fine. It's as if the implementation was still doing some sort of sync even without that flag, but less exhaustive. However, I may be misunderstanding or it may be another real-world issue that diverges from the spec. What do you think?

Externally synchronized means you guarantee you won't be doing write-access from a thread while another thread is accessing (read or write) to the object.

The example I gave used render_pipeline_create because the following can happen:

Thread A Thread B
vkGetPipelineCacheData( device, pipelineCache, ... ); vkCreateGraphicsPipelines( device, pipelineCache, ... );

You have Thread A requesting to read from pipelineCache, but Thread B wants to write to pipelineCache. If you request externally synchronized flag, you're telling Vulkan the driver should not worry about this scenario.

Therefore while the driver is, internally inside vkCreateGraphicsPipelines, it will modify pipelineCache while in the other thread the driver is, internally inside vkGetPipelineCacheData, copying the cache to the array you submitted.

This is a race condition. This is assuming both threads are using the same VkPipelineCache handle. It's ok to use different handles.

If you were to use external sync, you'd have to place a mutex in both threads (and anything else that calls a Vulkan function that requests a VkPipelineCache as parameter).

This is why vkMergePipelineCaches exists: If we have 8 threads all of them calling vkCreateGraphicsPipelines in parallel at the same time; if all 8 threads use the same cache the following scenarios can happen:

  1. You don't set external sync flag. The driver performs a lock that lasts for the whole duration of vkCreateGraphicsPipelines. Essentially this is the same as running 1 thread because only 1 vkCreateGraphicsPipelines can execute at a time
  2. You don't set external sync flag. The driver performs a lock that only lasts as little as possible when the cache is used. This is better than the previous option, but still far from optimal (Amdahl's law).
  3. You set the external sync flag, you don't use a mutex: Extreme corruption or crashes may happen
  4. You set the external sync flag, you use a mutex to protect every vkCreateGraphicsPipelines call. Why are you using threads? This is the same situation as 1.
  5. You set the external sync flag, you use 8 VkPipelineCaches (i.e. 1 per thread). When you're done calling vkCreateGraphicsPipelines and all threads are done, you merge the caches.
    • One disadvantage is that if Thread A and Thread B meet the exact same parameters (and it was not yet in the cache), they will both independently compile a PSO and this will only get caught at the end when vkMergePipelineCaches gets called.
    • However the main purpose of Vulkan PSO caches exist mostly to reuse compilations from a previous run saved on disk.

Update: In case you're wondering where all this comes from, this is (slightly cryptically) explained in 3.6. Threading Behavior

@RandomShaper
Copy link
Member Author

RandomShaper commented Aug 5, 2023

Thanks for taking the time to explain it. I've re-read that section and I found the missing piece for my mental model (bolds mine):

If two commands operate on the same object and at least one of the commands declares the object to be externally synchronized, then the caller must guarantee not only that the commands do not execute simultaneously, but also that the two commands are separated by an appropriate memory barrier (if needed).

The thing is that I was expecting to see vkGetPipelineCacheData()'s pipelineCache parameter specified as requiring external synchronization). However, now I finally realize that such treatment is only applied to functions writing to objects. The requirement for external sync on Vulkan functions reading objects is a consequence of the writers needing sync and the above paragraph, so the spec doesn't really need to explicitly state it. (This has been my fault, by I would have liked that the spec still mentioned somehow in the reader functions which writing, sync-requiring ones they may interact with.)

In this particular case (with VK_PIPELINE_CACHE_CREATE_EXTERNALLY_SYNCHRONIZED_BIT used), vkGetPipelineCacheData() needs external sync because other functions (vkCreate*Pipelines()) are specified as needing sync on the pipelines cache argument, and Godot can run one of them at the same time as the former.

In conclusion, I've updated the PR keeping the flag but locking the class mutex around the use of vkGetPipelineCacheData(). That should still be more optimal than letting Vulkan redundantly lock (given the current implementation of RenderingDevice already locks a lot itself). Please take a look.

PS: I've renamed from "Overhaul" to "Enhance", given that is is ended up being humbler than initially expected. 😁

@RandomShaper RandomShaper changed the title Overhaul Vulkan PSO cache Enhance Vulkan PSO caching Aug 5, 2023
@RandomShaper RandomShaper force-pushed the overhaul_vk_pso_cache branch 11 times, most recently from bfdbd74 to f3158d7 Compare August 5, 2023 18:59
@darksylinc
Copy link
Contributor

darksylinc commented Aug 5, 2023

Hi!

Looks better. Just a two remarks I noticed after I wrote my comment.

Remark 1

I said two commands that have read access should not need sync. However the Vulkan spec doesn't seem to make such guarantee (it only talks about "access", it never talks about "read" vs "write" access).

So the VK spec doesn't really guarantee vkGetPipelineCacheData is read-only and thus can be called concurrently (optimistically IMO any decent implementation should only do read-only operations). What I'm saying that if we're being pedantic; then you should also lock the other call to vkGetPipelineCacheData

Remark 2

I'm new to Godot, however I see that there are macros called _THREAD_SAFE_LOCK_ & _THREAD_SAFE_UNLOCK_ instead of directly calling _thread_safe_.lock() & _thread_safe_.unlock().

Personally I prefer the following idiom (using unnamed scopes) but it doesn't seem to be common in Godot's codebase:

void function()
{
    int a = 0;
    {
        _THREAD_SAFE_METHOD_
        vkGetPipelineCacheData( ... );
    }
}

One last comment about PSO caches

PSO caches are often used and abused a lot.

But knowing who demanded it to be implemented tells a lot of what the caches do or how they should behave (i.e. the more we stray outside of the original intended use case, the more resistance we may encounter):

When gamedevs requested PSO caches, the primary use case everyone involved had in mind was the following: PS4 ports (Vulkan is based on Mantle, and Mantle is inspired on GNM, PS4's API).

For example if you try "Detroit: Become Human", on first launch you'll be met with a ~30-minute shader cache compilation stage that uses all of your CPU cores.

What is happening is the following:

  1. On first launch, game launches all threads and compiles all PSOs.
  2. The PSO is cached, then merged from all threads, then saved to disk.
  3. On subsequent runs, since everything (or almost everything) is already cached; the cache is essentially "read only" and vkCreateGraphicsPipelines should return immediately.

Of course if we look at Godot's use case (or any other popular engine like Unity or UE4/5), our use case looks nothing like that. In fact we'd be happy if all threads could share write access on the same cache with semi-frequent reads (not so "read only")

Since then drivers have worked to optimized for the torture cases that normal engines have been doing in the real world; however it's always good to know what is "the path of least resistance" is supposed to look like.

What this means

Any decent Vulkan caching implementation would use a two-stage cache:

  1. First a global read-only cache is looked for
  2. If the entry is not found, the 2nd cache is locked and an entry looked for; if non is found, then create such cache

If this implementation is used, then NOT using external sync may be faster as long as the cache is read-only most of the time, because that's the optimal path.

@RandomShaper RandomShaper force-pushed the overhaul_vk_pso_cache branch 2 times, most recently from 69a7e4e to ce23a24 Compare August 5, 2023 19:03
@darksylinc
Copy link
Contributor

I updated my comment with a "What this means" section @RandomShaper

@RandomShaper
Copy link
Member Author

I'm new to Godot, however I see that there are macros called _THREAD_SAFE_LOCK_ & _THREAD_SAFE_UNLOCK_ instead of directly calling _thread_safe_.lock() & _thread_safe_.unlock().

I had to use the hacky way because that's now an static method.

So the VK spec doesn't really guarantee vkGetPipelineCacheData is read-only and thus can be called concurrently (optimistically IMO any decent implementation should only do read-only operations). What I'm saying that if we're being pedantic; then you should also lock the other call to vkGetPipelineCacheData

If they don't state any sync requirement for that function, aren't they implicitly guaranteeing concurrent calls are safe (assuming no other functions possibly modifying the cache can run without proper sync)?

@darksylinc
Copy link
Contributor

If they don't state any sync requirement for that function, aren't they implicitly guaranteeing concurrent calls are safe (assuming no other functions possibly modifying the cache can run without proper sync)?

The thing is that everything is governed by the section 3.6. Threading Behavior in the spec unless stated otherwise by the function's documentation.

And once we request EXTERNALLY_SYNC flag, it is governed by this paragraph:

Vulkan commands use simple stores to update the state of Vulkan objects. A
parameter declared as externally synchronized may have its contents updated at any time during
the host execution of the command. If two commands operate on the same object and at least one of
the commands declares the object to be externally synchronized, then the caller must guarantee
not only that the commands do not execute simultaneously, but also that the two commands are
separated by an appropriate memory barrier (if needed).

Because the documentation doesn't say that vkGetPipelineCacheData does not update/store state into the VkPipelineCache object (even though it is obvious... unless there is an implementation detail I'm missing); then, if we are being pedantic, we have to treat it like any other function that performs write access.

@darksylinc
Copy link
Contributor

Or you could request clarification to Khronos so that the spec is amended so that this detail is guaranteed.

@sakrel
Copy link
Contributor

sakrel commented Aug 5, 2023

VK_PIPELINE_CACHE_CREATE_EXTERNALLY_SYNCHRONIZED_BIT is provided by VK_EXT_pipeline_creation_cache_control (which has been promoted to Vulkan 1.3). I think you need to enable the extension and query VkPhysicalDevicePipelineCreationCacheControlFeaturesEXT before you can use it on drivers that do not support Vulkan 1.3.

@RandomShaper RandomShaper force-pushed the overhaul_vk_pso_cache branch from ce23a24 to 509a395 Compare August 21, 2023 07:52
@RandomShaper
Copy link
Member Author

Updated, honoring the most recent feedback.

@RandomShaper RandomShaper force-pushed the overhaul_vk_pso_cache branch from d788707 to bcf33cf Compare August 30, 2023 10:01
@akien-mga
Copy link
Member

CC @darksylinc if you can do another review pass. This should solve intermittent crashes in CI tests which are quite annoying.

@darksylinc
Copy link
Contributor

LGTM

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me too. I trust darksylinc's review and RandomShaper's testing. So let's go ahead and merge this when convenient.

@akien-mga akien-mga merged commit 497ca8c into godotengine:master Sep 2, 2023
@akien-mga
Copy link
Member

Thanks!

@RandomShaper RandomShaper deleted the overhaul_vk_pso_cache branch September 12, 2023 08:59
darksylinc added a commit to darksylinc/godot that referenced this pull request Sep 16, 2023
PR godotengine#80296 introduced a regression because it checks if the
VK_EXT_pipeline_creation_cache_control extension has been enabled before
using it, but turns out the process is a bit more convoluted than that
(a Vulkan driver may support the extension but then say the feature is
not supported)
mandryskowski pushed a commit to mandryskowski/godot that referenced this pull request Oct 11, 2023
PR godotengine#80296 introduced a regression because it checks if the
VK_EXT_pipeline_creation_cache_control extension has been enabled before
using it, but turns out the process is a bit more convoluted than that
(a Vulkan driver may support the extension but then say the feature is
not supported)
@clayjohn clayjohn removed the cherrypick:4.1 Considered for cherry-picking into a future 4.1.x release label Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants